Previous Article | matchIT Hub Index | Next Article |
matchIT Fuzzy
The default matchIT algorithm for fuzzy string matching allows for only 1 error of one of the following types: insertions, deletions, substitutions, transpositions, and wide transpositions. Where: a transposition is where two adjacent characters are swapped, and a wide transposition is where two non-adjacent characters are swapped.
The matchIT Fuzzy compare algorithm is equivalent to Damerau-Levenshtein with an edit distance of 1, except that Damerau-Levenshtein doesn't allow for wide transpositions.
Damerau-Levenshtein
This is an extension to the original Levenshtein distance algorithm that handles transpositions as well as insertions, deletions, and substitutions. Both algorithms are in the public domain.
The algorithms produce a string metric ("edit distance") for measuring the similarity between two strings (i.e. the number of changes to get from one string to the other). matchIT then produces a fuzzy score between 0 and 1, which is a measure of the edit distance in relation to the length of the longer string. Two configuration settings provide fine tuning of fuzzy matching Compare | Fuzzy | MaximumEditDistance and Compare | Fuzzy | MinumumScore.
Each string comparison must satisfy both constraints. See below for some examples that provide a rational for both constraints.
Examples
Using the default settings, these strings will be matched:
Type | First word | Second word | Edit distance | Score |
Identical | JO | JO | 0 | 1 |
Identical | JOHN | JOHN | 0 | 1 |
Identical | JONATHAN | JONATHAN | 0 | 1 |
Fuzzy (reversal) | JO | OJ | 1 | 0.5 |
Fuzzy (insertion) | AB | ABC | 1 | 0.667 |
Fuzzy (substitution) (#) | A | B | 1 | 0.5 |
Fuzzy (substitution) | JOHN | JAHN | 1 | 0.75 |
Fuzzy (transposition) | JOHN | JHON | 1 | 0.75 |
Fuzzy (insertion) | REBECCA | REBECCCA | 1 | 0.875 |
An exception is made for single-letter matches. matchIT only allows single-letter matches if they’re similar (i.e. G and J, S and F, M and N), and does not compare them using Damerau-Levenshtein.
Using the default settings, the following strings will not be matched – either because the edit distance exceeds the maximum (1) or the score hasn’t reached the minimum (0.5):
Type | First word | Second word | Edit distance | Score |
Fuzzy (insertions) (*) | ANTON | ANTHONY | 2 | 0.714 |
Fuzzy (insertions) (*) | REBECCA | REBBECCCA | 2 | 0.778 |
Fuzzy (substitutions) (*) | LEONARD | LOENRAD | 2 | 0.857 |
Fuzzy (reversal) (*) | JON | NOJ | 2 | 0.333 |
Fuzzy (reversal) | JOHN | NHOJ | 3 | 0.25 |
Different (+) | JO | ED | 2 | 0 |
Different | DAVE | ED | 3 | 0.25 |
Different | JOHN | DAVE | 4 | 0 |
Different | NOTTINGHAM | GLOUCESTER | 10 | 0 |
Containment | NOTTINGHAM | NOTTINGHAMSHIRE | 5 | 0.667 |
By increasing the MinimumEditDistance to 2, the three fuzzy matches marked (*) will be found. However, the row marked with (+) still won’t be matched, even though its edit distance is 2, because its score is below the minimum of 0.5. Hence the need for both configuration options.
(Of course, some of these matches are going to be found because they’re phonetically identical or similar (e.g. Rebecca and Rebbeccca), these examples are just to illustrate what can be achieve by using Damerau-Levenshtein regardless of phoneticization.)
Previous Article | matchIT Hub Index | Next Article |