Appendix B - Advanced Configuration Guide - Fuzzy Algorithms – Software Support

matchIT Fuzzy

The default matchIT algorithm for fuzzy string matching allows for only 1 error of one of the following types: insertions, deletions, substitutions, transpositions, and wide transpositions. Where: a transposition is where two adjacent characters are swapped, and a wide transposition is where two non-adjacent characters are swapped.

The matchIT Fuzzy compare algorithm is equivalent to Damerau-Levenshtein with an edit distance of 1, except that Damerau-Levenshtein doesn't allow for wide transpositions.

Damerau-Levenshtein

This is an extension to the original Levenshtein distance algorithm that handles transpositions as well as insertions, deletions, and substitutions. Both algorithms are in the public domain.

The algorithms produce a string metric ("edit distance") for measuring the similarity between two strings (i.e. the number of changes to get from one string to the other). matchIT then produces a fuzzy score between 0 and 1, which is a measure of the edit distance in relation to the length of the longer string. Two configuration settings provide fine tuning of fuzzy matching Compare | Fuzzy | MaximumEditDistance and Compare | Fuzzy | MinumumScore.

Each string comparison must satisfy both constraints. See below for some examples that provide a rational for both constraints.

Examples

Using the default settings, these strings will be matched:

Type	First word	Second word	Edit distance	Score
Identical	JO	JO	0	1
Identical	JOHN	JOHN	0	1
Identical	JONATHAN	JONATHAN	0	1
Fuzzy (reversal)	JO	OJ	1	0.5
Fuzzy (insertion)	AB	ABC	1	0.667
Fuzzy (substitution) (#)	A	B	1	0.5
Fuzzy (substitution)	JOHN	JAHN	1	0.75
Fuzzy (transposition)	JOHN	JHON	1	0.75
Fuzzy (insertion)	REBECCA	REBECCCA	1	0.875

An exception is made for single-letter matches. matchIT only allows single-letter matches if they’re similar (i.e. G and J, S and F, M and N), and does not compare them using Damerau-Levenshtein.

Using the default settings, the following strings will not be matched – either because the edit distance exceeds the maximum (1) or the score hasn’t reached the minimum (0.5):

Type	First word	Second word	Edit distance	Score
Fuzzy (insertions) (*)	ANTON	ANTHONY	2	0.714
Fuzzy (insertions) (*)	REBECCA	REBBECCCA	2	0.778
Fuzzy (substitutions) (*)	LEONARD	LOENRAD	2	0.857
Fuzzy (reversal) (*)	JON	NOJ	2	0.333
Fuzzy (reversal)	JOHN	NHOJ	3	0.25
Different (+)	JO	ED	2	0
Different	DAVE	ED	3	0.25
Different	JOHN	DAVE	4	0
Different	NOTTINGHAM	GLOUCESTER	10	0
Containment	NOTTINGHAM	NOTTINGHAMSHIRE	5	0.667

By increasing the MinimumEditDistance to 2, the three fuzzy matches marked (*) will be found. However, the row marked with (+) still won’t be matched, even though its edit distance is 2, because its score is below the minimum of 0.5. Hence the need for both configuration options.

(Of course, some of these matches are going to be found because they’re phonetically identical or similar (e.g. Rebecca and Rebbeccca), these examples are just to illustrate what can be achieve by using Damerau-Levenshtein regardless of phoneticization.)

matchIT Hub Index

matchIT Fuzzy

Damerau-Levenshtein

Examples

Related articles