During the matching step, records with the same match keys are grouped together and then compared with each other. The matchIT API uses the scoring weights for specific fields to determine an overall match score. If the score reaches the defined threshold score for a given match level (configured in the minimum score settings), then a matching pair will be written to the results table along with the score achieved.
The scoring weights have the following categories:
- Sure – generally means a match is certain (for example when scoring a name comparison, both the forenames and surnames of the records might be the same)
- Likely – There are small differences between the records for field being compared.
- Possible – There are notable differences between the records for field being compared, but the fields could still be a match.
- OneEmpty – one record has a blank entry for the field being compared, e.g. one record has no postcode when scoring a postcode/zip comparison.
- BothEmpty – both records have blank entries for the field being compared.
For example, if the minimum score setting for Individual Level matching is 80, then the combined component scores for Name, Address, Postcode, Email, Telephone, Organization and Custom must be greater than equal to 80. The component scores themselves are driven by the weights settings which dictate the score that will be achieved when specific elements are compared (depending upon how similar those elements are).
For example, by default when two names are compared at individual level, should they be considered to be a Sure match, then the name component of the overall score would be 60. If the postcodes when compared were considered to only be Possible matches, then the default weight indicates that the postcodes would score 15.
Name/Company Comparisons
matchIT SQL uses a name matching matrix to decide on whether when comparing names, records should achieve Sure, Likely, Possible etc. Upon installation of matchIT SQL, the following folder will be created
C:\matchIT SQL\config\matchingMatrices
Which contains matching matrices for the various matching levels. To illustrate how the matrix works, consider the following XML:
<lastnames match="equal">
<firstnames match="equal">
<middlenames match="equal">sure</middlenames>
<middlenames match="both_empty">sure</middlenames>
<middlenames match="one_empty">sure</middlenames>
<middlenames match="approx">likely</middlenames>
<middlenames match="contains">likely</middlenames>
<middlenames match="unequal">possible</middlenames>
</firstnames>
This indicates that where the last names for 2 records being compared are the same and the first names are also the same, then the ultimate result of the comparison would depend upon the data in the middle name fields, e.g. for Middle names that are also equal, then the result is sure. The actual score that this sure match would be worth, would depend on the matching weights defined in your configuration file for the sure match on name (at the corresponding matching level).
There are similar matching matrices for comparing organisation names,
the only difference being that name1, name2, name3 are used instead of lastnames, firstnames, middlenames.
<name1 match="equal">
<name2 match="equal">
<name3 match="equal">sure</name3>
<name3 match="both_empty">sure</name3>
<name3 match="one_empty">sure</name3>
<name3 match="approx">sure</name3>
<name3 match="contains">likely</name3>
<name3 match="unequal">likely</name3>
</name2>
findIT S2/matchIT Web similarly uses the same underlying scoring, but to allow for granularity, as its more common for manual review given the nature of how that tool is consumed, then decimal values are used. We also expose an additional node for 'starts_with' to allow for some additional fuzzier matches to be found.
<lastnames match="equal">
<firstnames match="starts_with">
<middlenames match="equal">0.92</middlenames>
<middlenames match="both_empty">0.92</middlenames>
<middlenames match="approx">0.90</middlenames>
<middlenames match="contains">0.90</middlenames>
<middlenames match="one_empty">0.89</middlenames>
<middlenames match="unequal">0.49</middlenames>
</firstnames>
The matrices used by matchIT SQL/matchIT Hub can similarly be edited to include decimal values.
For non name data, the matching result is determined as follows:
UK Specific Postcodes
The following rules are applied during comparisons of UK postcode fields:
- Equal (postout & postin present) -> SURE MATCH
- fuzzy differences (1-char insertion/deletion/replacement, 2-char transposition) where both postout and postin are present -> LIKELY MATCH
- Equal postouts, but both records missing postin -> LIKELY MATCH
- Equal postouts, but one record missing postin -> POSSIBLE MATCH
- fuzzy differences in the postout sections (1-char insertion/deletion/replacement, 2-char transposition), but both records missing postins -> POSSIBLE MATCH
- Other differences -> NO MATCH
Postcode/Zip Comparisons
The following rules are applied during comparisons of postcode/zip fields:
- equal (if 9 digit zip) -> SURE MATCH
- part equal (e.g. 5 digit zips match) -> LIKELY MATCH
- fuzzy differences (1-char insertion/deletion/replacement, 2-char transposition) -> POSSIBLE
- not equal -> NO MATCH
Address Comparisons
Address scores are calculated as a percentage of the address Sure weight.
Likely and Possible are really just thresholds and are not used the same way that they are for names, companies, postcodes etc.
An address that scores less than Possible will score 0.
If mustMatchLocation is enabled (the default), then a record that scores 0 on postcode must score at least Likely for address; a record that doesn't score Sure on postcode must get at least Possible for address; otherwise (if the record scores Sure on postcode) then the address can score anything.
Dates of Birth Comparisons
The following rules are applied during comparisons of date of birth fields:
- equal -> SURE MATCH
- fuzzy differences (1-char insertion/deletion/replacement, 2-char transposition, Transposed Day/Month) -> LIKELY MATCH
- Same Year -> POSSIBLE
- not equal -> NO MATCH
Email Comparisons
The following rules are applied during comparisons of email fields:
- equal -> SURE MATCH
- fuzzy (1-char insertion/deletion/replacement, 2-char transposition) -> LIKELY MATCH
- not equal -> NO MATCH
Custom Field Comparisons
The following rules are applied during comparisons of custom fields:
- equal -> SURE MATCH
- fuzzy differences (1-char insertion/deletion/replacement, 2-char transposition) -> LIKELY MATCH
- containment (if at least 10 chars) -> POSSIBLE
- not equal -> NO MATCH