With the exception of Nationality, these settings are considered advanced and can be omitted from a product designed to be simple to use:
- Matching weights
- Miscellaneous and custom fields
- Nationality
- Test run.
Matching Weights
For full details refer to the Advanced Configuration Guide | Matching Rules.
Weights are set for each level you want to match at e.g. for Individual level matching the settings go here in the configuration xml:
<settings>
<advanced>
<matchingRules>
<individualLevel>
<weights>…</weights>
</individualLevel>
</matchingRules>
</advanced>
</settings>
e.g. the default weights for the name and address elements for Address level matching are:
<weights>
<name sure="0" likely="0" possible="0" oneEmpty="0" bothEmpty="0" />
<organization sure="0" likely="0" possible="0" oneEmpty="0" bothEmpty="0" />
<address sure="40" likely="30" possible="20" oneEmpty="5" bothEmpty="5" />
<postcode sure="30" likely="20" possible="15" oneEmpty="5" bothEmpty="5" />
</weights>
And for Individual level matching are:
<weights>
<name sure="60" likely="40" possible="25" oneEmpty="5" bothEmpty="24" />
<organization sure="0" likely="0" possible="0" oneEmpty="0" bothEmpty="0" />
<address sure="40" likely="30" possible="20" oneEmpty="5" bothEmpty="5" />
<postcode sure="30" likely="20" possible="15" oneEmpty="5" bothEmpty="5" />
</weights>
The weights for each component are added together to give the overall score. So, the maximum score for address level using the weights above is 70, whilst for individual level it’s 130. Therefore, changing the weights has an impact on the Minimum Threshold Scores.
Miscellaneous and Custom Fields
In addition to matching on name and address elements there are miscellaneous other fields (such as email & telephone) and up to 9 custom fields that can be configured.
To include these other fields in the fuzzy matching scoring mechanism you have to set some weights that otherwise default to zero.
To add e.g. telephone and email and one custom field you would add something like:
<weights>
<telephone sure="15" likely="10" possible="5" oneEmpty="0" bothEmpty="0" />
<email sure="15" likely="10" possible="5" oneEmpty="0" bothEmpty="0" />
<customField1 sure="15" likely="10" possible="5" oneEmpty="0" bothEmpty="0" />
</weights>
See Email & for more details.
Nationality
Choose the nationality that represents the majority of the data being processed. If processing international data from multiple countries, specify “Other”. The nationality setting aids with the parsing of all input data.
<advanced>
<nationality>USA</nationality>
</advanced>
Test Run
The user interface could include an option to restrict the output to a sample of N matching pairs/groups for each score. This allows the user to review a subset of the results easily and adjust the minimum scores.
For users wishing to fine tune the minimum score level, the best idea is to reduce the minimum score a little, then spend some time looking at the matches that fall either side of the default minimum score. For example, lower the threshold score temporarily to 10-20 points below the normal default score (80 for individual level) and examine those results to see if there are any genuine matches. Likewise focus on examining 80-95 for false matches.
If there are lots of genuine matches that fall below the default minimum score, then the user can either just lower that default minimum score, or alternatively they can make a note of the similarities in those records (maybe the user can see that they match because the email or telephone or date of birth are the same) and put together one or more exact keys to run after the fuzzy matching, that will pick them up.
Similarly, for any false matches that achieve the default minimum score or higher, the user can either increase that default minimum score, or alternatively, look for common differences in those false matches (maybe the user can see that they are false matches because the email or telephone or date of birth are different) and tighten up the keys accordingly by adding these elements to the existing match keys e.g. by tacking email onto a particularly loose key, it will ensure that matching pairs are no longer picked up by that key unless email matches.
Email & Telephone
There are a couple of approaches to matching on email & telephone, with different levels of complexity.
Exact matching on email alone/telephone alone
This is the most common end user usage scenario and picks up additional types of matches to name & address matching.
Exact keys[1] are checked before fuzzy keys. So if two records are an exact match, then they won’t be fuzzy matched too.
This is configured by defining email and telephone as exact keys. You can use the raw input data as the match keys. If email is available add UPPER(TRIM(Email)) as a key. Similarly if telephone is available add TRIM(Telephone) as a key e.g.
<keys>
<exact>
<key>UPPER(TRIM(Email))</key>
<key>TRIM(Telephone)</key>
<exact>
<fuzzy>
<key>PostOut + Name1</key>
<key>Name1 + AddressKey</key>
</fuzzy>
</keys>
Alternatively there are some normalized keys. Username, Domain, and TLD (Top Level Domain) are extracted from Email – but these are still in their raw form so you should still use the match functions, e.g. UPPER(TRIM(Username)).
TelAreaCode and TelLocalNumber are extracted from Telephone and are normalized.
[1] Exact keys were introduced with Syniti Match API version 1.0.3