Previous Article | matchIT Hub Index | Next Article |
Passing Data to matchIT Hub
See the Getting Started | Using the API | Loading Data for full details of how to load data into matchIT Hub.
The API doesn’t support quoted fields, so we recommend picking a delimiter that is guaranteed not to appear within the fields, such as 0x01 or other non-printable character. Also, because the delimiter is specified as the first character in the data, you can use a different delimiter with each call to AddData().
See the Configuration Guide | Appendix A – Column Types for a list of the key words that can be used in the input columns definition. Only pass the minimum set of fields to the matchIT engine which are actually required for producing match keys.
Encoding
matchIT Hub fully supports Unicode encoding. There are three overloaded versions of the AddData() method which take UTF8, ANSI and UTF16:
int AddData(int table, const utf8char_t* data, int timeoutInMS);
int AddData(int table, const char* data, int timeoutInMS);
int AddData(int table, const wchar_t* data, int timeoutInMS);
…just pass the appropriate data type and it will be handled accordingly.
For languages such as Chinese that don’t use the Latin1 character set, matchIT Hub transliterates the Chinese characters into Latin1 so it can still generate phonetic keys for Chinese names e.g. 续函 == Xù Hán.
Outputting Results
See the Getting Started | Using the API | Retrieving Results for full details of how to fetch results from matchIT Hub.
One or more output types must be enabled. By default, only matching groups are output. Each output type is a delimited string, with the first character being the delimiter.
- MP - matching pairs: two matching records, plus the score for each level they match at;
- "|MP|record1|record2|levels|ind|fam|add|bus|cus"
- GP - grouped matching pairs: as per a matching pair, plus the group that contains this pair;
- "|GP|level|record1|record2|score|group|basescore"
- MG - matching groups: a single record, plus the group that contains it;
- "|MG|level|group|basescore|table|record"
- DE - deduped data: a copy of the input data minus all duplicate records (retaining the master/best record in each group);
- "|DE|level|data"
- DU - duplicate data: all duplicate records from the input data.
- "|DU|level|data"
See the Configuration Guide | Matching Settings for a detailed explanation of the above output types.
As an illustration of what control you can enable in your User Interface (GUI), your application could present the end user with output options such as:
- Output deduplicated records
- Unique records and a master record from each group of duplicates is kept
- Duplicates could be sent to a secondary output or discarded
- Grouped duplicates could be sent to a secondary output or discarded
The first two options can be implemented using the deduped data and matching groups output types, the third option can be implemented using the deduped data and duplicate data output types.
Basic User Settings
Ideally core options are controlled via the GUI and end users don’t have to edit an xml configuration file for basic levels of functionality. However, we don’t recommend presenting advanced configuration options in the GUI as these options can be left at default for the vast majority of use cases and over-complicating the GUI detracts from ease of use. It’s also best to steer end users away from changing the matching weights, as training and time to experiment methodically is required to successfully tune these. End users shouldn’t need to edit the XML configuration files, and most (>90%?) of the time your application should be auto-generating the XML for greatest ease-of-use.
We recommend that the following basic settings be built into the GUI:
- Matching level
- Minimum threshold score
- Match keys.
An explanation of how matchIT Hub’s matching engine works is at What Makes Us Special and this will help you understand the context of the advice on matching provided in the following sections of this set of articles.
Matching Levels
There are several different standard levels of matching, as follows:
- Individual e.g. matching Bill Smith and William Smith at the same address.
- Family e.g. matching Bill Smith and Sheila Smith at the same address.
- Household (also known as Address) e.g. matching Bill Smith and Susan Jones at the same address.
- Company e.g. matching Bill Smith at J Sainsbury PLC and Susan Jones at Sainsburys at the same address.
- Name only e.g. matching Bill Smith and William Smith at the same or different addresses.
- Company only e.g. matching ABC Life Assurance and A.B.C. Life at the same or different addresses.
You can also set up custom matching rules, which would include matching on email address or telephone number.
You can let the user decide which level(s) to match to via e.g. a set of checkboxes in your GUI.
Alternatively you can automatically choose which levels to use based on whether or not the input layout contains people's names or organization names etc. as follows:
Fields available |
Appropriate matching levels |
Name fields |
Name Only Level |
Address |
Address Level |
Address + Name fields |
Individual Level (or Family Level) |
Organization |
Company Only Level |
Address + Organization |
Business Level |
Address + Organization + Name Fields |
Individual Level |
Or you could use the automatic selection of level to set default checkboxes that the user can then change.
See the Configuration Guide | Matching Settings for more guidelines on choosing matching levels.
Minimum Threshold Scores
Minimum threshold score settings could be portrayed to the end user as: Loose, Medium or Tight. There will be a different minimum threshold score per level, when using the standard matching weights.
Matching level |
Maximum score |
Minimum threshold score |
Individual Level |
120 |
80 |
Name Only Level |
60 |
30 |
Family Level |
120 |
80 |
Address Level |
60 |
40 |
Business Level |
120 |
80 |
Company Only Level |
60 |
30 |
Match Keys
Refer to the matchIT Hub Configuration Guide|Appendix B - Match Keys for a list of key words that represent normalized (parsed, phoneticized, standardized) forms of the input data. These are ideal for use as match keys.
Additionally any of the input column types can be used as match keys but - since this is the input data in its raw form – subtle differences in the data will prevent matches. matchIT Hub Configuration Guide|Appendix C - Match Key Functions lists function that can be applied to match keys – these offer some basic normalization such as UPPER() & TRIM().
The selection of match keys depends on the data (residential or business, US, UK or international) and the volumes.
Loose keys are fine for low volume data but tight keys are required for high volume data or performance will degrade significantly. “High volume” typically means over (maybe) 5 million records but it does depend on the concentration of the data e.g. geographical concentration if address is part of the matching criteria.
You can set appropriate defaults for matching on name (individual or organization) and address from the following tables:
Medium Data Volume Match Keys
The US and other data with high-level postal codes
Matching level |
Default Keys |
Person level |
PostOut + PhoneticLastName |
PhoneticLastName + PhoneticStreet |
|
AddressKey + Premise |
|
Family level |
Same as Person level |
Address level |
PostOut + PhoneticLastName[1] |
PostOut + PhoneticStreet + Premise |
|
AddressKey + Premise |
|
Organization level |
PostOut + PhoneticOrganizationName1 |
PhoneticOrganizationName1 + PhoneticStreet |
|
AddressKey + Premise |
The UK and other data with low-level postal codes
Matching level |
Default Keys |
Person level |
PostOut + PhoneticLastName |
PhoneticLastName + PhoneticStreet |
|
Postcode |
|
Family level |
Same as Person level |
Address level |
PostOut + PhoneticLastName |
Postcode + PhoneticStreet + Premise |
|
Postcode |
|
Organization level |
PostOut + PhoneticOrganizationName1 |
PhoneticOrganizationName1 + PhoneticStreet |
|
Postcode |
High Data Volume Match Keys
The US and other data with high-level postal codes
Matching level |
Default Keys |
Person level |
AddressKey + PhoneticLastName |
PostOut + PhoneticStreet + Premise[2] |
|
Postcode[3] + PhoneticLastName |
|
Family level |
Same as Person level |
Address level |
AddressKey + PhoneticLastName |
PostOut + PhoneticStreet + Premise2 |
|
Postcode3 + Premise2[4] |
|
Organization level |
AddressKey + PhoneticOrganizationName1 |
Postcode + PhoneticOrganizationName1 + PhoneticOrganizationName2 |
|
Postcode + PhoneticStreet + Premise2 |
The UK and other data with low-level postal codes
Matching level |
Default Keys |
Person level |
Postcode + PhoneticLastName |
AddressKey + PhoneticLastName |
|
Postcode + Premise4 |
|
Family level |
Same as Person level |
Address level |
AddressKey + PhoneticLastName |
Postcode + Premise4 |
|
Organization level |
Postcode + PhoneticOrganizationName1 |
AddressKey + PhoneticOrganizationName1 |
|
Postcode + Premise4 |
Note that even when matching at address level, if the data includes names, we recommend that one of the match keys includes name (individual or organization): this is because the match key is used just to identify candidate pairs of records for comparison, not to determine whether the records match. By using name in conjunction with a smaller part of the address or postal code data, additional candidates can be identified for comparison.
[1] Omit this key if last name is not available
[2] Where Premise is non-blank
[3] Where Postcode is at street or lower level
[4] Or Delivery Point code/suffix if available
Previous Article | matchIT Hub Index | Next Article |