H3.2 - Integration Guide - Basics – Software Support

Passing Data to matchIT Hub

See the Getting Started | Using the API | Loading Data for full details of how to load data into matchIT Hub.

The API doesn’t support quoted fields, so we recommend picking a delimiter that is guaranteed not to appear within the fields, such as 0x01 or other non-printable character. Also, because the delimiter is specified as the first character in the data, you can use a different delimiter with each call to AddData().

See the Configuration Guide | Appendix A – Column Types for a list of the key words that can be used in the input columns definition. Only pass the minimum set of fields to the matchIT engine which are actually required for producing match keys.

Encoding

matchIT Hub fully supports Unicode encoding. There are three overloaded versions of the AddData() method which take UTF8, ANSI and UTF16:

int AddData(int table, const utf8char_t* data, int timeoutInMS);

int AddData(int table, const char* data, int timeoutInMS);

int AddData(int table, const wchar_t* data, int timeoutInMS);

…just pass the appropriate data type and it will be handled accordingly.

For languages such as Chinese that don’t use the Latin1 character set, matchIT Hub transliterates the Chinese characters into Latin1 so it can still generate phonetic keys for Chinese names e.g. 续函 == Xù Hán.

Outputting Results

See the Getting Started | Using the API | Retrieving Results for full details of how to fetch results from matchIT Hub.

One or more output types must be enabled. By default, only matching groups are output. Each output type is a delimited string, with the first character being the delimiter.

MP - matching pairs: two matching records, plus the score for each level they match at;

"|MP|record1|record2|levels|ind|fam|add|bus|cus"

GP - grouped matching pairs: as per a matching pair, plus the group that contains this pair;

"|GP|level|record1|record2|score|group|basescore"

MG - matching groups: a single record, plus the group that contains it;

"|MG|level|group|basescore|table|record"

DE - deduped data: a copy of the input data minus all duplicate records (retaining the master/best record in each group);

"|DE|level|data"

DU - duplicate data: all duplicate records from the input data.

"|DU|level|data"

See the Configuration Guide | Matching Settings for a detailed explanation of the above output types.

As an illustration of what control you can enable in your User Interface (GUI), your application could present the end user with output options such as:

Output deduplicated records

Unique records and a master record from each group of duplicates is kept
Duplicates could be sent to a secondary output or discarded

Output unique records only

Grouped duplicates could be sent to a secondary output or discarded

Output all records with duplicates flagged.

The first two options can be implemented using the deduped data and matching groups output types, the third option can be implemented using the deduped data and duplicate data output types.

Basic User Settings

Ideally core options are controlled via the GUI and end users don’t have to edit an xml configuration file for basic levels of functionality. However, we don’t recommend presenting advanced configuration options in the GUI as these options can be left at default for the vast majority of use cases and over-complicating the GUI detracts from ease of use. It’s also best to steer end users away from changing the matching weights, as training and time to experiment methodically is required to successfully tune these. End users shouldn’t need to edit the XML configuration files, and most (>90%?) of the time your application should be auto-generating the XML for greatest ease-of-use.

We recommend that the following basic settings be built into the GUI:

Matching level
Minimum threshold score
Match keys.

An explanation of how matchIT Hub’s matching engine works is at What Makes Us Special and this will help you understand the context of the advice on matching provided in the following sections of this set of articles.

Matching Levels

There are several different standard levels of matching, as follows:

Individual e.g. matching Bill Smith and William Smith at the same address.
Family e.g. matching Bill Smith and Sheila Smith at the same address.
Household (also known as Address) e.g. matching Bill Smith and Susan Jones at the same address.
Company e.g. matching Bill Smith at J Sainsbury PLC and Susan Jones at Sainsburys at the same address.
Name only e.g. matching Bill Smith and William Smith at the same or different addresses.
Company only e.g. matching ABC Life Assurance and A.B.C. Life at the same or different addresses.

You can also set up custom matching rules, which would include matching on email address or telephone number.

You can let the user decide which level(s) to match to via e.g. a set of checkboxes in your GUI.

Alternatively you can automatically choose which levels to use based on whether or not the input layout contains people's names or organization names etc. as follows:

Fields available	Appropriate matching levels
Name fields	Name Only Level
Address	Address Level
Address + Name fields	Individual Level (or Family Level)
Organization	Company Only Level
Address + Organization	Business Level
Address + Organization + Name Fields	Individual Level

Or you could use the automatic selection of level to set default checkboxes that the user can then change.

See the Configuration Guide | Matching Settings for more guidelines on choosing matching levels.

Minimum Threshold Scores

Minimum threshold score settings could be portrayed to the end user as: Loose, Medium or Tight. There will be a different minimum threshold score per level, when using the standard matching weights.

Matching level	Maximum score	Minimum threshold score
Individual Level	120	80
Name Only Level	60	30
Family Level	120	80
Address Level	60	40
Business Level	120	80
Company Only Level	60	30

Match Keys

Refer to the matchIT Hub Configuration Guide|Appendix B - Match Keys for a list of key words that represent normalized (parsed, phoneticized, standardized) forms of the input data. These are ideal for use as match keys.

Additionally any of the input column types can be used as match keys but - since this is the input data in its raw form – subtle differences in the data will prevent matches. matchIT Hub Configuration Guide|Appendix C - Match Key Functions lists function that can be applied to match keys – these offer some basic normalization such as UPPER() & TRIM().

The selection of match keys depends on the data (residential or business, US, UK or international) and the volumes.

Loose keys are fine for low volume data but tight keys are required for high volume data or performance will degrade significantly. “High volume” typically means over (maybe) 5 million records but it does depend on the concentration of the data e.g. geographical concentration if address is part of the matching criteria.

You can set appropriate defaults for matching on name (individual or organization) and address from the following tables:

Medium Data Volume Match Keys

The US and other data with high-level postal codes

Matching level	Default Keys
Person level	PostOut + PhoneticLastName
	PhoneticLastName + PhoneticStreet
	AddressKey + Premise
Family level	Same as Person level
Address level	PostOut + PhoneticLastName[1]
	PostOut + PhoneticStreet + Premise
	AddressKey + Premise
Organization level	PostOut + PhoneticOrganizationName1
	PhoneticOrganizationName1 + PhoneticStreet
	AddressKey + Premise

The UK and other data with low-level postal codes

Matching level	Default Keys
Person level	PostOut + PhoneticLastName
	PhoneticLastName + PhoneticStreet
	Postcode
Family level	Same as Person level
Address level	PostOut + PhoneticLastName
	Postcode + PhoneticStreet + Premise
	Postcode
Organization level	PostOut + PhoneticOrganizationName1
	PhoneticOrganizationName1 + PhoneticStreet
	Postcode

High Data Volume Match Keys

The US and other data with high-level postal codes

Matching level	Default Keys
Person level	AddressKey + PhoneticLastName
	PostOut + PhoneticStreet + Premise[2]
	Postcode[3] + PhoneticLastName
Family level	Same as Person level
Address level	AddressKey + PhoneticLastName
	PostOut + PhoneticStreet + Premise2
	Postcode3 + Premise2[4]
Organization level	AddressKey + PhoneticOrganizationName1
	Postcode + PhoneticOrganizationName1 + PhoneticOrganizationName2
	Postcode + PhoneticStreet + Premise2

The UK and other data with low-level postal codes

Matching level	Default Keys
Person level	Postcode + PhoneticLastName
	AddressKey + PhoneticLastName
	Postcode + Premise4
Family level	Same as Person level
Address level	AddressKey + PhoneticLastName
	Postcode + Premise4
Organization level	Postcode + PhoneticOrganizationName1
	AddressKey + PhoneticOrganizationName1
	Postcode + Premise4

Note that even when matching at address level, if the data includes names, we recommend that one of the match keys includes name (individual or organization): this is because the match key is used just to identify candidate pairs of records for comparison, not to determine whether the records match. By using name in conjunction with a smaller part of the address or postal code data, additional candidates can be identified for comparison.

[1] Omit this key if last name is not available

[2] Where Premise is non-blank

[3] Where Postcode is at street or lower level

[4] Or Delivery Point code/suffix if available

matchIT Hub Index

Passing Data to matchIT Hub

Encoding

Outputting Results

Basic User Settings

Matching Levels

Minimum Threshold Scores

Match Keys

Medium Data Volume Match Keys

The US and other data with high-level postal codes

The UK and other data with low-level postal codes

High Data Volume Match Keys

The US and other data with high-level postal codes

The UK and other data with low-level postal codes

Related articles