Previous Article | matchIT Hub Index | Next Article |
<matching>
<keys>...</keys>
<levels>...</levels>
<outputs>...</outputs>
<advanced>...</advanced>
<maximumClusterSize>...</maximumClusterSize>
<allowBlankKeys>...</allowBlankKeys>
<dynamicKeys>...</dynamicKeys>
</matching>
Matching settings are required only if the processing mode has been set to Matching. If these have not been specified in the configuration, then these defaults will be used:
- Smart Settings will automatically select keys and matching level appropriate for the mapped input column types;
- matching groups will be output;
- only unique refs will be output.
Match Keys
<keys>
<key>PostOut + NameKey</key>
<key>PhoneticLastName + AddressKey</key>
</keys>
It is important to note that the match keys determine how records are clustered. When a new record is added to an existing cluster (containing one or more existing records) the record is compared to each of those existing records. Clusters are used to group potentially matching records.
Functions can be applied to any key, and these can even be combined. For example, LEFT(Company,10) and UPPER(RTRIM(Address1)+PUNTRIM(Postcode)).
Refer to Appendix B - Match Keys and Appendix C - Match Key Functions for further details of the keys and functions that can be applied. In addition to the standard keys, fields from the added data can also be used within match keys, for example Address1 or Company; refer to Appendix A - Column Types for a list of input column types.
Optional attributes on the key tag are ruleSets (see Matching Rule Sets) and methods (see Matching Methods).
Furthermore, one or more exact match keys can be specified; for example:
<keys>
<exact>
<key>NormalizedName + NormalizedOrganization + Premise + PhoneticStreet + PhoneticTown + PostOut + PostIn</key>
<key>NormalizedName + UPPER(Address1 + Address2 + City + State + ZIP)</key>
<key skipFuzzy="false">Telephone + Email</key>
</exact>
<fuzzy>
<key>PostOut + NameKey</key>
<key>PhoneticLastName + AddressKey</key>
</fuzzy>
</keys>
Exact matching can be used to improve the performance of matchIT Hub, because a record that's added to an exact cluster is automatically considered a match with all records already in that cluster, without needing to compare those records with the new record.
Exact match keys can also be used to find matches that wouldn’t be matched, or don't need to be compared, by fuzzy clustering.
With exact keys there is an option to skip fuzzy clustering. skipFuzzy="true" (default) means that only one representative record from each exact key cluster - the first record added - is considered for fuzzy clustering.
There are two usages for exact keys:
- Define exact keys that combine most (or all) significant fields, such that if two records match on an exact key they must be identical. In this case the default skipFuzzy="true" is recommended (unless Bridging Prevention is enabled) because if the records are identical there is no point in fuzzy comparing more than one of them.
- Define exact keys that represent one or two identifying fields, such as: email, telephone number, reference number, social security number (US), or national insurance number (UK). E.g. Using an exact key of NormalizedName + Email would allow an individual to be tracked even if their postal address had changed. For this usage it is recommended to set skipFuzzy="false" since other fields in the record may be different it is worth fuzzy comparing all of them to avoiding missing matches.
Note that exact matching, if enabled, could significantly increase memory usage. Exact matching is best used on data that has a high level of duplication.
Optional Keys
Individual keys can be marked as 'optional' by enclosing them within square brackets; for example:
<keys>
<key>AddressKey + [NameKey]</key>
<key>PhoneticLastName + PhoneticStreet</key>
</keys>
Marking a key as optional means that all non-optional keys within the match key cannot be blank. Using a key of AddressKey + NameKey, records would be clustered and compared if they had the same values for both AddressKey and NameKey, including blanks (except where both AddressKey and NameKey are blank). Marking NameKey as optional means that AddressKey values cannot be blank, whilst NameKey values can be blank; this could be useful, for instance, when comparing records that had addresses but not always a person's name, such as when matching simultaneously at individual and address levels.
Note that if any optional key has been specified, then the keys within all other match keys cannot be blank; so with the second match key (PhoneticLastName + PhoneticStreet) records will only be clustered and compared if they have both the same values for PhoneticLastName and PhoneticStreet; records containing any blank values are never clustered and compared.
Optional keys can be used within fuzzy and exact match keys.
<allowBlankKeys>true</allowBlankKeys>
This setting is only applied if optional keys are not in use. If any optional key has been specified, using the square bracket notation, then this setting is ignored.
By default, blank key values are permitted within match keys. For example, a match key of PostOut + Name1 allows either PostOut or Name1 values to be blank (but not both). Disabling this setting ensures that records are only clustered and compared if all their key values aren't blank (for example, records with missing names and/or addresses would be skipped).
Matching Levels
<levels>
<individual minimumScore="80" enabled="true" />
<nameOnly minimumScore="40" enabled="" />
<family minimumScore="80" enabled="" />
<address minimumScore="55" enabled="" />
<business minimumScore="80" enabled="" />
<companyOnly minimumScore="40" enabled="" />
<custom minimumScore="80" enabled="" />
</levels>
One or more matching levels must be enabled.
Note that Individual and NameOnly are mutually exclusive (they can't both be enabled), as are Business and CompanyOnly. NameOnly actually uses the Individual level, but with all address and postcode weights set to 0. Similarly, CompanyOnly uses the Business level, but with all address and postcode weights set to 0.
Here's a guideline as to when the various matching levels should be used:
- Individual: For matching individual contacts, considering available data including name, address, and postcode;
- NameOnly: For only matching contacts' names;
- Family: For matching family members at a property, considering available data include last name, address, and postcode;
- Address: For matching properties only (addresses and postcodes; names are not used);
- Business: For matching individual companies, considering available data including company name, address, and postcode;
- CompanyOnly: For only matching companies' names;
- Custom: For any other type of matching that cannot be achieved with any of the above levels.
Note that the levels are highly flexible. For example, the Individual level can be configured to match company names too. Please contact Syniti support for further details.
Outputs
<outputs>
<types>
<matchingPairs enabled="" />
<groupedMatchingPairs enabled="" />
<matchingGroups enabled="true" />
<dedupedData enabled="" />
<duplicateData enabled="" />
<matchedData enabled="" />
</types>
<options>...</options>
</outputs>
One or more output types must be enabled. By default, only matching groups are output.
Each output type is a delimited string, with the first character being the delimiter (note that this will not always be a pipe character). A two-letter code identifies the type of each result:
- Matching Pairs: "|MP|record1|record2|levels|ind|fam|add|bus|cus(|compareResults)|key"
- Grouped Matching Pairs: "|GP|level|record1|record2|score|group|baseScore(|highestScore)|key"
- Matching Groups: "|MG|level|group|baseScore(|highestScore)(|duplicates)|table|record"
- Deduped Data: "|DE|level|data"
- Duplicate Data: "|DU|level|data"
- Matched Data: "|MD|level|group|baseScore(|highestScore)|duplicates|record"
- Matched Data (overlap): "|MD|level|best_match|score|record"
Matching Pairs
Outputs matching pairs of records. This is the simplest type of matching output; results are available immediately because grouping is not performed.
Output format: "|MP|record1|record2|levels|ind|fam|add|bus|cus(|name1|name2|name3|orgName1|orgName2|orgName3|acronymMatch)|key"
- "record1" and "record2" refer to the two matching records. When overlapping two data sources, they refer to data from table 1 and 2 respectively. The two records can be output as unique refs only or can be a copy of the input data (see Output Options below).
- "levels" indicates which matching level(s) the two records match at. It's an integer, and is the sum of one or more of the following values (for example, if levels is 5 then the two records matched at both individual level and address level):
- 1: individual level;
- 2: family level;
- 4: address level;
- 8: business level;
- 16: custom level.
- "ind|fam|add|bus|cus" indicate the scores for each matching level. If outputComponentScores is enabled (see Output Options, below), then each total score (ind, fam, add, bus, cus) is immediately followed by the scores for all mapped components. This is disabled by default, in which case only the total score is listed for each matching level.
- "name1|name2|name3|orgName1|orgName2|orgName3|acronymMatch" indicate the results of comparing the pair of records. Only present if outputCompareResults is enabled. "name1|name2|name3" are the Name Matching Matrix indices for last names, first names, and middle names respectively. "orgName1|orgName2|orgName3" are the Organization Matching Matrix indices. "acronymMatch" is 1 if the company names were matched using acronym matching, 0 otherwise.
- "key" is a 0-based index of the match key used to find the match (e.g. if 3 match keys are defined, the index will be 0-2). Keys are numbered in the order they are defined in the configuration settings.
Grouped Matching Pairs
Outputs matching pairs of records. Each pair is grouped with other matching pairs.
Output format: "|GP|level|record1|record2|score|group(|baseScore)(|highestScore)|key"
- "level" indicates the matching level (one of I, F, A, B, or C).
- "record1" and "record2" refer to the two matching records. See Matching Pairs, above, for further details.
- "score" indicates, by default, the total score for the specified matching level. If outputComponentScores is enabled (see Output Options, below), then the total score is immediately followed by the scores for all mapped components.
- "group" indicates the unique ref of the group's master/best record.
- "baseScore" indicates the lowest score of each match within the group. Not present if outputBestMatchesOnly is enabled.
- "highestScore" indicates the highest score of each match within the group. Only present if outputHighestScores is enabled.
- "key" is a 0-based index of the match key used to find the match.
Output format (with ):
"|GP|level|record1|record2|score|group|baseScore|highestScore|key"
Matching groups
Outputs the records within each group.
Output format: "|MG|level|group|baseScore(|highestScore)(|duplicates)|table|record"
- "level" indicates the matching level (one of I, F, A, B, or C).
- "group" indicates the unique ref of the group's master/best record.
- "baseScore" indicates the lowest score of each match within the group.
- "highestScore" indicates the highest score of each match within the group. Only present if outputHighestScores is enabled.
- "duplicates" indicates the number of duplicates within the group (this is one less than the group size). Only present if outputDuplicatesCount is enabled.
- "table" indicates which table the grouped record originates from (1 or 2 for an overlap, otherwise 0).
- "record" refers to a matching record within the group. As per record1 and record2, each record can be output as a unique ref or can be a copy of the input data.
From version 2.0.3 the first record output in each group is the master/best record.
Deduped Data
Outputs all unique records. If overlapping two tables, records are output from table 2 only.
Output format: "|DE|level|data"
- "level" indicates the matching level (one of I, F, A, B, or C).
- "data" is a copy of the input data.
Duplicate Data
Outputs all duplicate records. If overlapping two tables, records are output from table 2 only.
Output format: "|DU|level|data"
- "level" indicates the matching level (one of I, F, A, B, or C).
- "data" is a copy of the input data.
(Note that combining the Deduped and Duplicate Data output will result in a copy of the original input data. If overlapping two tables, this will be a copy of table 2.)
Matched Data
From version 2.0.3
Outputs a copy of the input data (or table 2 when overlapping) with matchrefs and basescores output for matching records.
Output format (internal dedupe): "|MD|level|group|baseScore(|highestScore)|duplicates|record"
- "level" indicates the matching level (one of I, F, A, B, or C).
- "group" indicates the unique ref of the group's master/best record.
- "baseScore" indicates the lowest score of each match within the group.
- "highestScore" indicates the highest score of each match within the group. Only present if outputHighestScores is enabled.
- "duplicates" indicates the number of duplicates within the group (this is one less than the group size).
- "record" a copy of the input data.
Output format (overlap): "|MD|level|bestMatch|score|record"
- "level" indicates the matching level (one of I, F, A, B, or C).
- "bestMatch" indicates the unique ref of the best record.
- "score" indicates the score of the match.
- "record" a copy of the input data.
Output Options
<outputs>
<options>
<outputHeader enabled="true" />
<outputUniqueRefsOnly enabled="true" />
<outputComponentScores enabled="false" />
<outputExactMatchScores enabled="false" />
<outputAllExactMatches enabled="false" />
<outputLargeClusters enabled="false" records="false" dumpAsErrors="false" />
<outputHighestScores enabled="false" />
<outputDuplicatesCount enabled="false" />
<outputCompareResults enabled="false" />
<outputBestMatchesOnly enabled="false" />
</options>
</outputs>
outputHeader: If enabled, then a header will be output for each output type, with the results to indicate the type of data for each column. Disable this setting to prevent the header from being output.
outputUniqueRefsOnly: If enabled (the default), then only unique refs are output. If disabled, then the output contains a copy of the input data, which can include the unique ref.
outputComponentScores: If enabled, then scores for mapped components are output for each matching level in addition to total scores. If disabled (the default), then only the total score for each matching level is output.
Mapped components are: name, organization, address, postcode, telephone, email, date of birth, website, and the nine custom fields. Component scores will only be written to the matchingPairs and groupedMatchingPairs outputs, and always in the order shown. For example, if the input data definition includes name, address, and postcode fields, then four scores will be output for each matching level: the total score for the level, plus name, address, and postcode scores.
Without component scores:
|MP|AA034531|AA037947|5|116|0|56|0|0
With component scores (underlined):
|MP|AA034531|AA037947|5|116|60|36|20|0|0|0|0|56|0|36|20|0|0|0|0|0|0|0|0
outputExactMatchScores: If enabled, then a total score is output for exact matches that is the sum of the sure score setting for all mapped components plus one. Otherwise the score field is blank for exact matches. Regardless of this setting the component scores for exact matches are always blank.
outputAllExactMatches: When disabled (the default), matching pairs are only output if a record exactly matches the first record of a cluster. If enabled, then all matching pairs are output. For example, if there are three records in a cluster (A, B, C) then three matching pairs will be found and output: A+B, A+C, B+C. Disabling this setting would prevent B+C from being output (assuming A was the first record added to the cluster). This setting should be used with care: a group containing 200 exact matches would produce 199 matching pairs by default, but with the setting enabled there'd be 19900 matching pairs!
outputLargeClusters: Disabled by default. If enabled, large clusters are instead output as results with format "|LC|key|cluster|records", where:
- "key" indicates the zero-based index of the match key;
- "cluster" identifies the unique ref of the first record in the cluster;
- "records" indicates the total number of records in the cluster.
Additionally, all records in any large clusters can be output as results with format "|LR|cluster|record|compared" where:
- "key" indicates the zero-based index of the match key;
- "cluster" identifies the unique ref of the first record in the cluster;
- "record" identifies the unique refs of the records within the cluster;
- "compared" indicates whether the record was compared (1) to all the preceding records in the cluster, or whether it was excluded (0) because the cluster has grown too large.
When dumpAsErrors is enabled then the information contained in this output will be output as a series of errors. This setting is independent of whether outputLargeClusters is enabled or disabled. By default dumpAsErrors is disabled.
Refer to the maximumClusterSize setting for further details on large clusters.
outputHighestScores (from version 2.0.3): If enabled, highest scores are also output to the Grouped Matching Pairs and Matching Groups output types. This is the highest score achieved by any matching pair within each group. (Conversely, the base score is the lowest score achieved by the pairs within a group, and is always output.)
outputDuplicatesCount (from version 2.0.3): If enabled, the number of duplicates in each group is output to the Matching Groups output type. This is one less than the number of records in the group.
outputCompareResults (from version 2.0.3): If enabled, the matching matrices indices and acronym match flag are included in the Matching Pairs output type.
outputBestMatchesOnly (from version 2.0.4.12): If enabled, the Grouped Pairs output type will only show the best match for each record. Where best means the match with the highest match score. This is really only useful when overlapping two tables, where only the best match for each overlap table record is output.
Advanced
<advanced>
<postMatchingRules>
<rule condition="" action="" />
<rule condition="" action="" />
...
</postMatchingRules>
<bridgingPrevention>
<nameBridgingPrevention enabled="true" />
<prefixBridgingPrevention enabled="true" />
<companyBridgingPrevention enabled="true" />
<aggressiveSplitting enabled="false" />
<masterRecordIdentification enabled="true" />
</bridgingPrevention>
<associations>
<associate element1="" element2="" />
<associate element1="" element2="" />
...
</associations>
<ruleSets> ... </ruleSets>
<options>
<outputDeletedMatches enabled="false" />
<outputSubRecords enabled="false" />
</options>
<volume>Medium</volume>
<tightness>Medium</tightness>
<businessTightness>Standard</businessTightness>
<methods>...</methods>
<bestFitMethods enabled="true" />
</advanced>
Post-Matching Rules
Advanced Post-Matching rules are applied to matching pairs prior to grouping. The Advanced Post-Matching rules only apply to fuzzy compared matches, they are entirely optional and none are defined by default. Each rule specifies both a condition using a SQL-like syntax, plus an action that determines what happens when a condition is satisfied.
Rule actions are either "Keep" or "Delete". If any successful rule specifies a Keep action, then the match is kept. If any successful rule specifies a Delete action, then the match is deleted, but only if the match isn’t being kept. Keep actions are therefore a higher priority than Delete actions. Rules are applied to each level that a pair matches at, so a matching pair might be deleted at some levels and kept at others.
Rule conditions are logical expressions that results in a Boolean (true or false). An expression can be a function – such as “matches(city)” – or a logical operation such as “AddressScore >= 30”, “City == ‘RALEIGH’”. Conditions can consist of a single logical expression or of multiple expressions (combined using “and”, or “or”). I.e.
- logical_expression
- logical_expression and logical_expression
- logical_expression or logical_expression
Brackets around sub-expressions are permitted (and are encouraged to avoid ambiguity). Additionally, expressions can be negated using “not” (or “!”) – e.g. “NOT MATCHES(FirstName)” or “!MATCHES(FirstName)”.
For a complete list of functions refer to Appendix E - Advanced Post-Matching Rule Functions.
Logical operations can involve integers or strings. The allowed operands of integer logical operations are: score, integer_value, and the Length() function. I.e.
- score logical_operator integer_value
- Length(field) logical_operator integer_value
The allowed operands of string logical operations are: field, string_value, and the Record1() and Record2() functions. I.e.
- field logical_operator field
- field logical_operator string_value
- Record1(field)/Record2(field) logical_operatorstring_value/field
Where:
-
logical_operator is one of:
- == (and =)
- !=
- LT (this means <)
- GT (this means >)
- LTE (this means <=)
- GTE (this means >=)
- integer_value is an immediate integer value.
- string_value is an immediate string value in single quotes.
- field is one of the field names specified in Appendix D - Normalization Output.
-
score is one of:
- Score – i.e. total score for level
- NameScore
- AddressScore
- OrganizationScore
- PostcodeScore
- TelephoneScore
- EmailScore
- DateOfBirthScore
- WebsiteScore
- Custom1Score
- ...
- Custom9Score
Logical operations involving two field names, e.g. of the form "FirstNames==Lastname", are only allowed with the operators "==" and "!=". The example "FirstNames==Lastname" is true if the firstnames from record 1 matches the lastname from record 2 or the firstnames from record 2 matches the lastname from record 1. To compare fields within a single record, refer to the Record() function.
Logical operations involving one field name, e.g. "City=='NFA'" or "Length(Lastname) LT 3" are only true if the condition applies to both records. To test for a condition that applies to either record use the Record() function.
Bridging Prevention
Bridging happens when we combine two matching pairs, by a common record, to form a group where the other two records are not a good match.
For example, consider the following two matching pairs.
Pair 1
- AR017900|MISS|SUSAN|DAVIS|
- AY272090|MS| |DAVIS|
Pair 2
- AK557942|MRS|KAREN|DAVIS|
- AY272090|MS| |DAVIS|
Each pair is a perfectly good match in itself. However, when we combine them into a group, based on the common record: AY272090, this creates a bridge between the two non-matching records:
Bridged group
- AR017900|MISS|SUSAN|DAVIS|
- AK557942|MRS|KAREN|DAVIS|
- AY272090|MS| |DAVIS|
A number of situations can cause bridging – the thing they have in common is that some information is missing from a record. Causes of bridging include:
- Name bridging: A record is missing forename and matches to multiple records with different forenames (as in the example above);
- Prefix bridging: A record with “Ms” matches to a record with “Miss” and to a record with “Mrs” (where one of the forenames is empty or just an initial);
- Company name bridging: A record with a company name acronym or partial company name matches to multiple company names (e.g. IBM matches “International Business Machines” and “Injection Blow Moulding”).
Bridging prevention splits bridged groups into sub-groups. This affects the Grouped Matching Pairs, Matching Groups, Deduped Data, and Duplicate Data outputs, but does not affect the Matching Pairs output.
From version 2.0.3, bridging prevention will not break apart matches flagged as "keep" by post-matching rules.
Prefix bridging prevention only has an effect if name bridging prevention is also enabled. For example, with the following group, enabling name bridging prevention alone will leave the group intact. But name & prefix bridging prevention will remove record FY856127 from the group.
- MX123765|Miss|Karen|Lang|
- NQ591736|Ms|K|Lang|
- FY856127|Mrs|K|Lang|
- AK275623|Miss|K|Lang|
nameBridgingPrevention: If enabled (the default for single table dedupe), bridging caused by missing forenames will be prevented
prefixBridgingPrevention: If enabled (the default for single table dedupe) and nameBridgingPrevention is also enabled, bridging caused by Ms will be prevented
companyBridgingPrevention: If enabled (the default for single table dedupe), bridging caused by company name acronyms will be prevented
aggressiveSplitting: If enabled, bridging records will be disassociated from all matching records. If disable (the default), bridging records will remain matched to one sub-group of non-bridging records. With the example above, aggressive splitting will leave no matching groups intact. With default splitting, just one of the records (either Susan or Karen) will be removed from the group.
Note that (prior to version 2.0.3) when enabling bridging prevention it is important to disable the skipFuzzy option on any exact keys. This is because bridging prevention needs to know about all matching pairs in order to work properly.
Note that bridging prevention can only work in overlap mode if an internal dedupe has already been performed on table 2. From version 2.0.2 bridging prevention defaults to disabled for overlaps.
masterRecordIdentification (from version 2.0.3): If enabled (default) and prior to 2.0.3, the master record in each group is chosen according to: Master Priorities rules, then address length, then lowest UniqueRef. If disabled, the master record in each group is simply the record with the lowest UniqueRef.
Associations (Multiple Elements)
When a record contains Multiple Elements, column types are naturally associated into the element types: name, organization, address, telephone, mobile, fax, email, job, qualifications, dob, website as follows:
Element Type | Column Types |
name | FullName, Prefix, Forenames, MiddleNames, Initials, Surname, Suffix |
organization | Organization, Department |
address | Address1…Address9, City, State, ZIP, PostOut, Plus4, DeliveryPoint, Country |
telephone | Telephone |
mobile | Mobile |
fax | Fax |
job | JobTitle |
qualifications | Qualifications |
dob | DateOfBirth |
website | Website |
Sub-records are created representing the various combinations of the different instances of each element type. Such that, columns of the same element type will only be combined with each other if they are from the same occurrence. I.e. like this:
- FirstForenames with FirstSurname;
- SecondForenames with SecondSurname.
but not like this:
- FirstForenames with SecondSurname;
- SecondForenames with FirstSurname.
Whereas, the First columns from one element type (e.g. FirstFullName) are combined with the First, Second, Third etc columns from other element types (e.g. FirstTelephone, SecondTelephone).
This allows, for example, for one organization to have multiple addresses or one name to have multiple telephone numbers.
However, where a record has a person’s home address and work address we would want to combine the organization name with the work address but not with the home address. To achieve this we need to configure Associations and associate organization with address.
<advanced>
<associations>
<associate element1="organization" element2="address" />
</associations>
</advanced>
Each of the <associate> configuration options prevents the mixing and matching of different multiple element from within the two element types indicated.
Similarly, to support an employee’s job history, for example, we would want to combine:
- FirstOrganization, FirstJobTitle and FirstAddress;
- SecondOrganization, SecondJobTitle and SecondAddress.
But not:
- FirstOrganization, SecondJobTitle and ThirdAddress;
- SecondOrganization, FirstJobTitle and SecondAddress.
To achieve this we need to configure the following Associations.
<advanced>
<associations>
<associate element1="organization" element2="address" />
<associate element1="job" element2="organization" />
</associations>
</advanced>
Note: with the above two associations configured, there is an implied third association between job and address.
As a further example, for one residential address with two inhabitants we might use the following associations:
<advanced>
<associations>
<associate element1="name" element2="dob" />
<associate element1="name" element2="email" />
<associate element1="name" element2="mobile" />
<associate element1="address" element2="telephone" />
</associations>
</advanced>
Matching Rule Sets
From version 2.0.4.
Matching Rules consist of constraints, weights and matching matrices. There are five sets of Matching Rules, that are traditional used for each of the five matching levels: individualLevel, familyLevel, addressLevel, businessLevel, & customLevel. However, since matchIT Hub version 2.0.4, you can use these five sets of Matching Rules in more flexible ways than being tied to a particular level. For this reason, the following aliases are supported for the xml tags: firstRuleSet, secondRuleSet, thirdRuleSet, fourthRuleSet, fifthRuleSet.
This can be useful if, for example:
- you are running multiple overlaps with data from different sources and want to define different matching rules for each source,
- in lookup mode you want to use different matching rules depending on the search query data available.
Each rule set can be enabled or disabled and configured with: a name, level, minimumScore. Where the default settings are:
<ruleSets>
<ruleSet name="individual" level="individual" enabled="true" />
<ruleSet name="family" level="family" enabled="true" />
<ruleSet name="address" level="address" enabled="true" />
<ruleSet name="business" level="business" enabled="true" />
<ruleSet name="custom" level="custom" enabled="true" />
</ruleSets>
Say, for example, you have multiple overlap sources (that require different Matching Rules) you could specify something like:
<ruleSets>
<ruleSet name="enquiries" level="individual" minimumScore="70" enabled="true" />
<ruleSet name="support" level="individual" minimumScore="60" enabled="false" />
<ruleSet name="newCustomer" level="individual" enabled="false" />
</ruleSets>
Where specifying a minimunScore for the ruleSet overrides the minimumScore for the level. The ruleSet attributes: minimumScore & enabled can be changed between overlaps – i.e. at the same time as changing the layout of table 2.
Matching keys can be associated with rule sets by specifying a comma separated list of rule set names or number in the ruleSets attribute on key tags. E.g.
<keys>
<fuzzy>
<key ruleSets="enquires,support,newCustomers">[Name1]+[PhoneticStreet]</key>
<key ruleSets="2">CustomField1</key>
<!-- credit card hash --><key ruleSets="enquiries,support">CustomField2</key>
<!-- account number --></fuzzy>
</keys>
Advanced Options
outputDeletedMatches: Outputs matches deleted by the Advanced Post-Matching Rules. This type of output will only occur if some Advanced Post-Matching Rules have been configured. As with matching pairs output the results are available immediately because grouping is not performed.
Output format: "|DP|record1|record2|levels|indscore|famscore|addscore|busscore|cusscore|indrule|famrule|addrule|busrule|cusrule"
- "record1" and "record2" refer to the two matching records. See Matching Pairs, above, for further details.
- "levels" indicates which matching level(s) the two records match at. See Matching Pairs, above, for further details.
- "indscore|famscore|addscore|busscore|cusscore" indicate the scores for each matching level. See Matching Pairs, above, for further details.
- "indrule|famrule|addrule|busrule|cusrule" indicate which rule causes the match to be deleted at each level.
Rules are identified by a rule number which is simply a sequential number of the rules defined in the config.
outputSubRecords: Outputs details of the sub-records created. This type of output is useful when mapping multiple elements as a way to check that any associations have been configured correctly.
Output format: "|SR|table|name|organization|address|telephone|mobile|fax|email|job|qualifications|dob|website"
- "table" indicates which table the sub record belongs to (1 or 2 for an overlap, otherwise 0).
- "name" indicates which instance of mapped name fields are used in this sub record.
- "organization" indicates which instance of mapped organization fields are used in this sub record.
- "address" indicates which instance of mapped address fields are used in this sub record.
- ...
- "website" indicates which instance of mapped website fields are used in this sub record.
For example, with the example associations above representing one residential address with two inhabitants (with dob, email & mobile associated with name and telephone associated with address), we would get the following sub-records:
- |SR|0|FirstName||FirstAddress||FirstMobile||FirstEmail|||FirstDateOfBirth|
- |SR|0|FirstName||FirstAddress|FirstTelephone|||FirstEmail|||FirstDateOfBirth|
- |SR|0|SecondName||FirstAddress||SecondMobile||SecondEmail|||SecondDateOfBirth|
- |SR|0|SecondName||FirstAddress|FirstTelephone|||SecondEmail|||SecondDateOfBirth|
Note: mobile and telephone can’t co-exist in the same sub-record because the underlying matching engine currently only supports one phone number at a time. Hence, this example results in four sub-records when you might expect only two.
Volume
volume: Gives an indication of the number of records being processed. Options are "High" (more than 5 million) or "Medium" (less than 5 million). The volume setting is used by the Smart Settings feature when determining which keys to use.
From version 2.0.3 the default was changed from High to Medium.
Tightness
tightness: Matching tightness controls how strict the matching should be. Options are "Loose", "Medium", or "Tight". The tightness setting is used by the Smart Settings feature when determining the minimum score thresholds used.
From version 2.0.3 the default was changed from Loose to Medium.
The minimum score thresholds are as follows
Matching level | UK Loose |
UK Medium |
UK Tight |
US Loose |
US Medium |
US Tight |
Individual | 80 | 90 | 100 | 71 | 80 | 96 |
name Only | 30 | 35 | 40 | 25 | 40 | 60 |
Family | 80 | 90 | 100 | 71 | 80 | 96 |
Address | 45 | 50 | 55 | 40 | 50 | 55 |
Business | 80 | 90 | 100 | 71 | 80 | 96 |
company Only | 30 | 35 | 40 | 25 | 40 | 60 |
Custom | 80 | 90 | 100 | 71 | 80 | 96 |
Business Tightness
From version 2.0.3.
businessTightness: Business tightness controls how strict the company name matching should be. Options are "Standard", "Tight", or "Legal". The business tightness setting is used by the Smart Settings feature to determine which organization matching matrices to used. Additionally, if businessTightness is tight or legal the advanced setting useEquivalentName is enabled and normalizationTruncation is set to 9, and if businessTightness is legal the advanced setting legalNormalization is enabled. (See Generate Organization settings).
Methods
From version 2.0.3.
methods: A comma separated list of matching methods to use. If left blank (default), the Smart Settings feature will select matching methods automatically based on the input columns mapped.
Only use this setting if you want to override the automatic selection to limit the matching methods used.
E.g. <methods>NameAddress,ExactEmail</methods>
Best Fit Methods
From version 2.0.3.
bestFitMethods: Determines how Smart Settings selects matching methods when matching level has been specified.
If enabled (default), smart settings always selects the most appropriate matching methods for the columns mapped. If disabled and a matching level is specified smart settings chooses only the main matching method most appropriate for the user specified level(s).
For example with Name, Company, and Address components mapped, smart settings will, by default, select the NameCompanyAddress matching method. However, if Individual level is specified and bestFitMethods is disabled, the NameAddress matching method is selected.
Other Settings
<maximumClusterSize>200</maximumClusterSize>
maximumClusterSize: All processed data is added to clusters. When a record is added to an existing cluster, it's then compared to each record already in the cluster, provided the maximum cluster size has not been exceeded. If the cluster has reached the maximum size, then no more comparisons will be performed on that cluster and it will be logged as a large cluster.
Large clusters can have a significant impact on performance. Regularly hitting the maximum cluster size usually means that the keys should be tightened; if necessary, please contact Syniti support for advice.
Please note, however, that comparisons (and subsequent matches) are not necessarily missed when a record is added to a large cluster. If using multiple match keys, then the record might still be compared to other records in a different cluster. At the end of processing, records in large clusters can be output in order to post-process these records or to help identify the causes of large clusters.
From version 2.0.3, when volume is set to "High" the maximumClusterSize is increased to 800.
Smart Matching
Smart Matching is a combination of features that allows the automatic configuration of keys and other settings based on columns mapped, and the automatic selection of which keys to use, on a record-by-record basis, based on the columns populated.
Smart Settings
When ApplySettings/applySettings is called with a settings xml string that specifies the input column mappings, matchIT Hub will automatically select some settings that are appropriate to the input column types mapped - unless those settings are set explicitly.
The Smart Settings feature can configure:
- Match Keys;
- Matching Level;
- Weights;
- Constraints - e.g. for tight matching mustMatchSuffix is enabled;
- Post-Matching Rules - e.g. if Matching Level is individual, any matches with a namescore of zero are deleted.
- Maximum cluster size
The Match Keys selected depend on: the columns mapped, the nationality, and the volume (High or Medium). If two inputs are configured, Smart Settings only considers those column types mapped in both.
To find out which settings were auto-configured call GetSettings/getSettings. If you are not happy with any of the settings chosen automatically, you can simply override them with another call to ApplySettings/applySettings.
For example, for UK High volume, if you map the following input columns:
- |UniqueRef|Prefix|FirstNames|LastName|Organization|Address1|Address2|Address3|Address4|Address5|Postcode
...then Hub will set the following keys:
<keys>
<fuzzy>
<key methods="NameCompanyAddress">[AddressKey]+[Premise]</key>
<key methods="NameCompanyAddress">[Name1]+[PhoneticStreet]</key>
<key methods="NameCompanyAddress">[Name1]+[PostOut]</key>
<key methods="NameCompanyAddress">[OrgName1]+[PostOut]</key>
</fuzzy>
</keys>
From version 2.0.3, Smart Settings will also define an exact key consisting of all the mapped input fields.
Additionally, individual matching level will be enabled and the following Post-matching rule configured:
<advanced>
<postMatchingRules>
<rule condition="level(IndividualLevel) and NameScore LT 24" action="delete" />
<rule condition="level(IndividualLevel) and !locationmatch()" action="delete" />
</postMatchingRules>
</advanced>
From version 2.0.3, Smart Settings disables the constraints (mustMatchLocation, mustMatchPremise, noOneEmptyPremise, mustMatchDirectional, mustMatchNumericStreetName, mustMatchBuilding, noOneEmptyBuilding) and, instead, define an equivalent post-matching rule that also takes into account additional, non-address, mapped components (telephone, email, etc) that have a significant weight.
Matching Methods
Smart Settings uses Matching Methods as a means of selecting keys.
Column types, that are used as keys, are grouped into the component types: name, organization, address, postcode, telephone, email, date of birth, website, and custom components as follows:
Component Type | Column Types |
name | FullName, Forenames, Surname |
company | Organization |
address | Address1…Address9, City, , Country |
postcode | Postcode, ZIP, PostOut, PostIn, Plus4 |
phone | Telephone, Mobile |
dob | DateOfBirth |
website | Website |
custom1-9 | Custom1-9 |
The available component types then determine which Matching Methods can be use:
Component Types | Matching Method |
Main Matching Methods | |
---|---|
name, company, address & postcode | NameCompanyAddress |
name, company & address | NameCompanyAddress |
name, company & postcode | NameCompanyZip |
name, address & postcode | NameAddress |
name & address | NameAddress |
name & postcode | NameZip |
name & state | NameState (From version 2.0.3) |
name | NameOnly |
company, address & postcode | CompanyAddress |
company & address | CompanyAddress |
company & postcode | CompanyZip |
company & state | CompanyState (From version 2.0.3) |
company | CompanyOnly |
address, postcode | AddressOnly |
address | AddressOnly |
postcode | ZipOnly |
Additional Matching Methods | |
name & phone | NamePhone |
name & email | NameEmail |
name & date of birth | NameDob |
company, phone | CompanyPhone |
website, address & postcode | WebsiteAddress |
website & address | WebsiteAddress |
website | ExactWebsite |
phone | ExactPhone |
ExactEmail | |
custom1-9 | ExactCustom1-9 |
Selecting the most appropriate Matching Methods is done by preferring those towards the top of the above table, we select only those that use components not already included in another Matching Method already selected.
When Hub automatically selects keys it indicates the Matching Methods each key is associated with in a comma separated list. In the example above all the keys are associated with the NameCompanyAddressMatching Method.
Dynamic Keys
<dynamicKeys>false</dynamicKeys>
dynamicKeys: In lookup mode (and overlap mode) enabling this option instructs Hub to dynamically choose which keys to use (from those defined), on a record-by-record basis, depending on which input columns are populated.
Enabling dynamicKeys in overlap mode forces allowsBlankKeys to be disabled as the two options are not compatible.
Enabling dynamicKeys makes Smart Settings define more keys so there are more keys available to choose from.
For example, for UK High volume, if you map the following input columns:
- |UniqueRef|FirstNames|LastName|Address1|Address2|Address3|Postcode|Email|Telephone|DOB
...then with dynamicKeys disabled, Hub will configure the keys for the most appropriate Matching Methods:
<keys>
<fuzzy>
<key methods="NameAddress">AddressKey+Name1</key>
<key methods="NameEmail">NameKey+Trim(Upper(Email))</key>
<key methods="NameEmail">NormalizedName</key>
<key methods="NameAddress">PostOut+PostIn+Name1</key>
<key methods="NameAddress">PostOut+PostIn+Premise</key>
<key methods="NamePhone">Puntrim(Telephone)+NameKey</key>
<key methods="NameDob">Trim(DateOfBirth)+NameKey</key>
</fuzzy>
</keys>
...however, with dynamicKeys enabled, Hub will the configure the keys for all available Matching Methods:
<keys>
<exact>
<key methods="ExactPhone">Puntrim(Telephone)</key>
<key methods="ExactEmail">Trim(Upper(Email))</key>
</exact>
<fuzzy>
<key methods="NameAddress">AddressKey+Name1</key>
<key methods="NameOnly">Name1 + Name2</key>
<key methods="NameEmail">NameKey+Trim(Upper(Email))</key>
<key methods="NameEmail">NormalizedName</key>
<key methods="AddressOnly">PhoneticTown + PhoneticStreet + Premise</key>
<key methods="NameZip,ZipOnly">PostOut + PostIn</key>
<key methods="NameAddress,NameZip">PostOut+PostIn+Name1</key>
<key methods="NameAddress,AddressOnly">PostOut+PostIn+Premise</key>
<key methods="NameOnly">Puntrim(FullName)</key>
<key methods="NamePhone">Puntrim(Telephone)+NameKey</key>
<key methods="NameDob">Trim(DateOfBirth)+NameKey</key>
</fuzzy>
</keys>
Hub, then, analyses which columns are populated in each table 2 record and selects the most appropriate Matching Methods, and hence keys, to use.
If you want to configure the keys manual and use the dynamicKeys, you will have to specify the methods for each key. So, it is recommended to use Smart Settings in conjunction with dynamicKeys.
Previous Article | matchIT Hub Index | Next Article |