Previous Article | matchIT Hub Index | Next Article |
IMPORTANT: If you make changes to the file structure below, the file will not receive changes to names and words in the future updates. It is your sole responsibility to migrate in additions or changes to future versions of this file.
The word lookup tables are used by the core matching component. These tables control:
- the matching equivalent of words e.g. Tony = Anthony
- the gender of forenames e.g. John = Male, Susan = Female, Chris = Either
- casing rules e.g. PO Box, IBM, 360Science
- expansion/contraction of abbreviations and correction of typing errors e.g. Svcs = Services, Finacial = Financial
- attributing type to these and other words e.g. Mr = Prefix, Ltd = Business, FL = State, The = Noise.
Usage
Location: <settings><advanced>
<advanced>
<datPath>C:\Program Files\matchIT Hub\datfiles\US</datPath>
</advanced>
Prior to version 3.1
datPath: Specifies the path to the folder containing the word lookup files to use.
This folder contains: NAMES.DAT, NAMES2.DAT, SURNAMES.DAT and TOWNS.DAT.
Read more details on the names and words files here.
From version 3.1
datPath: Specifies the path to custom word lookup files used to modify the default base data. This method is preferred for clients wanting to keep track of what they changed and to get any changes Syniti may have included with updates.
This folder may contain:
- address-custom.xml
- name-custom.xml
- organization-custom.xml
- misc-custom.xml
Alternatively, you can specify the path to the folder containing a copy of the default base word lookup files used to replace the default base data. If you do not want to be impacted by updates to the names and words, we suggest making a copy of the files included with the install into another folder and then editing the copy and pointing to that folder in your config. This ensures your changes will not be lost if you reinstall or update. However, if you do not use the latest versions of these files supplied with new releases of the software, some aspects of the software may not function correctly and new features may not be available.
This folder may contain:
- address.xml
- name.xml
- organization.xml
- misc.xml
Base files and custom files can be present in the same folder. This results in base files loading first and custom files being used for additions and/or deletions from the base file. A copy of the base files are included with the install, and a copy of the custom files are attached to this article.
These files contain region-specific blocks of words, e.g. address-custom.xml has:
- Country-agnostic address words.
- Countries and country codes.
- UK-specific address words, post towns, and counties.
- US-specific address words and states.
- Canadian-specific address words and provinces and territories.
- Australian-specific address words and states.
Words are grouped in categories ("street", "forename", etc). The casing of words is ignored for matching, but the case specified is how the word is output. Multiple words in any category can be grouped together by enclosing them in <group>...</group>. The words in a group will be considered identical (for example, Anthony and Tony, Ltd and Limited, Apartment and Apt).
address.xml
<address>
<!--country-agnostic words used by all countries-->
<common>
<streets> <!--collection-->
<group>
<street>Road</street> <!--category-->
<street>Rd</street>
</group>
<street>Rue</street>
...
</streets>
...
</common>
<!--specify countries using ISO 3166-1 alpha-2 codes-->
<country value="US">
...
</country>
...
</address>
address.xml contains the following categories:
Category | Collection | Description |
street | streets | Road name designator, such as "Rd" or "Street" |
numericStreet | numericStreets | Numeric street designator, such as "Rte" or "Highway" |
building | buildings | Building name designator, such as "Farmhouse" or "Hall" |
flat | flats | Secondary/sub-premise designator, such as "Suite" or "Apartment" |
premise | premises | Numeric building designator, such as "Block", or "Building" |
floor | floors | Floor designator, such as "Floor" or "Flr" |
box | boxes | Box designator, such as "PO Box" or "Postfach" |
direction | directions | Street directional, such as "North" or "SW" |
country | countries | The country name and variants, such as "United States" or "Afghanistan" |
code | countries | The country code, such as "US" or "USA" |
region | regions | Administrative region, e.g. UK counties such as "Kent" or "Glos", US states such as "Iowa" or "NC", CA province etc |
posttown | posttowns | UK posttowns |
town | towns | Towns and cities other than posttowns |
localCountries | localCountry | Local country, such as "Scotland" and "Wales" |
name.xml
<name>
<!--country-agnostic words used by all countries-->
<common>
<male>
<prefixes> <!--collection-->
<group>
<prefix salutation="S">Mr</prefix> <!--category-->
<prefix salutation="S">Mister</prefix>
</group>
<prefix>Master</prefix>
...
</prefixes>
...
</male>
<female>
...
</female>
<either>
...
</either>
...
</common>
<!--specify countries using ISO 3166-1 alpha-2 codes-->
<country value="US">
...
</country>
...
</name>
name.xml contains the following categories:
Category | Collection | Description |
prefix | prefixes | Prefix, such as "Mr" or "Captain" |
firstName | firstNames | Forename, such as "Adam" or "Abigail" |
surname | surnames | Surname, such as "Smith" or "Jones" |
suffix | suffixes | Suffix, such as "Jr" or "Senior" |
Qualification | Qualifications | Qualification word, such as "PhD" or "ARICS" |
Each prefix entry must have a salutation type associated with it. The following list shows the salutation types, along with an example of the type of salutation that will be generated:
Type | Rule | Example |
S | Dear Prefix Surname | Dear Mr Smith |
C | Dear Prefix Surname | Dear Mr Smith |
FS | Dear Prefix Forename Surname | Dear Mr John Smith |
FF | Dear Forename | Dear John |
F | Dear Prefix Forename | Dear Sir John |
B | Dear Prefix | Dear Sir |
T | Prefix | My Lord |
Salutation type C is different from type S in that it is treated as a name even if it is found in address lines 1 or 2 with Scan Address Lines for Names set. This means that if the option is switched on and e.g. MR has salutation type C, then Mr J Smith would be identified as a name in address line 1 or 2, whereas if MR has salutation type S, then it would not be identified.
organization.xml
<organization>
<!--country-agnostic words used by all countries-->
<common>
<types> <!--collection-->
<group>
<type>Ltd</type> <!--category-->
<type to="Ltd">Limited</type>
</group>
<type>Holdings</type>
...
</types>
...
</common>
<!--specify countries using ISO 3166-1 alpha-2 codes-->
<country value="US">
...
</country>
...
</organization>
organization.xml contains the following categories:
Category | Collection | Description |
word | words | Words indicative of a business name, such as "Printers" or "Antiquites" |
type | types | Business type, such as "Ltd" or "GmbH" |
name | names | Business name, such as "General Motors" or "Fedex" |
job | jobs | Job title word, such as "Manager" |
misc.xml
<misc>
<!--country-agnostic words used by all countries-->
<common>
<exclusions> <!--collection-->
<group>
<exclusion>Deceased</exclusion> <!--category-->
<exclusion>Decsd</exclusion>
</group>
<exclusion>Addressee</exclusion>
...
</exclusions>
...
</common>
<!--specify countries using ISO 3166-1 alpha-2 codes-->
<country value="US">
...
</country>
...
</misc>
misc.xml contains the following categories:
Type | Collection | Description |
exclusion | exclusions | Exclusion word, such as "Deceased" or "Moved" |
noise | noises | Noise word (i.e. ignored when generating keys or address matching), such as “The” or “House” |
special | specials | Special casing word, i.e. a word that is cased unusually but doesn't fall into any of the above categories, such as "PhotoMe" |
If you are editing the custom XML's, provide an action such as adding or deleting.
Attributes
The attributes "action" and "match" indicate how the customization modify the built in data. Both small- and large-scale customizations can be made – e.g. individual entries can be added or entire country-specific blocks disabled.
Attribute | Description |
action="modify" | can be applied to any node; always implied if not specified; modify the original node. |
action="replace" | can be applied to any node; delete the original node and its children, and replace with customized data. |
match="true" | always implied if not specified. |
match="false" | can only be applied to a group of two words; prevent the grouped words from matching by deleting them from any group in which the two words appear together. |
match="nothing" | can only be applied to a word that isn't in a group; prevent the word from matching anything. |
match="delete" | can be applied to any node; delete from the base data. |
examples
To prevent "Andy" from being considered a variant of "Andrew":
<name>
<common>
<male>
<firstNames>
<group match="false">
<name>Andrew</name>
<name>Andy</name>
</group>
...
To prevent "Andy" from being considered a variant of any male name:
<name>
<common>
<male>
<firstNames>
<name match="nothing">Andy</name>
...
To delete all either gender qualifications:
<name>
<common>
<either>
<qualifications match="delete" />
...
To replace the default list of company names with one of your own:
<organization>
<common>
<names action="replace">
<name>Syniti</name>
<name>helpIT</name>
<name>360Science</name>
...
</names>
Previous Article | matchIT Hub Index | Next Article |