Previous Article | matchIT SQL Index | Next Article |
So if you’ve set up a find matches or find overlap, you’ve seen that the keys were chosen for you depending on your matching level.
For starters, you can select a key up top, then view the drop down below
You’ll notice it lists both key fields and input fields
The key fields all start with an mk*, whereas the input fields refer back to the fields you mapped back in generate keys,
90% of our clients are fine using the default keys, maybe with a minor tweak such as adding a telephone or email to the keys
So why don’t we take a couple records and see what kind of keys they generate.
First we’re going to start with the name related fields
They consist of
mkNormalizedName – Normalised Name
mkName1 – phonetic last name
mkName2 – phonetic first name
mkName3 – phonetic middle name
mkNameKey – phonetic last name plus initial
mksuffix - Suffix
mkGender – Gender
So lets say we have two records that we’re matching
Record A
Fullname : MISS L FOLKES
Company: CUTTING EDGE DIRECT
Add1: PHOENIX MILL LONDON ROAD
Add2: FAR THRUPP
Add3: STROUD
Add4: GLOUCS
Postcode: GL5 2UB
Record B:
Firstname: Linda
Lastname: Foulks
Company: Cutting Edge Direct Ltd
Add1: Phoenix Mill
Add2: London Road
Add3: Stroud
Add4: Gloucs
Zip: GL5 2UB
These are obviously the same person, but its not so easy to line up considering the concatenated data in Add1 in the first address and the additional Far Thrupp element, which is why we have the keys.
Mknormalizedname is the normalised version of the name, so even though its first/last and fullname in another, it will be more consistent here. When we’re matching on name we put the emphasis on the last name, then the first name, then the middle name. we also will convert it to its nickname.
This field is important because when we do the scoring on name, this gives you insight into how the software sees the name.
So MISS L FOLKES becomes
FOLKES,L,
And
Firstname: Linda
Lastname: Foulks
Becomes
FOULKS,LYN,
mkName1 -
This is the phonetic of the last name – we follow the same order as the normalised name, that’s why name1 is the last name, because its what we are putting the most emphasis on.
Both have the same phonetic of
fylk
we use our own proprietary algorithm to generate these phonetic keys, the main thing to point out here is the y represents a vowel sound
mkName2 –
For the L, we’re just carrying through the first initial
For Linda we end up with a phonetic first name of lym
For the L, we’re just carrying through the first initial
mkName3 –
This is the phonetic of the middle name, in this case though, neither record has a middle name
mkNameKey – phonetic last name plus initial
in this case, the name key is the phonetic of the last name, and the first initial
its fylkL for both records
mksuffix –
if we detect a suffix of some sort, that’s going to go here, commonly Jr, Sr, 2nd, 3rd, etc
mkGender – Gender
This is the inferred gender, we use our names and words lexicon that has a list of common first names and their inferred gender, so we know for sure that Linda is female, and the MISS L FOLKES, the Miss is a common female indicator, so we will return one of four codes
M = Male
F = Female
E = Either (Sam smith – could be short for either Samantha or Samuel)
X = conflicting ( ie: Mr Christina smith )
So that covers the name fields
Now for company related fields,
Lets go through the list
mkNormalizedOrganization
mkOrgname1
mkorgname2
mkorgname3
mkorganizationkey
mkNormalizedOrganization
This field is important because when we do the scoring on name, this gives you insight into how the software sees the name.
Let's say we have two companies
The Palmer Air Charters service company
PAC inc
With the first example, by default we see ‘The’ as a Noise type word, and strip it out, we see ‘service’ and ‘company’ as business type words and strip those out as well. You can manipulate what we ignore or don’t ignore by reviewing the changes suggested in the company only matching video.
With companies we look at them left to right, as opposed to names where its last, first, middle.
So the first example becomes PALMER,AIR,CHARTERS
In the second example, we just recognise inc. as a business word, and strip that out, seeing the PAC in all caps, by default we assume it’s an acronym, so we simply end up with
PAC,, with blanks for the 2nd and third words
mkOrgname1
in this case for the first example, it’s the phonetic of palmer
pymy
for the second example, since we recognised it as an acronym, we just pass through the PAC
mkorgname2
y
in this case the phonetic of Air – which is basically one giant vowel sound
blank for the second example, because we’re ignoring ‘inc.’
mkorgname3
in this case, the phonetic of Charters
Fyty
And just blank for the third example
mkorganizationkey
a concatenation of the first 5 characters of mkorgname1 + first 5 of mkorgname2
This is commonly confused with the normalised organization, if you want your matches to be much tighter, you can use mknormalizedorganization instead or just mkorgname1 + mkorgname2 separately,
Address related fields
mkPremise
mkphoneticstreet
mkphonetictown
mkAddresskey
mkflatno
mkpostout
mkpostin
so lets take these addresses
A)
Add1: PHOENIX MILL LONDON ROAD
Add2: FAR THRUPP
Add3: STROUD
Add4: GLOUCS
postcode: GL5 2UB
B)
Add1: Phoenix Mill
Add2: London Road
Add3: Stroud
Add4: Gloucs
Postcode: GL5 2UB
mkpremise
There is no premise in this case but the house number isn’t always in the same spot so we look at the address lines as a whole to identify it
mkPhoneticstreet
Similar to the premise number, the street doesn’t need to be in the first address line to be identified. In this case, it has used the first two words 'Phoenix Mill' because it has identified the Road element and thinks the whole 'Phoenix Mill London Road' is the street, words like road/street/circle are generally noise words. In the second example, because the elements are split across two input fields, it identifies 'London' as the street. Noise words are identified because they are held in names.dat which by default doesn't contain 'Mill'. You can of course add it but since this is a potential Surname, it might be quite dangerous to do so. If you do add words to names.dat, please ensure you perform additional testing of the results to ensure there are no unforeseen consequences.
mkphonetictown
There is a file in C:\matchIT SQL\config\dataFiles\uk called Towns.dat which contains all the post towns in the UK, in this case we’ve identified Stroud because its in the list in Town.dat.
Mkaddresskey
By default, the phonetic street and town can both be up to 8 characters long, if you want to pare that town and be a bit looser, but still want to use a combination of both, then the address key is a concatenation of both, starting the first 4 characters of the phonetic town, then the first 4 characters of the phonetic street.
Mkflatno
A flat, or apartment number
Generally, if there’s some kind of numeric, or alpha numeric number that’s following the street
This one key is not very commonly used except in exact keys, more clients prefer changing settings like ‘must match premise’ and ‘no one empty premise’ to keep different apartments in the same building separate.
Mkpostout
For a UK Postcode “KT22 8DY”, the matchIT API populates mkpostout with “KT22”. Which may also be referred to as the ‘outward part’
For US addresses, the mkpostout is the first five digits of the zip code
So with a zip of 28401-4273 , the 28401 would be in the mkpostout
Mkpostin
For a UK Postcode “KT22 8DY”, the matchIT API populates mkpostin with “8DY”, or the inward part of the full valid postcode
For US addresses, the mkpostin is the last digits of the 9 digit zip code, if just a 5 digit zip the mkpostin would be blank.
So with a zip of 28401-4273 , the 4273 would be in the mkpostin
Optional fields
mkTelAreaCode
mkTelLocalNumber
mkFaxAreaCode
mkFaxLocalNumber
These won’t be available unless you mapped a telephone or fax when you first set up your generatekeys task. You can ensure these fields are populated by checking the advanced settings under generate keys.
In the case of UK phone numbers
For “01372 225 904” or “01372225904” the matchIT
API populates this with “01372”.
For “01372 225 904” or “01372225904” the matchIT
API populates this with “225904”.
In the case of US phone and fax numbers
For 914 – 600-7243
The 914 would be the area code part
The 6007243 would be the local number part
New Company Fields added with 2.3.1:
For consistency with the Real Time Processing for SQL Server deployment new keys were added, in the output tab of generatekeys if you click advanced, then you can pick additional key columns to output.
Lets start with company fields - for this exercise we'll still use
The Palmer Air Charters service company
PAC inc
mkNormalizedOrg1
Unlike the phonetic of the company(mkOrgName1) - this is instead the normalized version - think of this as a parsed out version of mkNormalizedOrganization
So for 'The Palmer Air Charters service company' , it will be 'Palmer' - as we're ignoring 'The' as a noise word so 'Palmer is the first significant word
For 'PAC inc' it will be 'PAC' as we recognized that as an acronym
mkNormalizedOrg2
This is the Second significant word from mkNormalizedOrganization
So for 'The Palmer Air Charters service company' , it will be 'Air' - as we're ignoring 'The' as a noise word so 'Air' is the Second significant word
For 'PAC inc' it will be blank as we recognize 'inc' as a noise word
mkNormalizedOrg3
This is the Third significant word from mkNormalizedOrganization
So for 'The Palmer Air Charters service company' , it will be 'Charters' - as we're ignoring 'The' as a noise word so 'Charters' is the Third significant word
For 'PAC inc' it will be blank as there was just the initial acronym
mkOrgAcronym
This is the acronym we've derived from the first 2 or 3 significant words in the company name
For both examples the acronym will be 'PAC' as we carried through the first letters from 'Palmer Air Charters' and recognised PAC as an acronym already in the other record, this may make it easier to line up some outliers when the address isn't as reliable because of vanity names for cities, which would throw off the addresskey that is generated.
If the company is a single word, an acronym will not be generated for it as a single letter, if a company has more than 3 significant words, an acronym is only generated off the first three.
mkAlphabeticOrganizationWordA
This is the first alphabetic Organization word- this may allow you to identify transpositions in company names when the order of the words isn't as consistent for some examples.
for our 'The Palmer Air Charters service company' example it actually is Charters, as air wasn't considered significant as it was less than 4 characters, for 'PAC inc' it was blank as the acronym isn't considered when populating this field.
mkAlphabeticOrganizationWordZ
This is the Last alphabetic Organization word- this may allow you to identify transpositions in company names when the order of the words isn't as consistent for some examples.
For our 'The Palmer Air Charters service company' example it actually is Palmer, as Service wasn't considered significant as it was a Noise type word lookup in the names and words, for 'PAC inc' it was blank as the acronym isn't considered when populating this field as it also falls under the 4 character minimum.
New Name Fields added with 2.3.1:
mkFirstNameIni
The initials of the normalised first name
For 'Mr. W. R Dayton Jr' , as well as for
Firstname: Bill
Lastname: Deighton
it would be 'W' , as we normalise Bill to Will then take the initial of the normalised version.
mkNormalizedFirstName
This is the normalized first name
For 'Mr. W. R Dayton Jr' , it would be 'W'
For
Firstname: Bill
Lastname: Deighton
It would be 'Will' , as Bill normalised to Will
mkNormalizedMiddleName
normalised second name, or mid
For 'Mr. W. R Dayton Jr' , it would be 'R'
For
Firstname: Bill
Lastname: Deighton
it would be blank
mkNormalizedLastName
normalised last name
or 'Mr. W. R Dayton Jr' , it would be 'Dayton'
For
Firstname: Bill
Lastname: Deighton
it should be 'Deighton'
In this case those examples don't line up, unlike the mkName1, as its using the last name as is instead of a phonetic version.
New Other Fields added with 2.3.1:
mkEmail
email address with casing standardised
mkEmailUser
user name portion of the email
mkEmailDomain
domain portion of the email
Previous Article | matchIT SQL Index | Next Article |