Previous Article | matchIT SQL Index | Next Article |
Why did this score 84 on individual
Example 1:
Record | A |
Fullname: | MISS L FOLKES |
Company: | CUTTING EDGE DIRECT |
Add1: | PHOENIX MILL LONDON ROAD |
Add2: | FAR THRUPP |
Add3: | STROUD |
Add3: | GLOUCS |
Postcode: | GL5 2UB |
Example 2:
Record | B |
First Name: | Linda |
Last name: | Foulks |
Company: | Cutting Edge Direct Ltd |
Add1: | Phoenix Mill |
Add2: | London Road |
Add3: | Stroud |
Add4: | Gloucs |
Postcode: | GL5 2UB |
And we’re trying to match them on name or company.
The first question we should be asking is actually if they're even being compared in the first place. Because remember our matching works as a two step process, we line it up on the keys, then score it.
So we have 3 default keys
key |
A |
B |
mkname1 + mkpostout |
fylk + GL5 |
fylk + GL5 |
mkname1 + mkphoneticstreet |
fylk + fymyksmy |
fylk + lymtym |
mkpostout+ mkpostin |
GL5 + 2UB |
GL5 + 2UB |
Because the records have the same sounding last name and the same outbound section of the postcode, we able to find it as a potential match on this key.
But because they have the same sounding last name and but slightly different street names, we weren't able to find it on the 2nd key
Finally because we had the same postcode on both records, we also able to identify the possible match on the third key (although because we found it on the 1st, we’re not going to compare it twice)
So we know its being compared, often clients may wonder why something isn’t scoring, but the actual issue is that it wasn’t being compared in the first place, so that should be the first thing you check.
So now that we know its being compared, let's find out what it's going to score.
By default, the score is a cumulative score, we don’t work on percentages, we try to look at it more like a human would, instead of a machine.
There are three main components to the score
Name
Address
Postcode
1) Name Scoring
With the Name, we want to look at the mknormalizedname, as mentioned previously, we look at it as last/first/middle.
So we have:
Fullname : MISS L FOLKES
Firstname: Linda
Lastname: Foulks
Which normalised to:
FOLKES,L,
And
FOULKS,LYN,
There’s a matching matrix xml that we have with many pre determined decisions, so when we look at this from left to right we see this pattern:
The last name sounds approx
The first initial is L which is the first letter of Lindal, so we assume that’s equal.
And neither has a middle initial, so this is both empty
So
Last = Sounds Approx
First = Equal
Middle = Both Empty
If we look at the matrix stored by default in
C:\matchIT SQL\config\matchingMatrices\individualLevel\namematchingmatrix.xml
If we navigate through it, we see the pattern it follows, and the associated score. So in this case, the possible score is equal to 25 points.
Whereas if we had a Will Dayton and Bill Dayton, it would be Equal, Equal, both empty, and score sure, which is 60 points
Some advanced clients will replace the sure/likely/possible entries in the matrix with decimal values, to get more granularity in their results
2) Address
When we score on the address, we look at the address lines as a whole, we don't explicitly match address1 to address1, address2 to address2, etc...
Record A:
Add1: PHOENIX MILL LONDON ROAD
Add2: FAR THRUPP
Add3: STROUD
Add4: GLOUCS
Record B:
Add1: Phoenix Mill
Add2: London Road
Add3: Stroud
Add4: Gloucs
We use our own proprietary algorithm that looks across these columns as a whole.
In this case, because Phoenix Mill and London Road are contained in both addresses and the town and county's are the same, it scores 29. If less than 50% of it was right, we’d end up just throwing it out and not scoring at all.
Address scores range from 15-30 by default in the UK
Or 20-40 by default in the US
3) Postcode
In this case both records have the same postcode
If you look at the weights under postcode, you’ll see that the score for a sure match is 30. You can access these weights from findmatches or find overlap tasks, each matching level has its own set of weights.
Sure:
A sure score is when you have two records with the same postcode
Likely:
UB7 7PQ matching
UR7 7PQ
Where only postin is the same = likely = 20
Possible:
M2 8HG matching
M2 = possible = 15
4) The Cumulative Score
So names are based on the matrix, address is based on an algorithm that looks at the address lines as a whole, and the postcode has its own separate rules.
Once we go through all 3 we add up the score
Name = 25
Address = 29
Postcode = 30
Total = 84
To get more insight into why your matches scored what they did, ensure you’re breaking out the component scores, its an option in your findmatches/findoverlap task when you’re showing advanced options.
Previous Article | matchIT SQL Index | Next Article |