Previous Article | matchIT SQL Index | Next Article |
Find Overlap is for matching two separate data sources. So you would be setting up two generate keys, pointing to your two separate tables.
First, let's go over the difference between an overlap vs. an internal dedupe (the findmatches/groupmatches tasks) as this is a common point of confusion.
When you’re matching internally (findmatches), you have these 3 records for example
1 |
GILL LAUGHTON |
GRETCH LTD |
165 CLOCK TOWER ROAD |
ISLEWORTH |
MIDDX |
TW7 6DT |
2 |
MRS J LAWTON |
GRETCH GROUP |
CLOCK TOWER RD |
ISLEWORTH |
MIDDX |
TW7 6DT |
3 |
JILL LAWTON |
GRETECH LTD |
CLOCK TOWER ROAD |
ISLEWORTH |
MIDDLESEX |
TW7 6DT |
These 3 records may not link directly, but if we know 1 matches 2, and 2 matches 3, then we can determine that all 3 go together. And when we link them we give a base score – which is the weakest link in the set. When you’re doing an internal dedupe its all about gaining that single customer view.
Now that is good for an internal dedupe.
Although when overlapping (findoverlap) it’s a different approach, as opposed to linking records together and choosing the weakest link as the baseline, you want to instead choose the best match
So instead let's say we had two tables
Table1 lets say is your existing customer table, and has 3 separate records (this may indicate that it should be deduped, but that may be a separate project)
Miss G LAUGHTON |
GRETCH LTD |
165 CLOCK TOWER ROAD |
ISLEWORTH |
|
TW7 6DT |
MRS J LAWTON |
GRETCH GROUP |
CLOCK TOWER RD |
ISLEWORTH |
MIDDX |
TW7 6DT |
JILL LAWTON |
GRETECH LTD |
CLOCK TOWER ROAD |
ISLEWORTH |
MIDDLESEX |
TW7 6DT |
And we have table 2 – lets say this is a feed of new sales
GILL LAUGHTON |
GRETCH GROUP LTD |
165 CLOCK TOWER ROAD |
ISLEWORTH |
TW7 6DT |
John Smith |
DARCY CONTROLS |
WINCHESTER STREET |
BASINGSTOKE, RG21 7EY |
|
Smith, John | D'ARCEY CONTROLS | 34 WINCHESTER STREET | BASINGSTOKE | RG21 7EY |
Now, we can only link that incoming record to one of our three existing customers.
So instead what we’ll do is findoverlap , it will compare the 3 customer records against our incoming table. In this case – we may have found 2 or 3 separate matches, but in this case – it’s about picking the best match, that way we don’t make a new customer record when we shouldn’t.
But we still have the issue that we’re attempting to link 3 new sales, but only caught 1. So John Smith placed 2 orders, but really should only have 1 account. So, in this case, you should take the records that were not matched in your overlap, and possibly run them through a dedupe internally as well. That way we only create one new customer account, instead of 2.
You shouldn’t rely on just overlap, or just findmatches necessarily. Ideally a combination of both will get you the best end result. It's just a matter of figuring out the order that works best for you.
We do recommend before running an overlap, that at least one of the tables would ideally be deduped internally. Otherwise, you can end up with many to many relationships
Previous Article | matchIT SQL Index | Next Article |