Exact matching isn’t always necessary but is normally suggested as it’s an easy way of improving performance in most cases.
When should I incorporate exact matching?
- You have a high duplication rate – exact matching is a way of getting ‘easy’ matches out of the way
- more granularity in results – exact matching provides an extra tier of scoring, so even if a match scored 130 (the highest score with our defaults) – an exact match would be an even better match than that, instead of being mixed in
- you want to match on a single field, such as an exact email match, or an account number or invoice that would normally not qualify as a match from a fuzzy perspective
Why shouldn’t I run an exact match
- exact matching selects all the columns and their data into memory, this may be slow depending how large your table is and how much tempdb space is allocated on your SQL instance
- you have lots of blank fields or junk in your data, in which case the exact matching will skip over most matching due to the columns not being ‘optional’ by default, or may over report matches that normally would be skipped, we expect the data to be well populated by default.
As a note, anything found by an exact match is removed for processing for the fuzzy match, so if you're too aggressive with the exact match by say doing an exact match on email, you could just end up introducing false positives if your data isn't clean. We normally recommend a single exact match key, that encompasses all the significant data points.
Let's add exact matching to a process we set up previously.
1. Exact matching should be run between generate keys and findmatches – there are two tasks that would need to be added
2. Assuming you're editing the package from the previous article, the first thing you'll need to do is delete the connector between generatekeys and findmatches
3. Drag and drop the tasks, connect them in the order like below. You can highlight and arrange using the format menu above, you make them the same width then center them, although if you want to rename the tasks you should do that before.
4. Open up findexactmatches – notice it reads in the task from the previous source
5. If you want to use the default key, you can just save and skip to step 12,
but if your goal is more granularity, I would suggest altering the existing key
Notice up top that a single key is listed, unlike the fuzzy keys where its only one or two fields, this concatenates a majority of the key fields.
Anything that starts out with an mk is a matchkey field, and has already been standardized by mSQL. You can think of the list of columns within a single key as ‘AND’ statements, whereas having multiple keys up top are like “OR” statements
6. In this case we’ll stick with the single key, but we’re going to alter the default key to use your input data instead (all those columns you mapped during generatekeys)
Go through and alter the default key to use a majority of you input fields, when you click on the key up top, it lists the individual fields on the bottom, you can then replace those mkfields with your input fields.
NOTE: IF YOU SEE TOWN/REGION POSTCODE – those are equivalent to CITY/STATE/ZIP
If you’re doing business level matching, you only need to map organization and any address lines and town/region/postcode
If you’re doing individual level matching – you only need to match fullname (and/or prefix/first/initials/last/suffix ) and any address lines/town/region postcode.
7. Once you’ve chosen all your keys, delete any extra ‘mk’ fields, then apply the PUNTRIM function to all the keys
this helps with simple things like
123 MAIN ST APT 1-A
123 main st. apt1a
which would be missed by standard SQL query
8. Your input data isn’t standardized which is why we suggest applying the PUNTRIM function, whereas an ‘mk’ field has already been standardized by mSQL and shouldn't need a function applied to it.
9. You can also set the ‘optional’ selection to yes or no, this is basically stating whether it's optional that the field is populated, so in this case I would consider address2 optional – as that isn’t always filled in because not every record has a suite, but fields like the name and zip are not optional, as those are required to count as a match.
10. If you wanted to do an exact email match or additional exact match keys, you could also add that – but we’d suggest you QA your results if that’s the case, or at least not choosing yes on the ‘optional section’, having a match of email only may cause false positives if you had say two different people with an info@360Science.com , so you'd either stick to the one exact match key and more easily identify that during fuzzy matching(reccomended), or make the key more concise by adding name in addition to email as part of the exact key.
11. Save the task
12. Open up groupexacmatches, then save it, no changes needed
13. Open up findmatches – change the source to group exact matches (this isn’t needed if you are starting a new find matches), also ensure that ‘exclude exact matches’ is ticked
14. Open up groupmatches
set Merge Exact matches to ‘all matches’, save (always set it to all if you're not using the default exact key)
Exact matching is now incorporated into your process.