Previous Article | matchIT SQL Index | Next Article |
Why should I run an exact overlap match?
- You have a large number of records that are exactly the same – matches found by using exact matching improves performance because these records are not compared like potential matches found using fuzzy matching
- more granularity in results – exact matching provides an extra tier of scoring, so even if a match scored 120 (the highest score with our defaults) – an exact match would be an even better match than that, instead of being mixed in
- you want to match on a single field, such as an exact email match, or an account number or invoice
Why shouldn’t I run an exact match
- exact matching selects all the columns and their data into memory, this may be slow depending how large your table is and how much tempdb space is allocated on your SQL instance . Or if there’s a large difference in table QTY – such as matching a 120 million record file against an incoming feed of only 20,000 records, you’re likely better going straight to fuzzy matching (find overlap).
- you have lots of blank fields or junk in your data, in which case the exact matching will skip over most matching due to the columns not being ‘optional’ by default, or may over report matches that normally would be skipped, we expect the data to be well populated by default.
You can use the default exact key in that case, or a mixture of source and key fields, although that would only be suggested if you have a high duplication rate.
Exact matching isn’t always necessary but is normally suggested as it’s an easy way of improving performance in most cases.
Ideally at least one of the tables are deduped internally first.
Let's add exact matching to a process we set up previously.
1) Exact matching should be run between the generate keys(x2) and findoverlap – there are two tasks that would need to be added
Findexactoverlap
Groupexactoverlap
If you're adding exact matching to an overlap process that was already set up, then please follow all steps below. Otherwise if you're building the package from the first time then just ignore steps 2 and 10.
2) Delete the connector between generatekeys and findmatches if you're editing an existing task
3) Drag and drop the tasks and connect them, it should be like below
you can align/rearrange using the format menu above if you highlight the bottom 4 tasks, make them the same width then center them if you want to make it look nicer
4) Now open up the exact overlap task
By default, we’re going to choose the larger table as the source, and the smaller table as the overlap table.
If your tables are empty or the same qty, you may need to choose the main and overlap data source yourself, or you can switch the main and overlap at this point, once you choose it here, it will then flow down through the other tasks.
5) When running an exact overlap by default we’re only going to have one to one or one to many relationships written out. This is different behavior from the findoverlap/findmatches tasks as those will write out many to many relationships, depending on your process, this may be something you should be aware of.
6) Now let's look at the keys by clicking next or clicking keys up top or next on the bottom.
When you click the key up top, that lists the selected key in the bottom half.
Think of the list of keys within a single key as an AND statement, all those fields have to line up exactly for it to count as a match. It’s a simple yes/no when it comes to exact matching.
Whereas if you were to click up top to add a key, that would be like an OR statement.
7) If your source data is different between the two tables such as with our example tables, then you’re likely best off sticking with the default key for an overlap. However, if you have company data without contact details then you should remove the mknormalizedname and mkgender key because this key field will always be blank and if both records are blank for any of the keys listed, they will be prevented from matching as exact matches. Likewise, if you only have individual names in your data, you should remove mknormalizedorganization. In both cases above, we also suggest making the mkflatno optional. Making a key field optional means that you allow records where this field is blank in both records to still match, note however that if only one record is blank, they will still be prevented from matching. Flat numbers are often empty in UK addresses so if you don't make this optional, many otherwise exact matching records might be missed.
You can use a combination of key fields (anything that starts with an MK*) and source data, but that will be less efficient as we’ll have to select across multiple tables.
8) Once you have your key set up, then you can save the task.
9) Now simply open up the groupexactoverlap and save it. It automatically reads in from the previous task
10) If you added exact overlap to an existing task, then we suggest checking the findoverlap and setting it to read in from the groupexactoverlap – also ensure that ‘exclude exact matches’ is ticked, otherwise any efficiency gains will be lost.
otherwise continue to set up the findoverlap/groupoverlap as covered previously if this is a new one.
Previous Article | matchIT SQL Index | Next Article |