Previous Article | matchIT SQL Index | Next Article |
So you’ve loaded your data into a staging database – in this case we’re using matchIT SQL demo data – you start out looking like this.
And then you run a package similar to this
Afterwords – your database looks like this, where some extra tables have appeared.
Let's go through each task step by step, and go over what these tables are.
1) Generate Keys
Generate keys is the first one that’s run, it creates two tables, by default they re-use whatever your main table’s name is, but you can rename these tables in the generate keys task, or alter the automatic table suffix’s we use by editing the template.xml.
The keys table and output table are pretty straight forward, if you read previous articles, the basic thing to understand is that the keys are our way of leveling the playing field, and the output table is useful if you want to parse the names for example.
These will get overwritten every time you run a project, although to save on processing time if your table stays the same then you can uncheck the overwrite existing keys for the task where the table is mostly static. Then it will only drop and recreate the keys table if the counts are different between the source and keys table. Its common for an overlap process to have it unchecked on your larger dataset, but leave checked for the smaller incoming dataset.
2) findexactoverlap
this produces a single table, this will get overwritten every time you rerun the package.
If you’re using the default it will be something like
*MainTablename*_exactOverlapRefs
it has 4 columns, the ID, record1, record2, and the key
The ID is just an autonumber, record1 refers to the id of a record from your main datasource, record2 refers to the id of the record from the overlap datasource, the key refers to key number it was first found on.
Please note the exact overlap will only report 1 to 1 or one to many relationships by default
3) groupexactoverlap – this produces two tables
*mainTableName*_exactOverlap
*overlaptablename*__exact__*guid*
The first table is just where we query all the results and build some overlap groups; this table isn’t meant to be used by the end user.
The second table with the __exact__ in the middle is a working table as well, but this table is significant in that it's telling the next steps which records were already found as matches. We append the guid to the end of the table name to distinguish it, as this table could otherwise be shared when you’re running multiple processes.
0 = use for fuzzy matching (findoverlap)
1 = found as exact match, ignore when doing fuzzy matching
4) findoverlap – this produces two tables
dbo.example1_fuzzyOverlapRefs
largeOverlapClusters_d80abc63-34a9-4878-b343-f71872cddd63
We see it starts out very similar to the exactoverlaprefs with an id and record1 and record2, and a key column at the end.
But unlike the exact overlap, which is a simple yes/no match, with the fuzzy matching we grade the score
We provide you with scores for any matching levels you have chosen, as well as if you’ve chosen to break out component scores which would then be displayed here as well.
The other columns are currently not used by the software.
Column | Description |
ID: | Record ID for each matching pair |
Record1: | This is the ID from the source data |
Record2: | This is the ID from the overlap data. This record belongs to the second datasource specified in the task. |
Level: | The Level column indicates the matching level(s) at which a match was found. If it contains a 1 then the two records match at the Individual level; if 2, then Family level; if 4, then Household level; and if 8, then Business level. Multiple levels are indicated by summing values – for example, 9 would indicate a match at both Individual and Business levels (1+8), and 15 a match at all four levels (1+2+4+8). By default, the Level column is followed by the total score for the four matching levels. These columns are fully configurable within a configuration file. Component scores (for name, organisation, address, etc.) can also be output for any level(s) |
IndividualScore: | Individual level total match score |
FamilyScore: | Family level total match score |
HouseholdScore: | Household level total match score |
BusinessScore: | Business level total match score |
These following columns relate to either master record identification or bridging prevention and shouldn’t be used for any other purpose; they are subject to change in future versions of matchIT SQL | |
MatchFlags: | The MatchFlags column is only used when Bridging Prevention is enabled (see GroupMatches). |
MasterPriority1: | Used for Master Record Identification. |
MasterPriority2: | Used for Master Record Identification. |
AddressLength1: | Used for Master Record Identification. |
AddressLength2: | Used for Master Record Identification. |
Key: | Indicates through which match key (as specified in your configuration file) this matching pair was found. |
The largeOverlapClusters_*GUID* table
This table lists the clusters that contain too many records (i.e. the Maximum Cluster Size has been exceeded). Processing the cluster will therefore be skipped to avoid the stored procedure potentially requiring a significant amount of processing time.
This ideally is blank, if you find matches are being missed and are sure that the keys line up, the first thing you should check is this table, if it isn’t, it might indicate that perhaps you should adjust your match keys or increase the large clusters limit.
5) groupoverlap
Your groupoverlap likely looks something like this
*tablename*_fuzzyoverlap
You’ll have the default output table – where we combine the exact pairs and the fuzzy pairs for you. If you’re a DBA, you may be fine with just this table, or skipping groupoverlap altogether and using just the two pairs tables, it gives you all the matching pairs and you can do whatever post processing you require.
Otherwise we use this table to build any optional output tables, and hopefully save you a little bit of SQL work.
Optional outputs:
*tablename*_overlapPairs - this is a side by side of your two sources.
We start out with the score, which will range from 80 to 120 by default, if the score is NULL, that means its an exact match – an exact match is likely better than any fuzzy match you get, this is commonly overlooked.
Anything with an _1 in the label is from your main datasource, anything with an _2 is from the overlap datasource.
Because the fuzzy matching can allow for many to many relationships, you may need to do your own post processing to choose the best match when there’s more than one.
*tablename*_overlapGroups - at this time, we'd suggest not including this in your output to minimise confusion with the internal dedupe where the matching_groups is useful for providing more clarity for your results.
*tablename*_overlapDuplicates - this is a list of all the records from your overlap data source that link to your main datasource, if all you needed was a list of records that match then here you are, if you’re more advanced in SQL and want to build this on your own, you could just do some select statements off the previous tables.
*tablename*_overlapDeduped - this is a list of all the records from your overlap data source that didn’t link to your main datasource, if all you needed was a list of records that didn’t match then here you are, if you’re more advanced in SQL and want to build this on your own, you could just do some select statements off the previous tables.
6) log table and reports
A) log table - When you’re trying to monitor processes to identify a choke point, or check the status from a remote server, this is great at giving you insight into what’s going on. You can modify the log table name in the template.xml if you want separate log tables for separate processes.
B) reports tables – If you find the reports tables annoying, you can turn them off by disabling them in the template, feel free to contact support if you need help with that. If you followed previous recommendations then you don’t have them in the first place.
They provide items such as gender breakdown, data extraction statistics, # of duplicates, breakdown of processing time by key, # of matches etc.
all of this information can also be extracted by running your own queries against the various matchIT results or log tables.
Previous Article | matchIT SQL Index | Next Article |