Previous Article | matchIT SQL Index | Next Article |
Once you are happy that all of the necessary changes have been carried out, you are ready to execute the sql script FindMatches.sql in SQL Server Management studio.
As you will be able to see from the script, and as is shown in the code snippet in the previous overview article, this script runs 4 stored procedures. First the msp_BulkGenerateKeys procedure will run, creating the keys table called by whatever name it was given in the specified data source in the XML. Secondly, the same procedure will generate the keys for the data and populate the keys table. Thirdly, the msp_FindMatches procedure will run and produce 2 tables – ‘large_clusters’ and ‘matches’ in the specified data source. The former table stores logs of any instances where a cluster of matches has exceeded the default maximum cluster size (set in the <generalSettings> section of the XML). The latter table stores records of matches that occur, their associated scores at each matching level as well as a column containing an integer indicating which levels the matches passed at (with regards to the minimum scores set in the <generalSettings> section). The value in the last column is made up of the sum of values which correspond to the levels at which the match passed at, which are as follows
- Individual = 1
- Family = 2
- Household = 4
- Business = 8
- Custom = 16
So for example, if a match contained a value of 9 in this column, that would mean that it passed at both individual and business level (1 + 8).
Any errors that occur with procedures will appear in the output pane in SQL Server Management Studio. From the error message you will be able to track down what the error applies to, and modify the configuration / sql script accordingly.
Lastly, the msp_GroupMatches will create a results table where anything that matched will be presented in a more sensible manner so its understandable by an end user, stacking duplicate records and their master on top of each other and carrying through the original data as well as a matchref column, which indicates the duplicates.
Once you have got this working and have a good idea about the configuration, you can have a look at running the other .sql scripts that are available in the demo folder, namely DedupeExample.sql and FindOverlap.sql. You will see in each script from the lines that start with EXEC what stored procedures are used. Note that in the case of overlapping in these two scripts, both data sources in the configuration are used, and you have the freedom to specify which data source to use as the main file and which one to use as the overlap.
Depending on your situation, you may wish to amend the sql scripts provided or write your own sql script from scratch to process your data. An overview of all stored procedures available is available in the next section which starts here.
Previous Article | matchIT SQL Index | Next Article |