|Previous Article||matchIT Desktop Index||Next Article|
Single File Deduplication
This section of the guide takes you through the process of deduplicating a single file. matchIT works on a copy of your data, which is in DBF format. To create this copy, you need to Import your data into matchIT.
The easiest and quickest way of importing data into matchIT is by using the Setup Wizard. The wizard can be initiated from the single file button in the Wizard Pane.
The wizard builds a database inside matchIT, into which your data is copied.
To begin with, you are asked which file you wish to import from. The Setup Wizard initially defaults to looking in the matchIT\Import directory, but a different drive or directory can be selected. The Setup Wizard helps to import any of the common text file types as well as native Access, Excel and ODBC data sources.
If a file is not displayed, you can select All Files from the drop down (*.*). Alternatively, you can select DBF Tables, Access Databases or Excel Worksheets from the Files of Type drop down list, the appropriate files are then displayed. To use ODBC you must set up an ODBC connection first.
To let you follow through a worked example, we have included a file called EXAMPLE1.TXT in this directory. This is a file of test data, using fictional data made up for the purpose of demonstrating matchIT. This section of the guide assumes that you are using that file. You can, of course, use your own data instead; this guide will be most relevant to data that contains similar information to the example data, but you should be able to interpret these instructions as appropriate.
First, select the Single File button on the left hand side of the matchIT user interface. Then use the file selector button to open the file selector window.
Either highlight EXAMPLE1.TXT and click the Open button, or double click this file, then select Continue.
Single File Wizard Operation
First, the Setup Wizard will determine the type of data in each field. Here you have two choices:
This is the default option. The wizard will automatically attempt to determine the data type. (See Automatic Import Wizard)
Choose this option if you want to manually specify what the various data items represent. This is usually appropriate if your data file is in a Fixed-Width format, unusual in its layout or content, or if the Setup Wizard has already failed to determine the data type. (See the Manual Import Wizard section in the Going Further with matchIT guide.)
Automatic Import Wizard
The Field Layout window is displayed, immediately following your selection of an import file. You should always check that the Setup Wizard has correctly identified the contents of each name or address field, using the scroll bars to view more fields and records (the Continue button is not enabled initially, as the field names must be reviewed first).
If the Setup Wizard has not understood what is in a column correctly, right click on the field name (i.e. the column heading) and then select Rename Field. The appropriate field name can then be chosen from the drop down list. For all the name and address fields in a data set, you should use the names that are shown in the field name drop-down list. matchIT refers to these field names as part of its processing, so all the address lines should simply be labeled ADDRESS LINE - matchIT will number them, and compare across all the address lines when checking for duplication.
If a data file, with a unique reference number for each record in the file, is being used - you can right click on the column heading for this field and label it Unique_Ref - this will enable you to export reference numbers for matching pairs to a file. Enabling you to remove duplicate records from your source database. The data in the field must be genuinely unique within the input file, or else you will receive an error message after import.
If you have data items in your own data files that are not relevant to deduplication or mailing, you can enter your own field name by clicking on the Rename Field button. First, check that the top dropdown list (Change field labeled) shows the field name that you want to change. You can then type the appropriate field name in the Enter New Field Name box.
For Comma or Tab Delimited input files, the Setup Wizard uses a default width of 40 characters, which it increases if it thinks necessary. However, you should change this if the actual maximum width of that field is greater or significantly less than 40. In EXAMPLE1.TXT, you could change the width of TELEPHONE to 20 or 25 characters, or leave it at 40 characters if you are not worried about saving disk space. The Setup Wizard uses a sample size from the input file, which is the first 1,000 records (by default). To see the longest string of data in a field, you can scroll to the bottom of the record list to view a "record" highlighted in pink, which shows the longest example of each field in the file sample.
If the input file is a fixed width text file, you can use the Split Field or Combine Fields buttons to correct the Setup Wizard (if it has not been able to work out where one field ends and the next begins). When splitting fields, click in the data window at the point where the field should end, before you select Split Field. You will then have to use the Rename Field button to correct the field name.
You can use the Add Field button to create extra fields which matchIT can populate later on e.g. if your input name is of the form ”Mr J Smith” or ”John Smith” all in one field, you can add fields for Prefix, Forenames and Surname, which matchIT will populate automatically on Import with the appropriate components of the name.
When you have reviewed the field names given to your data, select Continue. The next dialog box displayed allows you to tell the Setup Wizard what to do with the data, after it has been imported.
The next dialog box shown allows you to tell the Setup Wizard what needs to be done to the data after importation.
Select options as follows:
- matchIT needs to Create match keys if you are going to find duplicates, so this option will be selected by default, but you can deselect it if you do not want to dedupe this file or compare it with another file.
- Click on Generate Additional Fields to open a window offering further features which matchIT can apply to your data during the import process. The options available cover creating salutations, generating quality score fields, generating PAF description fields (if you have addressIT) and options to split and combine name elements.
- matchIT Pro also allows you to "proper case" the name and address i.e. convert data intelligently to upper and lower case, so tick the Correctly case name and address option.
- You should decide on which level of deduplication your data needs, using the Select matching level section. Because this data has a COMPANY field, matchIT assumes this is business data, so the options are: dedupe down to one record per Contact, one record per Business or one record per Address. With consumer data the options are: dedupe down to one record per Individual, one record per Family or one record per Household. Leave the setting as ’Contact’ for this exercise.
- Options allows you to change the many options that matchIT utilizes, such as the default salutation, whether to automatically exclude records that have suspect data in them (e.g. if someone has entered "New address needed" in the address lines) and a host of lower level options.
- The file selector button will let you overwrite the default file name and destination. Enter a name of WORKFILE: this new file will be saved in DBF format.
For this example file, after the options have been set, select Continue. When you become more familiar with matchIT, you can specify different matching criteria from the default settings - see Going Further with matchIT.
Now matchIT will start processing. This involves:
- Importing the original data
- Generating the key fields for finding duplicates
- Enhancing the data as requested e.g. salutations, casing, relocating Postcode and country data to fixed fields
- Locating the duplicates within the file
- Generating first level reports.
This stage should be very quick on the first example file, but is dependent upon file size, hardware, and software configurations. matchIT’s performance is benchmarked at a rate of several million records per hour.
There are many quick and simple things that you can do to speed things up. Some of them apply just to specific areas, such as finding matches - if you want to know more about tuning specific areas of matchIT, please contact us by submitting a ticket.
The central feedback window will keep you informed of progress.
The next screen displayed is the Matching Results dialog:
To look at the duplicates matchIT has found, choose Verify Matches. The first pair of duplicates is displayed in the upper right corner, shown side by side:
This screen allows you to display each pair of potential duplicates in turn. The matching score is shown in the upper right corner, as well as to the left of each set in the list view. The matches are shown with the least likely dupes (lowest match scores) first, as you may only want to review the lower scoring pairs.
You can decide interactively which of the pair to retain, cut and paste between records, or simply verify whether the duplicates are candidates for flagging. Clearly, you don’t have to wade through all the pairs of duplicates in this way, but Verify Matches is a good way of establishing the correct threshold above which it is safe to globally flag the duplicate records.
Some of the fields in the pairs of records are color-coded - this is to highlight where matchIT thinks the main differences lie between the two records. Those that are marked in red show fields which are clearly different in content, and those highlighted in yellow show where one field’s contents are contained within the corresponding field in the other record.
The first pair displayed has a score of 80. This is because pairs of records that have scored less than 80 are not thought by matchIT to be duplicates. NB: The match score is not a percentage, but a grade to help separate out true matches from false. If there is a ’gray area’ for the duplicates that have been found in a given file, with the default Matching Weights those duplicates will normally be in a score band of 80-85. Jack Whitson and Mr J Watson, both at the same address and zip would be shown as a match scoring 80, when they could be different people. However, this gives you the chance to go for marketing ’overkill’ and flag one of these records, rather than risk sending someone two communications, one of which is wrongly addressed.
If any pair shown is not a true match, select the False Match button to unflag the match.
This button will flag the right hand record in the pairs view. The flagged record is then grayed out and the icon changes from Flag Record to Restore Record, which allows you to change your mind.
Use the Quick Find button to quickly find duplicate records that contain a certain piece of information.
Use the forward and backward arrow buttons to scroll through the pairs of duplicates.
There is also a Next Score button, to help you decide on the threshold score from which to flag duplicates globally.
If you right click on either record, you will see additional options for copying or combining data from the two records. Select Done when you have looked at the duplicates from each of the score bands.
Now select View Matches from the Matching Results dialog box. The following dialog is displayed:
This dialog is used to indicate which matches you want to print (or preview) and in what order. Change the Report grouping to Sets from the default Pairs, but leave the Report Format as Business with Destination set to Preview. Click on Continue.
A report is displayed listing all matching records, showing the unique references, name, address etc. You can click on any part of the report to zoom in or out and use the red arrows at the top of the screen to move through the pages of the report.
On the print preview toolbar, click:
to print the report on the default printer and close it, or
to close page preview without printing.
You can customize any of matchIT’s reports via the Edit, Output Layouts menu e.g. to show additional data items, or to change the formatting of the report. Please see Going Further with matchIT for further information.
Once back at the Matching Results dialog, select Flag Matches from the options at the bottom of the window. You are prompted to specify a score, equal to or above which it is safe to flag duplicates - so leaving the minimum score at 80 will flag all duplicates found. Once your flagging score is set, select the Flag button. A dialog will then be displayed showing the results of the flagging step.
The records flagged will be excluded from any deduped file output, but will be included when you want to export information about matching records. You can always undo the results of the Flagging Step by unflagging ”flagged” records either individually via Verify Matches or globally via Database Utilities.
With a bit more experience using matchIT you will see how it is possible to flag all the closest duplicates automatically, and come back to deal with the less obvious ones interactively - perhaps after getting advice on how to decide which of a given pair to flag, from the client or department whose data it is you are processing. You can also use the Intelligent Data Merge option, which will ensure that any data present in the record being flagged, but missing from the record being kept, is copied over. This option is subject to rules that you can specify in the Setup, Matching Setup menu.
At this stage in the exercise, we have imported our data, found the duplicate entries and flagged them. Now we want to output a cleansed list. If this file is to be used for a ”live” job, you should now select the Reports/QA Dashboard button from the screen above, to examine summary information about the file, ”drill down” to view any suspect records, view data in different orders, output ranges or samples of the data.
If you choose Records in Matched Sets, matchIT will export unique reference numbers for matching pairs to a file. This will enable you to remove duplicate records from your source database and reassign ”orphan” records, using a program external to matchIT. Choosing Flagged Records outputs only data for the records marked as ”duplicates”.
For this exercise, choose to output the Deduped File. This displays the Produce Output dialog which has tabs, showing Main Options, Mail Sortation Options (only enabled when the sortIT module is activated and a mail sortation has been run) and Campaign History (only enabled in matchIT Campaign).
Output File Layout
matchIT allows you to save output layouts for future reuse. To create a new layout, select New in the Output File Layout section.
The resultant field picker (see below) shows you all fields that are available in the currently selected file (WORKFILE.DBF) in the left hand pane, and all the fields that have been selected for output in the right hand pane.
You will notice that matchIT uses more international labels now for the name fields i.e. FULLNAME becomes ADDRESSEE, LASTNAME becomes SURNAME, FIRSTNAMES becomes FORENAMES and TITLE becomes PREFIX. We could output the ADDRESSEE field from the original data, but when we asked matchIT to generate a SALUTATION in the Setup Wizard, we also generated a CONTACT field. Since this is a standardized name field that has been designed as the first line of the addressed item, this field should be the first that you select. It is one of the last fields in the database, so scroll down the available fields list until you see it, and double click the field name to move it over to the right hand pane. We will also select the SALUTATION field near the bottom of the list, because that is appropriate for the start of a personally addressed letter. Next select the COMPANY field, all the address lines and the ZIP field from the top of the available list.
Choose the Save Layout button and name it WORKFILE.OPL. Click Done to return to the Main Options dialog.
Although matchIT defaults to saving the file in its original format, we can change the output file format here. Drop down the Output Format list if you want to select a different format from COMMA - this is a comma delimited file without a header record, the same as the input file. If you want a header record to be inserted in the output file, to label each field, select CSV near the bottom of the drop down list. Other common options are Tab delimited, SDF (fixed width text), DBF and Excel (which is limited to 65,000 records). Microsoft Word can link to a comma delimited file as a data source for mail merge, so COMMA or CSV is a good choice for our exercise.
Next, click on the file selector button to the right of the Destination File Name to choose a different destination directory and name, if you wish and select Generate Output.
matchIT now generates the output file and (when finished) displays a message box that shows how many records have been output.
Now you have finished the whole process! Of course, this is a simple example, but matchIT is extremely flexible and sophisticated if you need it to be. As you become more experienced, you can fine tune matchIT to find all the duplicates in any data, no matter how badly structured or keyed. You can also find the common entries in multiple files (introduced in Exercise 2), automate frequent or complex jobs using the Job Script function of matchIT Pro and output address labels or mail merge pages direct to the printer.
|Previous Article||matchIT Desktop Index||Next Article|