mHub Usage Process Flow
Data can be fed into the engine from any source, including database tables and text files. mHub is cross-platform and can be used on Windows, Unix and Linux environments, provides outstanding processing performance, and scales to handle extremely large datasets.
mHub makes use of our leading edge proprietary matching technology, allowing for the identification of genuine fuzzy and phonetic matches that other matching tools miss. mHub is designed to be integrated into third party applications and manages to offer both development flexibility and ease of implementation. mHub is configured through XML and only very small amounts of code are required to execute the engine; example code in several programming languages is provided with the product.
Essentially, the calling application is responsible for retrieving data from the source database and passing it to mHub, with mHub passing back matching references according to the matching configuration specified by the application. The caller can then write the matching references back to the source database.
A matchIT Hub engine only provides a small number of methods which are appropriately-named so that the software is as intuitive and easy to use as possible. Additionally, the interfaces for the different editions (C++, C, C#, Java) are identical.
After instantiating an engine, the application uses ApplySettings() to pass in settings contained within an XML-formatted string. The application can then repeatedly read data from their database table and pass it into the engine using AddData(). To read results they call GetNextResult() as and when required (either once all data has been passed to the engine, or to periodically free up the output buffer).
The simple process flow below describes the typical method that an application needs to follow to carry out a single dataset or overlap de-duplication. In the case of an overlap, multiple database or file connections are opened.
Hub is configured via the ApplySettings() method, which takes an XML string containing one or more settings. The XML can be read from a static file or from a dynamic file created at runtime, or can even be a string hardcoded within the application’s source code (although this is not recommended).
Note that a settings string or settings file can include values for all settings, for several settings, or for just a single setting. The application could therefore split settings into separate reusable XML files, and execute a sequence such as this:
engine.ApplySettings( standardSettings );
engine.ApplySettings( dataSource1 );
engine.ApplySettings( individualLowVolumeKeys );
engine.ApplySettings( matchITAPISettings );
Each argument specifies the contents of a different XML file providing both flexibility and reusability. The specific settings that can be configured are not in the scope of this document, but relate to the following considerations:
- Match keys – which fields should be used to group together records that are similar for comparison by the matching process.
- Match scoring weights and settings – allows customization of what component match scores individual mapped fields should achieve.
- Matching constraints (e.g. must match on house/apartment/suite number, must match on gender etc.)
Full details of the configuration settings will be available in the mHub Application Programmer’s Guide.
Data is fed into the matchIT Hub engine, from any data source (or from two sources where the records matching across two files are required – we call this “overlap”), and matching records are identified via mHub’s clustering and comparison algorithms.
mHub provides output to the calling application using one or more of the following formats (this is user-configurable):
- Matching pairs of records;
- Grouped matching pairs of records – matching pairs of records with overall matching group IDs for records that are in the same matching set;
- Groups of matching records - instead of outputting matching pairs (e.g. output types 1 and 2: “record1|record2|score” and “record1|record2|score|matchref|basescore” respectively) this outputs the records in each group (i.e. simply “record|matchref|basescore”).
- Deduped records – output of source data with records identified as duplicates removed.
- Duplicate records – records that mHub has identified as being duplicates, excluding the master record from the matching set.
All output types except matching pairs require that all input data has been fed into mHub before any data can be output.
The ‘Deduped’ and ‘Duplicate’ output types do not contain any metadata such as scores, match refs, or base scores; they’re purely for outputting an unmodified copy of the original input data that can subsequently be written into other tables. For example, the deduped records and the duplicate records could be output to two empty tables; when processing has completed, three tables will then exist: one containing the original data, one the deduped data, and one the duplicate records.
For all these output types ‘record’ means, at a minimum, that the record’s unique ref is output. Optionally, the record’s original data can also be output (this is user-configurable).
For example, here are two examples of what can be output as matching pairs of records:
- “1000|JOHN SMITH|johns@360Science.com|2000|Mr J Smith|johns@360Science.com|1|100|0|0|0|0”
The first example is matching results plus unique references only (1000 and 2000), the second example is matching results plus
unique references plus a copy of the original input data for these records (assumes that only names and email addresses were supplied). Note that in the above examples, the 1 indicates the records match at individual level, 100 is the total individual level score.
Full details of the matching output will be available in the mHub Application Programmer’s Guide.
The product includes sample code that provides a simple demonstration of how to create a matchIT Hub engine, how it can be configured, how to pass data into it, and how to retrieve the output.
We provide sample code in C++, C# (.NET), and Java. The C# and Java samples illustrate how to read and write from database tables using ADO.NET, and JDBC respectively.
The sample demonstrates normalization, single table matching (internal dedup of a single data source), and overlap matching (an overlap of two data sources).
Deployment of an application that uses mHUB is very straightforward. Such applications only need to redistribute and install the relevant mHUB binaries (.DLL, .SO, .JAR files); nothing further is required. The binaries must be available to the application at runtime. On Windows, mHUB does not use COM, thus eliminating registration requirements and Cloud deployment issues.