Previous Article | matchIT Hub Index | Next Article |
1.3.1 Processing Mode
<mode>Matching</mode>
This is a compulsory setting. The processing mode must be configured before any data can be added to the matchIT Hub engine.
Currently supported processing modes:
- Matching - for finding matching records within a single data source, or for finding the overlap of two data sources (i.e. records that intersect both data sources);
- Lookup - primarily for online duplicate prevention applications, in which individual records are looked up in a reference data source.
- Normalization - for outputting a 'normalized' (cased, standardized, parsed, extracted, modified) copy of the input data.
- Grouping - for grouping the Matching Pairs output of previous Matching mode run(s).
- UKAddressVerification - for verifying UK addresses against Royal Mail PAF.
(Additional processing modes will be added in future releases of the product.)
1.3.2 Data Settings
<data>
<input />
...
<options>...</options>
</data>
Data settings specify the layout of the data to be added to matchIT Hub and options for pre-processing the incoming data.
Input Columns
<data>
<input columns="|UniqueRef|FullName|Address1|Address2|Address3|Postcode" />
</data>
This is a compulsory setting. The input data definition must be configured before any data can be added to the matchIT Hub engine.
The input data definition is a delimited string, containing one or more column types. The delimiter is indicated by the first character of the string; this must be a non-alphanumeric character. Refer to Appendix A for a list of all available column types. Each type (except for Other) can appear only once in the definition.
When overlapping two data sources, with different column definitions, their definitions can be defined as follows:
<data>
<input table="1" columns="|UniqueRef|FullName|Address1|Address2|Postcode" />
<input table="2" columns="|UniqueRef|FirstNames|LastName|Address1|Address2|Postcode" />
</data>
Optionally, each data source can be tagged with a descriptive name by including a 'name' attribute. The name will be written to the statistics XML that's output by the GetStats/getStats engine method. For example:
<data>
<input table="1" columns="..." name="Master Customer Table" />
<input table="2" columns="..." name="Incoming Feed" />
</data>
Multiple Elements
When a record contains multiple elements, additional names, organizations, addresses (including postcodes), telephones, faxes, emails, and jobs can be mapped in the data source definitions by prefixing columns names with Second, Third, Fourth, or Fifth.
For example:
- Map two names using FullName and SecondFullName.
- Map two addresses using Address1-9 and SecondAddress1-9.
- Map three postcodes using Postcode, SecondPostcode, and ThirdPostcode.
- Map five emails using Email, SecondEmail, ThirdEmail, FourthEmail, and FifthEmail.
A prefix of First is permitted, but is not necessary. FirstEmail is the same as Email. Elements can be mapped in any order. For example, this is a valid data source definition:
"|SecondEmail|ThirdCompany|Company|Email|SecondCompany"
For further details of matching record contains multiple elements see Associations.
Options
<data>
<options>
<trimAllData>false</trimAllData>
<verifyInputColumns>true</verifyInputColumns>
<textQualifier>false</textQualifier>
<abortOnSchemaError>false</abortOnSchemaError>
<abortOnDataError>false</abortOnDataError>
</options>
</data>
trimAllData: When enabled, leading and trailing whitespace is trimmed from all input data. This can help reduce memory usage because all added data is stored unmodified (unless this setting is enabled).
verifyInputColumns: All added data is parsed according to the configured input data definition. Processing is aborted if any data is encountered that doesn't conform to the definition. Disabling this setting will allow all data to be processed, whether it conforms to the definition or not. This option should only be used when necessary, for example when it's not possible to easily correct the data before it's added.
textQualifier (From version 2.0.4.18): The fields of an input data record may contain embedded delimiters, in which case the field is wrapped in double quote characters. If textQualifier is false, the record parser does not check for quotes or embedded delimiters, in which case the delimiter must be something that does not appear in the data.
abortOnSchemaError (From version 2.0.3): When enabled (or prior to version 2.0.3) if an input data record that doesn't match the column layout defined the engine will go into an aborted state. When disabled (default), the malformed data is simply logged and ignored.
abortOnDataError (From version 2.0.4.24): When enabled (or prior to version 2.0.4.24) if an input data record that contains invalid utf-8 the engine will go into an aborted state. When disabled (default), the record containing invalid data is simply logged and ignored.
Previous Article | matchIT Hub Index | Next Article |