H1.6 - Configuration Guide - Normalization Settings – Software Support

matchIT Hub Index

<normalization>
<outputs [columns="..."]>
<replacements>...</replacements>
...
<options>
<outputHeader enabled="true" />
<outputInputData enabled="false" />
<allowMultipleInstances enabled="false" />
<assignUniqueRef enabled="false" seed="1000000000" />
</options>
</outputs>
</normalization>

Normalization settings are required only if the processing mode has been set to Normalization. If these have not been specified in the configuration, then these defaults will be used:

the engine automatically determines which fields to output;
a header will be output with the results;
a copy of the input data will not be output;
no replacements will be made.

Output Fields

If the output fields haven't been specified (i.e. <outputs> is used), then matchIT Hub will automatically determine which fields to output. In this case, a header row can be output with the results to indicate which fields are being output; this can be used for dynamically creating a destination table in a database prior to importing the normalized data.

Alternatively, to specify the fields that are to be output, simply specify a delimited string that contains a list of output fields - for example, <outputs columns="|FullName|Prefix|FirstNames|LastName|InputFullName" />. Notice how it's possible to output a copy of the input data.

Note that input data can be passed through to the outputs without performing any normalization routines (casing, standardization, parsing), by assigning them as numbered Other fields in the data input definitions (i.e. mapping input fields as Other1, Other2, etc.). Up to 32 inputs can be marked using Other.

Refer to Appendix D for a list of all possible output fields.

Replacements

In addition, data can be modified as it's output. This can be achieved via simple replacements; for example replacing "Apartment" in address lines with "Apt", or replacing "USD" with "$" in a custom field. Any number of replacements can be defined.

All output data is parsed into substrings before any replacements are processed. Substrings are delineated by whitespace and punctuation characters, and by separating numbers from words.

Example 1.
Data in CustomField1 is modified as it's output. Instances of USD are replaced with a dollar ($), and instances of GBP are replaced with a pound (£). A value of "USD40" would be treated as two substrings ("USD" and "40") and would therefore be output as "$40" if the above replacements were defined.

<replacements column="CustomField1" >
<replacement from="USD" to="$" />
<replacement from="GBP" to="£" />
<replacements>

Example 2.
Data in both the FullName and Salutation fields is similarly modified. Instances of "Mr" are replaced with "Herr" (e.g. "Mr J Smith" becomes "Herr J Smith").

<replacements columns="|FullName|Salutation" >
<replacement from="Mr" to="Herr" />
<replacements>

A substring must match fully so that, for example, "Mrs" is not replaced with "Herrs" because of the "Mr" replacement defined above.

Example 3.
All ampersands in the four specified fields are replaced with "and". For example "Mr & Mrs J Smith" becomes "Mr and Mrs J Smith". (Notice how the ampersand is specified using the XML predefined entity "&".)

<replacements columns="|Prefix|FullName|Contact|Salutation" >
<replacement from="&" to="and" />
<replacements>

Options

outputHeader: If enabled, then a header will be output with the results to indicate the type of data for each column. Disable this setting to prevent the header from being output.

outputInputData: If enabled, then a copy of all input data will be appended to the output fields. This is only applicable if the output fields haven't been specified.

allowMultipleInstances: If enabled, then the output fields can contain multiple instances of any particular output field. By default, including FullName twice in the output fields, for example, would invalidate the configuration.

assignUniqueRef: [from v2.0.4.29] If enabled, then the UniqueRef on records will be set as an increasing sequence of integers starting from the value of seed. By default, this setting is disabled. The default seed is 1000000000. If your input table contains a UniqueRef column then this will be overwritten. If you require the UniqueRef to be assigned in the order that records are added to Hub then you must set the thread count to 1. Please Note: The maximum value we are able to assign as a UniqueRef is 9,223,372,036,854,775,807. Processing will fail if your seed plus dataset size exceed this number.

matchIT Hub Index

Output Fields

Replacements

Options

Related articles