The <dataSources> node is where the mapping of your data takes place. The <dataSources> node can contain any number of <dataSource> sub nodes, within which you define your sources that you want to operate on. Each <datasource> node must have an ‘id’ attribute specified as it is used when calling procedures to specify which data source you want the procedure to work on.
The first sub node within the <dataSource> node is the <connectionString> node, which contains a connection string to the database. By default, as mentioned in ‘Quick Start with Example Data’, it has blank credentials and points to a database called ‘matchIT_SQL_demo’, which is the sample database created by mSQL for use with the sample SQL scripts. You need to amend the connection string as necessary to point to your data.
Following the <dataSources> node is the <tables> node. This node contains all the definitions of the tables within your database that contain the data to be worked with. A single table is defined in a <table> node. The <table> node can contain any of the following attributes:
Attribute |
Description |
Name |
The name of the table in the database (mandatory for all tables). |
uniqueRef |
The name of the unique reference column in the table (mandatory for all tables). |
join |
The name of the table that the table in question joins to (the main table of data will not have this attribute). |
joinType |
The type of join to use between the table in question and the table it is joining to – either INNER, LEFT or RIGHT. Uses LEFT by default. |
joinColumn |
The name of the column in the join table that the unique ref column of the table in question joins with (mandatory for any table definition with a join attribute). |
isKeysTable |
Indicates that the table in question is the keys table definition (can only have one keys table definition per data source, and it must join to the main table of data) |
isOutputTable |
Indicates that the table in question is the output table which will contain cleaned and normalized data after key generation (can only have one output table defined per data source) |
tableHints |
Here you can specify a comma delimited list of Table Hints that will be used in SELECT queries for the table in question, for example ‘NOEXPAND’ if the table you have defined is an indexed view and you want the query to reference the view rather than its base tables. |
Within a <table> node, it is possible to have multiple <conditionalColumn> sub nodes that define columns in the table in question that must meet a condition for a record to be included (for example, a ‘deleted’ flag column). A <conditionalColumn> node must have the following attributes defined:
Attribute |
Description |
name |
The name of the column in the table. |
isEqualTo |
Specifies whether the value should or should not be equal to the value. |
Value |
The value of the flag column (linked to the isEqualTo attribute as mentioned above). |
isIntegerType |
Specifies whether the column is an integer type or not. |
One thing to note about the <tables> section is that SQL server supports views, so it is possible to use the name of a view for the ‘name’ attribute of a table node.
Using a Sample from your datasource
The next part of the datasource definition is Sampling. This allows you to specify a specific sample of data to use during your processing rather than the complete datasource. This can be useful if you are testing your scripts and refining matching settings.
There are several sampling options available:
- Percentage – uses a percentage from your datasource. For example, if you select 10%, then 1 in 10 records from your datasource would be used.
- NinM - uses a sample based on N in every M records of the source data.
- Range - Uses a sample over the specified range of the specified field.
- RandomN - Uses a sample of N randomly selected records from the source data.
- MaxFromTop - uses a sample of the top N records from the source data. If this option is applied with any of the previous options, then the limit setting is used as a restriction on the maximum number of records that will be used in the sample.
When sampling is enabled, only a sample of your data will be used by the mSQL stored procedures; remember to de-activate this when you are ready to process your complete datasource!
Field Mappings
The last node within the <dataSource> node is the <fieldMappings> node - It contains <fieldMapping> sub nodes that define how fields in your data map to the mSQL record object. The ‘matchITField’ attributes of these nodes are pre-set and must not be changed – The attribute that you will need to modify is the ‘databaseField’. Simply put the values of the ‘databaseField’ nodes to the names of the fields in your data that best match the corresponding ‘matchITField’. Any matchITFields that do not have an equivalent in your database should have their databaseField attributes left blank.
Note that it is also possible to specify custom fields that do not have a specific mSQL record equivalent, that you wish to match on, to any of the 9 generic custom fields that are available in the mSQL record object. By default, a there is a node included in the xml on installation with its matchITField attribute set to ‘CustomField1’ – You can in fact have 9 nodes in total with their matchITField attribute values ranging from CustomField1 to CustomField9. A database field such as ‘National Insurance Number’ could be mapped to one of these fields.
If you open a configuration file that has been edited and saved through the Web UI (described below) you will also notice that there is an <addressing> node within each data source that contains the settings used when running the addressing stored procedure. The output contained within this node varies depending on what addressing API is being used, and is best edited through the UI. For the sake of getting the demo scripts to work with your own data however, you do not need to be concerned with this node and whether or not it exists, as the addressing procedure will not be used.