Previous Article | matchIT Hub Index | Next Article |
C# & Java batch examples
The C# & Java batch clients are similar to the HubTest examples for the Hub API.
Usage
BatchClient /testconfig=<testConfigFile> /settings=<settingsFile> /license=<licenseFile> [/stats=<statsFile>] [switches] Where: <testConfigFile> The name of a BatchClient XML configuration file (see below for details). <settingsFile> The name of a matchIT Hub XML settings file (see the Configuration Guide for details). <licenseFile> The name of the file containing the activation code. <statsFile> The name of a file to write statistics to (XML format). Switches: /help or /? Display usage. e.g. BatchClient /testconfig=config.xml /settings=settings.xml /stats=stats.xml /license=activation.txt
Configuration
This configuration file is specific to BatchClient and gives database connection details and other options. There are four sections: service, log, input, & output.
<?xml version="1.0" encoding="utf-8" ?> <config> <service>...</service> <log>...</log> <input>...</input> <output>...</output> </config>
The <service> section configure general settings about the service. For example:
<service> <host>http://localhost:8080</host> <engine destroy="true">0</engine> <pollInterval>2000</pollInterval> <uploadBlockRecordLimit>100000</uploadBlockRecordLimit> <compressionMinSize>1048576</compressionMinSize> <downloadBlockRecordLimit>10000</downloadBlockRecordLimit> </service>
- <host> - configures the protocol, host and port of the matchIT Hub Service. The default is http://localhost:8080.
- <engine> - configures which engine instance to use and whether or not to destroy it when finished. Set the engine instance id to 0 creates a new engine instance. You can specify an engine name instead of id, if no engine instance exists with the given name a new one will be created.
- <pollInterval> - when waiting for processing to finish or results to be ready, specifies how many milliseconds to sleep between calls querying the state of the service. The default is 10,000ms (i.e. 10 seconds)
- <uploadBlockRecordLimit> - specifies the number of records for the ProxyEngine to collect - via the AddData() method - before uploading to the service. Specify 0 to wait and upload all the records in one go. The default is 100,000.
- <compressionMinSize> - when sending blocks of records to the service, blocks over the compression minimum size (bytes) will be compressed using gzip. Specify 0 (default) to disable compression.
- <downloadBlockRecordLimit> - specifies the maximum number of records to be returned at a time by the ProxyEngine OpenDownloadStream() method - which calls the REST API results method. Specify 0 (default) to download all the results in one go.
The <log> section configures logging. See Logging for full details.
The <input> & <output> sections configure the data sources and target. The following outline some example scenarios:
1) Input from and output to database tables
- Specify <connectionString> & <table> in the <dataSource> to read input from database tables.
- Specify <connectionString> & <table> in the <output> to write results to database tables.
With this configuration BatchClient read data from database tables and streams it to the Hub Service in chunks, and streams back the results and writes them to database tables.
The format of configurationString depends on the database and the language (C#/Java)
The C# BatchClient supports MS SQL and configurationStrings would be of the form:
<connectionString>Data Source=HOST; Initial Catalog=DATABASE; Integrated Security=SSPI;</connectionString> <connectionString>Data Source=HOST; Initial Catalog=DATABASE; User Id=USERNAME;Password=PASSWORD;</connectionString>
The Java BatchClient supports mySQL, MS SQL, & Oracle (you will need to download the appropriate drivers) and configurationStrings would be of the form:
<connectionString>jdbc:mysql://HOST/DATABASE?user=USER&password=PASSWORD</connectionString> <connectionString>jdbc:sqlserver://HOST;DatabaseName=DATABASE;IntegratedSecurity=true;</connectionString>
The following example (with Java BatchClient connectionstrings) loads data from a mySQL database and writes results to an MS SQL database.
This configuraton expects the Hub Service to be listening on localhost:8080 (default).
<?xml version="1.0" encoding="utf-8" ?> <config> <service> <host>localhost:8080</host> <!-- Set engine to 0 to create a new instance --> <engine destroy="true">0</engine> </service> <log> <filename>debuglog.txt</filename> <severity>Debug</severity> </log> <input> <!-- Define one data source for single table Matching --> <dataSource> <connectionString>jdbc:mysql://localhost/DATABASE?user=USER&password=PASSWORD</connectionString> <table>TABLE</table> <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns> </dataSource> <!-- Define two data sources for Overlap Matching --> <!-- <dataSource> </dataSource> --> </input> <!-- Output database in which to create tables for MatchingPairs, GroupedMatchingPairs, MatchingGroups, DedupedData, DuplicateData. Only the output types enabled in the matchIT Hub configuration file will be output. --> <output> <connectionString>jdbc:sqlserver://localhost;DatabaseName=DATABASE;IntegratedSecurity=true;</connectionString> </output> </config>
2) Input from and output to delimited files (by Service)
- Specify <filename> & <delimiter> in the <dataSource> to read input from delimited files.
- Specify <filename> & <delimiter> in the <output> to write results to delimited files.
With this configuration BatchClient sends the file names to the Hub Service, which reads and writes them directly. The Hub Service must have read/write access to the file names specified. Any non-alphanumeric character can be used as delimiter for the input file, '\t' is interpreted as tab. If no delimiter is specified, each record must begin with the delimiter used in that record (which can be different for each record). Each record in the output file begins with the delimiter used in that record.
<?xml version="1.0" encoding="utf-8" ?> <config> <input> <!-- Define one data source for single table Matching --> <dataSource> <filename>DRIVE:\PATH\FILENAME.txt</filename> <delimiter>\t</delimiter> <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns> </dataSource> </input> <!-- Output file to write the results. --> <output> <filename>DRIVE:\PATH\FILENAME.txt</filename> </output> </config>
3) Input from and output to delimited files (by Java BatchClient)
The java version of BatchClient supports a "local" attribute on the filename tags, i.e. <filename local="false">DRIVE:\PATH\FILENAME.txt</filename>. This defaults to true, meaning that the named file is local to the machine running the Hub Service (see scenario 2 above). If set to false it means the file is not accessible from the Hub Service so instead of sending the filename, BatchClient streams the data in chunks.
With this configuration BatchClient reads data from delimited files and streams it to the Hub Service in chunks, and streams back the results and writes them to a delimited file. Any non-alphanumeric character can be used as delimiter for the input file, '\t' is interpreted as tab. If no delimiter is specified, each record must begin with the delimiter used in that record (which can be different for each record). Each record in the output file begins with the delimiter used in that record.
<?xml version="1.0" encoding="utf-8" ?> <config> <input> <!-- Define one data source for single table Matching --> <dataSource> <filename local="false">DRIVE:\PATH\FILENAME.txt</filename> <delimiter>\t</delimiter> <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns> </dataSource> </input> <!-- Output file to write the results. --> <output> <filename local="false" split="false">DRIVE:\PATH\FILENAME.txt</filename> </output> </config>
The split option defaults to false, in which case all results are written to a single output file. Enable the split option to have the different output types Matching Pairs, Matching Groups, etc output to separate files. The various files are named by adding the output type to the filename, so FILENAMEMP.txt, FILENAMEMG.txt, etc. Only the output types enabled in the matchIT Hub configuration file will be output.
4) Mixing datasources
Scenario 1 above demonstrates input and output to and from database tables. Scenarios 2 & 3 above demonstrate input and output to and from delimited files. You can, of course, configure each dataSource and output to use a different type of data store. For example:
<?xml version="1.0" encoding="utf-8" ?> <config> <input> <!-- Universe data is held in a database table. --> <dataSource> <connectionString>jdbc:mysql://localhost/DATABASE?user=USER&password=PASSWORD</connectionString> <table>TABLE</table> <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns> </dataSource> <!-- Overlap data is in a delimited file. --> <dataSource> <filename local="false">DRIVE:\PATH\FILENAME.txt</filename> <delimiter>\t</delimiter> <columns>UniqueRef,Prefix,Forenames,Surname,Address1,Address2,Address3,Address4,Address5,Postcode</columns> </dataSource> </input> <!-- Output file to write the results. --> <output> <filename local="false" split="false">DRIVE:\PATH\FILENAME.txt</filename> </output> </config>
5) Running multiple overlaps against a master table that is only loaded once
To run an overlap specify two dataSources in the <input> section:
<input> <dataSource> <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns> <connectionString>jdbc:mysql://localhost/DATABASE?user=USER&password=PASSWORD</connectionString> <table>TABLE</table> </dataSource> <dataSource> <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns> <filename>DRIVE:\PATH\FILENAME.txt</filename> <delimiter>\t</delimiter> </dataSource> </input>
To just load data into table 1 in preparation for running overlaps (or lookups), define the columns for table 2 but don't specify either filename & delimiter or connectionString & table:
<input> <dataSource> <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns> <connectionString>jdbc:mysql://localhost/DATABASE?user=USER&password=PASSWORD</connectionString> <table>TABLE</table> </dataSource> <dataSource> <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns> </dataSource> </input>
To run subsequent overlaps against a previously loaded table 1, define the columns for table 1 but don't specify either filename & delimiter or connectionString & table:
<input> <dataSource> <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns> </dataSource> <dataSource> <columns>UniqueRef,Forenames,Surname,Address1,City,State,Zip</columns> <filename>DRIVE:\PATH\FILENAME.txt</filename> <delimiter>\t</delimiter> </dataSource> </input>
When running the first overlap (or intial load), in the <service> section set:
<engine destroy="false">0</engine>
The zero means it should create a new engine instance as usual, but the destroy="false" option tells it to leave that instance running with the table 1 data still loaded.
Run subsequent overlaps by specifing the engine number created in the first job, e.g.
<engine destroy="false">1</engine>
Previous Article | matchIT Hub Index | Next Article |