Syniti Match API for Spark - Class Reference – Software Support

The jar HubSpark.jar is the Syniti Match API for Spark package (com.matchIT.Hub.spark) and contains the following classes. You will only need this jar if you are building your own applications, the sample apps already include it.

Syniti Match API for Spark API Classes

The following classes provide a high-level interface to the Syniti Match API functionality.

DedupeConfiguration	Base class for configuration options for matchIT for Spark API and sample applications.
HubSchema	Provides a schema for each stage of processing, based on configuration settings.
HubSpark	Base class for HubSparkDataFrame and HubSparkRDD.
HubSparkDataFrame	High-level functions for deduping data held in DataFrames.
HubSparkRDD	High-level functions for deduping data held in RDDs.
HubStats	Used to collect statistics across the various stages and partitions of a job.

DedupeConfiguration

Base class for parsing XML configuration options for matchIT for Spark API and sample applications.

The constructor is passed a tag name used to indicate the name of the tag, within <config>, in which to find the configuration options. Default is "dedupeSpark", i.e. <config><dedupeSpark>...</dedupeSpark></config>

DedupeConfiguration(String appTag)

The xml settings are in the following format:

<?xml version="1.0" encoding="utf-8" ?>
<config>
<dedupeSpark>
<licenceFile>./activation.txt</licenceFile>
<delimiter>\t</delimiter>
<logLevel>error</logLevel>
<warehouseLocation>/user/hive/warehouse</warehouseLocation>
<groupingAlgorithm>hub</groupingAlgorithm>
<schema>|id|full_name|last_name|addr1|addr2|city|state|zip</schema>
<idField>0</idField>
<maxIterations>4</maxIterations>
</dedupeSpark>
</config>

licenceFile	A file containing the product activation code.
delimiter	The delimiter used in the input file and used in the delimited Strings in the RDDs.
logLevel	Minimum severity level of errors to log. See org.apache.log4j.Level. Off Fatal Error Warn Info Debug Trace
warehouseLocation	Location of the Hive warehouse. Only required if using Hive.
groupingAlgorithm	The grouping algorithm to use to group together matching pairs. Hub - Syniti Match API's Grouping algorithm. Requires coalescing the pairs to a single mode, so has limited scalabilty; graphX - Apache Spark's built in GraphX connected components algorithm. Limited scalability, only supported for DataFrames; Kwartile - Kwartile's Map/Reduce implementation of connected components. Scales well to 10s of billions of records, only supported for RRDs.
schema	Field names to use in a DataFrame. This setting is optional and if present overrides the field names in the header row. This is useful if you need to rename a unique ref field to 'id' to use graphX for grouping.
idField	Field number to use as the unique ref field in an RDD. This setting is required when using Kwartile for grouping, for joining the pairs of unique ref's back to the source data.
maxIterations	The maximum number of iterations to allow when using the Kwartile grouping algorithm.

HubSpark

Base class for high-level functions for deduping data held in DataFrames (Dataset<Row>) or RDDs (RDD<String> - where String is a delimited record).