The jar HubSpark.jar is the Syniti Match API for Spark package (com.matchIT.Hub.spark) and contains the following classes. You will only need this jar if you are building your own applications, the sample apps already include it.
Syniti Match API for Spark API Classes
The following classes provide a high-level interface to the Syniti Match API functionality.
DedupeConfiguration | Base class for configuration options for matchIT for Spark API and sample applications. |
HubSchema | Provides a schema for each stage of processing, based on configuration settings. |
HubSpark | Base class for HubSparkDataFrame and HubSparkRDD. |
HubSparkDataFrame | High-level functions for deduping data held in DataFrames. |
HubSparkRDD | High-level functions for deduping data held in RDDs. |
HubStats | Used to collect statistics across the various stages and partitions of a job. |
DedupeConfiguration
Base class for parsing XML configuration options for matchIT for Spark API and sample applications.
The constructor is passed a tag name used to indicate the name of the tag, within <config>, in which to find the configuration options. Default is "dedupeSpark", i.e. <config><dedupeSpark>...</dedupeSpark></config>
DedupeConfiguration(String appTag)
The xml settings are in the following format:
<?xml version="1.0" encoding="utf-8" ?>
<config>
<dedupeSpark>
<licenceFile>./activation.txt</licenceFile>
<delimiter>\t</delimiter>
<logLevel>error</logLevel>
<warehouseLocation>/user/hive/warehouse</warehouseLocation>
<groupingAlgorithm>hub</groupingAlgorithm>
<schema>|id|full_name|last_name|addr1|addr2|city|state|zip</schema>
<idField>0</idField>
<maxIterations>4</maxIterations>
</dedupeSpark>
</config>
licenceFile | A file containing the product activation code. |
delimiter | The delimiter used in the input file and used in the delimited Strings in the RDDs. |
logLevel | Minimum severity level of errors to log. See org.apache.log4j.Level.
|
warehouseLocation | Location of the Hive warehouse. Only required if using Hive. |
groupingAlgorithm | The grouping algorithm to use to group together matching pairs.
|
schema | Field names to use in a DataFrame. This setting is optional and if present overrides the field names in the header row. This is useful if you need to rename a unique ref field to 'id' to use graphX for grouping. |
idField | Field number to use as the unique ref field in an RDD. This setting is required when using Kwartile for grouping, for joining the pairs of unique ref's back to the source data. |
maxIterations | The maximum number of iterations to allow when using the Kwartile grouping algorithm. |
HubSpark
Base class for high-level functions for deduping data held in DataFrames (Dataset<Row>) or RDDs (RDD<String> - where String is a delimited record).
The class’ public methods are:
void init(String appName, DedupeConfiguration config) | Creates a SparkSession based on the supplied config details and application name. Creates instances of HubSchema and HubStats. |
void close() | Prints stats and closes the SparkSession. |
SparkSession getSparkSession() | Returns the SparkSession. |
JavaSparkContext getSparkContext() | Returns the JavaSparkContext. |
HubSparkDataFrame
The class’ public methods are:
Dataset<Row> matching(Dataset<Row> mainInput) | Perform internal matching on a dataset and return a dataset of matching pairs. |
Dataset<Row> matching(Dataset<Row> mainInput, Dataset<Row> overlapInput) | Perform overlap matching on two datasets and return a dataset of matching pairs. |
Dataset<Row> grouping(Dataset<Row> pairs) | Perform Grouping (using Syniti Match API's Grouping mode) on a dataset of matching pairs. |
Dataset<Row> groupingGraph(Dataset<Row> input, Dataset<Row> pairs) | Perform Grouping (using graphX's connected components algorithm) on a dataset of matching pairs. The input DataFrame must have a unique ref column named 'id', the pairs DataFrame must have unique ref columns named 'src', & 'dst' (HubSchema takes care of naming the matching pairs columns). |
HubSparkRDD
The class’ public methods are:
JavaRDD<String> matching(JavaRDD<String> mainInput) | Perform internal matching on an RDD and returns an RDD of matching pairs. |
JavaRDD<String> matching(JavaRDD<String> mainInput, JavaRDD<String> overlapInput) | Perform overlap matching on two RDDs and return an RDD of matching pairs. |
JavaRDD<String> grouping(JavaRDD<String> pairs) | Perform Grouping (using Syniti Match API's Grouping mode) on an RDD of matching pairs. |
JavaRDD<String> groupingKwartile(JavaRDD<String> input, JavaRDD<String> pairs) | Perform Grouping (using Kwartiles's connected components algorithm) on an RDD of matching pairs. The unique ref column number in the input RDD is specified by the idField configuration setting. |
HubSchema
Provides a schema for each stage of processing, based on configuration settings. Helper class that populates org.apache.spark.sql.types.StructType schema structures for each data processing class' inputs and outputs.
The constructor is passed an activation code and the same Hub settings xml used by the data processing classes.
HubSchema(String activationCode, String hubSettings)
The class’ public methods are:
StructType getInputSchema(String columns) | Generates an input schema given a delimited list of input columns (columns must start with the delimiter used). |
StructType getKeyGenerationOutputSchema(int table) | Generates Key Generation output schema for the given table. |
StructType getPairMatchingOutputSchema() | Generates the Pair Matching output schema (i.e. matching pairs). |
StructType getGroupingOutputSchema() | Generates the Grouping output schema. |
HubStats
Used to collect statistics across the various stages and partitions of a job. Create one instances of this class and pass it to all the data processing tasks. When processing is finished call HubStats::print() to display the total statistics.
The constructor is passed the JavaSparkContext and the number of exact and fuzzy keys in the configuration.
HubStats(JavaSparkContext context, int numExactKeys, int numFuzzyKeys)
The class’ public methods are:
void add(String statsXml) | Adds to the accumulators, the figures from the given Hub statistics XML. |
void addClusters(long newClusters, long newLargeClusters) | Adds the given values to the accumulators for the number of clusters and large clusters. |
void print() | Prints the stats to System.out. |
Data Processing Classes
The following low-level classes implement Spark functions used in transformations and have two versions. A “String” version for working with JavaRDDs, where String is a delimited record, and a “Row” version for working with JavaRDD/Dataset.
GroupingRow | Data processing class that performs Grouping of matching pairs. |
GroupingString | Data processing class that performs Grouping of matching pairs. |
GroupMatchingString | Data processing class that performs GroupMatching to post-process records already grouped. |
KeyedToKeyValuesRow | Data processing class that converts Keyed records into {key, value} pairs. |
KeyedToKeyValuesString | Data processing class that converts Keyed records into {key, value} pairs. |
KeyGenerationRow | Data processing class that performs Key Generation |
KeyGenerationString | Data processing class that performs Key Generation |
PairMatchingRow | Data processing class that performs Pair Matching. |
PairMatchingString | Data processing class that performs Pair Matching. |
Grouping
Groups matching pairs.
The constructors take: activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.
GroupingString(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)
GroupingRow(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)
GroupingRow
Implements FlatMapFunction<Iterator, Row> for use with JavaRDD::mapPartitions().
JavaRDD groups = allPairs.mapPartitions(
new GroupingRow(activationCode,
hubSettings,
delimiter,
stats));
GroupingString
Implements FlatMapFunction<Iterator, String> for use with JavaRDD::mapPartitions().
JavaRDD groups = allPairs.mapPartitions(
new GroupingString(activationCode,
hubSettings,
delimiter,
stats));
GroupMatching
Re-processes groups of matching records. This is for use when matching pairs have been grouped by some other means than Hub's Grouping mode - for example, Kwartile's Map/Reduce implementation of connected components. The point of reprocessing using Hub's GroupMatching mode is to apply the Bridging Prevention, and Master Record Identification functionality and to apply scores etc.
The constructors take: activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.
GroupMatchingString(String activationCode,
String hubSettings,
String delimiter,
HubStats stats)
GroupMatchingString
Implements FlatMapFunction<Iterator<Tuple2<String, Iterable<String>>>, String> for use with JavaPairRDD::mapPartitions().
JavaRDD groups = grouped.mapPartitions(
new GroupMatchingString(activationCode,
hubSettings,
delimiter,
stats));
KeyedToKeyValues
Applied to the output of KeyGeneration, generates {key, value} pairs for each key. The output of KeyGeneration has all the key values appended in new field. This task converts that input {key, value} pairs and converts the JavaRDD into a JavaPairRDD.
KeyedToKeyValuesRow
The constructor takes the StructType schema used in the Row records.
KeyedToKeyValuesRow(StructType schema)
Implements PairFlatMapFunction<Row, String, Row> for use with JavaRDD::flatMapToPair().
JavaPairRDD<String, Row> keys = keyed.javaRDD().flatMapToPair(
new KeyedToKeyValuesRow(keyGenOutputSchema));
KeyedToKeyValuesString
The constructor takes the delimiter used in the String records.
KeyedToKeyValuesString(String delimiter)
Implements PairFlatMapFunction<String, String, String> for use with JavaRDD::flatMapToPair().
JavaPairRDD<String, String> keys = keyed.flatMapToPair(
new KeyedToKeyValuesString(delimiter));
KeyGeneration
Appends all key values to each record using Hub in a new “Key Generation” mode.
The constructors take: a table number (0, 1, or 2), activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.
KeyGenerationString(int table,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)
KeyGenerationRow(int table,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)
KeyGenerationRow
Implements MapPartitionsFunction<Row, Row> for use with Dataset::mapPartitions().
Dataset keyed = rowsDF.mapPartitions(
new KeyGenerationRow(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats),
encoder);
Implements FlatMapFunction<Iterator, Row> for use with JavaRDD::mapPartitions().
JavaRDD keyed = rowsDF.javaRDD().mapPartitions(
new KeyGenerationRow(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats));
KeyGenerationString
Implements PairFlatMapFunction<String, String, String> for use with JavaRDD::mapPartitions().
Example usage:
JavaRDD keyed = rows.mapPartitions(
new KeyGenerationString(overlap ? 1 : 0,
activationCode,
hubSettings,
delimiter,
stats));
PairMatching
Applied to the output of KeyedToKeyValues grouped by key, compares every record in each group with every other record in the group (whilst avoiding duplicate comparisons). Sends pairs of records to Hub in Pair Matching mode. Outputs matching pairs.
The constructors take: a flag to indicate if overlap matching, activation code, Hub xml settings, a delimiter to use when constructing the delimited records passed to Hub, and an instance of HubStats.
PairMatchingString(boolean overlap,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)
PairMatchingRow(boolean overlap,
String activationCode,
String hubSettings,
String delimiter,
HubStats stats)
PairMatchingRow
Implements FlatMapFunction<Iterator<Tuple2<String, Iterable>>, Row> for use with JavaPairRDD<String, Iterable>::mapPartitions().
JavaRDD pairs = clusters.mapPartitions(
new PairMatchingRow(overlap,
activationCode,
hubSettings,
delimiter,
stats));
PairMatchingString
Implements FlatMapFunction<Iterator<Tuple2<String, Iterable>>, String> for use with JavaPairRDD<String, Iterable>::mapPartitions().
JavaRDD pairs = clusters.mapPartitions(
new PairMatchingString(overlap,
activationCode,
hubSettings,
delimiter,
stats));