H6.1 - matchIT Hub for Spark - Introduction

May 06, 2022 20:16
Updated

Apache Spark™ is a fast and general engine for large-scale data processing. Apache Spark is an open-source cluster-computing framework.

The matchIT Hub for Spark product provides transformation steps that add the deduplication functionality of matchIT Hub to Spark. This allows you to find matches in and across any combination of databases supported by Spark.

Prerequisites

You can run the sample apps out-the-box, as-is, but you may wish to tailor them to your specific requirements. For this you will need Java and Maven.

Java

Download and install Java from http://java.com/en/download/manual.jsp.

Maven

The easiest way to rebuild the sample applications is using Maven and the supplied pom.xml files.

Download and install Maven from https://maven.apache.org.

Spark

Spark can be run on Windows, but Linux is the platform of choice. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.

We recommend using AWS EMR to run jobs, as Spark clusters can be spun up simply on ad hoc basis. The matchIT Hub for Spark installation includes sample scripts for submitting jobs to EMR.

matchIT Hub Index

Prerequisites

Java

Maven

Spark

Related articles