The sample applications demonstrate deduplication of CSV data in the form of JavaRDD and of database tables loaded via Jdbc into Datasets.
Each sample application folder contains:
src | Folder containing the sample source code. |
build.sh | Script to build the application using maven and the pom.xml file. |
<app>-jar-with-dependencies.jar | Pre-built executable jar. |
pom.xml | Maven build configuration. |
readme | Text file with overview of application |
run.sh | Example script to run the application. |
sampleconfig.xml | Example configuration file. |
Additionally, the DedupeTextFile contains an example1.txt input file.
You don’t need to build the sample apps, as pre-built binaries are included, but build scripts are also included in case you want to modify the source to tailor the applications.