Syniti Match API is currently available for these operating systems:
- Windows (XP, Vista, 7, 10, 2008, 2012);
- Linux (requires a version of Linux that is built on GLIBC 2.26 or newer, e.g. Ubuntu 20.04, RedHat 8.4, SUSE 15.3, Amazon Linux2);
Syniti Match API is fully multithreaded and highly scalable. By default it will use all available processor cores, but can be forced to run using fewer if necessary. The more cores that Syniti Match API runs on, the faster it will be able to process data.
Typically our clients processing various volumes will provision a VM with enough cores to process the data in a timely manner - this is what we've seen with our clients:
- 500 million to a billion+ records will provision a 64 core machine*
- 200 million+ records normally have a 32 core machine
- 80 million+ records normally use a 16 core machine
- 30 million+ records normally use an 8 core machine
- we suggest a minimum of 4 cores, even with 4 cores you can process millions of records an hour
*If you're looking to process a billion records in minutes instead of hours, then we suggest Distributed processing with Hub for Apache Spark
Syniti Match API runs entirely in-memory. As the volume of data increases, memory requirements also increase. It is highly recommended that Syniti Match API is used on a machine with enough memory to sufficiently process the data without requiring disk storage.
As a rough guideline:
- a machine with 8 GB of RAM should comfortably process 15 million rows;
- a machine with 16 GB of RAM should comfortably process 30 million rows;
- a machine with 32 GB of RAM should comfortably process 60 million rows;
- a machine with 48 GB of RAM should comfortably process 80 million rows.
If overlapping two sources of data, then use their summed row counts with these guidelines (for example, 100 million vs. 20 million would require 80 GB of RAM.
Note that these figures are highly dependent on factors such as:
- the average size of each row (these figures assume an average row size of 150 bytes);
- which match keys are used (refer to the Configuration Guide for details on match keys);
- the amount of duplication in the data.
Normalization: Note that when an engine is configured for normalization, a row of data added to the engine is discarded immediately after it's processed and output; it is otherwise not retained in RAM. The above RAM requirements are therefore not applicable, and memory usage is minimal.
Due to its in-memory architecture, matchIT Hub is many times faster than any other specialist data matching solution. It scales automatically across multiple processors – efficiently processing very high volume data. Performance depends principally on hardware and match rate (duplication or overlap rate), but examples include:
- Finds overlap of 100,000 records against 50 million preloaded records in 11 seconds (uses 13GB RAM, 20% match rate)*
- Matches 1 million records in 12 seconds (uses 500MB RAM, 11% match rate)*
- Matches 50 million records in 52 minutes (uses 15GB RAM, 12% match rate)*
- Matches 1 billion records in 15 minutes (using Apache Spark, 10% match rate)†
* Using a 10-core hyper threaded Windows PC with 64GB RAM
† Using a cluster of 20 machines on AWS, each with 192GB RAM, 48 Cores, with CPU 3.1 GHz Intel Xenon® Platinum 8175. Time to start up the machine cluster is an additional 6 minutes.