Fast Batch Processing
In batch processing, the processing Job is launched regularly to mine the pre-existing data set for interesting information. Batch processing is often used for the tasks like ETL (extract-transfer-load populating the data warehouses), data mining, predictive analytics or transaction processing.
The workhorse in this area is filtering, joining, sorting, grouping and aggregating.
Relational databases were the traditional means to conduct such a processing. RDBMS approach was replaced by a tools like Hadoop or Spark, as they allow to scale the computation and to decouple the storage from the processing.
Batch processing made fast
Hazelcast Jet employs a lot of performance optimisations to speed up batch processing up to 15 times compared to Spark or Flink. Hadoop is overperformed by magnitudes.
See the complete benchmark
This performance gain is achieved by a combination of DAG computation model, in-memory, data locality, partition mapping affinity, SP/SC Queues and Green Threads. These key design decisions are explained on the performance page.
Hadoop and Spark clusters have heavy-weight ecosystems which are complex to deploy and manage. Moreover, for development, selecting the right modules and maintaining their dependencies and versions makes both development and operations a challenge.
In comparison, Jet is a single 10MB Java library with zero dependencies. It starts fast, scales automatically and handles failures itself without any further infrastructure necessary.
It’s natural to embed it into an application to build data processing microservices. Each Jet processing job can be easily launched within its own cluster to maximize service isolation. Jet can be fully embedded for microservices making it is easier for manufacturers to build and maintain next generation systems.
Batch unified with Stream
In batch processing, the complete dataset is assembled and available before a job is submitted for processing. Although Hazelcast Jet is build on top of a streaming core, it is a great tool for building batch processing applications. For Hazelcast Jet, the batch dataset is a specific stream that has ended when all the data is processed.
As a result, the same programming interface is used for both batch and stream processing making the transition towards streaming straightforward.
Connect to your existing world
The sources and sink adapters of Hazelcast Jet allow Jet to be plugged into the data processing pipeline.
Hazelcast Jet comes with pre-built connectors for Hazelcast IMDG distributed Map, Cache and List, HDFS and local data files (e.g. CSV or logs).
When Jet cluster is co-located with Hazelcast IMDG or HDFS cluster, Jet makes use of data locality. Jet nodes are able to efficiently read from it by having every node only read from their respective local partitions.
You can also create your own connectors for integration with databases or enterprise applications.
A choice of three APIs
Our Pipeline API is a general purpose, declarative API which provides developers with tools to compose distributed, concurrent batch computations from building blocks such as mappers, reducers, filters, aggregators and joiners. It is simple and easy to understand but is also powerful.
Here is the classic Word Count expressed in Pipeline API:
Pipeline p = Pipeline.create(); p.drawFrom(Sources.<Long, String>map("lines")) .flatMap(e -> traverseArray(delimiter.split(e.getValue().toLowerCase()))) .filter(word -> !word.isEmpty()) .groupBy(wholeItem(), counting()) .drainTo(Sinks.map("counts")); jet.newJob(p).join();
A distributed form of the Java 8 Stream API is also available in Hazelcast Jet. It is ideal for simple needs. You still get all the scale and performance.
For expert users, we also have a Core API, which is an edge and vertex level API. It is used for fine-grained control or to build your own DSL.Jet in 5 minutes