Fast Batch Processing
In batch processing, a processing job is launched regularly against an input data set. Batch processing is often used for tasks such as ETL (extract-transfer-load) for populating data warehouses, data mining, and analytics.
The workhorse functions in this area is filtering, joining, sorting, grouping and aggregating.
Traditionally, specialist ETL tools operating against relational databases were used. For large data sets, this approach has been replaced by tools such as Hadoop and Spark, as they allowed for parallel computation against distributed storage.
Batch processing made fast
Hazelcast Jet speeds up batch processing by up to 15 times compared to that of Spark or Flink and outperforms Hadoop by orders of magnitude.
The following benchmarks show batch performance against a data set held in HDFS.
See the complete benchmark.
This performance gain is achieved by a combination of DAG computation model, in-memory, data locality, partition mapping affinity, SP/SC Queues and Green Threads. These key design decisions are explained on the performance page.
Hadoop and Spark clusters have heavy-weight ecosystems which are complex to deploy and manage. For development, selecting the right modules and maintaining their dependencies and versions makes both development and operations a challenge.
In comparison, Jet is a single 10MB Java library with zero dependencies. It starts effortlessly and fast, scales automatically and handles failures itself without any further infrastructure necessary.
Hazelcast Jet can be fully embedded into an application to build data processing microservices making it easier for manufacturers to build and maintain next generation systems.Each Jet processing job can be easily launched within its own cluster to maximize service isolation.
Batch unified with Stream
In batch processing, the complete dataset is assembled and available before a job is submitted for processing. As a 3rd generation Big Data Engine, Hazelcast Jet treats batch as a specific type of stream processing, one with a finite source and no windows.
As a result, the same programming interface is used for both batch and stream processing making the transition towards streaming straightforward.
Connect into your existing world
The sources and sink adapters of Hazelcast Jet allow it to be plugged into the data processing pipeline.
When Jet cluster is co-located with Hazelcast IMDG or HDFS cluster, it makes use of data locality. Jet nodes are able to efficiently read from it by having every node only read from their respective local partitions.
You can also create your own connectors for integration with databases or enterprise applications.
A choice of three APIs
Our Pipeline API is a general purpose, declarative API which provides developers with tools to compose distributed, concurrent batch computations from building blocks such as mappers, reducers, filters, aggregators and joiners. It is simple and easy to understand but is also powerful.
Here is the classic Word Count expressed in Pipeline API:
Pipeline p = Pipeline.create(); p.drawFrom(Sources.<String>list("text")) .flatMap(word -> traverseArray(word.toLowerCase().split("\\W+"))) .filter(word -> !word.isEmpty()) .groupingKey(wholeItem()) .aggregate(counting()) .drainTo(Sinks.map("counts"));
A distributed form of the Java 8 Stream API is also available in Hazelcast Jet. It is ideal for simple needs. You still get all the scale and performance.
For expert users, we also have a Core API, which is an edge and vertex level API. It is used for fine-grained control or to build your own DSL.Jet in 5 minutes