Hazelcast Jet Features
Engineered for Performance
The Performance of Jet
Distributed DAG Execution
Hazelcast Jet uses directed acyclic graphs (DAG) to model data processing tasks – Jet Jobs. The Jet Job is composed of vertices — units of parallel processing such as data source readers, joiners, sorters, aggregators, filters, mappers or output writers. Those are connected by edges representing the data flow.
Hazelcast Jet provides an ultra low-latency and high-throughput distributed DAG execution.
Low Latency End-to-end
Hazelcast Jet is built on top of a one-record-per-time streaming core (also known as continuous operators). That refers to processing incoming records just after they are ingested as opposed to accumulating the records into micro-batches before submitting.
This fast processing can be boosted by using embedded Hazelcast IMDG. IMDG provides elastic in-memory storage and it is a great tool for publishing the results of the computation or as a cache for data sets to be used during the computation. Extremely low end-to-end latencies can be achieved this way.
As data streams are unbounded and there is the possibility of infinite sequences of records, a tool is required to group individual records to a finite frames in order to run the computation. Hazelcast Jet windows provides a tool to define the grouping.
Types of windows supported by Jet:
- Fixed/tumbling – The stream is divided into same-length, non-overlapping chunks. Each event belongs to exactly one window.
- Sliding – Windows have fixed length, but are separated by a sliding interval.
- Session – Windows have various sizes and are defined based on session identifiers contained in the records. Session is closed after a period of inactivity (timeout).
Hazelcast Jet allows you to classify the records in a data stream based on the time stamp embedded in each record — the Event-time. Event-time processing is a natural requirement as users are mostly interested in handling the data based on the time that the event originated (the Event-time). Event-time processing is a first-class citizen in Jet.
For handling late events, there is a set of policies to determine whether the event is “still on time” or “late”, discarding the latter.
Handling Back Pressure
In the streaming system, it is necessary to control the flow of messages. The consumer cannot be flooded by more messages than it can process in a fixed amount of time. On the other hand, the processors should be left idle wasting resources.
Hazelcast Jet comes with a mechanism to handle back pressure. Every consumer keeps signaling to all the upstream producers how much free capacity it has. This information is naturally propagated upstream to keep the system balanced.
Connected to Message Broker
Hazelcast Jet benefits from message brokers for ingesting data streams and is able to work as a data processor connected to a message broker in the data pipeline.
Jet comes with a Kafka connector for reading from and writing to Kafka topics.
Batch Processing and ETL
Batches Disguised as Streams
Although Jet is based on a streaming core, it is often used on top of bounded, finite data sets (often referred as batch tasks). Jet treats such data sets as if it were a stream that suddenly ends. Batches are processed just as streams in the same one-record-per-time streaming core.
Connectors for ETL
Hazelcast Jet includes connectors for Hadoop Distributed File System (HDFS), local data files (e.g., CSV or logs), and Hazelcast IMDG. Those should be combined in one pipeline to take advantage of the benefits of each, e.g., reading large data sets from HDFS and using distributed in-memory caches of Hazelcast IMDG to enrich processed records.
Processing Hadoop Data
Hadoop Distributed File System (HDFS) is the common tool used for building data warehouses and data lakes. Hazelcast Jet can use HDFS as either data source or destination. If Jet and HDFS clusters are co-located, then Jet benefits from the data locality and processes the data from the same node without sending them over the network.
Taking Advantage of Hazelcast IMDG
Hazelcast Jet is the distributed data processing tool that takes advantage of being integrated with the Hazelcast In-Memory Data Grid.
Hazelcast Jet Embeds Hazelcast IMDG
The complete Hazelcast IMDG is embedded in Jet. So all the services of IMDG are available to your Jet Jobs. As IMDG is the embedded, support structure for Jet, IMDG is fully controlled by Jet (start, shutdown, scaling etc.)
Benefits of embedded Hazelcast IMDG:
- Sharing the processing state among Jet Jobs.
- Caching intermediate processing results.
- Enriching processed events; cache remote data, e.g., fact tables from a database, on Jet nodes.
- Running advanced data processing tasks on top of Hazelcast data structures.
- Improving development processes by making start up of a Jet cluster simple and fast.
Distributing Jet Processed Data with Hazelcast IMDG
Jet Jobs take advantage of the Hazelcast IMDG connector by allowing reading and writing records to/from a remote Hazelcast IMDG instance.
Use a remote Hazelcast IMDG cluster for:
- Distributing data across IMap, ICache and IList structures.
- Sharing state or intermediate results among more Jet clusters.
- Isolating the processing cluster (Jet) from operational data storage cluster (IMDG).
- Publishing intermediate results, e.g., to show real-time processing stats on a dashboard.
The Hazelcast Way
Hazelcast Jet was built by the same community as the Hazelcast In-Memory Data Grid. Jet and IMDG share what the Hazelcast community experienced as the best.
You will get the value from Hazelcast Jet in less than 15 minutes.
Add one dependency to your Maven project and start building — Jet is a small JAR with no dependencies. Use the familiar java.util.stream API to keep all the complexity under the hood. You can still scale Jet to perform well on complex, distributed deployments later.
Lightweight and Embeddable
Hazelcast Jet is lightweight. It starts fast, scales and handles failures autonomously, and communicates with the data processing pipeline using asynchronous messages.
Also, Jet is not a server, it’s a library. It’s natural to embed it in your application to build a data processing microservice. Due to it’s light weight, each Job can be easily launched on its own cluster to maximize service isolation. This is in contrast to heavy-weight data processing servers where the cluster, once started, hosts multiple tasks and tenants.
Discovery and Cloud Deployment
Hazelcast Jet cluster members (also called nodes) automatically join together to form a cluster. Discovery finds other Jet instances based on filters and provides their corresponding IP addresses. Cluster setup is simple, not complicated.
The cloud discovery plugins can be used for easy setup and operations within any cloud environment.
Get professional support from the same people who built the software.
Open Source with Apache 2 License
Hazelcast Jet is open source and available under the Apache 2 license.
Hazelcast Jet is able to detect changes in the cluster topology (network failures and splits, node failures, exceptions) and reacts with graceful termination. The Job can be re-initiated or canceled.
Java 8 Stream API
java.util.stream API is a well-known and popular API in the Java community. It supports functional-style operations on streams of elements.
Hazelcast Jet shifts java.util.stream to a distributed world – the processing is distributed across the Jet cluster and parallelized. If j.u.s is used on top of Hazelcast distributed data structures, the data locality is utilized.
Jet adds support for java.util.stream API to Hazelcast IMDG collections.
The Core API
With the Hazelcast Jet Core API, you get the power of distributed DAGs.
Implementing the processors (DAG nodes) yourself is more work than using a high-level API, however it gives you a really powerful tool. Moreover, there are numerous ready-made implementations in Jet’s library.
Sources and Sinks
Hazelcast Jet comes with readers and writers for Hazelcast distributed data structures: IMap, ICache and IList.
Jet can use structures on the embedded Hazelcast IMDG cluster and make use of data locality — each processor will access the data locally.
Batch and streaming file readers and writers can be used for either reading static files or watching a directory for changes.
Socket readers and writers can read and write to simple text based sockets.
Kafka Streamer is used to consume items from one or more Apache Kafka topics. Kafka Writer is used as an Apache Kafka sink.
Hazelcast Jet can use HDFS as either data source or destination. If Jet and HDFS clusters are co-located, then Jet benefits from the data locality and processes the data from the same node without sending them over the network.
Log Writer is a sink which logs all items at the INFO level.
Ready for Cloud Deployment
Run Hazelcast Jet in the Cloud
Cloud developers can easily drop Hazelcast Jet into their applications. Jet can work in many cloud environments and can be easily extended via Cloud Discovery Plugins to more.
Docker for Hazelcast Jet
Hazelcast Jet includes container deployment options for Docker.
Hazelcast Jet for Pivotal Cloud Foundry
Cloud Foundry users are now able to provision and scale Hazelcast Jet clusters– adding and removing nodes seamlessly as required. After downloading the Hazelcast Jet for PCF tile from the Pivotal Network website, the PaaS Operator simply uploads the tile into Pivotal Ops Manager, enters any required configuration into the GUI, and clicks ‘Install’. After installation, the PaaS Operator need have no further interaction with the Tile.