Hazelcast Jet Features
Engineered for Performance
The Performance of Jet
Distributed DAG Execution
Hazelcast Jet uses directed acyclic graphs (DAGs) to model data processing tasks – Jet Jobs. The Jet Job is composed of vertices — units of parallel processing such as data source readers, joiners, sorters, aggregators, filters, mappers or output writers. Those are connected by edges representing the data flow.
Hazelcast Jet provides an ultra low-latency and high-throughput distributed DAG execution.
Low Latency End-to-end
Hazelcast Jet is built on top of the one-record-per-time streaming core (also known as continuous operators). That refers to processing incoming records just after they are ingested as opposed to accumulating the records to micro-batches before submitting.
The fast processing can be boosted by using embedded Hazelcast IMDG. IMDG provides elastic in-memory storage and it is a great tool for publishing the results of the computation or as a cache for datasets to be used during the computation. Extremely low end-to-end latencies can be achieved this way.
As data streams are unbounded and possibly infinite sequences of records, a tool is required to group individual records to a finite frames in order to run the computation. Jet windows provides a tool to define the grouping.
Types of windows supported by Jet:
- Fixed/tumbling – the stream is divided into same-length, non-overlapping chunks. Each event belongs to exactly one window.
- Sliding – windows have fixed length, but are separated by a sliding interval.
- Session – windows have various sizes and are defined basing on session identifiers contained in the records. Session is closed after a period of inactivity (timeout).
Jet allows you to classify the records in a data stream based on the timestamp embedded in each record — the Event time. Event-time processing is a natural requirement as users are mostly interested in handling the data based on the time the event originated (the event time). Event-time processing is a first-class citizen in Jet.
For handling late events, there is a set of policies to determine whether the event is “still on time” or “late”, discarding the latter.
Handling Back Pressure
In the streaming system, it is necessary to control the flow of messages. The consumer cannot be flooded by more messages than it cannot process in a time. On the other hand, the processors cannot stay idle and waste resources.
Hazelcast Jet comes with a mechanism to handle back pressure. Every consumer keeps signalling to all the upstream producers on how much free capacity it has. This information is naturally propagated upstream to keep the system balanced.
Connected to Message Broker
Hazelcast Jet benefits from message brokers for ingesting data streams and is able to work as a data processor connected to message broker in the data pipeline.
Jet comes with a Kafka connector for reading from and writing to the Kafka topics.
Batch processing and ETL
Batches Disguised as Streams
Although Jet is based on a streaming core, it is supposed to be used on top of bounded, finite datasets (often referred as batch tasks). Jet considers such a dataset as a stream that suddenly ends. Therefore, batches are streams in disguise for Jet.
Connectors for ETL
Jet includes connectors for Hadoop Distributed File System (HDFS), for local data files (e.g. CSV or logs) and for Hazelcast IMDG. Those should be combined in one pipeline to take advantage of the benefits of each: e.g. reading large datasets from HDFS and using distributed in-memory caches of Hazelcast IMDG to enrich processed records.
Processing Hadoop data
Hadoop Distributed File System (HDFS) is the common tool used for building data warehouses and data lakes. Jet can use HDFS as either data source or destination. If Jet and HDFS clusters are co-located, then Jet benefits from the data locality and processes the data from the same node without sending them over the wire.
Taking Advantage of Hazelcast IMDG
Hazelcast Jet is the distributed data processing tool that takes full benefit of being integrated with Hazelcast In-Memory Data Grid .
Hazelcast Jet Embeds Hazelcast IMDG
The complete Hazelcast IMDG is embedded in Jet. So all the services of IMDG are available to your Jet jobs. As the IMDG is the embedded, supportive structure for Jet, the IMDG is fully controlled by Jet (start, shutdown, scaling etc.)
Embedded in-memory data grid suits well for:
- Sharing the processing state among Jet Jobs.
- Caching intermediate processing results.
- Enriching processed events. Cache remote data (e.g. fact tables from database) on Jet nodes.
- Running advanced data processing tasks on top of Hazelcast data structures.
- Development purposes. Since starting the Jet cluster is so simple and fast.
Distributing Jet Processed Data with Hazelcast IMDG
The Jet Job can take benefit of Hazelcast IMDG connector allowing reading and writing records from/to remote Hazelcast IMDG instance.
Use remote Hazelcast IMDG cluster for:
- Sharing state or intermediate results among more Jet clusters
- Isolate the processing cluster (Jet) from operational data storage cluster (IMDG)
- If you need more control over your Hazelcast IMDG cluster. Because the embedded IMDG is managed by Jet.
- Publish intermediate results (e.g. to show real-time processing stats on dashboard)
The Hazelcast Way
Jet was built by the same community as the Hazelcast In-Memory Data Grid. Jet and IMDG share what the Hazelcast community experienced as the best.
You will get the value from Jet in less than 15 minutes.
Add one dependency to your Maven project and start building — Jet is a small jar with no dependencies. Use the familiar java.util.stream API to keep all the complexity under the hood. You can still scale Jet to perform well on complex, distributed deployments later, as you get introduced to Jet.
Lightweight and Embeddable
Jet is lightweight. It starts fast, it scales and handles failures itself and communicates with the data processing pipeline using the asynchronous messages.
Also, Jet is not a server, it’s a library. It’s natural to embed it to your application to build the data processing microservice. Due to it’s light weight, each Job can be easily launched on its own cluster to maximise service isolation. This is in contrast with heavy-weight data processing servers where the cluster, once started, hosts multiple tasks and tenants.
Discovery and Cloud Deployment
Jet cluster members (also called nodes) automatically join together to form a cluster. Discovery finds other Jet instances based on filters and provides their corresponding IP addresses. No complicated cluster setup is necessary.
The cloud discovery plugins should be used to allow the easy setup and operations within many cloud environments.
Get a professional support from the same people who built the software.
Open Source with Apache 2 License
Hazelcast Jet is open source and available with Apache 2 license.
Jet is able to detect changes in the cluster topology (network failures and splits, node failures, exceptions) and reacts with graceful termination. The Job could be re-initiated or cancelled.
The finer mechanisms of Jet fault-tolerance including snapshots and processing guarantees for stream processing are subject of intensive development.
Java 8 Stream API
java.util.stream API is well-known and popular API in the Java community. It supports functional-style operations on streams of elements.
Jet shifts java.util.stream to a distributed world – the processing is distributed across the Jet cluster and parallelized. If j.u.s is used on top of Hazelcast’s distributed data structures, the data locality is utilized.
Jet adds support for java.util.stream API to Hazelcast IMDG collections.
The Core API
With Jet Core API, you get all the power of distributed DAGs.
Implementing the processors (DAG nodes) yourself is more verbose compared to high-level API, however it gives you really powerful tool.
Sources and Sinks
Jet comes with readers and writers for Hazelcast distributed data structures: IMap, ICache and IList.
Jet can use sturctures on the embedded Hazelcast IMDG cluster and make use of data locality — each processor will access the data locally.
Batch and streaming file reader and writers can be used for either reading static files or watching a directory for changes.
Socket readers and writers can read and write to simple text based sockets.
Kafka Streamer is used to consume items from one or more Apache Kafka topics. Kafka Writer is used as Apache Kafka sink.
Jet can use HDFS as either data source or destination. If Jet and HDFS clusters are co-located, then Jet benefits from the data locality and processes the data from the same node without sending them over the wire.
Log Writer is a sink which logs all items at the INFO level.
Ready for Cloud Deployments
Run Jet in the Cloud
Cloud developers can easily drop Hazelcast Jet into their applications. Hazelcast Jet can work in many cloud environments and can easily be extended via Cloud Discovery Plugins to more.
Docker for Jet
Hazelcast Jet includes container deployment options for Docker.
Jet for Pivotal Cloud Foundry
Cloud Foundry users are now able to provision and scale new Hazelcast Jet clusters, and add and remove nodes seamlessly as required. After downloading the Hazelcast tile from PivNet, the PaaS Operator simply uploads the tile into Pivotal Ops Manager, enters any required configuration into the GUI, and clicks ‘Install’. After installation, the PaaS Operator need have no further interaction with the Tile.