ETL and Data Ingestion
ETL is a traditional concept dating back to 70s. What keeps it important until nowadays is current scale of the data – engineering and product teams have to load and pre-process the data from a variety of sources to ensure an outstanding performance of applications they build.
How can Hazelcast Jet help you build high-performance ETL pipelines?
Jet provides the infrastructure necessary for building and running the real-time ETL, so that you can focus on the business logic of the pipeline:
- Java API to be used to declaratively define your data pipelines.
- Connectors for extracting the data from sources and loading it into sinks (see Connectors)
- Runtime for executing the data pipelines. It deals with fault-tolerance, flow control, parallel execution and clustering to scale your pipelines.
Loading data into IMDG
Whilst Jet can be used to move data between variety of systems (see source and sink connectors) it’s outstanding for loading the data into Hazelcast IMDG (in-memory data grid) or for offloading data from an IMDG.
IMDGs are often used as an operational storage or as a distributed cache. Hazelcast Jet is very convenient tool to keep in-memory caches hot by performing a real-time ETL.
A typical example is a continuous loading of the event stream from Kafka into IMDG, basically creating materialized view on top of the stream allowing super-fast querying.
Learn more about loading data into IMDG using Jet.
Declarative Java API
Jet was built for developers by developers. Therefore, it’s main programming interface is a Java DSL called the Pipeline API. Pipeline API allows you to declaratively define the data processing pipeline by composing operations on a stream of records such as:
Pipeline API is similar to
java.util.Stream, however it has been designed to support distributed continuous (stream) processing as a first class citizen.
With ETL pipelines expressed as Java code, engineers can use the tooling they like such as an IDE, Git or Maven.
This pipeline extracts lines of text from a distributed K-V storage (map), counts occurrences of each word and loads results into another distributed map.
Pipeline p = Pipeline.create(); p.drawFrom(Sources.<Long, String>map(BOOK_LINES)) .flatMap(e -> traverseArray(delimiter.split(e.getValue().toLowerCase()))) .filter(word -> !word.isEmpty()) .groupingKey(wholeItem()) .aggregate(counting()) .drainTo(Sinks.map(COUNTS));
See full code sample on GitHub.
Connectors for Extracting and Loading Data
Jet comes with variety of connectors to stream the data into the Jet pipeline and to store the results to sink system such as:
- Hazelcast IMDG
- Plain Files
- TCP Socket
Is your favorite connector missing? There is a convenience API so that you can build custom connector easily.
Running the Data Pipelines
The heart of Hazelcast Jet is a high-performance execution engine. Once deployed, Jet runs the steps of the data pipeline concurrently, making use of all CPU cores available. Partitioned data are processed in parallel.
Jet processes the data continuously as it comes (opposed to batch or micro-batch execution) to achieve millisecond latencies.
This architecture allows you to process hundreds thousands of records per second with millisecond latencies using a single Jet node.
Learn more about the performance secret sauce.
Fault-Tolerant and Scalable Operations
ETL jobs have to meet strict SLAs frequently not allowing to simply restart the job if something fails.
Hazelcast Jet uses checkpointing to prevent restarting from beginning. Checkpoints are regularly taken and saved in multiple replicas for resilience. ETL job is rewinded back to last checkpoint if failure occurs. Therefore, the Job is delayed just a few seconds instead of restarting from the scratch (based on checkpointing interval).
The Jet cluster is elastic allowing dynamic scaling to deal with load spikes. New nodes can be added to the running cluster with zero downtime to increase the processing throughput linearly.
Learn more about how Jet makes your computation elastic.Jet in 5 minutes