Don't miss the upcoming webinar: Building Real-Time Data Pipelines with a 3rd Generation Stream Processing Engine - sign up now!

ETL and Data Ingestion

ETL is a traditional concept dating back to 70s. What keeps it important until nowadays is current scale of the data – engineering and product teams have to load and pre-process the data from a variety of sources to ensure an outstanding performance of applications they build.

How can Hazelcast Jet help you build high-performance ETL pipelines?

Jet provides the infrastructure necessary for building and running the real-time ETL, so that you can focus on the business logic of the pipeline:

  • Java API to be used to declaratively define your data pipelines.
  • Connectors for extracting the data from sources and loading it into sinks (see Connectors)
  • Runtime for executing the data pipelines. It deals with fault-tolerance, flow control, parallel execution and clustering to scale your pipelines.

Loading data into IMDG

Whilst Jet can be used to move data between variety of systems (see source and sink connectors) it’s outstanding for loading the data into Hazelcast IMDG (in-memory data grid) or for offloading data from an IMDG.

IMDGs are often used as an operational storage or as a distributed cache. Hazelcast Jet is very convenient tool to keep in-memory caches hot by performing a real-time ETL.

A typical example is a continuous loading of the event stream from Kafka into IMDG, basically creating materialized view on top of the stream allowing super-fast querying.

Learn more about loading data into IMDG using Jet.

Declarative Java API

Jet was built for developers by developers. Therefore, it’s main programming interface is a Java DSL called the Pipeline API. Pipeline API allows you to declaratively define the data processing pipeline by composing operations on a stream of records such as:

  • filtering
  • transforming
  • aggregating
  • joining
  • enrichment

Pipeline API is similar to java.util.Stream, however it has been designed to support distributed continuous (stream) processing as a first class citizen.

With ETL pipelines expressed as Java code, engineers can use the tooling they like such as an IDE, Git or Maven.

This pipeline extracts lines of text from a distributed K-V storage (map), counts occurrences of each word and loads results into another distributed map.

Pipeline p = Pipeline.create();
        p.drawFrom(Sources.<Long, String>map(BOOK_LINES))
         .flatMap(e -> traverseArray(delimiter.split(e.getValue().toLowerCase())))
         .filter(word -> !word.isEmpty())
         .groupingKey(wholeItem())
         .aggregate(counting())
         .drainTo(Sinks.map(COUNTS));

See full code sample on GitHub.

Connectors for Extracting and Loading Data

Jet comes with variety of connectors to stream the data into the Jet pipeline and to store the results to sink system such as:

  • Hazelcast IMDG
  • JMS
  • JDBC
  • Kafka
  • HDFS
  • Plain Files
  • TCP Socket

Is your favorite connector missing? There is a convenience API so that you can build custom connector easily.

Running the Data Pipelines

The heart of Hazelcast Jet is a high-performance execution engine. Once deployed, Jet runs the steps of the data pipeline concurrently, making use of all CPU cores available. Partitioned data are processed in parallel.

Jet processes the data continuously as it comes (opposed to batch or micro-batch execution) to achieve millisecond latencies.

This architecture allows you to process hundreds thousands of records per second with millisecond latencies using a single Jet node.

Learn more about the performance secret sauce.

Fault-Tolerant and Scalable Operations

ETL jobs have to meet strict SLAs frequently not allowing to simply restart the job if something fails.

Hazelcast Jet uses checkpointing to prevent restarting from beginning. Checkpoints are regularly taken and saved in multiple replicas for resilience. ETL job is rewinded back to last checkpoint if failure occurs. Therefore, the Job is delayed just a few seconds instead of restarting from the scratch (based on checkpointing interval).

The Jet cluster is elastic allowing dynamic scaling to deal with load spikes. New nodes can be added to the running cluster with zero downtime to increase the processing throughput linearly.

Learn more about how Jet makes your computation elastic.

Jet in 5 minutes

Hazelcast Jet

Main Menu