Real-Time Analytics Using Kappa Architecture.

9 min readFeb 18, 2022

Real-Time Analytics — Kappa Architecture Using Kafka, Druid and Imply Pivot.

Introduction

Kappa architecture is streaming-first architectural deployment pattern — where data coming from various sources (for Ex. Events, IoT, Batch, Telemetry) at real time at a high velocity, is streamed into a messaging system like Apache Kafka. A stream processing engine (like Apache Spark, Apache Flink, Kafka Streams etc.) reads data from the messaging system, transforms it, and publishes the enriched data back to the messaging system, making it available for real-time ingestion and analytics. Additionally, the data is distributed to the serving layer such as a cloud data lake, cloud data warehouse, NoSQL DB, Operational intelligence or Alerting systems for self-service Real-time analytics, Machine Learning (ML), Reporting, Dash-Boarding, Predictive and Preventive maintenance as well as Alerting use cases.

High Level Architecture

Conceptual Understanding

RealTime Analytics Analogy

Think of data flow like a river (Unbounded) which will eventually drain into a lake and we are trying to do analytics from the running river with a fishing net. Data is flowing at a very fast rate and we need insight into it at sub second latency for the dimensions we filter (fish). Stochastic Sketching.

Data flowing as a river and we are trying to get insight using a fishing net before it gets dumped into a lake

Why RealTime Analytics ?

We’re amidst a shift in business intelligence as companies modernize their data infrastructure to meet the real-time demands. Many companies still operate on historical data that is analyzed in batches — meaning they can’t get instant insights. This can have a significant impact on their ability to compete, understand customer trends, and address market changes in a timely fashion. RealTime analytics are essential for improving experience and get immediate insight about business data so that you can query, process and make decision immediately. Its also called hot analytics. Traditional Analytics are batch based and it takes hours or at times days to generate any insight into the data. While traditional analytics has its own use case a new architectural pattern called Lambda architecture is trying to bridge gap between hot and cold analytics in a single pipeline. This white paper is purely focused on Hot analytics using Kappa architecture using Apache Kafka, Kafka Streams, Apache Druid and Imply PIVOT.

The Rise Of Streams

With the advent of high performing stream processing systems like Apache Kafka, the adoption of enterprise messaging systems in enterprises is increasing exponentially and it has become a DE Facto standard for streaming high velocity data emitted by IoT, Events, Logs, Telemetry etc. Kappa architecture implementation is to decouple the source and consumption layer using messaging systems like Apache Kafka. This allows applications (Both source and consumption) to evolve independently over time with better resilience to change and downtime. In many use cases the raw unstructured data is produced to a messaging system like Kafka and a stream processor like Kafka Streams or Apache Spark ingest the data at RealTime. A transformation pipeline is built using it to enrich/transform the data and to filter out the meaningful dimension. The resulting data is put it back to the messaging system as an unbounded stream. In some cases aggregation and roll ups can be also performed to minimize the volume.

OLAP / Time Series NOSQL DB for Real-time Analytics

A new breed of NOSQL Database is needed to solve the problem. Something which combines the best of real-time streaming analytics, multidimensional OLAP with the scale-out storage and computing principles of Hadoop to deliver ad hoc, search and time-based analytics against live data. It also needs to ingest data at low latency and can provide fast query response time. Queries included ad hoc pivots, filters and group-by operations; time-based functions; and text based search. Relational Database Management Systems (RDBMS), NoSQL key/value & columnar stores, data warehouses and big data stores such as Hadoop or Elasticsearch were unable to provide all of these types of queries in a single platform.

Current DB spectrum (High Level)

Enter The World of Apache DRUID

Apache Druid was invented to address the lack of a data store optimized for real-time analytics. Druid is a new kind of analytics database that enables real-time, ad hoc exploration of massive amounts of live and historical data at any scale. It delivers sub-second response times against hundreds of petabytes of data and hundreds of millions of events per second. Druid’s architecture is unique; it is a Search Engine, Timeseries DB and a Column Database merged into one. Druid fully indexes all data, similar to a search engine, but instead of indexing data in search indexes, Druid stores data in immutable columns, similar to columnar databases as segments. The column storage format enables fast aggregations, but additionally, inverted indexes are created for string values (like in a search engine). These inverted indexes allow the engine to quickly prune out unnecessary data for a query so that the engine can scan exactly what it needs to complete the query.

https://www.youtube.com/watch?v=QjsfX1Bl7hE

Druid Architecture (High Level)

Druid ingest data as a Timeseries. Ingestion primary job is to shard the data that can distributed and replicated among historical nodes in the cluster using segments. Additionally we can filter, transform at the time of ingestion. Druid will then columnize the data, encode, index and compress it for fast aggregation and slicing/dicing of data to support RealTime analytics.

Ex: Lets say we capture product sales and customer satisfaction data and want to find maximum satisfaction per product per 5 minutes aggregation:

Druid System Architecture

Master Server

Manages data availability and ingestion. It is responsible for starting new ingestion jobs and coordinating availability of data on the “Data servers”. Within a Master server, functionality is split between two processes, the Coordinator and Overload.

Query Server

Handles queries from external clients. Routing queries to Data servers or other Query servers. Within a Query server, functionality is split between two processes, the Broker and Router

Data Server

Runs Historical and Middle Manager processes, executes ingestion workloads and stores all query-able data. Within a Data server, functionality is split between two processes, the Historical and Middle Manager.

Example of Roll-ups at ingestion time :

Druid Data Loader Console:

Kappa Architecture in real world application:

This application is used by providers/partners/health plan sponsor to manage patient health insurance and claim submission.

This architecture provide RealTime insight to data in motion. It uses Kappa streaming architecture to support RealTime analytics. Three major architectural components proposed are (1) Decoupling of events data from the application using streams, (2) Use Stream processor for ETL (Enrichment) & (3) RealTime ingestion of the data and analytics using Imply Druid/Pivot.

Shown below are the technology and tool sets used for the RealTime analytics pipeline, ingestion and Visualization for implementation of Kappa architecture. The first three actions are generated in the application plane. Remaining all are asynchronous and have ability to process from where it left off using offsets. This enables independent maintenance without impacting others.

Streaming

Application transactional events are streamed using native java Kafka producer with producer batching adjusted based on time. This will increase the efficiency of the producer and also helps in message compression. Note: Longer batching increases efficiency but induces latency. Idempotent producer is preferred along with replication factor of three. These are however design level considerations and values would need to be adjusted based on performance metrics. Basically application events will be sent to Kafka at RealTime from the application plane.

Enrichment:

The application native event JSON data is deeply nested and has higher cardinality, also these additional dimensions are not needed for RealTime analytics. Furthermore, different application events are written to different topics. For Ex. Security events (Login/Logout, Lockedout etc) and Transactional events(Eligibility, Claims etc) are separately emitted. Hence, We would need a stream processor (Kafka Stream) to join the two streams and enrich the data by joining them based on user_id. Enriched and Transformed data will be written back to Kafka topic for ingestion at RealTime by Apache Druid.

Ingestion:

Apache Druid is capable of ingesting data at real-time. Apache Druid provides Data Loader UI to configure the ingestion from various sources including Kafka. The series of steps are shown below for Kafka:

Connect: This steps connects Druid to a Kafka Topic using Bootstrap server and a topic name.
2. Parse & Flatten Data: This step parses the data (JSON) and if configured, flattens it. Druid requires flat data. This steps normalizes JSON data in row/column orientation.
3. Parse Time: Druid partitions data is based on the primary time column. A default time attribute is generated(if missing). For transactional events, audit timestamp was used.
4. Transform: This step can do simple transformation of the data. For ex. converting Celsius to Fahrenheit using expression language.
5. Filter: This step can filter data based on certain matching criteria. Can be used for rudimentary ETL process.
6. Configure Schema: Default primitive data types are automatically assigned, Roll-Ups can be also setup for metrics column/s. This saves space because its done at ingestion time. Unique to Druid.
7. Partition: Configure how Druid will partition the data. This is directly related to the segments. system used segment granularity as Daily. If the data is coming at high velocity and is big then reducing the granularity will partition the data to historical segments across multiple clusters.
8. Tune: Fine tune how Druid will ingest data. In this step, you will tune the offsets, poll timeout, max row in memory, replicas etc.
9. Publish: Configure behavior of index data as it reaches Druid. Manage parse error logging etc.

Visualization using Imply Pivot:

Imply Pivot provides easy integration with Apache Druid and intuitive data exploration using Data Cubes/Filters, Dashboards, Reporting and Events. This abstracts the Druid API layer and need to learn Druid query language (Native Query (JSON) / Druid SQL).

Conclusion:

Any application which generates huge volume of transactional or metrics data and need to get insight of it at RealTime can immensely help from this architecture. Using RealTime analytics we can adjust prices (dynamic pricing) OR offer new discounts or deals at RealTime which can help drive business value.

For example, lets say, we see an upward exponential trend in buying new iPhone’s, We want to offer discounts on its accessories at RealTime using a ML recommendation engine.

Netflix uses similar architecture to provide real-time recommendation. Remember the need here to act immediately before these trends disappears.

In this fast paced world, we can’t solely rely on traditional Big Data analytics
(usually batched) for such use cases where we need to react immediately in sub second latency. This white paper attempts to solve the architecture for building a speed and batch layer in one cohesive platform using Apache Kafka and Apache Druid.

References:

Building a Real-Time Analytics Stack with Apache Kafka and Apache Druid (https://www.youtube.com/watch?v=APSdu8wJBVk)
Apache Druid (part-1) https://anskarl.github.io/post/2019/druid-part-1/