Real-Time Analytics — System Guide

Section 1

TL;DR — Real-Time Analytics in Plain English

What real-time analytics actually IS — fresh-data OLAP queries answered in under a second, even when the underlying dataset is updating millions of times per second and thousands of users are querying simultaneously
Why OLTP databases (Postgres, MySQL) and batch OLAP warehouses (Snowflake, BigQuery) both fail at this problem — and the specific, mathematical reason each one breaks
The three database families and where each one lives on the latency-vs-freshness spectrum
Why columnar storage is the single most important building block — the I/O physics that make it 10–100× faster for analytical scans
How Apache Druid, Apache Pinot, and ClickHouse slice data into segments by time so a query touching "last 24 hours" only reads 1% of the total data
The four hard requirements that must be satisfied simultaneously: freshness, sub-second query latency, ad-hoc OLAP queries, and high concurrency — and why doing all four at once is genuinely hard

Real-time analytics is the discipline of answering complex analytical questions — "how many ad impressions per campaign in the last 15 minutes, broken down by country and device type?" — in under a second, over a dataset that is being updated millions of times per second, while thousands of concurrent users are asking different questions simultaneously. That four-way combination — fresh data + ad-hoc queries + sub-second latency + high concurrency, all at once — is what makes this hard. OLTP databases (Postgres, MySQL) are optimized for single-row lookups, not column scans. Batch OLAP warehouses (Snowflake, BigQuery) are optimized for big scans, but not for sub-second latency under high concurrency. Real-time OLAP engines (Apache Druid, Apache Pinot, ClickHouse, StarRocks, Apache Doris) are built for exactly this gap: columnar storage + indexed dimensions + time-based segment partitioning = sub-second queries over billions of rows with fresh data.

Traditional analytics is a 9 AM email of yesterday's numbers. Batch OLAP (Snowflake, BigQuery) improved on that — you can now run the same query in 12 seconds instead of 8 minutes, but freshness is still measured in minutes-to-hours. Real-time analytics takes the next leap: answering analytical queries in under one second over data that is seconds or milliseconds old. The hard part is not the speed alone — it is the simultaneous combination. Fresh data requires continuous ingestion pipelines. Sub-second latency requires columnar storage, inverted indexes, and aggressive pre-computation. Ad-hoc queries (you don't know the question in advance) mean you can't just cache one fixed answer. High concurrency (500–5,000 dashboard users at the same time) means the engine must serve many queries in parallel without melting. Getting all four at once is what specialized real-time OLAP engines are built for.

Think of databases as a spectrum of trade-offs between write speed, query latency, and scan throughput. OLTP databases (Postgres, MySQL, Oracle) optimize for fast single-row reads and writes with ACID transactions — perfect for "look up row 42" and "update row 42 in a transaction", terrible for "scan 2 billion rows and group-by campaign ID". Batch OLAP databases (Snowflake, BigQuery, Amazon Redshift) optimize for scanning huge datasets column-by-column, can process trillions of rows per query, but run queries in seconds-to-minutes and cost money per scan. Real-time OLAP engines (Apache Druid, Apache Pinot, ClickHouse, StarRocks, Apache Doris) sit in a specific sweet spot: columnar storage like batch OLAP, but with time-partitioned segments, inverted indexes on dimensions, and pre-aggregated rollups — enabling sub-second query latency while ingesting fresh data in real time.

You already know Kafka (events flow in) and stream processing (Flink/Kafka Streams compute over those events). Real-time analytics is the next layer: the query interface that lets humans ask questions about what those events mean right now. Events land in Kafka → stream processors optionally enrich/clean them → real-time OLAP engine ingests the stream directly (Druid and Pinot have native Kafka consumers) → dashboards and APIs query the engine at sub-second speed. The engines are also typically coupled with a batch layer for historical data: old segments (months ago) are stored cheaply in object storage (S3, GCS), while recent hot segments stay in fast local storage for low-latency access. This hot/cold architecture is why real-time OLAP can be cheap at scale despite serving very fast queries.

Real-time analytics solves a four-way problem: fresh data, ad-hoc OLAP queries, sub-second latency, and high concurrency — simultaneously. OLTP databases can't scan billions of rows. Batch OLAP warehouses can't answer in milliseconds under high concurrency. Real-time OLAP engines (Druid, Pinot, ClickHouse, StarRocks, Doris) solve all four via columnar storage, time-partitioned segments, and indexed dimensions, sitting downstream of Kafka in the modern data stack.

Section 2

Why You Need This — The Live Dashboard Problem

Before diving into how real-time analytics works, let's prove — concretely — that the tools most engineers reach for first (Postgres, then Snowflake) hit hard walls at dashboard scale. Understanding exactly where each tool breaks makes every design decision in the rest of the page obvious.

The Story: An Ad-Tech Dashboard That Brought Down the Database

Imagine an ad-tech company. They run display advertising campaigns — banner ads, video ads — for thousands of advertisers. Each advertiser wants a live dashboard: impressions served, clicks, click-through rate, spend, broken down by country, device type, and ad creative, updated every 30 seconds.

The engineering team starts with what they know: Postgres. They have an impressions table with six months of data — roughly 200 billion rows. Each dashboard page load fires a query like:

-- "How many impressions and clicks per campaign in the last 24 hours, by country?" SELECT campaign_id, country, COUNT(*) AS impressions, SUM(clicked) AS clicks FROM impressions WHERE ts >= NOW() - INTERVAL '24 hours' GROUP BY campaign_id, country ORDER BY impressions DESC;

On Postgres, that query takes 8 minutes. Why? Postgres is a row-oriented database — the impressions table stores each row's columns together. To compute the sum of "clicked" across 200 billion rows, Postgres must read every column of every row to get to the "clicked" column, even though it only needs two columns out of the fifteen in the table. That's roughly 15× more data read from disk than actually needed.

The Postgres wall: 8-minute query on 200B rows, single connection. The dashboard needs a 30-second refresh cycle with 500 simultaneous users. Running 500 × 8-minute queries in parallel would require more RAM than the machine has, and the database collapses under the connection count alone before a single query finishes.

Migration 1: Snowflake — Faster, But a New Problem Emerges

The team migrates to Snowflake, a batch OLAP warehouse. Snowflake is columnar — it stores each column separately, so the same query now reads only the columns it needs. The same query drops from 8 minutes to 12 seconds. That's a 40× improvement for batch reporting. Problem solved?

Not quite. Two new problems appear when you use Snowflake for a live dashboard:

Latency: 12 seconds feels acceptable in a BI report. But a dashboard that refreshes every 30 seconds and takes 12 seconds per query means the UI is showing stale numbers more than a third of the time. For live operations — campaign managers watching a new ad launch in real time — 12 seconds feels broken.
Cost explosion: Snowflake charges per compute-second of warehouse usage. 500 concurrent users × 12-second queries × 30-second refresh cycle = a warehouse that runs almost continuously, full throttle. The monthly Snowflake bill for 500 live dashboard users can reach tens of thousands of dollars for this query volume — an order of magnitude more than a purpose-built OLAP engine would cost for the same workload.

Migration 2: Apache Druid — The Right Tool

The team migrates to Apache Druid. Same data, same query. Results: ~200 ms. The same 500 users query simultaneously — Druid fans the query out across segment-level parallelism and answers everyone within the 30-second refresh window with capacity to spare. At this query volume, the compute cost is a small fraction of the Snowflake bill.

How does Druid do this? Three mechanisms work together, all explained in depth later in this page:

Columnar storage: like Snowflake, reads only the columns it needs — 10–50× less data than row-oriented Postgres.
Time-partitioned segments: 200 billion rows split into hourly segments. A "last 24 hours" filter only reads the 24 most recent hour-segments — roughly 1% of total storage. Druid skips the other 99%.
Inverted indexes on dimensions: each segment has a bitmap index on country, device_type, campaign_id. Filtering to "country = US" means reading a compressed bitmap, not scanning rows.

The chart maps out the fundamental trade-off landscape. OLTP databases (top-left) handle small-scale transactional queries well but collapse the moment you ask them to scan billions of rows at high concurrency — they simply aren't built for it. Batch OLAP warehouses (middle) are much better at large scans but live in a latency band of seconds-to-minutes and become expensive when hundreds of users are running concurrent queries. Real-time OLAP engines (bottom-right) occupy the sweet spot that nobody else covers: sub-second query latency and high concurrency and fresh data.

Why These Numbers Matter in Practice

There's a well-known study by Google on page load time and user abandonment — the numbers shift with device and era, but the principle is consistent: users start abandoning interactive UIs when response time crosses about one second. The same psychological threshold applies to dashboards. A live ops dashboard that takes 12 seconds per query doesn't feel live — it feels sluggish and users stop trusting it. A dashboard that responds in 200 ms feels instant and operators stay engaged with it.

The real cost of 12-second queries. It's not just user experience. Operations teams watching a live ad campaign, a product launch, or an infrastructure incident need to make decisions in real time. A 12-second query latency means you're seeing data from 12 seconds ago — compounded by the refresh interval. At 30-second refresh with 12-second queries, the freshest data you ever see is already 12 seconds stale. At 200 ms queries on a 30-second refresh cycle, the stale window is under a second. That's the difference between "live analytics" and "very fast batch."

Postgres collapses at analytical scale because row-oriented storage forces full-row reads even for single-column aggregations. Snowflake solves the scan problem with columnar storage but introduces latency and cost problems at high query concurrency. Real-time OLAP engines (Druid, Pinot, ClickHouse) add time-partitioned segments and inverted indexes on top of columnar storage to serve sub-second latency at high concurrency — migrating the ad-tech dashboard from 8 minutes → 12 seconds → ~200 ms at a fraction of the Snowflake cost.

Section 3

Mental Model — The Three Database Families

Every database you'll encounter optimizes for a specific workload. The mistake most engineers make is choosing a database they already know — often a relational database — and then fighting its limits when the workload changes. The right mental model: there are three fundamentally different optimization targets, and each leads to a different family of databases with different internal designs.

The Library Analogy

Imagine three different kinds of libraries. The first is a public library — it's great for finding one specific book quickly (the librarian knows exactly which shelf it's on), but if you asked "sum up every book's page count across 50,000 books", it would take all day because you'd have to open every single book. The second is a university research archive — it's organized by topic, optimized for reading huge volumes of material from one subject area, but retrieving a specific page from a specific book takes longer because the organization favors bulk access over quick point lookups. The third is a live news ticker — everything published in the last 24 hours is instantly accessible, sorted by time, and you can query it in real time while new stories are still coming in.

OLTP databases are the public library. Batch OLAP warehouses are the research archive. Real-time OLAP engines are the live news ticker. Each design is correct — for its intended use case.

The three-column chart makes the gaps concrete. OLTP databases have instant freshness (writes are immediately visible) but terrible analytical scan latency because row-oriented storage forces reading every column to find the one you need. Batch OLAP warehouses solve the scan problem with columnar storage but trade off freshness (data loads run in batches) and concurrency (every concurrent query spins up compute, which costs money). Real-time OLAP engines are the only family that simultaneously hits all four requirements: columnar storage for scan efficiency, indexing for filter performance, time-partitioned segments for freshness and locality, and a query routing architecture that scales concurrency via parallelism rather than by throwing more compute at each individual query.

The One Trade-off Real-Time OLAP Makes

Nothing comes free. Real-time OLAP engines make one significant trade-off to achieve their performance: they are poor at writes and updates. An OLTP database can update a single row in microseconds. Real-time OLAP engines store data in immutable segments — once a segment is written and indexed, you can't cheaply update individual rows. Corrections require compaction jobs that rewrite segments. This is fine for event-based workloads (ad impressions, clickstreams, sensor readings, application logs — data that arrives and is never modified), and it's a serious problem for transactional workloads (bank balances, inventory counts — data that must be updated in place).

The real-time OLAP use case in one sentence: High-volume, append-only or near-append-only event streams where humans or systems need to ask arbitrary aggregation questions with sub-second latency, simultaneously, at scale.

The three database families each optimize for a different target: OLTP for point reads/writes with ACID guarantees, Batch OLAP for large infrequent scans over historical data, and Real-Time OLAP for sub-second analytical queries over fresh, continuously ingested event data. Real-Time OLAP achieves its performance via columnar storage, dimension indexes, and time-partitioned segments — at the cost of poor in-place update performance, making it ideal for append-heavy event streams (clickstreams, impressions, sensor data) rather than transactional records.

Section 4

Core Concepts — The Real-Time Analytics Vocabulary

Real-time analytics comes with a vocabulary that can feel overwhelming at first — segment, rollup, HyperLogLog, deep storage, broker, historical node. Most of these terms are just names for ideas you can reason about from first principles. This section gives you the plain English warmup for each term before the jargon label, so you're never reading a sentence and thinking "I know all the words except the important one."

The map organizes terms into four clusters: Storage Layer (how data is laid out on disk), Query Layer (the software components that receive and route queries), Ingestion Layer (how data gets into the engine), and Approximation Layer (how to trade a tiny bit of accuracy for massive performance gains on high-cardinality queries). Let's walk through the most important terms in each cluster.

Storage Layer Terms

Columnar storage — Instead of storing each row's columns together ("row A: col1, col2, col3; row B: col1, col2, col3…"), each column's values are stored together ("col1: all values; col2: all values…"). This means an analytical query that reads only two columns out of fifteen only loads 2/15ths of the data. Section 5 gives the full physics explanation.
Segment — Think of a segment as a single chapter in a book. The book (your entire dataset) is divided into chapters (segments), each covering a specific time range. A segment is an immutable file package containing columnar data + indexes. Most real-time OLAP engines (Druid, Pinot) use segments as their fundamental storage unit. ClickHouse calls them "parts" but the concept is identical.
Time-based partitioning — Data is organized by time — each segment covers a specific time window (hourly, daily, or custom). WHY does this matter? Almost all analytical queries include a time filter ("last 7 days", "this month", "yesterday"). If data is partitioned by time, the engine can completely skip segments outside the query's time range without reading a single byte from them.
Deep storage — The long-term home for segments that have aged out of the hot tier. Typically S3, GCS, or Azure Blob Storage. Old segments are archived here and only pulled into fast local storage when queried.
Hot-cold tier — Recent data lives on fast local disks (hot tier), old data lives on cheap object storage (cold tier). WHY this matters: most dashboard queries ask about recent data ("last 24 hours", "this week"). Hot-cold tiering keeps the data you query most on fast storage and moves old data to storage that costs 20–50× less per GB.

Query Layer Terms

Broker — The front door for queries. When your application fires a SQL query, it hits the broker first. The broker looks at the time range and filters, figures out which segments are relevant, fans the query out to the nodes holding those segments in parallel, and merges the partial results. In Druid this role is literally called "Broker"; in Pinot it's also called a "Broker"; in ClickHouse there is no named broker — any node can route queries.
Historical node — The worker nodes that actually hold segments and execute scans. In Druid's architecture: Broker routes, Historical nodes execute. The broker is a thin router; the historical nodes do the heavy computational lifting.
Pre-aggregation / rollup — At ingestion time, if many events share the same values for key dimensions (same campaign_id, same country, same hour), the engine can merge them into one row with a count. A billion impressions might compact to a million pre-aggregated rows. At query time, GROUP BY reads a million rows instead of a billion — 1000× less data. This is called a rollup. The trade-off: you can't recover original individual rows after rollup, so you need to decide which granularity to preserve.
Materialized view — A pre-computed query result stored physically. Instead of running "impressions per campaign per hour" from scratch on every request, the engine pre-computes it and stores it as a physical table. Queries read the materialized view (fast) instead of the raw data (slow). Updated incrementally as new data arrives.

Ingestion Layer Terms

Freshness lag — The time between "this event happened" and "this event is visible in dashboard queries." Composed of three parts: how long before the event is emitted to Kafka, how long the stream processing pipeline takes to enrich it, and how long the OLAP engine takes to index and serve the new segment. Minimizing freshness lag is the core engineering challenge in real-time analytics ingestion pipelines.
Ingestion latency vs query latency — Two different things that are easy to confuse. Ingestion latency is how long data takes to enter and become queryable in the engine. Query latency is how long a query takes to execute. You can have low query latency (200 ms queries) but high ingestion latency (data is 2 minutes old) — or vice versa. Both matter for different reasons.

Approximation Layer Terms

Cardinality — How many distinct values a column has. "Country" has ~200 distinct values — low cardinality, easy to group by. "User ID" might have hundreds of millions of distinct values — high cardinality, very expensive to compute exact counts like "how many unique users saw this ad."
HyperLogLog — A clever algorithm that estimates "how many distinct values are there?" using a tiny fixed-size summary (a few kilobytes), with ~1-2% error. WHY this matters: exact distinct-count queries ("how many unique users?") require storing every unique value seen so far — which is prohibitively expensive at high cardinality. HyperLogLog achieves nearly the same answer with a summary that's thousands of times smaller, enabling sub-second distinct-count queries over billions of rows.
Data sketch / theta sketch — A general term for compact probabilistic summaries. HyperLogLog is a sketch for distinct counts. A theta sketch enables set operations on approximate counts — "how many users saw campaign A and campaign B?" — enabling audience overlap analysis at scale. Apache DataSketches is the library most engines use.

The real-time analytics vocabulary organizes into four clusters: storage layer terms (columnar storage, segment, time-based partitioning, deep storage, hot-cold tier), query layer terms (broker, historical node, pre-aggregation/rollup, materialized view), ingestion layer terms (freshness lag, ingestion vs query latency), and approximation layer terms (cardinality, HyperLogLog, data sketch). Knowing the WHY behind each term — not just the name — makes every architectural decision in the rest of the page immediately legible.

Section 5

Columnar Storage — Why It's Foundational

If you understand only one concept from this entire page, make it columnar storage. Every speed advantage that real-time OLAP engines have over Postgres — and most of the advantage they have over Snowflake — traces back to this single design decision about how bytes are physically arranged on disk.

Row-Oriented vs. Columnar: The Physical Byte Layout

Imagine a table with four columns: user_id, campaign_id, country, and clicked. It has five rows. In a row-oriented database (Postgres, MySQL), each row is stored together as a contiguous block of bytes:

Notice the red "clicked" cells. They are scattered throughout the file — every four adjacent cells belong to the same row, so to read just the "clicked" column you must seek to each row's location and extract its "clicked" byte. With 200 billion rows, the I/O is dominated by reading three columns you don't need for every one column you do.

Now look at the same data in columnar storage:

In columnar storage, each column is a separate file. To compute SUM(clicked), the engine opens exactly one file — clicked.col — reads its five bytes, and sums them. The user_id, campaign_id, and country files are never opened. With 200 billion rows and 15 columns, this means reading 1/15th of the data that a row-oriented query would read. On a table where each row is 200 bytes and you only need a 4-byte integer column, you read 4/200 = 2% of the bytes. That's the 50× I/O reduction that makes columnar scanning fast.

Compression: Columnar Data Compresses Far Better Than Row Data

The second big advantage is compression. Look at the country.col column above: it contains the values US, DE, US, US, GB. Notice they repeat. In fact, analytical data tends to have lots of repetition within a column — many rows have the same country, the same campaign ID, the same device type. When similar values are stored together, compression algorithms like run-length encoding (RLE) can represent them far more compactly.

In a row-oriented layout, each row's bytes are different — user ID, campaign ID, country, timestamp, and metric values are all interleaved. There's little repetition in the byte sequence, so compression barely helps. In columnar layout, a column of 10 million rows where 60% of values are "US" might compress from 20 MB to under 1 MB — a 20× compression ratio just from the locality of similar values.

Typical compression ratios in real-time OLAP engines: A raw event stream that would occupy 100 GB in a Postgres table often compresses to 5–20 GB in a real-time OLAP engine (columnar storage + dictionary encoding + RLE). Lower storage cost and more data fits in RAM for caching — both directly improve query performance. The exact ratio depends heavily on data schema and cardinality of columns.

SIMD: CPU-Level Acceleration for Dense Arrays

There's a third advantage that often goes unmentioned: SIMD (Single Instruction, Multiple Data). Modern CPUs can operate on 4, 8, or 16 integers simultaneously in a single instruction. This is called vectorization. For this to work efficiently, the data must be a dense, contiguous array of the same type.

Columnar storage produces exactly this: the clicked.col file is a dense array of integers. The CPU can sum 8 values per instruction, making the column scan up to 8× faster than equivalent row-oriented code. In row-oriented storage, the "clicked" values are scattered across rows, so the CPU can't vectorize the scan — each value requires a separate load, offset, and operation.

The Trade-off: Columnar Storage Is Slow for Writes

Nothing comes free. Columnar storage has one significant weakness: it is slow for individual row writes and updates. In a row-oriented database, inserting a new row means appending one contiguous block of bytes. In columnar storage, a new row means appending to N different column files — one write per column — and potentially invalidating existing indexes. For a table with 50 columns, a single row insert touches 50 column files.

Real-time OLAP engines solve this by batching writes: instead of inserting rows one at a time, events are accumulated in a small in-memory buffer, then flushed as a complete segment to disk. The segment is immutable once written — no row-level updates. Corrections require rewriting whole segments via background compaction jobs. This is exactly the right trade-off for event-based workloads where data arrives fast, in bulk, and is never modified after the fact.

Columnar storage is the foundation of real-time OLAP performance. It stores each column as a separate file, enabling queries to read only the columns they need (10–50× less I/O than row-oriented), achieving high compression ratios because similar values in a column pack tightly (5–20× typical compression), and enabling SIMD vectorized CPU operations on dense arrays. The trade-off is slow row-level writes, which real-time OLAP engines handle by batching inserts into immutable segments rather than supporting individual row updates.

Section 6

Segments and Time-Based Partitioning — How Druid and Pinot Slice the Data

Columnar storage explains why real-time OLAP can scan data efficiently. But it doesn't explain how a system with 200 billion rows can return a query in 200 milliseconds — even columnar storage has limits, and scanning 200 billion rows column-by-column takes seconds, not milliseconds. The second key mechanism is time-based segment partitioning: slicing the data into time-bounded segments so that most queries only touch a tiny fraction of the total data.

What a Segment Is

Think of a book that's been printed in daily volumes. Volume 1 covers January 1st. Volume 2 covers January 2nd. Each volume is a self-contained unit — it has its own index, its own page numbering, its own table of contents. To answer "what happened on January 15th?", you pull Volume 15 off the shelf and never open the other 364 volumes.

A segment in Apache Druid or Apache Pinot is exactly this. It is a self-contained, immutable file package containing:

Columnar data files — one file per column, containing the column's values for events in this time bucket
Bitmap (inverted) indexes — for each dimension column (country, campaign_id, device_type), a compact bitmap that maps "country=US" to a set of row positions in this segment. Looking up "US" events skips rows without touching any data files.
Metadata — the time range this segment covers, the datasource name, the version, min/max values per column (used for further pruning)

Segments are typically 100 MB to 1 GB in size. Each segment covers a specific time window — hourly segments are common for high-volume streams (one segment = one hour of events), daily segments for lower-volume data. A six-month dataset with hourly segments has 4,380 segments (6 months × 30 days × 24 hours). ClickHouse calls its equivalent unit a "part" — same concept, different name.

Segment Pruning: Skipping 99% of the Data

The key insight is segment pruning. When your query includes a time filter — and almost every analytical query does — the broker doesn't need to scan all 4,380 segments. It looks at each segment's metadata and skips any segment whose time range doesn't overlap with the query's time window.

A query for "last 24 hours" in a dataset with 6 months of hourly segments touches exactly 24 segments out of 4,380. That's 0.5% of the data — and with columnar storage inside each of those 24 segments reading only the needed columns, the total I/O might be less than 1% of what a full table scan would require. This is why 200-millisecond queries on 200-billion-row datasets are possible: the engine physically reads a tiny fraction of the stored data.

The diagram makes the math visual. A six-month dataset with hourly segments has roughly 4,380 segments total. A "last 24 hours" query causes the broker to examine each segment's time metadata, skip the 4,356 segments outside the window, and dispatch scan tasks to only the 24 segments within the window. Even if each of those 24 segments is 500 MB on disk, the total I/O is 12 GB — and with columnar storage reading only the needed columns, the actual bytes read might be under 1 GB. That is physically why sub-second queries on 200-billion-row datasets are possible.

Query Routing: How the Broker Dispatches to Segments

The broker is the coordinator that makes segment pruning work at query time. Here's the step-by-step flow for a dashboard query:

The diagram shows the five-step flow. The broker receives the query, consults its in-memory segment metadata index to determine which 24 of 4,380 segments overlap "last 24 hours," and fans the query out to the three historical nodes that hold those 24 segments (typically 8 segments per node, distributed by time range). Each historical node scans its assigned segments in parallel — wall-clock time is bounded by the slowest node, not the sum of all nodes. The broker collects three partial aggregation results and merges them into the final answer. The entire round trip takes around 200 milliseconds: ~5 ms broker overhead, ~140 ms for the parallel segment scans, ~10 ms for the merge.

Why Immutability Is a Feature, Not a Limitation

Once a segment is written and indexed, it is never modified. This might sound like a limitation — what if the data is wrong? — but immutability is what makes the entire architecture reliable and scalable:

Replication without coordination — A segment can be copied to 3 historical nodes for fault tolerance. Since it's immutable, all three copies are always identical. No locking, no conflict resolution, no distributed write protocols.
Replay and backfill — If data pipeline logic changes, old segments can be discarded and reprocessed from the original Kafka event log to produce new segments. The Kafka log is the source of truth; segments are derived data that can always be rebuilt.
Compaction — Small segments from streaming ingestion (e.g., one per 15-minute batch) are periodically merged by background compaction jobs into larger, more efficient segments. Compaction produces a new segment and atomically replaces the old ones — no downtime, no query interruption.
Cache coherence — Historical nodes cache segments in memory. Since a segment never changes, a cached segment is always valid. Cache invalidation is only needed when a new segment version is compacted to replace an old one.

ClickHouse "parts" are the same concept. ClickHouse calls its immutable columnar storage units "data parts" rather than "segments," and uses a merge-tree storage engine (MergeTree) that continuously merges small parts into larger ones in the background — the same compaction model. The semantics are identical: immutable columnar parts, time-ordered, pruned by metadata, merged for efficiency. When you read "segment" (Druid/Pinot) or "part" (ClickHouse), they mean the same thing.

Segments are immutable, self-contained file packages covering a specific time window, containing columnar data files, bitmap indexes on dimensions, and metadata. Time-based partitioning enables segment pruning: a query for "last 24 hours" in a 6-month dataset with hourly segments touches only 24 of ~4,380 segments — reading under 1% of total stored data. The broker routes queries by consulting segment metadata, fans sub-queries to historical nodes in parallel, and merges partial results. Immutability enables replication without coordination, replay from the Kafka log, background compaction, and zero-cost cache coherence.

Section 7

Indexing in Real-Time OLAP — Beyond Just B-Trees

An index is a helper data structure that lets a database find rows without scanning the whole table — a shortcut, basically. The classic shortcut you've seen in Postgres or MySQL is the B-tree index: a tree of sorted keys that finds a single row in a few hops. B-trees are excellent for point queries like "give me the row where user_id = 12345." But real-time OLAP workloads look nothing like that. A real OLAP query sounds like: "how many ad impressions happened in the last hour, grouped by country and device type, filtered to campaigns where the advertiser is in the retail category?" That query touches hundreds of millions of rows, filters across multiple columns simultaneously, and needs to finish in under 200 ms.

B-trees don't help here. For a B-tree to accelerate a filter, the filter must align with the index key — and when you're filtering on three columns simultaneously, a single B-tree helps with one and the engine falls back to a scan for the rest. Real-time OLAP engines instead build a portfolio of specialized indexes, each designed for a specific shape of analytical query. Understanding each one tells you why these engines are fast in ways a traditional RDBMS simply cannot be.

Index Type 1: Bitmap Index — One Bit Per Row Per Value

Imagine you have a column called country with five possible values: US, UK, DE, FR, JP. A bitmap index for this column creates five separate bit arrays — one per distinct value. Each bit array has one bit per row. If row 7 has country = US, then bit 7 of the US bitmap is 1 and bits 7 of all other country bitmaps are 0.

The power comes when you combine filters. Say you want: country = US AND device_type = mobile. You fetch the US bitmap and the mobile bitmap, then do a single bitwise AND operation. On modern CPUs, a single instruction processes 64 bits at a time (or 256–512 bits with AVX). For 1 billion rows, each bitmap is 1 billion bits = 125 MB. A bitwise AND across 125 MB happens in milliseconds. That's how Druid answers multi-dimension filters on 1B rows in under 100 ms.

The diagram shows the key insight: a bitmap index trades storage (a few hundred MB per low-cardinality column) for the ability to reduce a multi-filter query to a few bitwise operations. The math is what makes it fast — modern CPUs execute bitwise AND on 64 bits per clock cycle, and with SIMD (AVX-512) that's 512 bits per cycle. Filtering 1 billion rows with bitmap AND takes single-digit milliseconds. The critical limitation: bitmap indexes only work well for low-cardinality columns (a small number of distinct values). A column like user_id with 100 million distinct values would need 100 million separate bitmaps — absurd. For high-cardinality columns, you need a different approach.

Index Type 2: Inverted Index — Value → List of Rows

An inverted index flips the relationship: instead of one bit per row per value, it stores a mapping from each value to the set of rows containing that value. Think of it like the index at the back of a textbook — "see page 42, 87, 203" — except the "pages" are row IDs.

Apache Pinot uses inverted indexes for medium-cardinality dimension columns (think hundreds to tens of thousands of distinct values). When a query filters on campaign_id = 'abc123', Pinot looks up 'abc123' in the inverted index and gets back a compact list of row IDs — no scan needed. Combined with Pinot's star-tree index (see below), this gives Pinot some of the lowest per-query latencies of any engine at sub-100 ms for user-facing dashboards.

The inverted index is essentially the same data structure that makes Google Search fast — a word-to-document mapping. In OLAP, it's a value-to-row-ID mapping. The query "where campaign_id = 'camp_abc'" fetches the posting list for 'camp_abc' in O(1) time, then reads only those specific rows from the columnar data. Compared to scanning the entire column, this can skip 99% of rows for high-selectivity filters.

Index Type 3: Forward Index — Dictionary Encoding

The forward index isn't a filter accelerator — it's a storage optimizer. Both Druid and Pinot store string columns with dictionary encoding: each distinct value is assigned a small integer ID, and the column stores the integers rather than the strings. "United States" might be stored as integer 3, "Germany" as 7, and so on. Since the dictionary typically has thousands of entries while the column has billions of rows, you're storing 4-byte integers instead of 13-byte strings — a 3× storage reduction for typical country/city strings, even more for longer strings like campaign names.

Dictionary encoding also makes bitmap and inverted indexes cheaper: instead of comparing variable-length strings, comparisons operate on fixed-width integers, which CPUs are extremely efficient at processing.

Index Type 4: Star-Tree Index (Pinot) — Pre-Computed Multi-Dimensional Rollups

The star-tree index is Pinot's secret weapon for the most demanding aggregation queries. The idea: instead of computing GROUP BY country, device_type, campaign_id at query time by scanning billions of rows, pre-compute all combinations of those dimensions at ingest time and store the results in a tree structure.

Think of it as a materialized OLAP cube embedded inside the index. The tree has one node per combination of dimension values. A query for "total impressions by country" climbs to the country-level nodes and reads pre-aggregated counts — no row scanning at all. For dashboards with known dimension sets (which is most production use cases), this can reduce query latency from hundreds of milliseconds to single-digit milliseconds.

The star-tree index is what makes Pinot the top choice for user-facing dashboards — it trades index build time and storage for near-instant query answers on known aggregation patterns. LinkedIn uses it to power analytics dashboards that return in under 100 ms even at millions of rows per second ingest. The trade-off: star-tree indexes must be defined at table creation time for specific dimension sets. If your query asks for a combination the star-tree wasn't built for, it falls back to a raw columnar scan.

Index Type 5: Primary Key (Sparse) Index — ClickHouse MergeTree

ClickHouse takes a different approach. Its sparse primary index stores data physically sorted by the primary key (e.g., (event_date, campaign_id)) and maintains one index entry per ~8,192 rows (a "granule"). A query filtering on the primary key identifies which granules to read, then ClickHouse reads only those granules — skipping the rest entirely. Because data is sorted, related rows are physically contiguous on disk, maximizing both compression ratios and cache utilization.

Why ClickHouse compresses so well. When you sort data by (date, campaign_id), rows from the same campaign on the same day are next to each other in storage. Adjacent rows with similar values compress extraordinarily well — LZ4 or ZSTD can reduce a sorted column of country codes from 4 bytes/row to under 0.1 bytes/row (40×+ compression). This is why ClickHouse benchmarks often show 10–40× storage reduction compared to raw data, and why compressed reads are faster than uncompressed reads (smaller data = fewer disk reads).

Choosing the Right Index — Quick Guide

Use: Bitmap index (Druid default, ClickHouse BitmapIndex)

Best for: gender, device_type, status codes, boolean flags, country (200 countries). Each distinct value gets its own bitmap. Multi-filter queries become bitwise AND/OR operations — incredibly fast. Storage is one bit per row per distinct value.

Why NOT for high-cardinality: 100M user_ids × 125 MB per bitmap = 12.5 TB of indexes. Absurd.

Use: Inverted index (Pinot, ClickHouse InvertedIndex)

Best for: campaign_id, ad_group_id, product SKUs, city names. Maps each value to a posting list of row IDs. Lookups are O(1) dictionary lookup + O(k) posting-list read. Most efficient when query selectivity is high (the filter matches a small fraction of rows).

Use: Range index / sorted primary key (ClickHouse MergeTree, Druid range index)

Best for: user_id, session_id, order_id, IP addresses. Bitmap and inverted indexes don't scale to millions of distinct values. Instead, sort the data by the high-cardinality column and use range filters — the physical sort order allows binary search to find the right block, then a scan of only the matching range.

Druid also supports Bloom filters on segment metadata to quickly skip entire segments that can't contain a given value.

Use: Star-tree index (Pinot only)

Best for: dashboards where the GROUP BY dimensions are known in advance (country + device_type + hour, for example). Pre-computes all rollup combinations at ingest time. Queries hit pre-aggregated nodes instead of raw rows — often 100–1000× faster than a columnar scan for the same query. Must be configured explicitly per table with the chosen dimensions.

Real-time OLAP engines use specialized indexes — not B-trees — because analytical queries filter across multiple dimensions simultaneously. Bitmap indexes handle low-cardinality columns via bitwise operations (microseconds for billion-row filters). Inverted indexes map values to row-ID posting lists for medium-cardinality dimensions. Dictionary/forward encoding compresses storage 3–40×. Star-tree indexes (Pinot) pre-aggregate known dimension combinations at ingest time, enabling single-digit-millisecond aggregation queries. ClickHouse's sparse primary index over sorted data achieves both fast range queries and exceptional compression. Choosing the right index requires knowing your column cardinality and query patterns.

Section 8

Pre-Aggregation & Rollups — Trade Storage for Query Speed

Even with the best columnar storage and indexes, scanning a billion rows per query has a cost. At some point, raw-row scanning doesn't scale — not because the hardware is too slow, but because you have thousands of simultaneous users, each triggering a billion-row scan. If each scan takes 200 ms and you have 2,000 concurrent users, the math works out: you need enough hardware to run 2,000 scans in parallel. That's expensive.

The alternative is pre-aggregation: instead of computing the answer to "how many impressions per campaign per hour" at query time, you compute it once at ingest time and store the result. When a query comes in, it reads the pre-aggregated summary — milliseconds, not a billion-row scan. This is the same insight behind database materialized views, but done in real time as events arrive.

Rollup Tables — Collapsing Billions of Rows to Thousands

A rollup at ingest time works like this: as events arrive, the engine groups them by their dimension columns (hour, country, campaign_id) and accumulates metric columns (count, sum_revenue, sum_clicks). When you flush the segment to disk, each unique combination of dimension values becomes a single row with the accumulated metrics.

Concrete example: an ad-tech platform receives 1 billion click events per day. Without rollup, that's 1B rows/day in storage. With hourly rollup by (campaign_id, country): there are 10,000 campaigns × 200 countries × 24 hours = 48 million combinations in theory, but in practice each campaign is only active in a few countries, so the actual row count is around 2–5 million rows/day. That's a 200–500× reduction in storage and a 200–500× speedup for any query that aligns with the rollup dimensions.

The before/after contrast is stark. Raw events give you perfect fidelity — you can answer any question. But at 1 billion rows per day, storing three years of history means over a trillion rows. Rollup collapses that to ~1 billion rows for the same history, and queries run 300× faster because they're scanning fewer rows. The loss: you can no longer drill down to an individual event. This trade-off is why the decision to roll up must be made at table design time, not after the fact.

Materialized Views — On-the-Fly Rollup for ClickHouse and StarRocks

ClickHouse and StarRocks take a different approach: rather than rolling up at ingest time (before the data hits storage), they support materialized views that maintain rollup summaries alongside the raw data. When a new event arrives in the base table, ClickHouse simultaneously writes the raw row AND updates the appropriate aggregation row in the materialized view — all in the same write path.

This is the best of both worlds: you keep full raw event history for drill-down analysis AND have pre-computed rollups for fast dashboard queries. The cost: roughly double the storage (raw + aggregated). ClickHouse's SummingMergeTree and AggregatingMergeTree engines are specifically designed for this — they automatically merge partial aggregate states during background compaction, so intermediate states never grow unbounded.

-- ClickHouse: materialized view that maintains hourly impression rollups -- Base table receives raw events CREATE TABLE impressions_raw ( ts DateTime, campaign_id String, country LowCardinality(String), clicked UInt8, revenue Float32 ) ENGINE = MergeTree() ORDER BY (toStartOfHour(ts), campaign_id); -- Materialized view rolls up to hourly granularity automatically CREATE MATERIALIZED VIEW impressions_hourly_mv ENGINE = AggregatingMergeTree() ORDER BY (hour, campaign_id, country) AS SELECT toStartOfHour(ts) AS hour, campaign_id, country, countState() AS impressions_state, -- partial agg state sumState(revenue) AS revenue_state, sumState(clicked) AS clicks_state FROM impressions_raw GROUP BY hour, campaign_id, country; -- Query the rollup — fast, reads millions of rows, not billions SELECT hour, country, countMerge(impressions_state) AS impressions, sumMerge(revenue_state) AS revenue FROM impressions_hourly_mv WHERE hour >= toStartOfHour(now()) - INTERVAL 24 HOUR GROUP BY hour, country ORDER BY hour;

The key insight in this code: countState() and sumState() store partial aggregate states (binary blobs that can be merged later) rather than final values. When you query with countMerge() and sumMerge(), ClickHouse combines those partial states across all rows that share the same key. This allows ClickHouse to merge segments in the background without losing accuracy — if two segments both have rows for (hour=10, campaign_id=abc, country=US), the merge combines their partial states rather than double-counting.

Approximate Queries — Trading 1% Accuracy for 10,000× Less Memory

Some metrics are genuinely expensive to compute exactly, even with rollups. The canonical example: unique users. "How many distinct users saw this campaign in the last 30 days?" This requires knowing which user IDs are unique — and a naive set of all user IDs for a 1B-event stream can consume 100 GB of memory for an exact count.

The solution: HyperLogLog (HLL). The algorithm uses a clever hashing trick: instead of storing every unique user ID, it stores a small summary structure (2–16 KB) that lets you estimate the count of distinct IDs with ~1–2% relative error. For a billion-row stream, the exact count takes 100 GB; HLL gives 99% accuracy with 16 KB. For a dashboard showing "1.2M unique users" (which rounds to the nearest thousand anyway), a 1% error is invisible.

The HLL diagram reveals the key property that makes it practical for OLAP: mergeability. You can pre-compute an HLL sketch per segment (say, one per hour of data) and merge multiple sketches together at query time. The merged sketch still fits in 1.5 KB and still gives ~1% error. This means an HLL "unique users" metric can be rolled up just like a sum or count — stored per-segment, merged at query time. Druid, Pinot, and ClickHouse all support HLL (Druid calls them HLLSketchBuild, Pinot uses ThetaSketch, ClickHouse has built-in uniqHLL12()).

Beyond HLL, two other approximate data structures are common in real-time OLAP: Theta sketches for set intersection/difference queries ("users who saw ad A but NOT ad B"), and T-digest for percentile estimation (p95/p99 query latency over billions of observations).

Pre-aggregation trades raw-event granularity for query speed. Rollup ingestion (Druid, Pinot) collapses 1B rows/day to millions by grouping on dimension columns at ingest time — 200–500× storage reduction and proportional query speedup, but individual events are gone forever. Materialized views (ClickHouse, StarRocks) maintain both raw data and rollups simultaneously using partial aggregate states that merge safely during background compaction. For metrics like unique users where exact computation is prohibitive (100 GB+), probabilistic data structures — HyperLogLog (1–2% error, 16 KB), Theta sketches (set operations), T-digest (percentiles) — give near-exact answers in kilobytes.

Section 9

Ingestion — Streaming, Batch, and the Lambda Heritage

All of the query speed in the world is worthless if data doesn't get into the engine quickly and reliably. Ingestion is the process of moving events from their source (usually Kafka or S3) into queryable segments inside the OLAP engine. Real-time OLAP engines support two very different ingestion paths — streaming and batch — and most production deployments use both simultaneously. Understanding why both exist, and how they interact, is essential to designing a reliable real-time analytics system.

Streaming Ingestion — Seconds from Event to Query

In streaming ingestion, the OLAP engine runs a continuous consumer that reads from Kafka topic partitions. As events arrive, the engine accumulates them into an in-memory buffer (Druid calls this a "real-time task"; Pinot calls it a "consuming segment"). Every few minutes (configurable — typically 5–30 minutes), the engine seals the current buffer into a segment on disk, makes it queryable, and starts a fresh buffer.

The result: events are typically visible to queries within 1–10 seconds of being produced to Kafka. This is the "real-time" in real-time analytics — for a campaign manager watching a new ad launch, they see impression counts updating every few seconds, not every hour.

Batch Ingestion — Efficient Historical Loading

Streaming ingestion handles the current firehose, but what about history? If you're launching a new real-time analytics system on data that's been collecting in S3 for two years, you can't wait for the streaming path to process that backlog — it would take months. That's where batch ingestion comes in.

Batch ingestion reads from large files in object storage (S3, GCS, HDFS), processes them in parallel (Druid can spin up hundreds of parallel ingestion tasks), and produces optimally-sized segments. Because batch ingestion processes files that already exist (rather than a real-time stream), it can sort the data by time, apply rollup more efficiently, and build indexes more effectively — producing better-compressed, better-indexed segments than streaming ingestion can. Batch is also how you handle data corrections: if an upstream system sends a corrected version of yesterday's data, you batch-ingest the correction and the bad segments are replaced.

The dual-path architecture is production-critical for reliability. Streaming ingestion can fail mid-segment (the Kafka consumer crashes, a segment build fails), and you need the ability to replay from Kafka offsets. Batch ingestion handles this naturally — you just re-run the job against the same S3 files. Having both paths also lets you bootstrap a new table: stream from now, backfill with batch, and the system stitches the two together seamlessly.

Segment Compaction — Merging Small Streaming Segments

Streaming ingestion creates many small segments — each buffer flush produces a new segment, so if you flush every 10 minutes you generate 6 segments per hour. Over time, the cluster accumulates thousands of tiny segments, and query performance degrades because each segment requires separate I/O. The fix: segment compaction.

Druid's Coordinator and Pinot's Minion both run compaction jobs that merge small recent segments into larger optimally-sized ones (typically 300 MB–1 GB each). Compaction happens in the background and doesn't block queries — the engine serves queries from the old small segments until the merged segment is ready, then atomically swaps them. The merged segment is better compressed (because the sorter has access to more data, finding better compression opportunities) and faster to query (fewer files = fewer I/O seeks).

Schema Evolution — Adding Columns Without Rebuilding History

Production systems evolve. You add a new tracking dimension — say, "ab_test_variant" — six months after the table was created. With traditional databases, adding a column to a 1B-row table means an ALTER TABLE that touches every row. Real-time OLAP engines handle this more gracefully: new columns are added to the schema, and from that moment forward new segments include the column. Old segments that pre-date the column simply return NULL for that column. No backfill required — the engine treats a missing column in an old segment as implicitly null. If you need historical values, you batch-ingest a corrected version of old data with the new column populated.

Real-time OLAP ingestion has two complementary paths: streaming (Kafka consumer → segments in 1–10 seconds, handles current data) and batch (S3/HDFS → parallel tasks → optimized segments, handles history and corrections). Most production systems use both. Streaming produces many small segments; background compaction merges them into fewer, larger, better-compressed segments without blocking queries. Schema evolution is handled gracefully — new columns appear in future segments while old segments return NULL, no full-table rebuild required.

Section 10

Query Engine Architecture — Broker, Coordinator, Historical

When a user fires a query against an Apache Druid or Pinot cluster, that query doesn't hit a single machine. It fans out across a cluster of nodes, each responsible for a subset of the segments, and the results are merged before being returned. Understanding this architecture tells you why real-time OLAP engines can answer billion-row queries in under 200 ms — they decompose the problem into thousands of sub-problems that run in parallel across hundreds of cores.

Druid's Node Topology

Druid splits its work across five different kinds of servers, each with one focused job. Think of it like a restaurant kitchen: somebody takes orders, somebody cooks, somebody plates, somebody manages the storeroom, somebody schedules deliveries. Below are the five "stations" and what each does:

Broker — the query entry point. Every client query hits a Broker first. The Broker understands the segment timeline (which segments cover which time ranges) and knows which Historical and Real-Time nodes hold which segments. It rewrites the query into sub-queries, fans them out in parallel, collects partial results, merges them, and returns the final answer. Brokers are stateless and horizontally scalable — add more Brokers to handle more concurrent queries.
Historical — the workhorse for segments older than the streaming window. Historical nodes download segments from deep storage (S3/HDFS/GCS) to local disk, keep them in memory-mapped files, and serve queries against them. A Historical node might hold 50–500 segments, each representing a few hours of data. Multiple Historicals answer the Broker's sub-queries in parallel — this is where segment-level parallelism lives.
Middle Manager / Real-Time Tasks — handles streaming ingestion. Middle Managers spin up "tasks" that each consume one or more Kafka topic partitions, build segments in memory, and periodically publish them to deep storage. Once published, a Historical node downloads the segment and takes over query responsibility.
Coordinator — the cluster's control plane. The Coordinator watches which segments exist (via metadata in MySQL/PostgreSQL), decides which Historical nodes should hold which segments (load balancing, replication), and orchestrates segment movement. It doesn't touch the query path.
Overlord — manages ingestion tasks. Works with Middle Managers to start/stop/monitor streaming and batch ingestion tasks. Think of it as a job scheduler specifically for the data-in side.

The topology diagram shows the fundamental principle: the Broker is a traffic director, not a compute node. It holds no segment data. Its job is to know "which segments touch the time range this query needs, and which Historical nodes hold those segments?" It then fires sub-queries to all relevant nodes simultaneously. If a query touches 600 segments spread across 30 Historical nodes, all 30 nodes work in parallel — the wall-clock time is determined by the slowest single node, not the sum of all nodes. That's why adding more Historical nodes directly reduces query latency for large queries.

Tail Latency and Speculative Execution

The fanout model has a well-known weakness: tail latency amplification. If a query fans out to 200 segment-level sub-queries, the query's total latency equals the slowest sub-query. Even if 199 sub-queries finish in 50 ms, one slow sub-query (perhaps because the segment is cold on disk and needs a page-fault load) can push total latency to 500 ms. At scale, this slow outlier is nearly guaranteed to appear on every query.

The mitigation is speculative execution: when a sub-query takes longer than expected (say, the 80th percentile threshold), the Broker sends a duplicate request to a replica of the same segment on a different node. Whichever responds first wins; the other is cancelled. This trades a small amount of extra load for a large reduction in tail latency — the p99 can drop from 500 ms to 100 ms with speculative execution enabled.

Why Pinot often beats Druid at ultra-low latency. Pinot was designed from day one for user-facing dashboards — scenarios where a live page must return in under 100 ms for thousands of simultaneous users. Pinot adds two key optimizations on top of the basic fanout: (1) the Star-Tree index pre-computes common aggregations so many sub-queries don't need to scan at all, and (2) Pinot supports "realtime-offline join" (querying both the real-time consuming segments and the historical segments as a single logical table seamlessly). Druid added similar optimizations later. For latency below 100 ms at high concurrency, Pinot's architecture tends to win. For query flexibility and multi-tenant workloads, Druid's architecture tends to win.

Real-time OLAP query engines use a scatter-gather architecture: a stateless Broker receives queries and fans them out as sub-queries to Historical nodes in parallel, one sub-query per segment group. Segment-level parallelism means a 1B-row query across 600 segments runs as 600 simultaneous scans. Tail latency amplification (the slowest sub-query sets the total latency) is mitigated by speculative execution — sending duplicate requests when a sub-query is slow. Pinot optimizes further with star-tree pre-aggregation and seamless real-time/historical joins for sub-100 ms user-facing queries.

Section 11

The Big 5 Real-Time OLAP Engines — Comparison

Five engines dominate real-time OLAP in production. They all share columnar storage, time partitioning, and ingestion from Kafka — so the surface-level story looks similar. But they make very different design choices for their target workloads, and picking the wrong one for your use case costs months of migration. Here's a precise breakdown of each, what they're optimized for, and when to choose them.

Engine	Origin	License	Primary Strength	Notable Weakness	Choose When
Apache Druid	MetaMarkets 2011, ASF 2018, top-level 2019	Apache 2.0	Mature, multi-tenant, flexible time-series	Complex ops; joins are expensive	Multi-tenant SaaS analytics, ad-tech, event exploration
Apache Pinot	LinkedIn 2013, open-sourced 2015, ASF Incubator 2018, top-level 2021	Apache 2.0	Lowest latency, star-tree index, user-facing	Complex ad-hoc joins; operational complexity	User-facing dashboards requiring <100 ms at high concurrency
ClickHouse	Yandex ~2009, production 2012, open-sourced 2016	Apache 2.0	Fastest raw scan throughput; simple ops; huge community	Native streaming ingest less mature; real-time latency ~seconds	High-throughput analytics where simplicity > ultra-low latency
StarRocks	Forked from Doris 2020, open-sourced	Apache 2.0 (re-licensed from Elastic License 2.0 in 2023)	Strong join support; lakehouse integration (S3 tables)	Younger community; fewer managed cloud options	Lakehouse analytics, complex joins on real-time + historical data
Apache Doris	Baidu Palo, open-sourced 2017, ASF Incubator 2018, top-level 2022	Apache 2.0	Full MPP SQL, broad analytical coverage	Lower global adoption outside Chinese tech ecosystem	General real-time analytics; strong in Chinese tech companies

Engine Deep Dives

Apache Druid — The Battle-Tested Generalist

Druid is the elder statesman of real-time OLAP. It was built in 2011 at MetaMarkets to power ad-tech analytics dashboards — exactly the workload that modern real-time analytics engines are designed for. Open-sourced in 2012, donated to the Apache Software Foundation in 2018, and graduated as an Apache top-level project in 2019. The core architecture (Lambda-heritage streaming + historical tiers, Broker scatter-gather, segment-level bitmap indexes) has proven itself at companies including Airbnb, Netflix, and many large ad platforms.

Why Druid is still the default for multi-tenant analytics: its query isolation model lets many tenants share a cluster without one runaway query starving others. The Broker intelligently routes queries to minimize cross-segment reads, and the Coordinator's segment placement policy keeps tenant data co-located to reduce cross-node fanout. The built-in native query API alongside modern SQL (via Calcite) lets advanced users squeeze out extra performance when needed.

Druid's weaknesses are real: join support is historically poor (Druid is designed for denormalized fact tables, not normalized star schemas). Operational complexity is high — you manage Broker, Historical, Middle Manager, Coordinator, Overlord, plus ZooKeeper and a metadata database. Managed Druid (via Imply.io or Apache Druid on Kubernetes) reduces this burden but adds cost.

Apache Pinot — The Latency Champion

Pinot was built at LinkedIn in 2013 to power live analytics for LinkedIn's feed — a context where millions of users simultaneously see analytics about their posts, connections, and ad performance. That origin shapes everything about Pinot's design: sub-100 ms latency at thousands of concurrent queries is the north star, not raw throughput or query flexibility.

Pinot's star-tree index is its most distinctive capability — the ability to pre-aggregate known dimension combinations at ingest time and resolve those aggregations with near-zero compute at query time. For dashboards where you know the GROUP BY dimensions (which is most production dashboards), this transforms 200 ms queries into 5–20 ms queries. LinkedIn runs Pinot at very high QPS across many clusters, with member-facing products like "Who's Viewed Your Profile" answering thousands of queries per second at tens-of-milliseconds latencies.

Pinot's weakness: anything that requires joining two large tables is a challenge. Pinot supports lookup joins (one small dimension table joined to a large fact table), but complex multi-table joins at scale require pushing join logic to the application layer or preprocessing the data before ingestion. If your workload requires SQL joins across multiple large real-time tables, StarRocks or Druid (with a broadcast-join workaround) is a better choice.

ClickHouse — The Raw Speed Specialist

ClickHouse was built inside Yandex starting around 2009, primarily to power Yandex.Metrica, one of the world's largest web analytics platforms. ClickHouse went into production at Yandex in 2012, was open-sourced under Apache 2.0 in 2016, and was spun out as ClickHouse Inc. in September 2021. It has since become one of the fastest-growing real-time OLAP engines by community adoption, largely because it is genuinely one of the fastest columnar scanners available and one of the easiest to operate.

Where Druid and Pinot have complex multi-node topologies, ClickHouse runs as a small cluster (or even a single node for moderate scale) with fewer moving parts. The MergeTree storage engine, sparse primary key index, and vectorized query execution are architectural choices that maximize raw CPU throughput — ClickHouse benchmarks frequently show it outperforming Druid and Pinot on pure scan-and-aggregate workloads, though the exact margin depends heavily on schema, indexes, and query pattern.

ClickHouse also has the richest SQL dialect of the five engines — hundreds of built-in functions, window functions, full JOIN support, arrays as first-class types, and a particularly capable time-series function library. This makes it the easiest to adopt for teams coming from traditional SQL databases.

ClickHouse's trade-off: native streaming ingestion (from Kafka) is less mature than Druid and Pinot. The MaterializedView approach works well but requires more manual configuration. Real-time data freshness typically lags by several seconds to a minute rather than 1–3 seconds. For workloads that need true sub-second freshness at very high Kafka ingest rates, Druid or Pinot is more battle-tested. For workloads that prioritize query speed and operational simplicity over ultra-low ingest latency, ClickHouse wins almost every time.

StarRocks — The Join-Capable Lakehouse Engine

StarRocks (formerly DorisDB) was forked from Apache Doris by the original Doris engineering team in 2020, with the goal of building a more performant MPP engine — short for Massively Parallel Processing, an architecture where a query is split into pieces that run simultaneously on many nodes and the results are stitched back together. The key differentiator: StarRocks treats joins as a first-class operation and has invested heavily in multi-table query optimization — colocated joins (data for joined tables stored on the same nodes), broadcast joins for small dimension tables, and runtime-adaptive join strategies.

StarRocks also has the best lakehouse integration story of the five: it can query S3 Parquet files directly (via external catalogs) without ingesting them into its own storage, making it a natural fit for architectures where data lives in a data lakehouse (Apache Iceberg, Delta Lake, Apache Hudi) and you want real-time analytics without full data duplication.

When to choose StarRocks: your analytics require joining multiple large real-time tables (e.g., joining impressions + clicks + user_profile + campaign in a single query), or you're building a lakehouse stack where the same data serves both batch processing (Spark) and real-time dashboards. The community is growing rapidly, particularly in Asia, and ClickHouse's simpler operational story makes it a stronger choice for pure analytics-only workloads.

Apache Doris — The Broad SQL OLAP Platform

Apache Doris (originally Baidu Palo) was open-sourced in 2017, donated to the Apache Software Foundation Incubator in July 2018, and graduated as a top-level Apache project in June 2022. It's a widely-used real-time OLAP engine in the Chinese internet ecosystem — Baidu, TikTok, Tencent, NetEase, and hundreds of others run Doris at scale. Like StarRocks, it's a full MPP SQL engine that supports complex joins, materialized views, and direct Kafka ingestion.

Doris has been broadening its scope to cover more of the analytics stack: it supports both real-time streaming ingestion and batch loading, has a growing ecosystem of integrations with Flink and Spark, and is working toward lakehouse compatibility. The SQL compatibility is strong — Doris is largely MySQL-compatible, which lowers the learning curve for teams already familiar with MySQL.

When to choose Doris: you're building in an ecosystem where Doris has existing support and tooling (common in Chinese tech), or you want a single SQL platform that covers real-time analytics and broader OLAP workloads without managing separate engines. For teams outside this ecosystem, ClickHouse or StarRocks will have larger communities and more available documentation in English.

The positioning chart reveals the fundamental trade-off: lower latency correlates with less join flexibility. Pinot achieves the lowest latency by pre-computing aggregations (star-tree) and restricting joins to simple lookup patterns. ClickHouse sits in the pragmatic middle — not the absolute fastest, but the best combination of query speed, join capability, and operational simplicity. StarRocks and Doris sacrifice some latency for much richer SQL join semantics, making them right for complex analytical queries but less suited for user-facing dashboards where every millisecond counts.

The decision tree distills the positioning chart into actionable choices. For most teams starting from scratch on a real-time analytics problem, ClickHouse is the pragmatic default: easiest to set up, fastest raw scans, excellent SQL, growing managed offerings (ClickHouse Cloud). Move to Pinot if you need sub-100 ms user-facing dashboards with known aggregation patterns. Move to Druid for multi-tenant SaaS analytics with isolation requirements. Move to StarRocks or Doris when complex joins across multiple real-time tables are non-negotiable.

Five engines dominate real-time OLAP, each with a distinct target. Apache Pinot optimizes for sub-100 ms user-facing queries using star-tree pre-aggregation — LinkedIn's choice for live feed analytics. Apache Druid is the mature multi-tenant generalist, strong for ad-tech and event exploration but complex to operate. ClickHouse delivers the fastest raw scan throughput with the simplest operations and best SQL — the right default for most new projects. StarRocks and Apache Doris add MPP join semantics and lakehouse integration for analytics requiring complex multi-table joins. For most teams, ClickHouse is the starting point; migrate toward Pinot only when sub-100 ms p99 becomes a hard requirement.

Section 12

Storage Tiers — Hot, Warm, Cold, and Deep

Real-time analytics data has a freshness decay curve. Yesterday's data is queried constantly; data from three months ago is queried occasionally; data from two years ago might be queried once a quarter for a compliance report. Storing all of this at the same cost and speed is wasteful — you'd either overspend on fast storage for old data nobody queries, or you'd suffer slow queries on recent data because you put everything on cheap storage. The solution: storage tiering, where data migrates from fast expensive storage to slow cheap storage as it ages.

The Three Tiers (Plus Deep Storage)

Most production real-time OLAP deployments use three tiers, with a fourth "deep storage" layer as the archive:

Hot tier — NVMe SSD, last few hours to days. The most recently-ingested segments live here. All data is in memory-mapped files on local NVMe SSDs. Queries hit segments that are largely already in the OS page cache, so effective read bandwidth is often in the tens of GB/s. Query latency for hot-tier data is in the 10–100 ms range. Cost: highest (NVMe SSDs on server-grade hardware in the cloud run ~$0.08–0.20/GB/month depending on cloud provider and instance type).
Warm tier — HDD or cheaper SSD, last few weeks to months. Segments that have aged out of the hot tier migrate here. Queries still complete at acceptable speed (1–5 seconds for large scans) because the data is local. Some deployments skip the warm tier entirely and go directly from hot to cold, accepting that "older" queries will be slower. The warm tier makes sense when you have a significant volume of medium-age data that is queried regularly but not in the sub-second SLA.
Cold tier (Deep Storage) — S3/GCS/HDFS, anything older than the warm window. All segments are uploaded to object storage as they are created — Druid and Pinot do this immediately on segment publish. The deep storage copy is the ground truth. Historical nodes cache a local copy; if a Historical node dies, it recovers by re-downloading from deep storage. Some modern engine versions (Druid 0.22+, Pinot tiered storage) support querying deep storage directly with a local cache, enabling you to answer queries over cold data without downloading the full segment — the engine fetches only the columns and pages needed for a specific query.

The pyramid shape is intentional — hot data is small in volume but high in query frequency, while cold data is vast in volume but rarely queried. The cost inversion is stark: NVMe SSD in the cloud costs roughly 6–7× more per GB than S3 object storage. For a 10 TB analytical dataset where 95% of queries touch only the last 3 days (300 GB), tiering means you pay NVMe rates only on 300 GB and S3 rates on 9.7 TB — a massive cost saving with no query-speed penalty for the 95% case.

Retention Policy — What to Keep and How Long

Storage tiers must be paired with a retention policy: how long do you keep raw data at each tier before discarding or archiving? A typical policy looks like this:

The retention timeline shows a layered strategy. Raw event data — which provides the highest granularity but the highest storage cost — is kept only for 30–90 days (enough for operational investigations and drill-downs). Hourly rollups (which are 200–500× smaller than raw data) are kept for 1–2 years, covering most trend analysis and year-over-year comparisons. Daily rollups (another 24× smaller than hourly) are kept indefinitely or for compliance periods (7+ years in regulated industries). This multi-layer approach means the total storage cost is dominated by the cheap cold-tier rollup data, not by the expensive hot-tier raw data.

The Cost Calculation — Why Tiering Makes Real-Time Analytics Economically Viable

Concrete math illustrates why storage tiering is essential at scale. Suppose you have 100 TB of raw event data generated per year. Without tiering, storing all of it on NVMe at $0.15/GB/month = $15,000/month just for storage. With tiering:

Hot tier (last 3 days of raw): ~820 GB × $0.15 = $123/month
Warm tier (90 days of raw + 2yr hourly rollups): 24 TB raw + 1 TB rollup × $0.04 = ~$1,000/month
Cold/Deep tier (3yr daily rollups + full raw archive): 300 GB rollup + 100 TB archive × $0.023 = ~$2,300/month

Total with tiering: ~$3,400/month versus $15,000/month flat NVMe — a 4.4× cost reduction. And query performance for the 95% of queries that touch recent data is actually better, because hot-tier data is entirely on fast NVMe instead of being diluted across 100 TB of mixed storage.

Deep storage as the recovery backbone. Druid and Pinot upload every segment to deep storage (S3/GCS) immediately on publish — before it's even served by a Historical node. This means if a Historical node fails and its local disk is lost, the cluster automatically reloads segments from deep storage on replacement nodes. No data is lost. The segment in deep storage is the source of truth; local disk is just a cache. This architecture gives real-time OLAP clusters surprisingly strong durability guarantees despite running stateful workloads.

Storage tiering aligns data cost with query frequency. Hot tier (NVMe SSD, last 24–72 hours) handles sub-100 ms queries at high cost. Warm tier (HDD/cheaper SSD, last 30–90 days) handles acceptable-latency queries at medium cost. Cold/deep storage (S3/GCS, everything older) serves as the archive and recovery backbone — every segment is uploaded here immediately, enabling node failure recovery without data loss. Tiering combined with a layered retention policy (raw events 30–90 days, hourly rollups 1–2 years, daily rollups indefinitely) reduces total storage cost by 4–5× compared to flat NVMe storage, while keeping recent-data queries fast.

Section 13

Common Use Cases — Where Real-Time OLAP Shines

Real-time OLAP is not a universal tool. It shines brightest in a specific constellation of conditions: fresh data measured in seconds, analytical queries with multiple group-by dimensions, high concurrency, and results that users or systems act on immediately. The six use cases below hit that constellation. Understanding why each fits helps you recognize the pattern when you encounter it in the wild.

The diagram above shows how LinkedIn's "Who viewed your profile" feature works under the hood. Profile view events flow from the browser into Kafka. A stream processor enriches them with viewer metadata; a fast path ingests directly into Pinot for real-time segment writes. Pinot serves the dashboard API in under 100ms — a query like "give me all viewers of member_id=42 in the last 30 days, grouped by viewer company" executes directly on time-partitioned columnar segments. End-to-end, an event from a profile view is queryable in roughly 2–5 seconds. That is the "real-time" promise.

The matrix above distills the key dimensions that drive engine selection: concurrency pressure, freshness tolerance, query latency SLO, and cardinality. User-facing dashboards (Netflix, LinkedIn) sit at the hardest corner — very high cardinality, thousands of concurrent users, sub-100ms SLO. Internal BI tools are comparatively relaxed: fewer users, a 1-second SLO, and medium cardinality — which opens the door to StarRocks or ClickHouse with richer SQL join support.

User-Facing Analytics Dashboards

When a product feature shows analytics to end users — not internal analysts, but real customers — the bar is brutal. Netflix's content performance UI, LinkedIn's "Who viewed your profile", Stripe's transaction analytics dashboard: these serve thousands to tens of thousands of concurrent users, each firing slightly different queries (different date ranges, different dimension filters). The latency SLO is sub-100ms because users sitting at a browser notice any pause over 200ms. Cardinality is extremely high because user_id, session_id, and content_id are all in the hundreds of millions. Apache Pinot was purpose-built for this workload at LinkedIn — its broker tier routes queries to only the segments touched by the time filter, keeping latency predictable regardless of total data size.

The key insight: The user-facing constraint forces you into real-time OLAP. Batch OLAP (Snowflake, BigQuery) cannot serve 10,000 concurrent users at sub-100ms because each query spins up compute resources — the startup overhead alone exceeds the SLO.

Anomaly Detection and Real-Time Monitoring

Observability platforms ingest enormous volumes of metric, log, and trace data — CPU, memory, request rate, error rate, latency — tagged by host, service, region, and datacenter. An engineer investigating an incident types a query like "show me p99 latency grouped by service and region for the last 15 minutes" and expects results in under a second. The engine under the hood is real-time OLAP — ClickHouse powers many modern observability stacks (used at Cloudflare, eBay, and others for log analytics), and Druid is used at several large observability deployments. (Honeycomb is a notable exception: they built a custom columnar engine called Retriever rather than adopting an off-the-shelf OLAP engine.) The freshness requirement here is tighter than dashboards — an anomaly you detect 5 minutes late may already have cascaded into a full outage. Ingestion lag must be under 30 seconds, ideally under 5 seconds.

Ad-Tech Reporting — DSPs, Publishers, and Real-Time Bidding Analytics

Ad-tech is where real-time OLAP was battle-tested first. A DSP needs second-level reporting: "how many impressions and clicks did campaign 9812 get in the last 60 seconds, broken down by country and ad creative?" Advertisers pause poorly-performing creatives; publishers need live floor-price adjustments. Apache Pinot powers LinkedIn's advertising analytics and several other ad networks. Apache Druid powers ad analytics at many programmatic advertising companies. The distinguishing characteristic of ad-tech: the cardinality of campaign_id × creative_id × publisher_id is astronomical, and rollup at ingestion time (collapsing individual impression events into 1-second buckets per campaign/country/device) is almost always done to keep segment sizes manageable.

Real-Time Business Intelligence — Operational Dashboards

Internal operations teams — logistics, customer support, supply chain, fraud — need dashboards where the data is measured in minutes, not hours. "How many orders are stuck in 'payment pending' right now, grouped by payment gateway?" is a query that drives action: someone escalates to the gateway vendor. The concurrency here is lower (50–500 users vs. 10,000), the cardinality is medium, and the latency SLO is relaxed to around 1 second. This opens the door to StarRocks or ClickHouse with richer join support, because internal BI queries tend to involve multiple tables (orders JOIN payments JOIN customers) that would be painful to fully denormalize. The data freshness requirement — data visible within 10–30 seconds of the event — is still tighter than Snowflake can offer but relaxed compared to user-facing dashboards.

IoT and Telemetry Analytics

An industrial IoT deployment might have a million sensors emitting temperature, pressure, and vibration readings every 5 seconds — that is 200 million events per second at peak. Queries look like "what is the average temperature reading for sensors in Zone 3 of Plant B that exceeded 85°C in the last 10 minutes?" The dimensions are device_id (high cardinality), zone, plant, sensor_type, and event_time. The time dimension is always the dominant filter — most queries are "last N minutes" — which is exactly what time-partitioned segments were designed for. Apache Druid's native time-based partitioning makes it a natural fit. ClickHouse's aggressive compression on repetitive telemetry values also works well here. The main gotcha: device_id cardinality can reach hundreds of millions, making HLL sketches mandatory for any "how many unique devices reported in the last hour" query.

A/B Testing and Experimentation Guardrails

When you launch a new feature to 1% of users, you need real-time guardrails: "is the checkout conversion rate for experiment group B statistically significantly below the control group right now?" Catching a bad experiment early — within minutes of launch, not hours — prevents wide exposure to a buggy or harmful feature. The query pattern is: group by experiment_id and variant, compute conversion rate (events / session starts), check if the difference exceeds the guardrail threshold. The data model is relatively simple (experiment_id, variant, user_id, event_type, timestamp), the concurrency is low (experimentation pipelines run scheduled checks), and freshness needs to be within 30 seconds. Apache Druid is commonly used here because its rollup can pre-aggregate per-experiment-per-variant counts at ingestion, making the guardrail queries trivially fast.

Real-time OLAP is the right tool when you need sub-second analytical queries over fresh data under concurrency. The six canonical use cases — user-facing dashboards, anomaly detection, ad-tech reporting, operational BI, IoT telemetry, and A/B testing guardrails — all share the same core constraint: data that is seconds old, queries with multiple dimensions, and latency SLOs that batch OLAP warehouses cannot meet. Engine selection within real-time OLAP depends on which corner of the matrix you sit in: user-facing (Pinot, Druid), operational BI with joins (StarRocks, ClickHouse), or high-ingest telemetry (Druid, ClickHouse).

Section 14

Data Modeling for Real-Time OLAP — Star, Snowflake, Wide-Table

In traditional data warehousing (Snowflake, BigQuery, Redshift), you learn to normalize your tables into star or snowflake schemas. That advice fails hard in real-time OLAP. The reason is simple: joins are expensive, and real-time OLAP engines are optimized for single-table, segment-parallel scans. Understanding the three modeling patterns — and when each breaks — is one of the most important skills in this space.

The diagram above compares the three schemas side-by-side. A star schema has one central fact table and 3–5 dimension tables joined by foreign key. A snowflake schema normalizes the dimension tables further (dimensions have sub-dimensions). A wide table collapses everything into one massive table with 200–500 columns. In batch OLAP, star schemas are optimal because the query engine can plan efficient hash joins. In real-time OLAP, every join requires cross-segment communication that destroys query latency — so the wide table wins despite its storage cost.

Why Wide Tables Win in Real-Time OLAP

Imagine your fact table has 50 billion rows. A star schema query needs to join it against dim_campaign (10 million rows) and dim_user (500 million rows). In Druid, that means the broker must coordinate across hundreds of segment servers, shipping hundreds of millions of dimension-table rows to each server for the join, or shuffling fact-table data to co-locate with dimension data. Either way, the latency blows up from 50ms to 30+ seconds — worse than Snowflake.

The wide table takes the opposite approach: at ingestion time, you flatten every dimension attribute into the fact row. When campaign_id appears in an event, you look up campaign_name, advertiser_id, and campaign_type at ingest time and write all of them directly into the fact row. At query time, there is no join — the query engine scans a single table's segments, filters by any combination of dimensions, and returns results. This trades storage (roughly 5–10× larger than a normalized star schema for the same data) for sub-second query performance.

Who handles joins better? ClickHouse and StarRocks (Apache Doris) have significantly better join support than Druid or Pinot. ClickHouse can broadcast small dimension tables to all nodes for fast lookups; StarRocks has first-class MPP shuffle joins. If your use case genuinely requires joins between large fact tables, prefer StarRocks or ClickHouse over Druid or Pinot. But even then, pre-denormalizing where possible reduces query latency by an order of magnitude.

Dimension Refresh and Slowly Changing Dimensions

A user upgrades from the free plan to the premium plan. A campaign gets a new name. A country gets renamed. These are dimension changes — facts about a user, campaign, or place that change occasionally over time, not on every event. The data-warehousing world has a name for these: Slowly Changing Dimensions (SCDs). The interesting question is: when a user's segment changes from FREE to PREMIUM, what should happen to past events that already mentioned them as FREE? Two approaches exist, and only one of them works well in real-time OLAP.

The SCD diagram above shows the critical insight about slowly changing dimensions. When a user moves from segment "FREE" to segment "PREMIUM", what does that mean for historical impressions? The wide-table pattern gives you a clean answer: each event row contains the user's segment at the time of the event. Impression A happened when the user was FREE — it will always show segment=FREE. Impression B happened after the upgrade — it shows PREMIUM. This is the correct approach for most analytical questions ("what was the conversion rate among FREE users last month?") and requires zero retroactive updates to stored segments.

The alternative — retroactively updating all historical rows when a dimension changes — requires rebuilding Druid or Pinot segments, which is expensive and can cause query inconsistency during the rebuild window. Pinot supports upsert tables for mutable data (introduced around v0.7 and stabilized through subsequent releases), which is valuable for use cases where you genuinely need to update facts (e.g., order status changes). Druid added SQL-style REPLACE semantics through its multi-stage query (MSQ) engine in v25 and later. But for dimension changes, the as-of snapshot is almost always the right choice.

Data modeling for real-time OLAP inverts the wisdom of batch OLAP warehousing: denormalize aggressively into wide tables (200–500 columns) to eliminate joins and maximize per-segment parallelism. Star and snowflake schemas that work beautifully in Snowflake or BigQuery become performance disasters in Druid and Pinot, where every join requires cross-segment coordination. Capture slowly changing dimensions as "as-of" snapshots baked into each event row at ingestion time, rather than retroactively updating historical segments.

Section 15

The Cardinality Problem — Why It Determines Everything

Before you add a dimension column to your real-time OLAP schema, ask one question: how many distinct values can this column hold? Mathematicians and database folks call this count the column's cardinality. A "country" column has cardinality ~200 (one per country in the world); a "user_id" column has cardinality in the hundreds of millions. That single number determines whether your bitmap indexes work, whether rollup compresses your data, whether your queries run in 50ms or 30 seconds, and whether your cluster costs $5,000 or $50,000 per month. Getting cardinality wrong is the single most common cause of real-time OLAP performance disasters in production.

The chart above shows the non-linear relationship between cardinality and query latency. Columns with 10–1,000 distinct values (country, device_type, status) sit in the safe zone: bitmap indexes compress to a fraction of a bit per row, and rollup dramatically reduces row count. As cardinality climbs into the millions (user_id, session_id), bitmap indexes balloon in size — a bitmap index for 100 million user_ids requires a 100-million-bit vector per segment just to represent one column — and rollup cannot compress because every row has a unique combination of dimensions. Latency climbs from milliseconds to seconds to tens of seconds.

Why Bitmap Indexes Break at High Cardinality

A bitmap index for a column works like this: for each distinct value, the engine stores a bitmask with one bit per row. If you have 200 rows and the column "country" has 5 distinct values, you get 5 bitmasks of 200 bits each — trivially small. A query filtering WHERE country = 'US' loads one 200-bit bitmask and ANDs it with other filter bitmasks in microseconds.

Now imagine "user_id" with 100 million distinct values across 1 billion rows. The engine would need 100 million bitmasks of 1 billion bits each — that's roughly 12 petabytes of index data for a single column. Even with roaring bitmaps (a compressed bitmap format), the index for high-cardinality columns becomes larger than the data itself, slower to load, and provides zero filtering benefit (nearly every row has a unique value, so no bitmap narrows the candidate set meaningfully).

Solutions for High-Cardinality Dimensions

The diagram above shows two techniques that work in practice. The bucket-then-aggregate pattern replaces a high-cardinality user_id (100 million distinct values) with a low-cardinality bucket column (user_id modulo 1,000 = 1,000 distinct values). Per-bucket event counts can then be queried in milliseconds using bitmap indexes, and the application re-aggregates bucket results. The HLL sketch approach — used when you need a distinct count (how many unique users saw this ad?) — stores a 12KB probabilistic summary instead of a full set of user IDs. HLL provides a count-distinct estimate accurate to about 2% using fixed memory regardless of the actual count, making "how many unique users this week?" a single fast scan over pre-computed sketch columns.

The cardinality audit you must do before building. Before designing your real-time OLAP schema, list every dimension column and its cardinality. For every column above 1 million distinct values: (1) can you bucket it? (2) can you use HLL for count-distinct? (3) do you actually need this column as a dimension, or is it only needed as a filter on a specific known value (in which case, consider a lookup approach instead of indexing)? A misconfigured high-cardinality column that gets bitmap-indexed can increase query latency by 100–600× — a disaster you won't discover until production.

Cardinality is the silent killer of real-time OLAP performance. Columns with fewer than 1 million distinct values (country, device, status, zip code) are safe for bitmap indexes and rollup compression. Columns with millions or billions of distinct values (user_id, session_id, request_id) explode bitmap index sizes and defeat rollup compression, turning 50ms queries into 30-second timeouts. The solutions — HLL sketches for distinct count, bucket columns for group-by analysis, materialized aggregates for common patterns — all require cardinality-aware thinking at schema design time, not as a retrofit.

Section 16

Joins in Real-Time OLAP — Why Druid Hates Them and ClickHouse Tolerates Them

Joins are the backbone of relational analytics. In traditional SQL, you join your orders table to your customers table to get "orders with customer name" — obvious and cheap. In real-time OLAP with billions of rows distributed across hundreds of segment servers, a join requires either broadcasting the smaller table to every server or shuffling rows of both tables across the network to co-locate matching rows. Both options are expensive. Understanding which engine handles which join type — and when to avoid joins entirely — is critical for building fast real-time OLAP systems.

The diagram above shows the two join strategies used in distributed systems. In a broadcast join, the smaller table (a dimension table like dim_campaign with 10 million rows) is sent to every segment server. Each server then performs a local join against its slice of the fact table without any cross-server communication. This is fast as long as the broadcast table fits in each server's memory — typically fine for dimension tables under 100MB. In a shuffle join, rows from both tables are redistributed across nodes by their join key hash — all rows with order_id=42 end up on the same node, from both fact tables. This enables large-to-large joins at the cost of massive network I/O, and is only practical in engines designed for it (StarRocks, Apache Doris).

Engine-by-Engine Join Support

Druid was designed from day one as a single-table scan engine. Joins were explicitly not supported until v0.18, which added lookup-based left joins — essentially the broadcast join pattern where a small dimension table is pre-loaded into every historical node. Druid v25+ added the multi-stage query (MSQ) engine, which enables SQL joins by coordinating across workers in a map-reduce style. MSQ joins are significantly slower than native single-table scans — use them only when denormalization is genuinely impossible. The production recommendation for Druid remains: denormalize at ingestion, avoid joins. If you need complex joins in Druid today, you will be fighting the grain of the engine.

Druid join gotcha: Even lookup-based joins slow Druid queries significantly when the lookup table is large. Keep lookups under 10 million rows; for anything larger, denormalize.

Pinot's join story is similar to Druid's. The core engine is optimized for single-table queries, and the lookup join pattern (small dimension tables broadcast to all server nodes) works well for dimensions under ~100 million rows. Pinot v0.11+ introduced the multi-stage query engine (MSE) which adds proper distributed join support — fact-to-fact joins become possible at the cost of increased latency and compute. For user-facing dashboards with strict sub-100ms SLOs, Pinot's best practice is wide-table denormalization + lookup joins for small dimensions only.

ClickHouse has full SQL join support and handles broadcast joins well — when you join a large fact table against a small dimension table (up to a few hundred million rows), ClickHouse broadcasts the small table to all query nodes and performs local hash joins efficiently. Large-to-large joins (both tables with billions of rows) are slower and may require query-level tuning. ClickHouse's JOIN types include hash join (default, good for small right-side table), sort-merge join, and full join; you select via SETTINGS join_algorithm. For ClickHouse use cases, the recommendation is: broadcast join for small dimensions (fast), avoid large-to-large joins or pre-denormalize.

-- ClickHouse: force broadcast join for small dimension table SELECT e.campaign_id, c.campaign_name, count() AS impressions FROM events e INNER JOIN campaigns c ON e.campaign_id = c.id -- campaigns is small: broadcast works WHERE e.event_time >= now() - INTERVAL 1 HOUR GROUP BY e.campaign_id, c.campaign_name SETTINGS join_algorithm = 'hash' -- hash join = broadcast small table

The query above shows ClickHouse joining a large events table against a small campaigns dimension. The join_algorithm = 'hash' setting tells ClickHouse to load the campaigns table entirely into memory on each node and perform a local hash join — the broadcast pattern. This is fast (typically under 200ms even on billions of events rows) as long as the campaigns table fits in memory per node.

StarRocks (and its sibling Apache Doris) are the real-time OLAP engines with the best join support, because they were designed as distributed MPP (massively parallel processing) databases from the start — more like a fast distributed relational database than a purpose-built OLAP engine. StarRocks supports broadcast joins, shuffle joins (for large-to-large), bucket shuffle joins (for co-located tables), and colocate joins (when both tables are partitioned by the same key on the same set of nodes — zero network shuffle required). For use cases that genuinely need complex multi-table joins at real-time speeds, StarRocks or Doris is the right choice.

The decision heuristic: If your workload is single-table GROUP BY + filter (classic OLAP), use Druid or Pinot. If you need complex multi-table joins at real-time speeds, use StarRocks or ClickHouse. The tradeoff is that Druid/Pinot have better support for ultra-high-concurrency user-facing dashboards; StarRocks/ClickHouse have better SQL flexibility.

When to Denormalize vs. When to Join

The decision matrix above distills the join strategy to a simple lookup by situation. The consistent theme: every row of the matrix where Druid or Pinot appear recommends either denormalization or small lookup joins. Only when you have genuinely large-to-large join requirements — and you are willing to accept higher latency (typically 500ms–2s instead of 50ms–100ms) — does StarRocks or Doris enter the picture.

Joins are the Achilles heel of real-time OLAP. Druid and Pinot were designed for single-table scans and fight you at every join — use them only with denormalized wide tables or small broadcast lookups. ClickHouse handles broadcast joins well but struggles with large-to-large. StarRocks and Doris have first-class distributed join support via MPP shuffle, at the cost of higher latency than single-table Pinot/Druid. The practical rule: if your latency SLO is under 200ms, denormalize everything. If your queries genuinely require complex joins and you can tolerate 500ms+, StarRocks is your engine.

Section 17

Monitoring and Operating Real-Time OLAP

Running a real-time OLAP cluster in production is not a "set it and forget it" operation. The engines are complex distributed systems with many moving parts — broker routing, segment servers, ingestion pipelines, deep storage tiering. Knowing which metrics to watch, what thresholds to alert on, and how to diagnose latency regressions before users notice is the operational skill that separates a 99.9% uptime cluster from a firefighting nightmare.

The dashboard above shows the six metrics that matter most for real-time OLAP operations. Each panel has an alert threshold baked in — let's walk through the reasoning behind each one.

The Seven Operational Metrics (and Why Each One)

Query latency p50/p95/p99. This is the primary user-facing SLO. p50 tells you what the median user experiences. p99 tells you the worst 1% — for a user-facing dashboard with 10,000 users, the worst 1% is 100 users getting a bad experience every minute. Alert on growth rate, not just absolute values: if p99 was 50ms yesterday and is now 200ms, that trend needs investigation before it reaches 2,000ms. A sudden p99 jump usually means either a new query pattern, a bad segment (corrupted or oversized), a GC pause on a historical node, or concurrency saturation.

Ingestion lag. The time from an event being written to Kafka to the event being queryable in the OLAP engine. For real-time dashboards, this needs to be under 5 seconds; for anomaly detection, under 30 seconds. A lag spike to 60+ seconds usually means the ingestion pipeline hit a bottleneck: the realtime ingestion nodes are falling behind Kafka, the segment commit pipeline is slow, or a Kafka partition imbalance is starving some ingestion tasks. DLQ depth is a related metric: events that fail schema validation or size checks land in the DLQ; a growing DLQ means silent data loss.

Query error rate. A healthy cluster should have near-zero query errors (less than 0.1%). When error rate climbs above 1%, it is almost always one of: OOM (out of memory) errors on segment servers caused by queries that scan too much data, query timeouts (queries that exceed the configured timeout and are killed), or broker-to-server connectivity failures. An OOM cascade is particularly dangerous: if one server runs OOM and restarts, its segments must be re-loaded from deep storage or re-balanced from other servers, increasing load on remaining servers, potentially causing more OOMs.

Concurrent queries. Every real-time OLAP cluster has a maximum concurrent query limit configured per node (in Druid, druid.server.http.numThreads; in ClickHouse, max_concurrent_queries). When the in-flight query count approaches this limit, new queries start queueing. Queueing causes latency to grow exponentially — a queue of 10 at 100ms per query means the 10th queued query waits 1,000ms before it even starts executing. Alert when concurrency exceeds 80% of the configured maximum for more than 30 seconds.

Segment count and average segment size. Segments that are too small (under 50MB) hurt performance because the query coordinator must fan out to more servers and merge more partial results. Segments that are too large (over 1GB) hurt because a single slow disk read blocks more of a query. The sweet spot for Druid is 300–600MB per segment. If you see thousands of tiny segments (caused by high-frequency realtime commits that weren't compacted), run a compaction job.

Storage tier hit ratio. If a query touches segments in the hot tier (local SSD or memory), latency is 5–20ms per segment. If it falls to the warm or cold tier (HDD or object storage), latency is 50ms–2,000ms per segment. Target 95%+ of queries served fully from the hot tier. A drop below 90% means your hot tier is under-sized for the query access pattern — either grow the hot tier or age out data more aggressively.

Broker routing time. The broker is the entry point for all queries. If the broker is slow (routing time p99 above 10ms), it becomes a single-point bottleneck regardless of how fast the segment servers are. Broker slowness is usually caused by: too many segments to route against (high segment count), slow ZooKeeper/metadata queries, or insufficient broker CPU/memory. Scale brokers horizontally if routing time grows under load.

The latency vs. concurrency curve above shows the characteristic "hockey stick" shape of queued systems. While in-flight queries are within the cluster's thread pool capacity, latency stays flat — each query gets immediate service. Once queries start queueing (beyond 80% of max concurrent), every new query must wait for an earlier query to finish before starting. Latency grows exponentially. The critical insight: you must alert at 80% concurrent capacity, not when latency is already bad. By the time p99 latency is visibly degraded, you're already in the exponential part of the curve and user experience has suffered.

Operating a real-time OLAP cluster requires watching seven key metrics: query latency percentiles, ingestion lag, query error rate, concurrent query count, segment count and size, hot tier hit ratio, and broker routing time. Alert on growth rate trends and capacity thresholds rather than absolute values — the hockey-stick concurrency curve means that by the time latency looks bad, hundreds of users are already experiencing timeouts. Use native tooling (Druid Console, Pinot Console, ClickHouse system tables) and integrate with Datadog or Prometheus for cluster-wide visibility.

Section 18

Practice Exercises — Real-Time OLAP Reasoning

These five exercises are designed to build the reasoning muscles you need for real-time OLAP in interviews and on the job. Each one maps to a concept from S13–S17. Work through them yourself before reading the solution — the struggle is where the learning happens.

Exercise 1 — Dimension Index Selection

You are designing the schema for a real-time OLAP table for an event stream. Each event has four dimension columns:

country — 200 distinct values (US, GB, DE, …)
user_id — 10 million distinct values
device_type — 50 distinct values (mobile, desktop, tablet, …)
event_time — continuous timestamp

For each dimension, state (a) whether to use a bitmap index, an inverted index, or no index, and (b) why. Also flag any dimension that poses a cardinality risk.

Think about what cardinality level each column falls into (low/medium/high). Recall that bitmap indexes excel at low cardinality, inverted indexes at medium cardinality, and high cardinality breaks both. event_time is a special case — what kind of partitioning does it need?

Solution:

country (200 values) → bitmap index. Low cardinality, exactly the sweet spot for bitmaps. Each bitmap is 200 bits × row count — trivially small. Rollup can collapse millions of events per country per hour into pre-aggregated counts. Queries filtering WHERE country = 'US' execute in microseconds using the bitmap.
device_type (50 values) → bitmap index. Even lower cardinality than country. Same reasoning. Excellent bitmap compression and rollup potential.
user_id (10 million values) → NO bitmap index. Flag as cardinality risk. 10 million distinct values puts this firmly in the danger zone. A bitmap index would require 10 million bitmasks — exploding index size. Instead, if you need "events per user", use a sketch column (HLL for count-distinct) or materialize per-user aggregates. Never add user_id as an indexed dimension unless absolutely necessary. If you need to filter by a specific user_id (e.g., "show me all events for user 42"), use a range/inverted index with a specific equality filter, not a bitmap.
event_time → time-based segment partitioning, not a bitmap. event_time is continuous; bitmaps don't apply. Instead, the engine partitions data into time-granular segments (e.g., hourly Druid segments, daily Pinot segments). Queries with a time range filter automatically prune irrelevant segments — this is the primary performance mechanism for time-series queries.

Exercise 2 — Rollup Design for High-Volume Streams

Your system receives 1 billion events per day. Each event has columns: country (200 values), campaign_id (50,000 values), event_type (click or impression — 2 values), and event_time. You want to rollup events into 1-hour buckets to answer queries like "how many clicks per campaign per country per hour?"

Calculate: (a) What is the maximum number of unique rows in the rollup table per day? (b) What compression ratio does rollup provide over raw events? (c) What dimensions are safe to include in the rollup key?

The rollup table has one row per unique combination of (country, campaign_id, event_type, hour). Compute the Cartesian product of all distinct values × hours per day. Compare that to 1 billion raw events. For part (c), think about which dimensions have bounded cardinality.

Solution:

(a) Maximum rollup rows per day: The rollup key is (country, campaign_id, event_type, event_hour). Distinct combinations: 200 countries × 50,000 campaign IDs × 2 event types × 24 hours = 480 million unique rows maximum. In practice, not every campaign runs in every country every hour, so actual rollup rows will be much less — typically 5–30% of the theoretical maximum, meaning roughly 24–144 million rows per day in the rollup table.

(b) Compression ratio: Raw events = 1 billion rows per day. Rollup rows (theoretical max) = 480 million, practical = ~50 million. Compression ratio = 1,000,000,000 / 50,000,000 = ~20× reduction in row count. This translates directly to faster queries and lower storage cost. The more events per dimension combination, the better rollup compresses (a campaign with 100,000 events per hour becomes 2 rows — one for clicks, one for impressions).

(c) Safe dimensions for rollup key: country (200), campaign_id (50K), event_type (2), event_hour — all are bounded cardinality and fine. What you MUST NOT add to the rollup key: user_id (10M+), session_id, or request_id. Adding user_id to the rollup key would mean one row per user per campaign per hour — that's 1 billion users × 50K campaigns = 50 trillion possible rows, defeating the entire purpose of rollup. If you need per-user analysis, keep raw events in a separate high-cardinality table and use HLL sketches for distinct-user counts in the rollup.

Exercise 3 — Engine Selection for Three Scenarios

Choose between Apache Druid, Apache Pinot, ClickHouse, and StarRocks for each of the following scenarios. State your choice and the two most important reasons for it.

Scenario A: A user-facing dashboard feature for a social platform. Expected: 10,000 concurrent users, query latency SLO of <100ms, data freshness of <5 seconds, single-table GROUP BY queries.
Scenario B: An internal business intelligence tool for operations teams. Expected: 200 concurrent users, queries involve 3-table joins (orders + inventory + suppliers), latency SLO of <2 seconds, daily refreshed dimension tables.
Scenario C: Infrastructure metrics and alerting. Expected: 50 million metrics per minute, 30-day retention with hourly rollups after 1 day, complex time-series queries by host/service/region, query latency <500ms.

For A, what are the defining constraints? (User-facing + high concurrency + strict SLO.) For B, what feature matters most? (Multi-table joins.) For C, high ingest + time-series queries + tiered retention — which engine was originally designed for this pattern?

Solution:

Scenario A → Apache Pinot. Reasons: (1) Pinot was built by LinkedIn specifically for high-concurrency user-facing dashboards — its broker tier, pre-computed indexes, and realtime segments are optimized for exactly the 10,000 concurrent users × sub-100ms pattern. (2) Single-table GROUP BY queries with high concurrency is Pinot's core sweet spot; no joins means no penalty for Pinot's join limitations. Druid would also be acceptable but Pinot has slightly better story for user-facing latency SLOs.

Scenario B → StarRocks (or ClickHouse). Reasons: (1) Three-table joins are required — Druid and Pinot handle this poorly. StarRocks has first-class MPP distributed joins and handles multi-table SQL naturally. (2) 200 concurrent users and a 2-second SLO is relaxed enough for StarRocks's join overhead. ClickHouse is also acceptable if the tables are medium-sized and broadcast join strategy works (small dimension tables). Avoid Druid or Pinot here — the 3-table join requirement alone disqualifies them.

Scenario C → Apache Druid. Reasons: (1) Druid was designed for exactly this pattern — time-series metric data with tiered retention (hot segments in SSD for recent data, warm/cold segments in S3 for historical data) is a first-class Druid feature. The segment compaction pipeline handles rolling hourly rollups automatically. (2) Time-range queries by host/service/region with aggressive segment pruning is Druid's original design target. ClickHouse is a reasonable alternative (it also handles time-series and tiering), but Druid's native Kafka ingestion and rollup at ingestion time are the cleanest fit for this ingest volume.

Exercise 4 — Diagnosing a Latency Regression

A Druid cluster's p99 query latency has doubled from 120ms to 240ms over the past two hours. The cluster is serving a product analytics dashboard. No code was deployed. What are the three most likely causes, and what would you check first for each?

Think about the monitoring metrics from S17. What changed in the past two hours — was it concurrency? Ingestion patterns? Segment characteristics? A latency doubling with no deployment usually points to a resource-level change, not a code bug. What happens to Druid when many small segments accumulate? What happens when a node runs low on memory?

Solution — the three most likely causes:

Cause 1: Segment count explosion (most common). Druid accumulates small realtime segments as data is ingested. If compaction is not running, segment count grows continuously. More segments = more fan-out from broker to historical nodes = more inter-process communication overhead per query. What to check: Druid Console → Segments tab → sort by time. If you see thousands of segments under 50MB each, compaction is lagging. Check the compaction task queue. Fix: trigger a compaction task for the affected datasource.

Cause 2: Concurrency saturation. If dashboard usage grew in the past two hours (peak business hours, a new feature rollout), the cluster may be hitting its concurrent query limit. Queries queue and wait for free threads, adding latency. What to check: Druid metrics → concurrent queries metric. Compare current concurrency vs. two hours ago. If concurrency grew significantly and is now above 70-80% of max_concurrent_queries, the cluster is queuing. Fix: scale out historical/broker nodes or reduce per-query resource limits to allow more concurrency.

Cause 3: Hot tier fill-up causing cold segment reads. If ingestion volume increased (e.g., a batch of historical data was loaded or a new high-volume event type was added), the hot tier (local SSD) may have filled up, forcing recent data to spill to warm/cold tiers. Queries that should hit SSD are now hitting HDD or object storage, causing 10-100× higher segment scan latency. What to check: Hot tier hit ratio metric. If it dropped from 97% to 85%, this is the cause. Fix: increase hot tier capacity, or adjust tiering policy to age out older segments faster.

Exercise 5 — Designing a Tiered Storage Architecture

You are operating a real-time metrics platform with the following retention requirements:

Last 24 hours: Full raw events at 1-second granularity, all dimension columns, sub-100ms query latency required.
Days 2–30: Hourly rollups (aggregate per host/service/region per hour), query latency under 500ms.
Months 2–12: Daily rollups (aggregate per service per day), query latency under 2 seconds.

Design the tiering strategy: (a) which storage tier for each retention window, (b) how to implement the rollup transitions, and (c) what the estimated storage reduction is at each tier compared to raw events.

Map the three retention windows to hot (local SSD/memory), warm (HDD or fast object storage), and cold (cheap object storage). For rollup transitions, think about what happens to Druid segments over time — what process moves them from hot to warm? For storage estimation, think about how many raw 1-second events collapse into one hourly rollup row for a given host+service+region combination.

Solution:

(a) Tier mapping:

Last 24 hours → Hot tier (local NVMe SSD or memory-mapped). Sub-100ms SLO requires segments to be resident in fast local storage. This is the most expensive tier — size it to hold 24 hours of raw events. For a 50M events/minute stream, 24 hours = ~72 billion events × ~100 bytes/event ≈ 7.2TB of raw data; after columnar compression and rollup at ingestion, typically 300GB–1.5TB on disk.
Days 2–30 → Warm tier (local HDD or EBS/persistent disk near the cluster). Hourly rollups are much smaller than raw events; 500ms SLO is achievable from local HDD or fast persistent block storage. Do not use object storage (S3/GCS) for this tier — the object-storage access latency (10–100ms per segment fetch) would blow the 500ms SLO for complex queries.
Months 2–12 → Cold tier (object storage: S3, GCS, or Azure Blob). Daily rollups accessed infrequently; 2-second SLO is achievable from object storage for simple queries. Cold tier storage is 10–100× cheaper than SSD. This is where the economics of real-time OLAP systems become attractive at scale.

(b) Rollup transitions: In Druid, configure segment tiering rules in the Coordinator: "segments older than 24 hours move to tier=warm, segments older than 30 days move to tier=cold". Run a separate compaction task that processes day-old segments and produces hourly-aggregated segments (one row per host/service/region/hour replacing potentially millions of raw rows). A second compaction task processes 30-day-old segments and produces daily-aggregated segments. In Pinot, use the retention manager plus a separate offline table for pre-aggregated historical data ingested via batch job.

(c) Storage estimation per tier: Assuming 100 hosts × 50 services × 10 regions = 50,000 unique host/service/region combinations. Raw events: ~50M events/minute → 72 billion events/day. Hourly rollup: 50,000 combinations × 24 hours = 1.2 million rows/day — a 60,000× compression ratio vs. raw. Daily rollup: 50,000 combinations × 1 = 50,000 rows/day — a 1.44 million× compression ratio vs. raw. The cold tier is essentially free to operate compared to the hot tier despite covering 11 months of history.

The five exercises above cover the full reasoning arc for real-time OLAP: index selection by cardinality, rollup design with compression math, engine selection by use case constraints, operational diagnosis of latency regressions, and tiered storage architecture. Working through them hands-on builds the intuition to tackle novel real-time analytics problems from first principles — mapping requirements to the right engine, schema, and operational strategy without memorizing rules.

Section 19

Bug Studies — When Real-Time Analytics Goes Wrong

Real bugs teach more than theory ever will. These four production incidents are reconstructed from real patterns that engineering teams hit repeatedly. None of them required exotic hardware failures or freak events — each one was a straightforward engineering mistake that looked reasonable at design time and only revealed its cost at scale.

Bug #1 — Adding session_id as a Dimension Destroyed the Bitmap Index

Incident: An analytics team added session_id as a dimension column to enable "per-session funnel analysis". Within 48 hours of deploying to production, segment sizes ballooned from 200 MB to 8 GB per segment. Dashboard query latency jumped from 100 ms to 30 seconds. The ingestion pipeline fell behind by 4 hours before the team realised what had happened.

What Went Wrong

Real-time OLAP engines like Druid and Pinot use bitmap indexes on dimension columns to accelerate filtering and grouping. A bitmap index has one bit-vector per distinct value — for a country column with 200 countries that is perfectly fine (200 bit-vectors, trivially small). But session_id is a UUID: every session is unique. In a data set with 1 billion events per day, that means 1 billion distinct values, and therefore 1 billion bit-vectors. The engine has to materialise an index entry for every single session, and almost every bit-vector is zero everywhere except one row. The index becomes larger than the actual data and provides zero compression benefit.

The math: a bitmap index over a column with cardinality C in a segment of N rows costs roughly C × N / 8 bytes. At C = 1 billion and N = 100 million rows per segment, that is ~12 TB per segment for the index alone — obviously impossible. Even at realistic cardinalities (C = 1 million), the index balloons the segment from 200 MB to 8 GB.

The SVG shows the core problem: a bitmap index on country (200 values) creates 200 compact, compressible bit-vectors — excellent. A bitmap index on session_id (1 billion unique values) creates 1 billion bit-vectors, each with only one bit set — no compression is possible and the index dwarfs the data itself.

-- DANGEROUS: session_id is UUID — cardinality = number of events -- Every dimension column gets a bitmap index automatically CREATE TABLE impressions ( ts TIMESTAMP, campaign_id VARCHAR(64), -- OK, ~10K campaigns country VARCHAR(8), -- OK, ~200 countries session_id VARCHAR(36), -- 💥 1B+ distinct values per day clicked TINYINT ) -- Result: segment size 200MB → 8GB, query latency 100ms → 30s

-- Move session_id OUT of dimensions, into a metric or exclude it -- If you need funnel analysis, pre-aggregate at session level upstream CREATE TABLE impressions ( ts TIMESTAMP, campaign_id VARCHAR(64), -- dimension ✓ country VARCHAR(8), -- dimension ✓ -- session_id removed from dimensions — use a metric instead sessions BIGINT, -- pre-aggregated: distinct session count clicked TINYINT ) -- Segment size back to 200MB, query latency back to 100ms ✓

Lesson: Every dimension column in a real-time OLAP schema gets a bitmap index — automatically, with no warning. Before adding any dimension, ask: "How many distinct values will this have?" If the answer is more than ~100,000, either pre-aggregate it upstream (turn it into a metric), hash it into buckets, or exclude it from dimensions entirely.

Bug #2 — Dashboard Showed 2-Hour-Stale Data and Nobody Noticed for a Week

Incident: During a major marketing campaign launch, inbound event volume tripled. The real-time ingestion pipeline quietly fell behind. The OLAP engine kept serving dashboards — from data that was 2 hours stale. Product managers made campaign decisions based on "real-time" data that was reflecting state from before the campaign launched. The lag was discovered only when a user noticed that click counts from an email were not appearing.

What Went Wrong

The ingestion pipeline consumed from Kafka and loaded data into the engine. Kafka topics have a consumer lag metric that accurately reports how far behind a consumer is. The team had not set up any alert on this metric. The OLAP engine continued to answer queries — it just answered them with whatever data it had already ingested. It did not signal "my data is 2 hours old". From the dashboard user's perspective, charts still refreshed every 30 seconds; the numbers just quietly stopped moving.

This is a silent failure mode. Unlike a database crash or a 500 error, data staleness is invisible unless you explicitly monitor for it. A dashboard that shows stale data looks exactly like a healthy dashboard — until someone checks the timestamps.

The chart shows both offsets tracking together during normal load. When the campaign launches (red dashed line), the Kafka offset spikes because event production triples. The ingestion pipeline can't keep up and the two lines diverge — the gap is the staleness that dashboards silently serve to users.

# No ingestion lag monitoring — silent failure # Druid/Pinot supervisor config had no lag threshold alert # No Kafka consumer group lag alert in the monitoring stack # All dashboards showed "Last updated: just now" # because the UI used the browser clock, not the data timestamp

# Prometheus alert on Kafka consumer group lag - alert: IngestionLagTooHigh expr: kafka_consumer_group_lag{group="druid-realtime"} > 100000 for: 5m labels: severity: warning annotations: summary: "Druid ingestion is lagging — dashboards may show stale data" description: "Lag is {{ $value }} events behind" # Also: show data freshness timestamp on every dashboard panel # so users can see "Data as of: 14:22 (2 min ago)" instead of trusting a clock

Lesson: Real-time analytics is only real-time if you can prove it is. Monitor Kafka consumer group lag and alert when it exceeds a threshold meaningful for your SLA. Display a "data as of" timestamp on every dashboard panel — let users see exactly how fresh the data is rather than assuming. Silence in real-time pipelines is almost always stale data, not success.

Bug #3 — Removing Rollup for "Better Drill-Down" Made Storage 50× Bigger and Queries 20× Slower

Incident: A product manager requested drill-down to individual event level on dashboards. An engineer disabled rollup/pre-aggregation and switched to storing raw events. Storage for 90 days of data grew from 400 GB to 20 TB. Dashboard p99 query latency went from 800 ms to 18 seconds. After two months, a data analysis confirmed that nobody had ever used the drill-down feature.

What Went Wrong

Rollup is one of the most powerful features of real-time OLAP engines. Instead of storing one row per event (1 billion events/day = 1 billion rows), you pre-aggregate at ingestion time: group events by their dimension values within a time bucket (e.g., 1 minute), sum the metrics, store one row per group. For a dataset with 1 billion events/day and 10,000 unique dimension combinations, rollup might reduce 1 billion rows to 1 million rows — a 1000× reduction in storage and a 1000× reduction in query scan work.

The trade-off is that once you roll up, individual event identity is gone. You cannot ask "show me the exact sequence of clicks for user X" — only aggregated totals. For 99% of dashboard queries, that is exactly what you want. The rare need for event-level drill-down should be served by a separate raw event log (cheap S3-backed storage), not by removing rollup from the real-time OLAP layer.

-- Raw event table — 1 row per event — 1 billion rows/day -- Druid/Pinot schema with rollup DISABLED { "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": "NONE", -- 💥 no time-bucketing = raw rows "rollup": false -- 💥 disabled } } -- Storage: 20 TB for 90 days -- Query: SELECT SUM(clicks) ... GROUP BY campaign, country → 18s p99

-- Rollup enabled: 1 row per (campaign, country, minute) { "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": "MINUTE", -- ✓ 1-minute time buckets "rollup": true -- ✓ pre-aggregate at ingest } } -- 1B events/day × ~10K dimension combos → ~10M rows/day -- Storage: 400 GB for 90 days (50× smaller) -- Query: SUM(clicks) now scans 10M pre-agg rows → 800ms p99 ✓ -- For event-level drill-down: read from cheap S3 raw log, not OLAP

Lesson: Never disable rollup to enable "drill-down" in a real-time OLAP engine. Keep rollup on and route event-level drill-down queries to a cheap raw event log (Parquet on S3, BigQuery coldline). Before any schema change that removes a performance feature, estimate the storage and query cost impact. And measure whether the new capability is actually used before paying the ongoing operational cost.

Bug #4 — Hot Tier Sized for Average Traffic; Flash Sale Sent p99 from 80 ms to 8,000 ms

Incident: A retail analytics platform used a tiered storage architecture: last 3 days in hot tier (local NVMe SSD), 4-90 days in warm tier (S3-backed). The system was sized and load-tested at average query traffic. A holiday flash sale doubled dashboard traffic overnight. Hot-tier historical nodes ran out of memory. Segments that should have been served from SSD were evicted and the engine fell through to the S3-backed warm tier. Dashboard p99 latency went from 80 ms to 8,000 ms. The on-call team didn't have a runbook for hot-tier eviction.

What Went Wrong

Real-time OLAP engines like Druid and Pinot use tiered storage where the hot tier keeps frequently-queried segments in local fast storage. Hot-tier nodes have a finite amount of memory and disk. When traffic spikes and more segments are needed simultaneously than the hot tier can hold, the engine evicts the least-recently-used segments to S3. Queries that hit evicted segments now have to read from S3 — which can be 100× slower than local NVMe for small random reads.

The bug has two components: (a) the hot tier was not sized for peak traffic, only average traffic; (b) there was no alert when hot-tier eviction was happening, so the team did not know to add capacity until p99 had already spiked. Hot-tier eviction is one of the most common causes of surprise latency spikes in production OLAP deployments.

The diagram shows the normal (green) path — broker routes queries to the hot-tier NVMe nodes and returns in 80 ms. During the flash sale (orange box), hot-tier nodes run out of memory, evict segments to S3, and the next query for those segments falls through to the warm tier — a 100× latency increase from a storage tier change, not from any query change.

# No alert on hot-tier eviction rate # System sized for avg_qps, not peak_qps # No runbook for "add hot-tier capacity fast" # Hot tier node memory: 64 GB # Normal working set: 48 GB (fits) # Flash-sale working set: 110 GB (eviction inevitable)

# 1. Size hot tier for peak, not average (2× headroom minimum) hot_tier_nodes: 6 # was 3 — doubled for peak segment_cache_size: 80GB # per node # 2. Alert on eviction rate - alert: HotTierEvictionHigh expr: druid_historical_segment_evictions_total > 50 for: 2m annotations: summary: "Hot-tier segments being evicted — p99 latency will spike" # 3. Pre-warm segments before expected traffic events # Druid: force-load segments via coordinator API before flash sale POST /druid/coordinator/v1/rules/{dataSource} {"type": "loadForever", "tieredReplicants": {"hot": 2}}

Lesson: Size hot-tier storage and memory for peak traffic, not average traffic — at least 2× headroom. Set up alerts on hot-tier eviction rate so you know before latency spikes, not after. For predictable traffic events (product launches, sales), pre-warm segments by loading them to the hot tier before the event starts. Hot-tier eviction is silent and immediate — by the time dashboards are slow, users are already frustrated.

These four bugs share a pattern: the failures were predictable from first principles (cardinality math, Kafka lag visibility, rollup trade-offs, storage tier sizing) but invisible at design time because the scale was not yet present to reveal them. The defences are: (1) audit every dimension column for cardinality before deploy, (2) monitor ingestion lag explicitly and display data freshness timestamps, (3) keep rollup enabled and route event-level drill-down to a separate raw store, (4) size the hot tier for peak load and alert on eviction.

Section 20

Real-World Architectures — How Big Companies Do Real-Time Analytics

The best way to understand real-time OLAP engine choices is to see the specific problems that drove each company to build or adopt a particular engine. These case studies show the architecture — not promotional numbers, but the engineering rationale and the data flow.

This is the canonical architecture shared by almost every large real-time analytics deployment. Events flow from source systems into Kafka, optionally through a stream processor for enrichment or windowing, then directly into the OLAP engine. Druid and Pinot have native Kafka consumers built in, so many deployments skip the separate stream processor entirely. Dashboards and internal APIs query the OLAP engine directly at sub-second speed.

LinkedIn — Apache Pinot ("Who Viewed Your Profile")

LinkedIn created Apache Pinot to solve a very specific problem: user-facing analytics. "Who Viewed Your Profile" and similar features need to show each user their own real-time stats — profile views, post impressions, follower growth — updated within seconds, while hundreds of millions of users are checking simultaneously. This is fundamentally different from internal analytics dashboards: every query is for a different user's data, concurrency is extreme, and latency must be imperceptibly fast (sub-100 ms) or the feature feels broken.

LinkedIn built Pinot in-house because no existing system handled the combination of user-facing query latency (under 100 ms), high concurrency (tens of thousands of queries per second from end-user traffic), and fresh data (seconds old). They open-sourced Pinot in 2015. The system feeds Kafka events through Pinot's real-time ingestion, which partitions data by user ID so each query only hits the shards relevant to that user — a design that dramatically reduces fan-out compared to a global scatter-gather approach.

Architecture highlight: Pinot uses a strategy. A query for user 12345's profile views only hits the one or two segment servers that hold user 12345's data, not every server in the cluster. This is what makes sub-100 ms latency achievable even at high concurrency.

Netflix — Apache Druid for Content + Operational Analytics

Netflix adopted Apache Druid for two distinct use cases: content performance analytics (how are shows performing by region, device type, time of day?) and operational analytics (are CDN nodes healthy? What are error rates by region and content type?). The operational analytics case is particularly interesting — it requires querying over real-time system telemetry at sub-second speed, which is more "observability" than traditional analytics.

Netflix runs Druid at multi-petabyte scale with what's called a Lambda architecture — the idea of running two parallel pipelines for the same data, one fast-and-approximate (the streaming path) and one slow-but-accurate (the batch path), then combining their answers so users see fresh numbers and trustworthy historical totals. Concretely: a real-time path (Kafka → Druid real-time ingestion) for fresh-but-approximate data, and a batch path (batch jobs → Druid historical segments) for accurate-but-lagged data. Druid merges the two paths transparently at query time. Netflix also built Mantis, their internal stream processing platform, which pre-processes events before they enter Druid — filtering, aggregating, and enriching at the stream layer to reduce the raw data volume that Druid has to index.

Uber — Pinot for Rider and Driver Real-Time Metrics

Uber uses Apache Pinot for internal real-time dashboards that track rider demand, driver supply, surge pricing effectiveness, and trip completion rates — all by city, time-of-day, and vehicle type, updated in near-real-time. The key challenge at Uber is geographic cardinality: the system needs to slice metrics by city (500+ cities), neighbourhood, and time simultaneously, which creates a large dimension space.

Uber built uReplicator, a custom Kafka-to-Kafka replication system that stabilises Pinot's Kafka consumption at scale. Standard Kafka consumer group rebalancing caused ingestion gaps at Uber's partition counts and throughput levels. uReplicator acts as a middleware that provides a stable, predictable feed to Pinot's real-time nodes.

Airbnb — Druid for Guest and Host Insights + A/B Testing

Airbnb uses Apache Druid to power several analytics products: host dashboard metrics (how many views did your listing get this week? what's your acceptance rate?), search ranking and relevance analytics, and A/B experiment result dashboards. The A/B testing use case is particularly demanding: Airbnb runs hundreds of experiments simultaneously, and experiment analysts need to see metric results broken down by experiment variant, user segment, and time period — updated fresh enough to make ship/no-ship decisions quickly.

Airbnb chose Druid over Pinot for this use case because their query patterns lean more toward multi-dimensional exploration (analysts exploring data, not users checking their personal stats) and Druid's mature operator tooling suited their data engineering team. They use Druid's virtual columns feature extensively to avoid re-ingesting data every time a new computed dimension is needed for an experiment.

ClickHouse — Product Analytics and Observability at Many Companies

ClickHouse has become the de-facto standard for high-throughput log analytics and observability stacks at scale — used at companies including Cloudflare (HTTP log analytics), eBay, Uber, and many others for internal product analytics (feature usage tracking, message send rates, API call volumes) and observability (log aggregation, error rate dashboards). Teams typically choose ClickHouse over Druid/Pinot for a few consistent reasons: simpler operational model (no separate real-time/historical node split, no external deep storage required for basic deployments), richer SQL support, and significantly easier schema management (Druid and Pinot require schema changes with ingestion restart; ClickHouse's ALTER TABLE ADD COLUMN is immediate).

A log analytics pipeline is a good example of ClickHouse's strength: structured log events (JSON) are parsed at write time into typed columns, then ClickHouse's MergeTree storage engine compresses them aggressively. Engineers query logs using SQL with full-text LIKE searches on the message column and aggregate over structured fields — a pattern that is awkward in Druid/Pinot but natural in ClickHouse.

The table summarises the engine-to-use-case mapping seen across these companies. Pinot wins when user-facing sub-100 ms latency is the primary requirement. Druid wins when multi-tenant internal dashboards with Lambda architecture maturity are needed. ClickHouse wins when simplicity, flexibility, and rich SQL matter more than extreme concurrency. StarRocks/Doris wins when join performance against lakehouse formats (Iceberg, Delta) is required. All four have managed cloud offerings that eliminate most operational complexity.

The case studies in this section all converge on the same data flow (Kafka → optional stream processor → OLAP engine → dashboard API) but choose different engines based on their primary constraint: Pinot for user-facing sub-100 ms latency (LinkedIn, Uber), Druid for multi-tenant Lambda architectures (Netflix, Airbnb), and ClickHouse for simplicity and SQL flexibility across many observability and product-analytics deployments. Several companies also contributed back — Pinot (LinkedIn), uReplicator (Uber), Mantis (Netflix) — which is how open-source real-time OLAP matured as a category.

Section 21

Common Misconceptions — What People Get Wrong About Real-Time Analytics

Real-time OLAP is one of those topics where confident-sounding incorrect beliefs are common — partly because the category is relatively new, partly because the engine names (Druid, Pinot, ClickHouse) are less familiar than Postgres or Redis. These are the most common wrong turns engineers take when approaching this space for the first time.

Real-time OLAP engines and data warehouses (Snowflake, BigQuery, Amazon Redshift) solve different parts of the analytics problem. They are not competitors — they are complements.

Data warehouses excel at long-running complex analytical queries: historical trend analysis, year-over-year comparisons, complex multi-table joins, ad-hoc exploration by data analysts. These queries can take seconds or minutes and that is acceptable — they run infrequently, are not user-facing, and scan enormous amounts of data.

Real-time OLAP engines excel at a narrow but critical requirement: sub-second queries over fresh data under high concurrency. They are optimised for the live dashboard case, not the "run a six-table join over 3 years of transaction history" case.

The modern data stack typically uses both: Kafka → real-time OLAP engine for live dashboards → also → batch ETL → data warehouse for deep historical analysis. Removing the warehouse to save cost is a common and painful mistake — the real-time OLAP engine will gradually accumulate historical data, its operational cost will rise, and it will never perform as well on the batch analytical queries the warehouse was handling.

Both are distributed real-time OLAP engines with time-based segment partitioning, bitmap indexes, and Kafka ingestion. But their design centres are different enough to matter at production scale.

Apache Pinot was designed from the start for user-facing analytics at LinkedIn — where p99 latency must be under 100 ms even at tens of thousands of queries per second. Pinot's broker routing is partition-aware (routes to only relevant segment servers), its indexing includes sorted indexes and range indexes beyond simple bitmaps, and its latency SLOs at the design level are stricter than Druid's.

Apache Druid was designed for multi-tenant internal analytics with a Lambda architecture — supporting both a real-time path and a batch historical path that merge seamlessly. Druid's historical node system is more mature for managing petabytes of historical segments, its coordinator is more battle-tested for large-scale multi-tenant cluster management, and its operator community is larger and more experienced.

Choosing between them: if you need user-facing sub-100 ms latency, Pinot is the stronger choice. If you need mature multi-tenant cluster management and Lambda architecture support, Druid is the stronger choice. For most new deployments, ClickHouse is worth considering first because it avoids the operational complexity of both.

ClickHouse is simpler to operate than Druid or Pinot and is an excellent choice for a large fraction of real-time analytics use cases. But "good enough" has limits.

ClickHouse uses a scatter-gather model at the server level. At very high query concurrency (500+ simultaneous queries) over a very large cluster, ClickHouse's per-query CPU cost per server starts adding up. Druid and Pinot's segment-level fan-out parallelism (a single query hits many small segment servers, each returning a tiny partial aggregate) scales more gracefully for this specific extreme-concurrency pattern.

ClickHouse also does not have Druid/Pinot's tiered storage model built in by default (though ClickHouse Cloud and recent versions add cold storage support). For deployments where hot/warm/cold tier management over multiple petabytes is a first-class requirement, Druid/Pinot have more mature solutions.

That said: if your use case is internal analytics dashboards, product metrics, log analysis, or operational dashboards — and your query concurrency is in the hundreds rather than thousands — ClickHouse is almost certainly the right choice. It has richer SQL, simpler operations, and faster development iteration than either Druid or Pinot.

Both engines have improved join support in recent versions — Druid added hash joins in 0.20, Pinot added broadcast join support. But "improved" does not mean "fast at scale".

Joins in Druid and Pinot work well when one side of the join is small enough to broadcast to all segment servers (typically under a few hundred MB). A lookup join — "join this event table to a dimension table of 10,000 campaigns" — works acceptably because the dimension table fits in memory everywhere.

What remains slow is joining two large fact tables. There is no efficient distributed shuffle join in either Druid or Pinot that approaches the performance of a native columnar scan on a pre-denormalised wide table. The correct long-term strategy is: denormalise at ingestion time. If you need to join campaign metadata to impression events, do it in the stream processor (Flink, Kafka Streams) before the events land in the OLAP engine. Do not rely on join performance in the OLAP query layer for production dashboards.

In OLTP databases (Postgres, MySQL), adding an index on a frequently-filtered column is almost always a win. Real-time OLAP engines are different: every index you add costs storage at ingestion time and slows down write throughput.

Druid and Pinot build bitmap indexes automatically for dimension columns and range indexes for metric columns with range predicates. Adding additional index types (inverted, sorted, text) accelerates specific query patterns but increases segment size. At scale, a 30% increase in segment size means 30% more data to read from S3, 30% more hot-tier memory needed, and 30% more ingestion CPU.

The discipline: start with the default indexes. Measure which queries are slow. Only add an index when you can measure the query improvement vs. the ingestion + storage cost. Many teams add every index type "just in case" and then wonder why segments are 3× the size they expected.

This is partially true and highly engine-dependent. ClickHouse handles both well — it has no architectural split between "real-time" and "historical" nodes. You can store years of data and query it at sub-second speed (for manageable table sizes) without any tier-management overhead.

Druid and Pinot are more complex here. Both support storing historical segments in deep storage (S3/HDFS) and loading them on demand, but queries that span large historical ranges must load many segments from object storage — which is much slower than querying hot-tier data. Complex batch analytical queries ("year-over-year trend for all campaigns from 2019–2024") may time out or be prohibitively slow on Druid/Pinot.

The practical answer: use the right tool for each job. Real-time OLAP engine for sub-second fresh dashboards (last 90 days hot/warm). Data warehouse for deep historical analysis. The operational cost of keeping 5 years of data in Druid at hot-tier performance levels vastly exceeds the cost of a warehouse query that takes 20 seconds.

HyperLogLog (HLL) is one of the most useful tools in real-time analytics — it estimates distinct counts (unique users, unique sessions, unique IPs) in constant memory regardless of the actual cardinality. Instead of storing every distinct value, an HLL sketch stores a compact probabilistic summary. The trade-off is accuracy: HLL has a standard error rate of roughly 1–2%.

For most analytics use cases, 1–2% error is invisible: "you had 4,917,200 unique visitors" versus "4,900,000 unique visitors" — neither is actionable at a different level than the other. But some use cases genuinely require exact counts:

Compliance and billing: "How many users accessed feature X?" for licensing audits or usage-based billing must be exact — an off-by-1% error at 10 million users is 100,000 users you either over- or under-charged.
Fraud detection: Counting distinct credit cards used in a session where the threshold is "flag if more than 3" — a 1% error on a small cardinality is a 100% wrong answer.
Legal reporting: GDPR data access counts, SOX audit trails — exact records required.

If your business has any of these requirements, use exact COUNT(DISTINCT) with a raw event store, not HLL on a pre-aggregated rollup. ClickHouse supports exact uniq() alongside HLL uniqHLL12() — pick deliberately based on the accuracy requirement of each specific metric.

The seven misconceptions break into three themes: tool scoping (real-time OLAP complements, not replaces, data warehouses; ClickHouse is not always "good enough" at extreme concurrency), engine differences (Druid and Pinot have meaningfully different design centres), and feature limits (join performance is still constrained; indexes have a cost; HLL is approximate). Understanding these prevents expensive architectural mistakes that are hard to undo once data is flowing at scale.

Section 22

Operational Playbook — Pick, Onboard, and Run a Real-Time OLAP Engine

Reading about real-time OLAP engines is the easy part. Choosing one, deploying it safely, and keeping it healthy in production is where most teams struggle. This five-stage playbook covers the decisions and tasks in the order you actually encounter them.

Stage 1 — Pick the Right Engine

The engine choice is the single most important and least reversible decision in this stack. Migrating from one engine to another at production scale is a multi-month engineering project. Get this right before you write the first ingestion job.

Use this decision framework:

Sub-100 ms user-facing queries at high concurrency (user dashboards, SaaS product analytics)? → Apache Pinot or managed StarTree. Pinot's partition-aware routing is purpose-built for this.
Multi-tenant internal analytics with Lambda architecture and mature cluster management? → Apache Druid or managed Imply. Druid's historical node management is the most battle-tested.
Simplicity + flexibility + rich SQL + moderate concurrency (under ~300 QPS)? → ClickHouse or ClickHouse Cloud. Fastest path from schema design to production query.
Join-heavy workloads over Iceberg/Delta/Hudi lakehouse tables? → StarRocks or Apache Doris. Both have native lakehouse catalog connectors and efficient distributed join execution.
API-first product where you want ClickHouse power without operating ClickHouse? → Tinybird (ClickHouse-as-an-API managed service).

Managed vs. self-hosted: Unless you have dedicated data infrastructure engineers and a specific reason to self-host, start with the managed offering. The operational complexity of running a multi-node Druid or Pinot cluster (deep storage configuration, coordinator tuning, segment compaction, ZooKeeper management) is significant. Managed services (Imply, StarTree, ClickHouse Cloud) remove most of this overhead at a cost premium that is usually smaller than the engineering time they save.

Stage 2 — Onboard: Schema, Pipeline, and Rollup Configuration

After choosing an engine, schema design is where most onboarding mistakes happen. Get this wrong and you pay for it forever (or until you rebuild the entire data set).

Step A — Design your schema as a wide table. Resist the normalisation instinct. Join campaign metadata, user properties, and event attributes into one wide row at ingestion time (in the stream processor). Real-time OLAP engines are designed for wide tables: 20–50 columns, with most of them being dimension columns that can be filtered and grouped.

Step B — Classify every column before writing the schema. For each column ask: "Will this be filtered/grouped in WHERE or GROUP BY clauses?" → dimension column. "Will this be summed/averaged/counted?" → metric column. "Is this a unique identifier per event (UUID, session_id)?" → NOT a dimension — exclude from indexing or treat as a metric.

Step C — Configure rollup early. Rollup is much harder to add retroactively than to configure from the start. Decide your rollup granularity (1 minute is usually a good default), confirm which metrics are additive (SUM), and which require approximate data structures (distinct counts → HLL). If in doubt, enable rollup at 1-minute granularity and keep a raw event log in a separate cheap store for drill-down access.

# Anti-pattern: UUID dimensions, no rollup, raw events datasource: impressions granularitySpec: rollup: false # storing every raw event queryGranularity: NONE # no time-bucketing dimensionsSpec: dimensions: - name: session_id # 💥 UUID — 1B+ distinct values type: string - name: user_agent # 💥 high cardinality — thousands of values type: string - name: country type: string

# Best practice: wide table, rollup on, cardinality-checked dimensions datasource: impressions granularitySpec: rollup: true queryGranularity: MINUTE # 1-minute rollup buckets segmentGranularity: DAY dimensionsSpec: dimensions: - name: campaign_id # ✓ ~10K distinct values type: string - name: country # ✓ ~200 values type: string - name: device_type # ✓ 5 values: mobile/desktop/tablet/tv/bot type: string - name: ad_format # ✓ ~20 values type: string # session_id removed — tracked via HLL sketch below metricsSpec: - name: impressions type: count - name: clicks type: longSum fieldName: clicked - name: unique_sessions_hll # ✓ approximate distinct session count type: HLLSketchBuild fieldName: session_id

Stage 3 — Test: Load, Cardinality, and Chaos

Real-time OLAP systems have failure modes that do not appear under synthetic test loads. Test with data that matches production in cardinality, not just volume.

Load test with production-like cardinality. Generate test data that has the same distribution of dimension values as production — same number of unique campaign IDs, same country distribution, same device type ratios. A load test with 10 synthetic campaign IDs will pass even with a fatally high-cardinality schema, because 10 IDs is a perfectly acceptable bitmap index. Only production-like cardinality will reveal the problems from Section 19.

Chaos test query broker failures. Kill a query broker node while load is running. The remaining brokers should absorb traffic; queries should return errors or reroute, not cause a cluster-wide stall. If your broker layer is under-provisioned, a single broker death under load can cause query queue buildup that takes minutes to drain.

Measure p99 latency under concurrency, not just average. Most dashboard SLOs care about p99 — the experience of the 1% of users whose queries take longest. A 200 ms average with a 4,000 ms p99 is a worse user experience than a 400 ms average with a 600 ms p99. Always report p99 (and p999 for user-facing products) in load test results.

Stage 4 — Monitor: The Five Signals That Matter

A real-time OLAP system in production has more moving parts than a typical OLTP database. The five signals below will catch 90% of production incidents before users notice:

Query p99 latency — the primary SLO metric. Alert when p99 exceeds your threshold for 2+ minutes. Causes: hot-tier eviction, cardinality explosion, broker under-provisioning.
Kafka consumer group lag — the staleness signal. Alert when lag exceeds 100,000 events (tune to your throughput). Causes: ingestion pipeline crash, schema validation failures silently dropping events, temporary throughput spike.
Segment-size distribution — a leading indicator of schema problems. If any segment grows 3× the expected size, a high-cardinality dimension was likely added. Catch this before it cascades to storage exhaustion.
Storage tier hit ratio — what fraction of query bytes are served from hot (fast) vs warm/cold (slow) storage. A sudden drop in hot-tier hit ratio predicts a p99 spike. Alert before the spike, not after.
Error rate — query errors (broker returns 500) and ingestion errors (schema validation failures, Kafka deserialization errors). A low but non-zero ingestion error rate is often data loss in disguise.

Stage 5 — Optimize: Cardinality, Materialized Views, and Tier Management

Optimization in real-time OLAP is a continuous activity, not a one-time task. The four levers you will use repeatedly:

Cardinality control. Periodically audit dimension cardinality in production. Run SELECT COUNT(DISTINCT dim) FROM table for each dimension column. Any column with cardinality over 100,000 that was not intentionally designed that way is a candidate for removal or pre-aggregation. Cardinality grows silently as data accumulates — a column that had 5,000 distinct values at launch can have 500,000 after a year.

Materialized views for hot queries. If 80% of your dashboard traffic queries the same set of dimensions and time ranges, create a materialized view (or a separate rollup datasource in Druid/Pinot) that pre-aggregates exactly those dimensions. The view is maintained at ingestion time at near-zero marginal cost, and queries against it are 10–100× faster than scanning the base table.

Rollup granularity tuning. As data ages, you can increase rollup granularity — roll up 1-minute data into 1-hour data for segments older than 30 days. This is called "compaction" in Druid. It dramatically reduces segment count and storage cost for historical data while keeping the data queryable.

Tier-down stale data. Configure storage tier policies so that segments older than your "hot" window automatically move to warm storage (larger, slower disk or object storage). Do not keep data in the hot tier indefinitely — it crowds out the recent data that actually needs fast access. Set automatic tier-down policies and validate that old queries (last 30–90 days) still complete within an acceptable latency budget from the warm tier.

-- Run this monthly to catch cardinality growth before it hurts -- Druid: use INFORMATION_SCHEMA or native segment metadata query SELECT "segment_id", "column_name", "cardinality" FROM sys.segments CROSS JOIN UNNEST(schema) AS t(column_name, cardinality) WHERE "dataSource" = 'impressions' AND column_name NOT IN ('__time', 'impressions', 'clicks') AND cardinality > 100000 ORDER BY cardinality DESC; -- Any row here is a potential index explosion — investigate and fix

{ "dataSource": "impressions", "taskPriority": 25, "inputSegmentSizeBytes": 419430400, "maxRowsPerSegment": 5000000, "granularitySpec": { "segmentGranularity": "DAY", "queryGranularity": "HOUR", "rollup": true }, "skipOffsetFromLatest": "P30D" } // Compacts segments older than 30 days, rolling up from MINUTE → HOUR // Reduces segment count by ~60× for historical data // Keeps recent data at MINUTE granularity for drill-down

The five-stage playbook — Pick, Onboard, Test, Monitor, Optimize — maps the sequence of decisions and tasks in the order a team actually encounters them. Picking the wrong engine is the least recoverable mistake; use the decision framework in Stage 1 seriously. Schema design (Stage 2) and cardinality control are the most common sources of production incidents. Monitoring the five key signals (Stage 4) catches most problems before users notice. Compaction and materialized views (Stage 5) are the primary levers for ongoing cost and latency optimization.

Section 23

Cheat Sheet & Glossary

Quick-reference cards for review and interview prep. Pin this section — it condenses the whole page into one-liners and clear definitions.

Core Rules — One Line Each

Fresh data + sub-second queries + ad-hoc OLAP + high concurrency — all four at once. That combination is what makes it hard.

Columnar stores only the columns a query touches; row stores read every column to reach any single value. For aggregation over 2 of 50 columns, columnar reads 4% of the data row storage would.

An immutable, time-bucketed chunk of data (typically one day) stored as columnar files with embedded bitmap indexes. The unit of replication, tiering, and parallelism in Druid/Pinot.

Wide table = join dimension data at write time, store everything in one wide row. Query scans one table. Star schema = normalised, requires joins at query time. For real-time OLAP, wide table wins every time.

HyperLogLog estimates COUNT(DISTINCT) in constant memory with ~1–2% error. Perfect for "daily active users" dashboards. Wrong for billing, compliance, or fraud where exact counts are required.

User-facing analytics requiring sub-100 ms p99 at high concurrency (LinkedIn-style "Who Viewed Your Profile", SaaS product dashboards). Partition-aware routing is the differentiator.

Multi-tenant internal analytics with Lambda architecture — mature historical node management, battle-tested coordinator, strong operator community.

Simplicity + flexibility + rich SQL for internal analytics, logs, and observability at moderate concurrency. Fastest path from schema to production query.

Every dimension column gets a bitmap index. A UUID dimension with 1 billion distinct values makes segments 40× larger and indexes useless. Audit cardinality before every schema deploy.

Glossary

Columnar storage: Stores each column's values together on disk rather than each row. Enables vectorised SIMD operations and compression per column type — critical for fast analytical scans.
Segment: An immutable, time-bucketed file group containing columnar data + indexes for a specific time range. Druid and Pinot distribute query work across segments in parallel.
Rollup: Pre-aggregating many raw events into fewer rows at ingestion time — grouping by dimension values within time buckets and summing metrics. Reduces storage and query scan work by 100–1,000×.
Materialized view: A pre-computed query result stored as a physical table, maintained automatically as new data arrives. Eliminates repeated base-table scans for common query patterns.
HyperLogLog (HLL): A probabilistic data structure for estimating COUNT(DISTINCT) in O(1) memory with ~1–2% standard error. Used in pre-aggregated schemas where exact distinct counts would be prohibitively expensive to store.
T-Digest: A data sketch for approximate percentile estimation (p50, p99, p999) that merges accurately at the tails where precision matters most. Used for latency percentile tracking in pre-aggregated schemas.
Broker: The query-routing node in Druid and Pinot. Receives SQL queries, routes sub-queries to the relevant historical/real-time nodes, merges partial results, and returns the final answer to the client.
Historical node: A Druid node that serves immutable historical segments loaded from deep storage. Responsible for query serving on older data. Contrasted with real-time nodes that serve fresh, actively-ingesting segments.
Deep storage: The durable, cheap object store (S3, GCS, HDFS) where Druid and Pinot persist all segment files. Deep storage is the source of truth; hot/warm nodes are just a cache on top of it.
Hot / warm / cold tier: Storage hierarchy: hot = recent segments on local NVMe SSD (fast, expensive); warm = less-recent segments on larger HDD or object storage (medium); cold = old segments on cheap S3 Glacier (slow, cheap). Tier-down policies move segments automatically as they age.
Denormalize: Joining dimension data (campaign metadata, user attributes) into the fact table at ingestion time to avoid joins at query time. The primary schema strategy for real-time OLAP — creates a "wide table".
Star schema: A normalised data warehouse schema with a central fact table joined to dimension tables (campaign, user, geography). Efficient for storage; inefficient for real-time OLAP queries that must join at query time.
Wide table: A denormalised fact table where dimension data has been merged in at ingestion time. Enables single-table scans without joins — the preferred schema pattern for real-time OLAP engines.
Broadcast join: A join strategy where the smaller table is copied to every node in the cluster so each node can join locally without network shuffle. Works for dimension tables under ~500 MB; breaks for large fact-to-fact joins.
Shuffle join: A distributed join where both tables are re-partitioned by join key across the cluster. Handles large tables but requires network transfer of data — expensive and slow in real-time OLAP engines; more suitable for batch processing frameworks like Spark.
Cardinality: The number of distinct values in a column. Low-cardinality columns (country: ~200) are ideal for bitmap indexes. High-cardinality columns (session_id: billions) destroy bitmap index efficiency and must be handled as metrics or excluded from indexing.

The cheat sheet distills the nine most important one-liners from the page — the mental models that matter most for both interviews and production decisions. The glossary covers 16 terms you will encounter in engine documentation, monitoring dashboards, and architecture discussions. Together they serve as a quick-review reference before an interview or when onboarding a teammate to the stack.

Real-Time Analytics — Sub-Second Insight From Live Data

TL;DR — Real-Time Analytics in Plain English

Why You Need This — The Live Dashboard Problem

The Story: An Ad-Tech Dashboard That Brought Down the Database

Migration 1: Snowflake — Faster, But a New Problem Emerges

Migration 2: Apache Druid — The Right Tool

Why These Numbers Matter in Practice

Mental Model — The Three Database Families

The Library Analogy

The One Trade-off Real-Time OLAP Makes

Core Concepts — The Real-Time Analytics Vocabulary

Storage Layer Terms

Query Layer Terms

Ingestion Layer Terms

Approximation Layer Terms

Columnar Storage — Why It's Foundational

Row-Oriented vs. Columnar: The Physical Byte Layout

Compression: Columnar Data Compresses Far Better Than Row Data

SIMD: CPU-Level Acceleration for Dense Arrays

The Trade-off: Columnar Storage Is Slow for Writes

Segments and Time-Based Partitioning — How Druid and Pinot Slice the Data

What a Segment Is

Segment Pruning: Skipping 99% of the Data

Query Routing: How the Broker Dispatches to Segments

Why Immutability Is a Feature, Not a Limitation

Indexing in Real-Time OLAP — Beyond Just B-Trees

Index Type 1: Bitmap Index — One Bit Per Row Per Value

Index Type 2: Inverted Index — Value → List of Rows

Index Type 3: Forward Index — Dictionary Encoding

Index Type 4: Star-Tree Index (Pinot) — Pre-Computed Multi-Dimensional Rollups

Index Type 5: Primary Key (Sparse) Index — ClickHouse MergeTree

Choosing the Right Index — Quick Guide

Pre-Aggregation & Rollups — Trade Storage for Query Speed

Rollup Tables — Collapsing Billions of Rows to Thousands

Materialized Views — On-the-Fly Rollup for ClickHouse and StarRocks

Approximate Queries — Trading 1% Accuracy for 10,000× Less Memory

Ingestion — Streaming, Batch, and the Lambda Heritage

Streaming Ingestion — Seconds from Event to Query

Batch Ingestion — Efficient Historical Loading

Segment Compaction — Merging Small Streaming Segments

Schema Evolution — Adding Columns Without Rebuilding History

Query Engine Architecture — Broker, Coordinator, Historical

Druid's Node Topology

Tail Latency and Speculative Execution

The Big 5 Real-Time OLAP Engines — Comparison

Engine Deep Dives

Apache Druid — The Battle-Tested Generalist

Apache Pinot — The Latency Champion

ClickHouse — The Raw Speed Specialist

StarRocks — The Join-Capable Lakehouse Engine

Apache Doris — The Broad SQL OLAP Platform

Storage Tiers — Hot, Warm, Cold, and Deep

The Three Tiers (Plus Deep Storage)

Retention Policy — What to Keep and How Long

The Cost Calculation — Why Tiering Makes Real-Time Analytics Economically Viable

Common Use Cases — Where Real-Time OLAP Shines

Data Modeling for Real-Time OLAP — Star, Snowflake, Wide-Table

Why Wide Tables Win in Real-Time OLAP

Dimension Refresh and Slowly Changing Dimensions

The Cardinality Problem — Why It Determines Everything

Why Bitmap Indexes Break at High Cardinality

Solutions for High-Cardinality Dimensions

Joins in Real-Time OLAP — Why Druid Hates Them and ClickHouse Tolerates Them

Engine-by-Engine Join Support

When to Denormalize vs. When to Join

Monitoring and Operating Real-Time OLAP

The Seven Operational Metrics (and Why Each One)

Practice Exercises — Real-Time OLAP Reasoning

Bug Studies — When Real-Time Analytics Goes Wrong

What Went Wrong

What Went Wrong

What Went Wrong

What Went Wrong

Real-World Architectures — How Big Companies Do Real-Time Analytics

Common Misconceptions — What People Get Wrong About Real-Time Analytics

Operational Playbook — Pick, Onboard, and Run a Real-Time OLAP Engine

Cheat Sheet & Glossary

Core Rules — One Line Each

Glossary

Related Topics — What to Study Next