Stream Processing — System Guide

Section 1

TL;DR — Stream Processing in Plain English

What stream processing actually IS — computing over infinite, unbounded data where you can never wait for the data to "finish"
Why batch processing breaks down the moment you need answers in seconds, not hours — and the exact math explaining the latency gap
The four hard problems that make streaming genuinely tricky: time skew, windowing, stateful aggregations, and exactly-once semantics under retries
How a stream processing program is really a dataflow graph — a chain of operators that runs forever, partitioned across workers
How Apache Flink, Kafka Streams, Spark Structured Streaming, and Apache Beam each make different bets on that latency-vs-correctness spectrum
Why event time and processing time diverge — and why that divergence is the root cause of the hardest bugs in streaming systems
What a watermark is and why it's how the system decides "we've seen enough, close the window"

Stream processing is the discipline of computing over data that never ends. There is no "load the file, run the query, print the results" loop — there is only a continuous river of events flowing in, and your job is to extract aggregates, detect patterns, and produce outputs while the river is still running. The four hard problems are: time (events arrive late and out of order, so which clock do you trust?), windowing (how do you carve infinity into computable chunks?), state (aggregations need memory across events — what happens when the worker crashes?), and exactly-once (retries under failures mean the same event might arrive twice — how do you count it only once?). Get those four right and you can build fraud detectors that fire in two seconds, dashboards that update live, and ML feature pipelines that never go stale.

In batch processing, you wait for data to accumulate, then run a job over it. The data has a clear start and end: last night's orders, last month's logs, last year's transactions. Stream processing throws that model away. Click events, sensor readings, payment authorisations, log lines — these never stop arriving. There is no "end of data". You must produce continuous results — counts, averages, anomaly alerts — while the data is still flowing, with latency in the milliseconds to seconds range rather than hours. The fundamental shift is from bounded data (finite, fits in a batch job) to unbounded data (infinite, grows forever, can't be loaded all at once).

Moving from batch to streaming isn't just "run the batch job more often." Four entirely new problems appear. First, time skew: an event's timestamp (when it happened) and when you see it (wall clock on the processor) are different — mobile phones go offline, networks delay packets, replays replay old events. Second, windowing: to compute "purchases per minute" you must slice the infinite stream into finite time buckets. Third, stateful aggregation: "count per user" means the operator must remember what it has seen so far — which requires durable, crash-safe state stores. Fourth, exactly-once semantics: when a worker crashes and restarts, it will re-process some events; you need to ensure your outputs count each event exactly once despite the retries. All four problems are hard. Solving all four together is what distinguishes Apache Flink from a weekend side project.

You already know Kafka — the distributed log where events land first. Stream processing is the compute layer that sits on top of Kafka (or Kinesis, or Pulsar). Events flow out of Kafka, into a stream processor, and the processor writes results to Kafka, a database, a dashboard, or an alerting system. Think of Kafka as the highway and stream processing as the toll booths doing real work on every passing car — continuously, at highway speed, forever. The major frameworks today are Apache Flink (stateful, event-time-native, production workhorse), Kafka Streams (a Java library embedded in your service, no separate cluster), Spark Structured Streaming (micro-batch model with DataFrame API), and Apache Beam (a unified programming model that runs on Flink, Spark, or Dataflow as a backend).

Stream processing computes over unbounded, never-ending data, producing results continuously in milliseconds rather than waiting hours for a batch job. The four core hard problems — time skew, windowing, stateful crash-safe aggregations, and exactly-once semantics — separate real streaming systems from naive polling loops. Stream processors sit between Kafka (the log) and your databases or dashboards, forming the compute backbone of real-time data pipelines.

Section 2

Why You Need This — When Batch Isn't Fast Enough

Before we build anything, let's prove that batch processing — the tool almost every team starts with — has a fundamental latency floor it cannot cross. Understanding why that floor exists makes every streaming design decision obvious.

The Story: A Fraud Detector That Fires at 9 AM

Imagine an e-commerce company. They process thousands of orders per hour. They have a fraud detection model — it looks at order value, shipping address, IP geolocation, and purchase history to flag suspicious orders. The data team built it as a Hadoop batch job. Every night at midnight the job runs, scanning all orders from the previous day. By 9 AM the next morning, the fraud report lands in the operations team's inbox.

The problem: A fraudulent order placed at 8 PM is not flagged until 9 AM the next morning — 13 hours later. The item shipped at 10 PM. The fraudster received the package at 7 AM. The fraud report lands 2 hours after delivery. The company absorbed the full loss of the transaction, plus the cost of dispute resolution. The batch job worked perfectly — it just ran 13 hours too late.

This isn't a software bug. It's an architectural ceiling. Batch processing has a formula for its latency that you can't escape:

Batch latency = data-collection-window + processing-time
The data-collection-window is how long you wait before starting the job (nightly = up to 24 hours, hourly = up to 1 hour). Processing-time is how long the job takes to run (minutes to hours for large datasets). You cannot reduce the total latency below the collection window without running the job more frequently — and at some frequency, "running the batch more often" becomes streaming.

The Three Generations of Latency Compression

The industry didn't jump from nightly batch to millisecond streaming overnight. It happened in three steps, each compressing latency by roughly 100×:

The diagram shows the key insight: in nightly batch processing, the dominant cost is the data-collection window — the time you spend waiting before the job even starts. Micro-batch (Spark Streaming running 30-second to 2-minute batches) reduces this dramatically but doesn't eliminate it. True streaming removes the collection window entirely — each event is processed the moment it arrives. The processing time per event is tiny (single-digit milliseconds for modern hardware), so the total latency collapses to the time it takes for one event to flow through the pipeline.

The Fraud Detector Rewired

Back to our e-commerce company. They migrate the fraud detector to Apache Flink. The same ML model logic, now expressed as a stream processing job reading from a Kafka topic. Here's what changes:

Before (batch): Order placed at 8:00 PM → batch job starts midnight → fraud detected 9:05 AM → 13 hours later. Package already delivered.
After (streaming): Order placed at 8:00:00 PM → Flink processes the event at 8:00:00.300 PM → fraud score exceeds threshold → payment declined at 8:00:00.500 PM. 500 milliseconds total. Transaction never completes.

The ML model didn't change. The business logic didn't change. Only the execution model changed — from "accumulate and batch" to "process each event as it arrives". That 500ms is the fundamental latency floor of the streaming approach: the time for a Kafka consumer to poll the event, route it through Flink's operator graph, and write the result back.

The batch model is always a latency trade-off. Nightly batch isn't "wrong" — it's perfectly correct for use cases where freshness doesn't matter. Running monthly payroll? Batch. Generating weekly executive reports? Batch. But any use case where humans or systems need to respond to an event while it's still actionable — fraud detection, live dashboards, inventory alerts, recommendation freshness — streaming is the only architecture that keeps the door open.

Batch processing has an irreducible latency floor equal to its data-collection window — the time you wait before the job starts. Nightly jobs create 12-hour-plus windows; micro-batch compresses this to 30 seconds to 2 minutes; true streaming removes the collection window entirely, dropping total latency to the millisecond range. The e-commerce fraud case illustrates the concrete business consequence: 13 hours of batch latency means fraud is detected after the goods have shipped, while 500ms of streaming latency means the transaction never clears.

Section 3

Mental Model — The Dataflow Graph

You don't write a stream processing program the way you write a for-loop. There is no "start here, end here". Instead, you describe a graph of operations — boxes connected by arrows — and the runtime keeps that graph running forever, processing each event as it flows through. Understanding this graph mental model is the single most important conceptual step in stream processing.

The Assembly Line Analogy

Imagine a car factory assembly line. Cars (events) enter at one end. They pass through a series of stations: paint station, engine installation station, quality check station, packaging station. Each station does one specific job. Stations run in parallel — car #10 is being painted while car #9 gets its engine installed. The line never stops; it runs 24/7 as long as cars keep arriving.

A stream processing program is exactly this assembly line. Events enter at the left (from Kafka, Kinesis, or a file). They pass through a series of operators — filter this event, transform that field, group by user ID, count over a 1-minute window, write to the database. The operator chain is called a dataflow graph. You write the graph once; the runtime keeps it executing forever.

The Four Kinds of Operators

Every dataflow graph is built from four types of operators. Each type has a specific role:

Source operators — the entry points. They read events from an external system and inject them into the graph. Examples: Kafka consumer reading from a topic, a Kinesis shard reader, a file watcher. One source per input stream.
Transform operators — stateless per-event work. They receive one event, do some computation, and emit zero or more events. Examples: map (convert one event to another format), filter (drop events that don't match a condition), flatMap (split one event into many). These are the simplest operators — no memory required across events.
Stateful operators — aggregations and joins that need to remember past events. Examples: count per user (must keep a running count per user ID), sum of order values per 1-minute window (must accumulate over time), join two streams (must buffer one side waiting for the other). These require a state store — memory that persists across events and survives crashes.
Sink operators — the exit points. They take processed results and write them somewhere: Kafka topic, PostgreSQL table, Elasticsearch index, Prometheus metrics endpoint, dashboard WebSocket. One sink per output destination.

A Concrete Example: Count Purchases Per User Per Minute

Let's make this tangible. Say you want to count how many purchases each user makes every minute — the kind of real-time signal a fraud detector or recommendation engine needs. The dataflow graph has exactly five steps:

The diagram shows the five operators for our purchase-count pipeline. Events flow left to right. The Kafka source emits a raw purchase event. The filter operator drops cancelled orders (stateless — just a predicate on each event). The keyBy(userId) operator partitions the stream so all events for the same user flow to the same worker — this is essential because the aggregation needs to see all of a user's events. The tumbling window groups one minute of events per user, then the count aggregation fires. The result — "user u1 made 7 purchases between 12:00 and 13:00" — is written to a Kafka output topic for downstream systems to consume.

Parallelism: The Same Graph, Running Across Workers

The runtime doesn't run just one copy of this graph. It runs N copies in parallel — one per partition. If our Kafka input topic has 64 partitions, Flink spins up 64 parallel copies of the filter+keyBy+window+count chain. Each copy handles a subset of events. The parallelism is automatic — you express the graph once, and the runtime fans it out. This is how stream processors handle millions of events per second: not by making one machine faster, but by spreading the workload across hundreds of identical operator instances.

Graph vs code. When you write a Flink or Kafka Streams program, you're not writing imperative "do this then do that" code. You're declaring a graph. The framework compiles your declarations into an execution plan (the actual dataflow DAG), schedules operators onto workers, manages parallelism, handles restarts, and keeps the graph running forever. Your code describes what to do; the framework handles how and when to do it.

A stream processing program is a dataflow graph — a directed chain of operators connected by event streams. Source operators pull from Kafka or Kinesis. Stateless transform operators (map, filter) do per-event work with no memory. Stateful operators (windows, counts, joins) need durable state stores that survive crashes. Sink operators write results downstream. The runtime deploys this graph in parallel across workers, one copy per partition, enabling throughput that scales horizontally just by adding machines.

Section 4

Core Concepts — The Stream Processing Vocabulary

Stream processing has precise terminology. Before you can read the Flink docs, debug a stateful job, or design a windowed aggregation, you need these terms locked in. Each term below is introduced with a plain English sentence first — the jargon name comes after, once you already understand the idea.

How to read this section: Plain English first, then the bold technical term. Every term also appears as a on first use elsewhere in the page.

The concept map shows how the major ideas connect. At the centre is the dataflow graph — everything else either feeds into it (unbounded data, sources, sinks) or is a property of how it behaves (windows, event time, watermarks, state, exactly-once). The red arrows highlight the time-related cluster — this is where most stream processing complexity lives, which is why Sections 5 and 6 give it dedicated treatment.

Term Glossary — Plain English First

A single event that carries data about one thing that happened — a click, a purchase, a sensor reading, a log line — is called a record or event. We use both terms interchangeably.

Data that has a fixed, known end — last month's transactions, a CSV file — is called bounded data. Data that never stops arriving — live click streams, IoT sensors, trading feeds — is called unbounded data. Stream processing is the discipline of computing over unbounded data.

The time when an event actually happened in the real world — encoded in the event's timestamp field — is called event time. The time when your processing system sees the event — the wall clock at your operator — is called processing time. They diverge. This divergence is the root of most hard problems in streaming — we'll explore it in depth in Section 5.

A heartbeat signal the system emits saying "we believe event time has reached T" is called a watermark. Watermarks let windowed operators decide "it's safe to close the 12:00–13:00 bucket and emit the count". Without watermarks, a window would never know when all its events had arrived — it would wait forever.

Carving the infinite stream into finite, computable time buckets is called windowing. The four window shapes are: tumbling (fixed non-overlapping buckets), hopping (overlapping, advance by step), sliding, and session (gap-based, closes after inactivity). We cover all four in Section 7.

When an operator needs to remember data across many events — like "how many purchases has user #1234 made today?" — it uses keyed state (state partitioned per key) or operator state (state global to the operator). State stores are the memory that makes stateful aggregations possible — and the reason you need crash recovery.

A snapshot of the entire state of all operators at a point in time, saved to durable storage, is called a checkpoint. Checkpoints enable crash recovery: if a worker dies, the job restarts from the last checkpoint and replays only the events that arrived after it. A manually triggered checkpoint you can restart from intentionally is called a savepoint.

When each event is counted exactly once, even if the system retries after a crash, that property is called exactly-once semantics. The easier (and more common) guarantee, where events may be processed more than once after a crash but are never lost, is called at-least-once semantics. Whether exactly-once matters depends on whether your sink is idempotent.

When a downstream operator can't keep up and the upstream operator needs to slow down, the system applies backpressure. This prevents out-of-memory crashes when load spikes temporarily overwhelm a bottleneck operator. Flink and Kafka Streams both handle backpressure automatically — you don't implement it yourself.

Stream processing has a precise vocabulary that every practitioner needs. The core axis is: events (unbounded data flowing in), operators (transform/stateful/source/sink forming a dataflow graph), time (event time vs processing time, managed via watermarks), windowing (tumbling/hopping/session carve infinity into computable buckets), state (keyed stores persisted via checkpoints for crash recovery), and correctness (exactly-once vs at-least-once, with backpressure managing flow under load).

Section 5

Event Time vs Processing Time — Why Time Is the Hardest Problem

Here's a question that sounds simple: "How many purchases happened between 2 PM and 3 PM yesterday?" If you're running batch queries on a database, this is trivial — filter by timestamp, count rows, done. In stream processing, this question reveals a deep problem that affects the correctness of almost every real-time computation. The problem is that there are two different clocks in a streaming system, and they disagree.

Clock 1: Event Time — When It Actually Happened

Every event carries a timestamp — the moment the event occurred in the real world. A purchase event has a purchasedAt field; a click event has a clickedAt field; an IoT sensor reading has a measuredAt field. This timestamp is set by the source system — the mobile app, the payment processor, the sensor firmware. It represents ground truth: the event happened at exactly this time.

This clock is called event time. When you want to answer "what happened between 2 PM and 3 PM?", event time is the right clock to use. It doesn't matter when your processor saw the event — it matters when the purchase occurred.

Clock 2: Processing Time — When Your System Saw It

Your stream processing operator has a different clock — the wall clock on the machine running the operator. When an event arrives in the operator's input buffer, that wall clock says "5:03 PM". This is called processing time. It's the time the system actually observed the event, not when the event happened.

For many simple use cases, processing time is perfectly fine. If you want "how many events are we seeing right now?" (a throughput metric), processing time is correct — you don't care when the events happened, you care about current arrival rate. But the moment you ask "what happened in a specific real-world time window?", processing time gives you wrong answers.

The diagram shows five events (A through E) plotted on two axes: their event time (top, amber axis) and their actual arrival at the processor (bottom, blue axis). Events A, D, and E arrive quickly — only a second or two of network delay, so the two clocks stay close. Events B and C are the problem: their event time says they happened at 2:01 PM, but they arrive at the processor at 2:04–2:05 PM — over three minutes late. Why? Possible reasons: a mobile phone went offline and buffered the events, a network partition delayed delivery, or a consumer process crashed and replayed a Kafka partition. In all these cases, the real-world truth is that these events happened at 2:01 PM — but the processor saw them three minutes after it had already moved on.

Why This Breaks Naive Stream Processing

Suppose you're using processing time for your 2:00–3:00 PM window. Your system opens the window when the wall clock reaches 2:00 PM and closes it when the wall clock reaches 3:00 PM. Events B and C arrive at 2:04 and 2:05 PM processing time — they fall inside the window and get counted. This seems fine.

But now try to answer: "How many purchases happened between 2:00 PM and 2:01 PM?" With processing-time windows, B and C are in a completely different 1-minute bucket than A even though B and C happened at 2:01 PM. Your answer is wrong — B and C contribute to the wrong window.

The core problem: A processing-time window for "2:00–2:01 PM" closes at wall-clock 2:01 PM. Events B and C have event time 2:01 PM but arrive at wall-clock 2:04 PM — they're assigned to the 2:04–2:05 PM processing-time window instead. Any per-minute report built on processing time is lying about when things actually happened.

The Real-World Scenario: Mobile Offline-Then-Resync

Here's the canonical example. A user opens your mobile shopping app at 2:50 PM. The phone is on spotty cellular. The user makes a purchase. The app sends the event — but the network drops the packet. The app retries every 30 seconds. At 5:00 PM, the user moves into WiFi range. The app reconnects and successfully delivers the purchase event.

This diagram illustrates the gap clearly. The purchase event has event time 2:50 PM — that's when the user tapped "buy". The processor doesn't see it until 5:00 PM when the phone reconnects to WiFi. A naive processing-time system puts this event in a "5:00 PM" window. An event-time system — if its watermark hasn't already closed the 2:50 PM window — correctly assigns the event to the 2:50 PM bucket. That two-hour gap is extreme, but smaller gaps (seconds to minutes) happen constantly in production systems due to network jitter, retry queues, and consumer group rebalancing.

When to Use Each Clock

Neither clock is universally "better" — each answers a different question:

Use event time when: You're computing "what happened in a specific real-world time period?" — purchase counts per hour for a business report, sensor readings per minute for an analytics dashboard, fraud score based on purchase history in the last 10 minutes of activity. The business question refers to real-world time → use event time.

Use processing time when: You're measuring properties of the system itself — throughput (how many events per second is our pipeline processing right now?), latency (how old are the events when they arrive?), queue depth (how far behind is this consumer?). You care about the pipeline's behaviour, not the events' real-world semantics → use processing time.

Two clocks exist in every streaming system: event time (when the event actually happened, encoded in the event itself) and processing time (the wall clock when the processor sees the event). They diverge because networks delay packets, mobile devices go offline and batch-deliver events, and replayed partitions re-deliver old events. Processing-time windows produce wrong answers for real-world time queries because late-arriving events land in the wrong bucket. Event-time windows solve this but require knowing when it's safe to close a window — that decision is made by watermarks, which Section 6 explains.

Section 6

Watermarks — How to Decide "We've Seen Enough"

Section 5 ended with a problem: if you're doing event-time windowing, how do you know when all the events for a given time window have arrived? You can't wait forever — the stream never ends. But if you close the window too early, late events that should have been in that window are lost. Watermarks are the answer to this dilemma. They are the mechanism stream processors use to reason about time under uncertainty.

The Core Idea: A Progress Signal

Imagine you're counting all purchases between 2:00 PM and 3:00 PM. Events for that window are still trickling in. You've seen events with timestamps up to 2:58 PM. You want to emit the count — but you're not sure if more 2:xx PM events are still in transit.

A watermark is a special signal you inject into the event stream that says: "I am confident that all events with event time earlier than T have already been seen." When the windowed operator sees a watermark at T = 3:05 PM, it knows: no more events with timestamp earlier than 3:05 PM should arrive (or if they do, they're considered "late"). It's safe to close the 2:00–3:00 PM window and emit the count.

Formally: a watermark with timestamp T is an assertion that all events with event time < T have been delivered to this operator. Every windowed operator waits for a watermark that passes the window's end time before it fires and emits results.

The diagram shows the watermark flow. Events 1, 2, and 3 carry event times of 2:55–2:59 PM and arrive at the processor around 3:00–3:01 PM processing time. The watermark advances as events arrive — it's generated by subtracting a configured "allowed lateness" from the highest observed event time. When the watermark reaches T = 3:01 PM (which passes the window boundary of 3:00 PM), the window closes and emits its count of 3 events. The late event (L), which has event time 2:50 PM but arrives at 3:05 PM processing time, arrives after the watermark has already passed — so it cannot be added to the now-closed window. Instead, it's routed to a late-data side output stream for special handling.

How Watermarks Are Generated

Watermarks don't appear by magic — someone has to generate them. The most common strategy in Apache Flink and Kafka Streams is called bounded out-of-orderness: you pick a fixed delay (say, 10 seconds) and set the watermark to max_seen_event_time - 10s. In plain English: "I'm assuming no event will arrive more than 10 seconds late."

Watermark = max observed event time − max allowed lateness
Example: you've seen events with event time up to 2:59:50 PM. Your allowed lateness is 30 seconds. Watermark = 2:59:50 PM − 30s = 2:59:20 PM. This tells operators: "All events with event time earlier than 2:59:20 PM should have arrived by now."

The Latency–Completeness Trade-off

The allowed lateness setting is the most important tuning knob in event-time windowing. It controls a fundamental trade-off:

The trade-off is real and unavoidable. A tight watermark (say, 2-second allowed lateness) means your window results arrive quickly after the window boundary closes — but any event that was delayed more than 2 seconds is considered "late" and either dropped or sent to a side output stream for separate processing. A loose watermark (say, 60-second allowed lateness) captures nearly all late events — but your window results are delayed by 60 seconds, which may be unacceptable for a fraud detection alert.

In practice, most production teams use between 5 and 30 seconds of allowed lateness, based on the 99th-percentile event delay they observe in their system. You measure it empirically (histogram of event-time-to-processing-time gaps), then set the allowed lateness to cover that percentile comfortably.

Watermarks Are Heuristics — Late Data Still Happens

Here's the hard truth: watermarks are a bet, not a guarantee. You're saying "I believe no event will arrive more than 30 seconds late." Sometimes you're wrong. A phone might be offline for 2 hours. A batch replay job might emit events with timestamps from yesterday. In those cases, events arrive after their window's watermark has already passed.

Stream processors handle true late data in three ways:

Drop it: The simplest option. Late events that arrive after the watermark are silently discarded. Acceptable when late events are rare and the count error is small relative to total volume.
Side output stream: Route late events to a separate Kafka topic or output stream for special handling — manual review, reprocessing, or inclusion in the next window. Apache Flink calls this a side output.
Allowed lateness on the window: Some frameworks let you keep a window "open" for extra time beyond the watermark, accepting late events and issuing updated results. Flink calls this "allowed lateness" on the window — you can configure a window to accept events up to 5 minutes past its watermark, emitting a corrected result each time a late event arrives. The trade-off: you hold window state in memory for longer.

Watermarks are to stream processing what SQL transactions are to databases. You don't think "will this transaction fail partway through?" on every query — you trust the system's correctness guarantee. Similarly, once you understand watermarks, you stop worrying "did I miss any late events?" and start trusting your watermark configuration handles it — you just need to tune the allowed lateness to your system's observed delay distribution and instrument alerts when the late-data side output starts filling up.

A watermark is a progress signal in the event stream asserting that all events with event time earlier than T have been seen. When a windowed operator receives a watermark that passes its window's end boundary, the window closes and emits results. Watermarks are generated by subtracting a configured allowed-lateness duration from the highest observed event time. Tight watermarks mean low output latency but more late drops; loose watermarks mean higher completeness but delayed output. Late events that slip past the watermark are handled via drop, side output, or window allowed-lateness — each a distinct trade-off between simplicity, memory, and correctness.

Section 7

Windowing — Carving Infinity Into Computable Pieces

Here is the central problem with infinite data: you can never compute "the total" because there is no total yet — more data is always coming. To produce any aggregate — count, sum, average — you have to first answer the question: aggregate over WHAT? The answer is a window: a bounded slice of the infinite stream that you can actually finish computing.

Think of it like a bus driver counting passengers. You can't count "total passengers ever" without waiting forever. But you can count "passengers per 10-minute shift" — that's a tumbling window. Or "passengers in the last 10 minutes, updated every 2 minutes" — a hopping window. Or "all passengers in one continuous ride from boarding to exit" — a session window. Each window type answers a different business question, and choosing the wrong one produces subtly wrong results that are very hard to debug.

There are four standard window types. Let's build an intuition for each before looking at the math.

What this SVG shows: all four window types on the same event-time axis (0–9 minutes). Tumbling windows are strict boxes — each event belongs to exactly one window. Hopping windows overlap, so the same event can count in several windows. Sliding windows recompute fresh for every arriving event. Session windows have no fixed size — they grow as long as events keep arriving, and they close when there's a gap longer than the timeout.

Tumbling Windows — Fixed Buckets, No Overlap

A tumbling window is the simplest window: fixed size, non-overlapping, no gaps. Every event falls into exactly one bucket. "Count purchases every 5 minutes" means windows [0:00–4:59], [5:00–9:59], [10:00–14:59] — and so on forever. Each window produces exactly one output record when it closes.

The math is clean: a stream of N events over T minutes produces exactly ⌈T/window_size⌉ output records — one per window. If your window is 5 minutes wide and you run for an hour, you get 12 output records.

Use tumbling windows when each bucket must be independent and non-overlapping — billing periods, SLA reports, hourly dashboards. The "no overlap" property means you can't accidentally double-count an event in two windows.

// Apache Flink — tumbling window: count orders per user every 5 minutes DataStream<Order> orders = env.addSource(kafkaSource); orders .keyBy(order -> order.userId) // partition by user — each user gets its own state .window(TumblingEventTimeWindows.of(Time.minutes(5))) // fixed 5-min bucket, event time .aggregate(new CountAggregator()) // one result per (user, window) pair .print(); // TumblingEventTimeWindows means: use the event's own timestamp field, not wall-clock. // If you wrote TumblingProcessingTimeWindows, you'd get wall-clock buckets — // fine for monitoring, wrong for historical replays where events arrive out of order.

The code walkthrough: keyBy(order -> order.userId) creates K parallel window-streams — one per user. Think of it as splitting the single infinite stream into K sub-streams, one per user, and then windowing each independently. This is keyed windowing, and it's how you get "5 purchases per user per minute" rather than "5 purchases total per minute". The aggregate() call fires once per window when the window closes, producing one output per (userId, window) pair.

Hopping Windows — Overlapping Slides

What if you want a 10-minute rolling count that updates every minute — not once every ten minutes? That means windows have to overlap: a new one starts before the previous one ends. This shape is called a hopping window (sometimes called a "sliding window" in older frameworks — the naming is inconsistent, which is a constant source of confusion). It has two parameters: a size and a hop interval. The window size is how wide each window is; the hop interval is how often a new window starts.

Example: size = 10 minutes, hop = 1 minute. Every minute, a new 10-minute window opens. Each event falls into up to size/hop windows simultaneously — here, up to 10. This means one event produces up to 10 output records across all the windows it participates in. That's called the fanout — more overlap = more computation = more cost.

Use hopping windows when you want a "trailing average" that updates frequently — "average CPU over the last 5 minutes, refreshed every 30 seconds" or "7-day rolling average of sales, updated daily". The overlap is the feature: you get smooth, continuously updated aggregates rather than step-function jumps.

// Hopping window: 10-minute window, new window every 1 minute orders .keyBy(order -> order.userId) .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1))) // Flink calls this "SlidingEventTimeWindows" — what we call hopping. // size=10min, slide=1min → a new 10-min window starts every 1 minute. .aggregate(new SumAggregator()) .print();

What this code does: every minute, a fresh 10-minute window opens. Each event lands in up to ten of these overlapping windows simultaneously — so every minute you emit a result covering "the last 10 minutes ending now." Confusingly, Flink's API name SlidingEventTimeWindows matches what most of the literature now calls a hopping window. Read the two parameters as "(size, hop)" and ignore the name.

Sliding Windows — Per-Event Recomputation

A true sliding window (distinct from hopping) recomputes the aggregate fresh on every event arrival. Every new event triggers a new window result that covers [now - window_size, now]. This gives you the most up-to-date possible result, but at the cost of computing a new result for every single event — which is expensive at high throughput.

True sliding windows are relatively rare in practice because the per-event recomputation cost is prohibitive at scale. Most use cases that look like sliding windows are better served by hopping windows with a small hop interval. The distinction matters for interviews: if asked "how do you compute a trailing 5-minute average that updates in real time?", the correct answer is a hopping window (e.g., size=5min, hop=1sec) — not a true per-event sliding window.

Session Windows — The Gap-Based Window

A session window has no fixed size at all. Instead, it's defined by an inactivity gap: "group all events that arrive within 30 minutes of each other; when there's a 30-minute gap with no events, close the window." This is how you detect a user's browsing session, a driver's continuous trip, or a machine's active work period.

The tricky part: session window sizes are completely unpredictable. A session could be 2 events over 10 seconds or 300 events over 45 minutes. This makes session windows more expensive to manage in state — the framework can't pre-allocate fixed buckets; it has to dynamically open, extend, merge, and close windows per-key as events arrive.

What this SVG shows: top row — two clusters of events separated by a 38-minute gap. Since the gap exceeds the 30-minute timeout, the first cluster closes as Session 1, and the second cluster opens a new Session 2. Bottom row — the session merge scenario: two "tentative" sessions that end up merging into one when a new event arrives within the gap of both. Flink handles this automatically, but it requires the framework to store and merge window state dynamically.

What this SVG shows: keyBy(userId) splits the single mixed event stream into K parallel sub-streams — one per user. Each sub-stream runs its own independent window with its own isolated state. User A's window closes and produces output completely independently of User B's window. This parallelism is why Flink can process millions of concurrent user sessions without a single shared bottleneck — each key is its own little stream.

Window type decision rule: Ask "does each event belong to exactly one bucket?" → tumbling. "Do I want a rolling result that updates more often than the window size?" → hopping. "Do I want to group events by natural activity clusters with no fixed time?" → session. "Do I need a fresh result for every single event?" → true sliding (warn your ops team first).

Windowing solves the fundamental problem of aggregating over an infinite stream by slicing it into finite, computable pieces. Tumbling windows are disjoint and produce one result per bucket. Hopping windows overlap (fanout = size/hop) and give smoothly-updating rolling aggregates. Session windows are gap-based with no fixed size and are used for natural activity grouping. Keyed windowing via keyBy() creates K independent parallel window streams, enabling per-user or per-entity aggregations at full parallelism.

Section 8

Stateful Operations — Joins, Aggregations, and the State Store

Most useful things you want to do with a stream require memory. Counting purchases per user requires the operator to remember how many purchases it has seen so far for each user. Joining click events with user profiles requires looking up a profile table. Detecting "three failed logins in 10 minutes" requires storing the last three login events per user. Anything beyond "print each event as it arrives" is a stateful operation.

State in stream processing is different from state in a normal application. The big difference: state must survive worker crashes. If the node holding "User A has made 4,987 purchases so far" crashes and forgets everything, you can't just restart — User A's counter resets to zero and you get wrong answers. So state has to be durable and crash-safe by default. This requirement immediately rules out a simple HashMap in memory as the only state storage option — you need something that can checkpoint and recover.

Where State Lives: In-Memory vs RocksDB

Flink (and Kafka Streams) offer two main state backends:

Keep everything in RAM (HashMapStateBackend) — state lives as plain Java objects in the JVM heap. Blazingly fast reads and writes — nanosecond latency. Checkpoints snapshot the entire state to remote storage (S3/HDFS) periodically. The downside: if your state is large (millions of user counters) you need enormous RAM, and full checkpoints are slow because you must serialize everything.
Put it on local disk in a fast key-value store (EmbeddedRocksDBStateBackend) — state lives in RocksDB, an embedded key-value store on local disk that organises data as an LSM tree (the same write-friendly structure used by Cassandra and many NoSQL engines). State larger than RAM is fine — RocksDB pages in what it needs. Slightly slower than in-memory (microseconds vs nanoseconds), but supports terabyte-scale state on commodity hardware. HashMapStateBackend is Flink's default, but RocksDB is the recommended choice for production jobs with large or long-window state (and is required for incremental checkpoints). Incremental checkpoints are supported — only the changed SST files (the small on-disk segments RocksDB writes) go to S3, not the whole state.

What this SVG shows: the two state backends side by side. In-memory (HashMapStateBackend) stores state as plain Java objects in the JVM heap — fast, but RAM-bounded. RocksDB stores state on local SSD — slightly slower, but supports terabyte-scale state and uses incremental checkpoints that only write changed data to S3, keeping checkpoint overhead low even with massive state.

Aggregations — Count, Sum, and Custom Accumulators

The most common stateful operation is an aggregation: count events per key per window, sum transaction amounts per merchant, compute the 99th percentile latency per service. Flink provides built-in aggregations (count, sum, min, max, first, last) and a generic AggregateFunction interface for custom logic.

Here is a hand-written accumulator that computes a running average per user. Watch for three pieces — a start state (the empty accumulator), an update rule (how each new event changes that state), and an output rule (how the accumulated state becomes a final number) — because every custom aggregation you ever write will have exactly these three pieces:

// Custom accumulator: count + sum to compute running average per user public class AvgAggregator implements AggregateFunction<Order, long[], Double> { @Override public long[] createAccumulator() { return new long[]{0L, 0L}; // [count, totalAmount] } @Override public long[] add(Order order, long[] acc) { acc[0]++; // increment count acc[1] += order.amount; // add to total return acc; } @Override public Double getResult(long[] acc) { return acc[0] == 0 ? 0.0 : (double) acc[1] / acc[0]; // avoid divide-by-zero } @Override public long[] merge(long[] a, long[] b) { return new long[]{a[0] + b[0], a[1] + b[1]}; // needed for session windows } }

The merge() method is only needed for session windows — when two session window fragments merge (as shown in the SVG above), Flink needs to combine their accumulators. For tumbling and hopping windows, merge is never called.

Joins — Stream-Stream vs Stream-Table

Joining two streams together is one of the most powerful and most misunderstood operations in stream processing. There are two fundamentally different kinds of join, and picking the wrong one produces either incorrect results or a system that eats all your RAM.

Stream-Stream Join: Both sides are live, infinite streams. An order event on the left needs to match a payment confirmation event on the right, where both events happened within the same 5-minute window. The key question: how long do you wait for a match? If the order arrives at t=0 and the payment confirmation arrives at t=4m59s, they should match. But if the payment never arrives, you need to eventually stop waiting and close that window. Stream-stream joins use a time window to bound how long each side waits for a match from the other side. Events that fall outside the window are unmatched and either discarded or sent to a side output.

Stream-Table Join: One side is a live stream of events; the other side is a continuously-updated table (in Kafka Streams, this is called a KTable; in Flink it's a connect() with a BroadcastState or a Table API lookup). Every incoming event on the stream side looks up the latest value in the table at the time the event arrives. This is how you enrich click events with user profiles: the click is the stream, the user profile database is the table. No time window needed — you always use the most recent profile.

What this SVG shows: the two join architectures side by side. In a stream-stream join, both streams buffer incoming events in state until they find a match within the time window — late events can miss their match entirely. In a stream-table join, the KTable compacts to the latest value per key, and every incoming stream event does a point lookup — no buffering required on the stream side, and the table side only keeps the latest value rather than all historical events.

The Unbounded State Problem

State can grow without bound if you're not careful. "Count per user" over a stream that's been running for 3 years means you now have state for every user who ever existed — tens of millions of entries that are mostly stale. Your RocksDB state store gradually expands until the worker runs out of disk.

The fix is state TTL (time-to-live): any state entry not updated within a configured period is automatically expired and cleaned up. In Flink:

// State TTL — automatically expire user state after 7 days of inactivity StateTtlConfig ttlConfig = StateTtlConfig .newBuilder(Time.days(7)) .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) // reset TTL on each write .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) .build(); ValueStateDescriptor<Long> descriptor = new ValueStateDescriptor<>("userCount", Long.class); descriptor.enableTimeToLive(ttlConfig); // attach TTL to this state descriptor ValueState<Long> userCount = getRuntimeContext().getState(descriptor); // Now any user who goes inactive for 7 days has their state automatically evicted

What this code does: the StateTtlConfig attaches an expiry rule to a piece of state. OnCreateAndWrite means "reset the 7-day timer every time you write" — so an active user never expires. NeverReturnExpired means once the timer is up, the state behaves as if it never existed, even if it hasn't been physically cleaned up yet. Wire this config onto every ValueStateDescriptor you create; otherwise the state for that descriptor grows forever.

State TTL is the most commonly forgotten configuration in production streaming jobs. Forgetting TTL on long-running jobs is how teams discover at 3 AM that their Flink cluster's RocksDB state has grown to 4 TB and the job is crashing due to disk exhaustion.

Stateful stream operations (aggregations, joins) require durable, crash-safe state stores. RocksDB-backed state (recommended in Flink for large/long-window state, and required for incremental checkpoints) supports terabyte-scale state; in-memory HashMapStateBackend is Flink's default and is faster but RAM-bounded. Stream-stream joins match events from two live streams within a time window; stream-table joins enrich stream events with the latest value from a compacted KTable. Always configure state TTL to prevent unbounded state growth in long-running jobs.

Section 9

Checkpoints, Savepoints, and Recovery

Here is an uncomfortable truth about stream processing: workers crash. Machines fail. Kubernetes evicts pods. The JVM runs out of memory. In a batch job, you just re-run the job from scratch. In a streaming job that has been running for days and accumulated gigabytes of state, "restart from scratch" means you lose everything — every counter, every partial aggregation, every in-progress session window. You'd have to replay every event ever produced just to rebuild the state.

The solution is checkpointing: periodically snapshot the complete state of every operator in the DAG to durable remote storage (S3, HDFS, Azure Blob). When a crash happens, restart from the most recent checkpoint and replay only the events that arrived after that checkpoint. This is how streaming systems achieve fault tolerance without losing data or correctness.

How Checkpoints Work — The Barrier Mechanism

The tricky part is making the snapshot consistent. If Operator A snapshots its state at T=100, and Operator B snapshots its state at T=110, but B has already processed events that A hasn't snapshotted yet — the combined state is inconsistent. You'd recover to an impossible mixed state.

Flink solves this with a simple trick: send a special marker message into the stream alongside the data, and have every operator snapshot itself the moment the marker passes through. Because the marker travels in the same order as the data, every operator captures "the state right after the same set of events" — even though they snapshot at different wall-clock moments. These markers are called checkpoint barriers, and the underlying idea is a classic 1985 result called the Chandy–Lamport distributed snapshot algorithm. Here's how it plays out:

The JobManager (the Flink coordinator) triggers a checkpoint. It injects a special "barrier" message into every Kafka partition at a specific offset.
Each barrier flows through the DAG with the data. When a source operator receives a barrier, it snapshots its state (e.g., the current Kafka offset) and passes the barrier downstream.
An operator with two input channels waits until it has received barriers from all input channels (barrier alignment) before taking its own snapshot. Events arriving after the barrier are buffered until the snapshot is complete.
Once all operators have snapshotted and reported success to the JobManager, the checkpoint is complete. The JobManager records the checkpoint ID and its location in S3.
On crash, all operators reload state from the latest complete checkpoint and tell their Kafka consumers to seek back to the recorded offset. Events between the checkpoint and the crash are replayed.

What this SVG shows: a three-operator DAG with a barrier flowing through it. The key step is the aggregator waiting for barriers from both input channels before taking its snapshot — this is "barrier alignment" and is what makes the snapshot globally consistent. Without alignment, two operators could snapshot at different logical moments and produce a checkpoint that represents an impossible state.

What this SVG shows: the crash-and-recovery timeline. A checkpoint completes at T=60s and is saved to S3. The job continues normally until T=120s when a worker crashes. On restart, the system loads the T=60s snapshot and replays 60 seconds of Kafka events. Those 60 seconds of events are processed again — which means any output produced during that 60-second window might be produced twice unless your sink is idempotent. This is why exactly-once semantics (Section 10) is a separate, harder problem.

Savepoints — Checkpoints You Control

A savepoint is a user-triggered checkpoint. Unlike automatic checkpoints (which the framework triggers periodically and automatically deletes old ones), a savepoint is triggered manually via the Flink CLI or REST API and is retained indefinitely.

Savepoints are used for: (a) planned job upgrades — stop the old job, take a savepoint, deploy the new version, restore from savepoint. The new job picks up exactly where the old one left off. (b) A/B testing — fork two jobs from the same savepoint. (c) Disaster recovery testing — verify you can actually restore from a saved state before you need to in production.

# Trigger a savepoint for job abc123, store it in S3 flink savepoint abc123def456 s3://my-bucket/flink/savepoints/ # Cancel the job and take a final savepoint atomically flink cancel --withSavepoint abc123def456 s3://my-bucket/flink/savepoints/ # Restart a new job from a savepoint flink run --fromSavepoint s3://my-bucket/flink/savepoints/savepoint-abc123-xyz \ my-streaming-job.jar # Flink validates that the savepoint is compatible with the new job's operator topology # (same operator IDs and state types). A rename of an operator = incompatible savepoint.

Operator UIDs are mandatory for savepoint compatibility. Always annotate your operators with .uid("my-operator") in Flink. Without UIDs, Flink autogenerates them based on position in the DAG — meaning any topology change (even adding an operator before an existing one) invalidates all savepoints. With explicit UIDs, you can restructure your job and still restore from an old savepoint.

Checkpoint interval tuning: shorter intervals = less data to replay on crash, but more S3 write overhead. A 60-second checkpoint interval means at most 60 seconds of replay on crash. A 10-minute interval means up to 10 minutes of replay. For most production jobs, 60–120 seconds is the standard. For jobs with very large state (TB-class), incremental checkpoints reduce the overhead enough that 60 seconds is still practical.

Checkpoints are periodic, automatically-triggered consistent snapshots of all operator state, stored in S3/HDFS. Barrier propagation through the DAG (via the Chandy-Lamport algorithm) ensures snapshots are globally consistent. On crash, the job restores from the latest complete checkpoint and replays events from that point — replay interval equals the checkpoint period. Savepoints are user-triggered checkpoints used for planned upgrades, migrations, and DR testing. Always assign explicit UIDs to operators to maintain savepoint compatibility across job changes.

Section 10

Exactly-Once Semantics — The Hardest Property to Achieve

Here is the problem. After a crash, the system replays events from the last checkpoint. Those events have already been processed — the output was already written to the sink before the crash. When you replay them, you process them again. Now the sink has seen each event twice. If your sink is a counter in PostgreSQL, you've incremented it twice. If your sink is a Kafka topic, you've published each record twice. Downstream consumers receive duplicate data and produce wrong aggregates.

This is the at-least-once guarantee: every event is processed at least once, possibly more times due to replay. Most streaming pipelines run at-least-once by default. It's much easier to implement because you don't need any coordination between the checkpoint and the sink — just write, and on replay, write again.

Exactly-once means every event contributes to the output exactly once, even across crashes and replays. Achieving this requires three things working together: a replayable source, a checkpointed state store, and an idempotent or transactional sink. If any of the three is missing, you fall back to at-least-once.

The Three Preconditions for Exactly-Once

Replayable source (Kafka offsets — yes; raw TCP — no) — The source must be able to replay from a specific position. Kafka gives you exactly this via consumer offsets — "replay from offset 42810 on partition 3." A raw TCP socket does not: once a byte is consumed, it's gone. This is why nearly all production exactly-once pipelines use Kafka (or Kinesis, or Pulsar) as the source — not because of speed, but because of replayability.
State stores participate in checkpoints — All operator state (counts, sums, join buffers) must be snapshotted as part of the checkpoint. When the job restores from the checkpoint, all state is exactly as it was at checkpoint time. Flink does this automatically for all built-in state types. Custom state that lives outside Flink's state management (e.g., a Redis call you make in your map operator) is not included in checkpoints — you'd need to handle idempotency yourself.
Idempotent or transactional sink — The sink must not double-count replayed events. Two approaches: (a) Idempotent writes — writing the same event twice produces the same result as writing it once. A key-value store where writes are PUT key → value is idempotent; an incrementing counter is not. Elasticsearch upserts (document ID as idempotency key) are idempotent. (b) Two-phase commit (2PC) — the sink buffers writes since the last checkpoint, and only commits them when the checkpoint completes. If the job crashes before the checkpoint completes, the buffered writes are discarded — the replay will re-produce them and they'll be committed with the next successful checkpoint.

How Flink Achieves Exactly-Once: Two-Phase Commit at the Sink

The trick is to make the sink's writes invisible until the checkpoint succeeds. The sink stages its writes inside an open database/Kafka transaction. If the checkpoint finishes, the transaction commits and downstream consumers see the writes. If the job crashes, the transaction is rolled back and the writes never existed — so the replay can safely re-do them. This "stage first, commit only after everyone else is also safe" dance is the classic two-phase commit (2PC) protocol from distributed databases, and Flink wraps it in a class called TwoPhaseCommitSinkFunction. Here's what happens during a checkpoint cycle:

Pre-commit phase: As events flow in, the sink writes them to a pending transaction (in Kafka: to a producer transaction that hasn't been committed yet; in JDBC: to a temp table or a transaction not yet flushed). These writes are invisible to readers.
Barrier arrives: When the checkpoint barrier reaches the sink operator, the sink calls preCommit() — it flushes any buffered writes to the pending transaction but does not yet commit.
Snapshot: The sink snapshots its transaction handle (e.g., the Kafka producer transaction ID) into the checkpoint state.
Checkpoint completes: Once all operators have successfully snapshotted, the JobManager notifies the sink. The sink calls commit() — the Kafka producer transaction commits, records become visible.
On crash before commit: The checkpoint didn't complete, so there is no record of the transaction. On restart, the sink calls abort() on any open transactions and starts fresh. The replayed events will be written into a new transaction that commits with the next successful checkpoint.

What this SVG shows: the happy path (top) and crash path (bottom) through the two-phase commit cycle. On the happy path, events are buffered in a pending Kafka transaction, preCommit() flushes on the barrier, the transaction handle is snapshotted, and commit() fires when the checkpoint succeeds — making records visible. On the crash path, the job crashes after preCommit() but before the checkpoint completes — the open transaction is aborted, the job restores from the previous checkpoint, replays events, and the new transaction commits with the next successful checkpoint.

What this SVG shows: the honest picture of what exactly-once actually guarantees. Inside the Flink pipeline — source offsets, operator state, and a Kafka sink with 2PC — every event is counted exactly once. Outside the pipeline boundary — an email sent from a map operator, a payment API called from a flatMap, a non-idempotent database increment — exactly-once does NOT help. Those external calls need application-level idempotency: a unique event ID you check before processing, or a transaction ID that deduplicates at the receiver.

"Exactly-once" is often oversold. What frameworks mean by exactly-once is: within the checkpoint boundary, each event changes internal state exactly once and produces exactly one output to a participating sink. The moment your operator calls an external service (REST API, SMTP server, payment processor), that call is outside the checkpoint boundary and will be retried on replay. Design for idempotency first; rely on exactly-once as a secondary safety net.

Exactly-once semantics requires three cooperating pieces: a replayable source (Kafka offsets), checkpoint-participating state, and an idempotent or transactional sink. Flink implements exactly-once for Kafka sinks via two-phase commit — events write to a pending transaction, the transaction commits only when the checkpoint succeeds, and the transaction aborts on crash so replayed events start a fresh transaction. Exactly-once covers internal pipeline state; external side effects (API calls, emails, non-idempotent writes) require application-level idempotency keys.

Section 11

Backpressure, Throughput, and Latency Trade-offs

A streaming pipeline is only as fast as its slowest operator. If you have a DAG where the source produces 1 million events per second, an enrichment operator processes 800k per second, and a database-write sink handles only 200k per second — what happens to the other 800k events? They pile up somewhere. If that "somewhere" is an unbounded buffer, you'll run out of memory in minutes. If that "somewhere" is nowhere (events are dropped), you get data loss. The real answer is backpressure: the slow downstream operator signals to the upstream that it's overwhelmed, and upstream slows down to match.

Think of it like a highway on-ramp meter. When the highway (downstream) is congested, the meter light (backpressure signal) slows cars onto the highway (upstream producers). The cars don't disappear — they just wait at the meter. The total throughput is the same as the bottleneck; the backpressure mechanism just makes the pipeline degrade gracefully instead of crashing.

How Flink Implements Backpressure: Credit-Based Flow Control

Flink uses a credit-based flow control mechanism between operators. Each operator maintains a set of network buffers for receiving data. When an upstream operator wants to send data to downstream, it first checks how many "credits" (available buffers) the downstream has. If the downstream has no available buffers (it's processing at full capacity), the upstream pauses — it stops consuming from its input until credits become available.

This credit-based approach is elegant because it propagates backpressure all the way back through the DAG automatically. If the sink is slow, the aggregator behind it runs out of output buffers, which means it can't accept new input from the map operator, which means the source slows down reading from Kafka. The source's consumer group falls behind Kafka — events stay in Kafka rather than piling up in the pipeline. Kafka acts as the "pressure relief valve" for the whole system.

What this SVG shows: a four-stage pipeline where the DB sink is the bottleneck at 200K events/sec. Credit-based backpressure propagates the constraint backwards through the DAG — each operator sees "no credits available from downstream" and slows its own output. The entire pipeline stabilises at 200K/sec. No events are lost — the excess events stay in Kafka, visible as consumer lag. This is the correct behaviour: consumer lag is a warning sign, not data loss.

Throughput vs Latency: The Inevitable Trade-off

Throughput and latency are in tension in every streaming system. High throughput requires batching: instead of processing each event one-at-a-time, the framework batches multiple events into a single network transfer or a single RocksDB write. Batching amortises the per-event overhead — one network round-trip for 1000 events instead of 1000 round-trips. But batching means events sit in a buffer waiting for the batch to fill up, which adds latency.

What this SVG shows: the throughput-latency curve as buffer size increases. Throughput rises quickly and then plateaus — doubling the buffer from 1000 to 2000 events barely increases throughput, but it doubles buffering latency. The "optimal knee" is the point where throughput is near its plateau but latency is still low. In Flink, this corresponds to a network buffer timeout of 100–200ms and buffer sizes of a few thousand events — the defaults are tuned for this sweet spot.

Tuning Knobs for Production Pipelines

Parallelism per operator — The most important tuning knob. If the DB sink is the bottleneck at 200K/sec with parallelism=2, increasing to parallelism=8 (assuming the DB can handle it) raises throughput to 800K/sec. Profile which operator is the bottleneck using Flink's Web UI or backpressure metrics, then increase that operator's parallelism.
RocksDB block cache size — For RocksDB state, the block cache is the in-memory read cache for the most frequently accessed state. Too small = constant disk reads = slow state access = backpressure. For a job with millions of active keys, 1–4 GB of block cache per task manager is typical. Configure via state.backend.rocksdb.block.cache-size.
Async I/O for external lookups — If your map operator calls an external database or REST API synchronously, each event waits for the I/O to complete before processing the next event. With 10ms average external lookup latency, a single-threaded operator can process at most 100 events/sec. Flink's AsyncFunction allows you to pipeline 100+ concurrent requests per operator — suddenly 100ms I/O gives you 1000 events/sec per operator instead of 100.
Network buffer count and timeout — taskmanager.network.memory.num-buffers-per-channel controls how many network buffers each channel gets. taskmanager.network.netty.sendreceive-buffer-size controls buffer size per TCP connection. The buffer flush timeout (execution.buffer-timeout, default 100ms) is the maximum time an event waits in a network buffer before being sent downstream — the primary latency knob for sub-100ms pipelines.

Kafka Streams backpressure is different. Kafka Streams is a library inside your application — it doesn't have Flink-style network buffers between operators. Instead, Kafka itself is the backpressure mechanism: if a Kafka Streams app falls behind, the consumer lag grows in Kafka. There's no explicit credit signalling; the processing loop simply reads a batch, processes it, commits offsets, reads the next batch. The "backpressure" is implicit — slow processing = slower Kafka poll = higher consumer lag. Operational monitoring therefore focuses on consumer lag (in milliseconds) rather than buffer occupancy.

Backpressure prevents fast upstream operators from overwhelming slow downstream ones — in Flink via credit-based flow control where operators signal how many network buffers they have available, causing upstream to pause when downstream is full. The whole pipeline runs at the bottleneck operator's rate; excess events accumulate in Kafka as consumer lag. Throughput and latency trade off via buffer size — larger buffers increase throughput (batching amortises per-event overhead) but increase latency. The primary tuning knobs are parallelism per operator, RocksDB block cache size, async I/O for external lookups, and network buffer flush timeout.

Section 12

Framework Comparison — Flink vs Kafka Streams vs Spark Structured Streaming vs Beam

Four major options dominate the stream processing landscape, and they make completely different architectural bets. Choosing the wrong one forces painful migrations. Understanding each option's core model — not just its feature list — is what lets you make the right call for your situation.

The key insight before the comparison: each framework's model determines what it's good at. Flink's event-at-a-time model makes it great at sub-second latency and complex stateful logic, but it requires a separate cluster to run. Kafka Streams' library model makes it trivially deployable but tightly couples you to Kafka. Spark Structured Streaming's micro-batch model gives great throughput and batch/stream unification but sacrifices sub-second latency. Beam's abstraction model lets you write once and run on any backend, but adds an indirection layer that can obscure performance.

What this SVG shows: the four frameworks positioned on a latency vs state-complexity chart. Flink occupies the low-latency, high-state-complexity corner — it's the choice when you need millisecond responses and complex stateful logic. Kafka Streams sits in the low-to-medium latency, simpler-deployment zone — when you're already in a Kafka-first microservices world and don't want to run a separate cluster. Spark Structured Streaming occupies the higher-latency, batch-stream-unification zone — when you already run Spark for batch jobs and want to reuse the same code for streaming. Beam's dashed rectangle spans the first three because it's not a runtime — it's a programming model that compiles to any of them.

The Four Frameworks in Depth

Apache Flink — The Production Stateful Streaming Workhorse

Model: True event-at-a-time streaming. Each event is processed individually as it arrives — not batched. The operator network runs continuously, with state persisted in RocksDB on each worker node.

Runtime: Separate cluster — a JobManager (coordinator) and one or more TaskManagers (workers). Deployable on Kubernetes, YARN, or standalone. This is Flink's biggest operational cost compared to Kafka Streams — you're running an extra cluster.

Latency: Sub-second, typically 1–10ms end-to-end for simple pipelines. Complex stateful pipelines (joins, session windows) add a few milliseconds of state access overhead. Checkpointing adds a brief pause (usually <5ms with incremental checkpoints on fast storage).

State model: First-class. Keyed state (ValueState, ListState, MapState, AggregatingState) stored in RocksDB. Supports both event-time and processing-time. State TTL. Exactly-once via two-phase commit sinks. Savepoints for upgrades.

Exactly-once: Yes — Kafka sources + Kafka/JDBC sinks participate in checkpoint-based two-phase commit.

Languages: Java (primary), Scala, Python (PyFlink). SQL via Flink SQL / Table API.

When to choose Flink: You need sub-second latency + complex stateful logic (session windows, stream-stream joins, ML feature engineering). You're processing at scale where Kafka Streams' library model becomes hard to operationalise. Used by Netflix (anomaly detection), Uber (real-time analytics), Stripe (fraud prevention), Alibaba (ecommerce analytics).

Kafka Streams — The Zero-Cluster Library

Model: Event-at-a-time (same model as Flink) but implemented as a Java library that runs inside your existing application process. There is no separate stream processing cluster. Your Spring Boot service IS the stream processor.

Runtime: Embedded in your application. You add a dependency, write a Topology, and call streams.start(). Kafka itself handles parallelism and failover — consumer group rebalancing reassigns partitions across multiple instances of your service. Scale by deploying more pods.

Latency: Sub-second — similar to Flink for simple operations. Kafka Streams processes records within a poll batch, so latency is roughly the Kafka consumer poll interval (default 100ms) plus processing time.

State model: RocksDB-backed KStore (key-value) and KTable (changelog). State backed by an internal Kafka changelog topic — on restart, the state is rebuilt from the changelog (which can take minutes for large state, unlike Flink's checkpoint restore). State can also be materialised to interactive query servers via Kafka Streams' Interactive Queries API.

Exactly-once: Yes — via Kafka transactions (processing.guarantee=exactly_once_v2). Requires Kafka broker ≥2.5. Slightly lower throughput than at-least-once due to transaction overhead.

Languages: Java/Kotlin only (the library is JVM-only). Kafka Streams DSL (high-level) or Processor API (low-level).

When to choose Kafka Streams: Your event bus is Kafka, your services are JVM-based, and you don't want the operational overhead of a separate Flink cluster. Best for enrichment, filtering, joining with KTables, and lightweight aggregations within a microservices architecture. Not ideal if your state exceeds a few GB (the changelog rebuild on restart is painful at scale).

// Kafka Streams DSL: join order stream with customer KTable, then group and count by product StreamsBuilder builder = new StreamsBuilder(); KTable<String, Customer> customers = builder.table("customers-topic"); KStream<String, Order> orders = builder.stream("orders-topic"); orders .join(customers, // stream-table join (order, customer) -> new EnrichedOrder(order, customer), Joined.with(Serdes.String(), orderSerde, customerSerde)) .groupBy((k, enriched) -> enriched.productId) // rekey by product .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5))) .count() // count per product per 5-min .toStream() .to("product-order-counts"); // write results to Kafka KafkaStreams streams = new KafkaStreams(builder.build(), props); streams.start();

Walkthrough: the builder declares two inputs — a KTable (a continuously updated lookup table keyed by customer ID) and a KStream (the live flow of orders). The .join() enriches each order with the latest customer record. The .groupBy() re-keys by product ID, then a 5-minute tumbling window counts orders per product, and the result is written back to a Kafka topic. The whole "topology" runs inside your service process — there is no separate cluster to babysit.

Spark Structured Streaming — The Batch-Stream Unified API

Model: Micro-batch — Spark collects incoming events into small batches (configurable trigger interval, minimum ~100ms) and processes each batch as a mini-DataFrame batch job. True event-at-a-time processing is not the model; you always process a micro-batch.

Runtime: Separate Spark cluster (Spark driver + executors). Deployable on Databricks, EMR, GKE, or standalone. The same cluster that runs your batch Spark jobs also runs Spark Structured Streaming jobs — which is the killer feature if you already have a Spark cluster and team.

Latency: 100ms to a few seconds. The micro-batch interval is the minimum latency — you can't get sub-100ms because you're waiting for a micro-batch to accumulate. For most analytics use cases (dashboards, hourly reports, near-realtime ML feature pipelines), this is completely fine.

State model: Stateful operations supported via mapGroupsWithState / flatMapGroupsWithState (arbitrary state) or built-in windowed aggregations. State stored in Spark's in-memory execution model. Fault tolerance via WAL (write-ahead log) and checkpointing to HDFS/S3. State is generally less flexible than Flink's — no RocksDB, no TTL out of the box (requires custom logic).

Exactly-once: Yes — the micro-batch model provides natural exactly-once via batch atomicity. Each micro-batch either fully commits or fully rolls back. Idempotent Kafka sources + transactional Kafka sinks give end-to-end exactly-once.

Languages: Python (PySpark), Scala, Java, SQL. The DataFrame API means you can write streaming jobs in the same Python/SQL you use for batch — a huge productivity win if your team is Python-first.

When to choose Spark Structured Streaming: You already run Spark for batch jobs and want to reuse the same API, same cluster, same team skills. Your latency requirement is 100ms–1s and not sub-100ms. You want Python or SQL as your primary language. You do a lot of joins between batch datasets and streaming data (broadcast joins work especially well in Spark).

Apache Beam — The Write-Once, Run-Anywhere Abstraction

Model: A programming model and SDK, not a runtime. You write a Beam pipeline using the Beam SDK (Python, Java, Go, SQL). Beam compiles your pipeline to run on any supported runner: Apache Flink, Apache Spark, Google Cloud Dataflow (managed), or the Direct runner (for local testing). The Beam model unifies batch and streaming into one API — a batch job is just a streaming job with a bounded source.

Runtime: The runner you choose. Using Beam doesn't save you from choosing a runtime — it defers the choice. A Beam pipeline on the Flink runner runs in a Flink cluster. A Beam pipeline on Dataflow runs in Google's managed infrastructure.

Latency / throughput: Completely determined by the runner. Beam on Flink = Flink's latency. Beam on Spark = Spark's latency. Beam itself adds a modest overhead (the translation layer), but in practice it's negligible compared to I/O and state access.

Exactly-once: Supported on runners that support it. Beam on Flink inherits Flink's two-phase commit. Beam on Dataflow inherits Dataflow's exactly-once guarantees.

Languages: Java, Python, Go, SQL. Python support via the Python SDK + cross-language transforms.

When to choose Beam: You want to write once and be able to switch runners (e.g., start on Dataflow for managed simplicity, migrate to Flink for cost control). You use Google Cloud and want Dataflow's managed autoscaling without writing Flink jobs directly. You need a unified batch+stream programming model where the same pipeline code runs in both modes.

The real cost of Beam: The abstraction layer means you can't always use runner-specific features (e.g., Flink's advanced state APIs, Spark's broadcast joins). Debugging is harder — your code runs on Flink, but your mental model is Beam, and mapping errors between the two is non-trivial. Beam is the right bet if portability matters; if you're 100% committed to one runner, writing native Flink or Spark code gives more control.

At a Glance: 4-Way Comparison Table

Property	Apache Flink	Kafka Streams	Spark Structured Streaming	Apache Beam
Processing model	Event-at-a-time (true streaming)	Event-at-a-time (library)	Micro-batch (100ms–1s batches)	Programming model (runner decides)
Runtime	Separate Flink cluster	Library inside your app (no cluster)	Separate Spark cluster	Depends on runner (Flink/Spark/Dataflow)
Latency	1–10ms	10–50ms	100ms – a few seconds	Depends on runner
State model	RocksDB / in-memory; keyed + operator state; TTL; savepoints	RocksDB KStore/KTable; rebuilt from Kafka changelog	In-memory / WAL; less flexible than Flink	Inherits runner's state model
Exactly-once	Yes (2PC sinks + checkpoints)	Yes (Kafka transactions)	Yes (micro-batch atomicity)	Runner-dependent
Languages	Java, Scala, Python, SQL	Java, Kotlin (JVM only)	Python, Scala, Java, SQL	Java, Python, Go, SQL
Best for	Complex stateful ops, sub-second latency, large-scale	Kafka-first microservices, no extra cluster wanted	Already-Spark teams, Python/SQL, batch+stream unification	Runner portability, Google Cloud Dataflow, write-once
Used by	Netflix, Uber, Stripe, Alibaba	Confluent customers, microservice-heavy orgs	Databricks users, data-warehouse-centric orgs	Google Cloud users, teams needing multi-cloud portability

What this SVG shows: a decision tree for picking a streaming framework. Start at the top: if you're Kafka-native and want no extra cluster, Kafka Streams. If you already have Spark or need Python/SQL as your primary language, Spark Structured Streaming. If you need to deploy on multiple cloud runtimes or want Dataflow, Beam. If none of those apply — you need sub-second latency, complex stateful ops, and don't mind running a separate cluster — Apache Flink.

The most common correct choice is Kafka Streams for microservices and Flink for dedicated analytics pipelines. At companies that have grown past a certain scale, it's common to run both: Kafka Streams handles per-service stream enrichment (each team owns their own topology, embedded in their service), while Flink handles cross-service analytics where you need complex joins across multiple Kafka topics and sub-second latency dashboards. Spark Structured Streaming enters when you have a Databricks-heavy data platform. Beam enters when your platform team wants to abstract over cloud providers.

The four major stream processing frameworks serve different architectural contexts. Flink is the choice for complex stateful pipelines with sub-second latency requirements, running as a separate cluster. Kafka Streams is a JVM library embedded in your microservice, zero extra cluster, tightly Kafka-coupled. Spark Structured Streaming uses a micro-batch model (100ms–1s latency) and shines when you already run Spark and want batch-stream API unification. Apache Beam is a programming model, not a runtime — it defers the engine choice to a runner (Flink, Spark, or Dataflow), trading maximum runner-specific control for portability.

Section 13

Common Stream Processing Use Cases — Where This Actually Lives in Production

Knowing how stream processing works is only half the job. Knowing which problem calls for it — and how to wire the pieces together for that specific domain — is what makes the difference between a design that ships and one that stays on a whiteboard. The six use cases below are the ones you will encounter in real production systems. Each one has a different reason for needing streaming, a different state model, and a different set of traps for the unwary.

The diagram above shows the fundamental pattern behind fraud detection: a transaction event arrives in Kafka, a Flink job enriches it asynchronously from Redis (user-profile features like historical velocity, typical spend, location), runs an ML scoring function, and emits a decision — all before the payment settles. The async Redis enrichment is critical: a synchronous database call in the hot path adds ~20 ms per event; async I/O brings that to ~0.5 ms without blocking the operator thread pool.

(a) Real-Time Fraud Detection — Score Transactions Before They Settle

Why streaming? Payment settlement windows are 1–10 seconds. A batch job running every hour is 3,600× too slow. The fraud decision must happen in-flight or not at all.

Architecture pattern: Kafka topic receives payment events → Flink job keys by user_id → async enrichment from Redis feature store (velocity counts, typical merchant categories, device fingerprint) → ML scoring (usually a pre-loaded ONNX model in the operator) → decision emitted back to Kafka. A stream-table join is used here: the "stream" side is the live transaction events; the "table" side is the user-profile Redis store, which is itself updated by a separate stream reading account activity events.

State model: Each operator maintains a per-user rolling window of recent transactions (last 60 seconds, last 5 minutes, last hour) as keyed state. State TTL is set to 2 hours to cover the longest velocity window. Checkpoint interval is 30 seconds because state recovery must be fast — a crash that causes a 90-second gap in fraud scoring is unacceptable.

The trap everyone falls into: Synchronous external lookups inside the operator. If you call redisClient.get(userId) synchronously, you block the operator thread while waiting for a network round trip. One slow Redis node can cascade into a full pipeline stall. Always use async I/O (Flink's AsyncFunction, Kafka Streams' async client) so the operator thread can process other events while waiting for the enrichment response.

The business value here is in the two-second window before settlement. Every millisecond of latency reduction is money: blocking a fraudulent $500 transaction costs nothing; missing it costs $500 plus chargeback fees plus operational overhead. Latency is literally revenue.

(b) Real-Time Recommendations — Keep Feature Vectors Current as Events Arrive

Why streaming? User preferences change in minutes, not days. A user who just watched three thriller movies is in a different mood than they were yesterday. A recommendation engine trained on yesterday's batch data misses this signal entirely.

Architecture pattern: Event stream (clicks, watches, purchases, ratings) → Flink job computes rolling user-feature vectors (genre preferences, recency-weighted engagement scores, collaborative filtering signals) → writes updated feature vectors to a low-latency feature store (Redis, Feast, DynamoDB) → recommendation service reads from the feature store at query time. The stream processor never serves recommendations directly — it keeps the feature store fresh. The serving layer is stateless HTTP.

State model: Exponentially-weighted moving averages work well here. New signals are given high weight; signals from last week decay. This avoids unbounded state growth — you don't store every click ever, just the current weighted vector. State size stays constant per user regardless of how long they've been a customer.

The trap: Writing to the feature store on every single event. If a user generates 50 events per session, you'd do 50 Redis writes per user per session. Instead, use a tumbling window of 10–30 seconds to batch feature updates — one write per window per active user. Latency penalty is negligible for recommendation quality; write amplification reduction is dramatic.

Why streaming? An alert that fires 20 minutes after a production incident is nearly useless. Observability requires sub-minute aggregation to catch P95 latency spikes, error rate surges, and CPU saturation before they become full outages.

Architecture pattern: Application agents emit structured log lines and metrics into Kafka → a stream processor (Flink, or purpose-built pipelines inside tools like Datadog Agent) computes per-service, per-endpoint aggregates (request count, error count, P50/P95/P99 latency) over 1-minute and 5-minute tumbling windows → results land in a time-series database (InfluxDB, Prometheus remote write, Datadog metrics API) → dashboards and alert rules query from there.

Scale note: A service emitting 10,000 requests/second generates 10,000 log lines/second. Aggregating that into a per-minute P95 requires a streaming reducer — you cannot buffer a full minute of raw log lines in memory and sort them. The standard approach is a reservoir sampling or t-digest algorithm that computes approximate percentiles in constant memory.

The trap: Using the wall clock at the processor (processing-time windows) instead of the event's own timestamp (event-time windows) for SLO calculations. If a service sends metrics with a 30-second delay (buffering), a processing-time window shows a false dip followed by a spike. Event-time windows with a 60-second watermark give you the correct picture at the cost of a 60-second output delay — still well within SLO alert requirements.

The CDC diagram shows the most powerful use of stream processing for data integration. Debezium reads PostgreSQL's write-ahead log (WAL) — the same log PostgreSQL uses for crash recovery — and emits every INSERT, UPDATE, and DELETE as a structured event to Kafka. The Flink job consumes those events and materialises them into multiple derived stores simultaneously: Elasticsearch for full-text search, Redis for fast cache lookups, and a data warehouse for analytics. Every downstream store is updated within seconds of the source write. No nightly ETL job. No stale search indexes. No cache that's hours out of date.

(d) CDC-Driven Data Integration — Keep All Derived Stores in Sync

Why streaming? The alternative — polling the source database every N minutes for changes — is slow, expensive (full table scans), and often forbidden on production databases. CDC reads the WAL, which is already being written for crash recovery. It's zero-overhead on the source and delivers every change in order.

Architecture pattern: Kafka Connect + Debezium connector reads the Postgres WAL → emits change events to Kafka → a Flink job (or simply Kafka Connect sink connectors) materialises them into derived stores. Each derived store gets its own consumer group, consuming from the same Kafka topic independently, so a slow Elasticsearch indexing job cannot block Redis cache updates.

The ordering guarantee: CDC events for the same primary key come in WAL order. This means a row that is INSERTed then UPDATEd twice then DELETEd will appear in that exact order in Kafka. The Flink job must handle "upsert" semantics — write or overwrite on INSERT/UPDATE, delete on DELETE. If you accidentally process events out of order (e.g., by parallelising without keying on primary key), you can apply an UPDATE before its INSERT and corrupt the derived store.

The trap: Not handling schema evolution. When a developer adds a column to the source table, the Debezium event schema changes. If your Flink job has a hard-coded deserialization schema, it will crash. Use a Schema Registry (Confluent, AWS Glue) with forward-compatible Avro or Protobuf schemas so the stream processor can handle old and new schemas simultaneously during rolling deployments.

(e) Real-Time Inventory and Pricing — Consistency Across Regions in Seconds

Why streaming? A retailer with a warehouse management system and an e-commerce front end needs stock counts to be consistent within seconds — not minutes, not hours. "Add to cart" for the last item must not succeed for two customers simultaneously. Stream processing keeps a running tally of reservations and releases, decrementing available stock in real time.

Architecture pattern: Inventory events (reserved, released, fulfilled, received) → Kafka → Flink job with per-SKU keyed state maintaining available count → sink to a low-latency store readable by the checkout service. Pricing events (promotion start, promotion end, competitor price match) flow through a parallel pipeline that writes to the same store with a different key prefix.

The challenge: Exactly-once guarantees matter acutely here. An over-count due to duplicated inventory events could oversell stock. The Flink + Kafka exactly-once transaction protocol (described in S10) is the correct solution — the checkpoint barrier + Kafka transactional producer ensures each inventory event affects the count exactly once even across failures.

(f) IoT and Sensor Analytics — Aggregating Millions of Device Readings

Why streaming? A factory floor with 50,000 temperature sensors each emitting a reading every second generates 50,000 events/second. You cannot store all of these in a queryable OLTP database — storage and write throughput would be prohibitive. Stream processing aggregates them (average, min, max, anomaly detection) in real time, storing only the aggregated results and high-value anomaly events.

Architecture pattern: MQTT broker or Kafka for device ingestion → Flink with per-device keyed state and tumbling windows → anomaly detection (compare current reading against a rolling baseline, flag if deviation > 3σ) → alert routing to on-call teams + archival of raw events to cold storage (S3) for later analysis.

Session windows shine here: A sensor that goes offline and comes back should produce a new "session" of readings, not be merged with readings from two hours ago. Session windows with a 5-minute inactivity gap naturally handle device reconnects and offline periods without any special-casing in application code.

The scale trap: Flink partitions state by key. If you key by device_id and have one million devices, you have one million logical partitions — that's fine, state is spread across workers. But if you key by factory_id with only 20 factories, you have 20 partitions regardless of task manager count. The resulting "hot key" problem means 20 operators handle all load, and 20 is a hard parallelism ceiling. Always key at the finest granularity the business logic allows.

Stream processing powers six canonical production patterns: fraud detection (scoring before settlement), recommendations (fresh feature vectors), observability (sub-minute aggregations), CDC data integration (derived stores always in sync), real-time inventory (exactly-once stock counting), and IoT analytics (aggregating device readings at scale). Each has a distinct state model, keying strategy, and traps — knowing which pattern fits which domain requirement is the core skill of a streaming architect.

Section 14

Lambda vs Kappa vs Streaming-Only — The Big Architecture Decision

Before you write a single line of stream processing code, you have to answer a bigger question: what architecture will your data pipeline follow? Three schools of thought have emerged over the past fifteen years, each representing a different bet on where streaming technology was headed at the time. Understanding all three — and why the industry has been converging toward the third — is essential for any system design conversation at staff or principal level.

The Lambda architecture diagram shows the original big data design pattern introduced by Nathan Marz around 2011. Data flows into two parallel pipelines that run simultaneously: a batch layer reprocesses all historical data on a schedule (every few hours) to produce correct, accurate results; a speed layer processes the most recent data with low latency to fill in the freshness gap. At query time, results from both layers are merged. The architecture solves the freshness problem but introduces a maintenance nightmare: every piece of business logic must be written and maintained in two separate systems.

The Kappa architecture, articulated by Jay Kreps (Kafka co-creator) around 2014, makes a simple but powerful observation: if your event log is durable and replayable — which Kafka is — then you don't need a separate batch layer. Reprocessing is just running the same streaming job against an old Kafka offset. One codebase. One test suite. One execution engine. When a bug ships, you deploy a fixed version of the streaming job, point it at offset 0 (or the offset from N days ago), let it catch up to real time, then switch traffic and shut down the old version.

When to Use Each

Lambda Architecture — still relevant when: (1) you have existing Hadoop/Spark batch infrastructure that cannot be replaced; (2) your correctness requirements demand a batch layer to periodically "fix" results that streaming approximate; (3) you have a regulatory requirement for a full historical recomputation path separate from real-time. The practical reality: most teams running Lambda are doing so because they started with batch and bolted on streaming later, not because it's the best architecture for a greenfield system.

Kappa Architecture — the right choice when: (1) you're building greenfield with Kafka already as your event backbone; (2) your business logic can be expressed entirely as streaming computations (most modern ML feature pipelines can); (3) you can afford Kafka retention long enough to replay the historical window you need for reprocessing. The minimum viable Kafka retention for Kappa is typically 7–30 days — enough to replay through most bug-fix cycles. For longer histories, use compacted topics (which keep the latest value per key forever) or supplement with an object-store archive (S3 + Flink batch source for replay beyond Kafka retention).

Streaming-Only (Modern) — the emerging pattern for 2024+: Kafka as the durable log + Flink for stateful stream processing + object-store (S3/GCS) for long-term event archive + Iceberg/Delta Lake for batch-style historical queries over the archive. You get Lambda's correctness and historical depth without Lambda's dual-codebase problem, because the archive is queryable directly without a separate Hadoop cluster. Apache Flink 1.16+ supports both streaming and batch execution modes in the same job — the "unified" model that Beam promised is finally a practical reality in Flink.

Lambda (2011) ran parallel batch+streaming pipelines to combine correctness with low latency, but required maintaining two codebases. Kappa (2014) eliminated the batch layer by treating Kafka's replayable log as the reprocessing mechanism — one codebase, one engine. The modern streaming-only pattern adds object-store archival for history beyond Kafka's retention window, achieving full Lambda correctness with Kappa simplicity. For greenfield systems, start with Kappa; only use Lambda if you inherit a Hadoop ecosystem you cannot replace.

Section 15

Reprocessing and Replay — How to Fix a Bug in Historical Data

Here is one of the most underrated properties of a well-designed streaming system: when your code has a bug, you can go back in time and reprocess history as if the bug never existed. This is not magic. It works because Kafka (and similar durable logs) retains every event for a configurable retention period. If that retention is 30 days, every event from the past 30 days still exists at its original offset. A fixed version of your streaming job can replay from offset 0 and produce corrected output — exactly as if the correct code had been running all along.

This property is why the "Kafka as the system of record" philosophy is so powerful. In a traditional architecture, if a batch ETL job writes corrupt data into a database, fixing it requires either manual patches or re-running against the source system (which may not keep history). In a Kafka-based architecture, the source of truth is the immutable log. The database is just a derived view. Wrong derived view? Rebuild it from the log.

Pattern A: Side-by-Side Reprocessing (Zero-Downtime Bug Fix)

The safest reprocessing pattern runs the fixed job alongside the production job simultaneously, writing to a separate output topic or table. Once the replay job catches up to real time and results are validated, you perform a clean cutover — update the consumer configuration to point at the new output, shut down the old job. Zero downtime. Full validation window before switching.

The diagram shows the side-by-side pattern in action. The buggy v1 job keeps running — production continues uninterrupted. The fixed v2 job starts from a historical offset (say, the point in time when the bug was introduced) and replays at full speed. Kafka allows replay to run faster than real time because there is no waiting for new events — the processor reads from disk at whatever speed the network allows. In practice, a replay job processes 5–15× faster than the event rate at which those events originally arrived, so a 30-day backlog might catch up in 2–3 days of wall-clock time. Once v2 catches up and its output is validated, consumers switch and v1 is decommissioned.

Pattern B: Bulk Replay (Simpler, Requires Downtime Window)

If you can afford a brief maintenance window — say, a few hours overnight — bulk replay is simpler. You pause the production job, take a checkpoint, deploy the fixed code, point the job at the offset where the bug was introduced, and let it replay to the latest offset. Once it's caught up, resume normal operation. No dual output topics. No consumer switching. Just a clean reprocess.

The tradeoff: downtime during replay. For most internal analytics pipelines, this is acceptable. For fraud detection or inventory systems with SLAs in the seconds range, side-by-side is the only option.

The Math: How Long Does Replay Take?

Here's the calculation every engineer should be able to do before committing to Kappa:

Replay duration estimate
events_to_replay = event_rate × retention_seconds
replay_duration = events_to_replay / (replay_throughput)

Example: 1 million events/sec ingestion rate, 7-day Kafka retention, replay throughput of 5 million events/sec (5× real-time speed):
events = 1,000,000 × 604,800 = 6.048 × 10^11 events
duration = 6.048 × 10^11 / 5,000,000 = ~33.6 hours of replay

Critical constraint: the replay must complete before the bug-fixed code's divergence from the buggy code compounds to a point where output is materially different between the two versions. If the bug only affects edge cases, you have more time. If it corrupts every event, replay must complete before the v2 job's output diverges from business truth.

Schema Evolution During Replay

One frequently overlooked problem: when you replay events from 30 days ago, some of those events were written with an older schema. A column that didn't exist 30 days ago will be absent from old events. A Flink job deserializing those events must handle both old and new schema versions simultaneously. The solution is a Schema Registry (Confluent Schema Registry, AWS Glue) with backward-compatible schema evolution, so the deserializer can read both the old Avro schema (version 1) and the new one (version 2) without crashing. Never write a stream processor that hardcodes a single schema version if replay is part of your operational model.

Reprocessing lets you retroactively fix bugs in historical data by replaying Kafka events with corrected code. The two patterns are side-by-side replay (zero downtime, dual output topics, cutover on validation) and bulk replay (simpler, requires maintenance window). The math determines feasibility: at 1M events/sec and 7-day retention, 33+ hours of replay is required at 5× speed. Schema evolution support in your deserializer is mandatory for replay to work across schema changes.

Section 16

Monitoring Stream Processors — The Metrics That Actually Catch Incidents

A streaming job that is silently dropping late events, slowly growing its state into an OOM, or falling behind real time looks perfectly healthy from the outside — no errors thrown, no crashes, CPU and memory within bounds. The metrics that catch these silent failures are specific to streaming and different from the CPU/memory/request-rate dashboard you'd use for a web service. Here are the seven metrics that actually matter, and what each one tells you before it becomes a 3 AM incident.

The dashboard above shows the seven metrics in a typical Flink or Kafka Streams monitoring setup. Panels ①–③ are the highest-priority SLO metrics — they directly measure whether the pipeline is meeting its business commitments. Panels ④–⑥ are operational health metrics that predict future failures before they occur. Panel ⑦ is the sneaky one: state size growth looks benign until it causes an OOM crash at 3 AM.

Each Metric Explained — What It Tells You and Why

① End-to-End Latency — The User-Facing SLO

This is the time from when an event occurred (event-time timestamp in the message) to when the output record lands in the sink (database write, Kafka output topic, dashboard update). It is the metric that directly answers "are we meeting our latency SLA?" A fraud detector promising sub-200ms decisions should alert at 150ms — not at 200ms when it's already breached.

How to measure it: subtract the event-time header from the sink write timestamp. Emit this as a histogram per operator stage so you can identify which stage contributes the most latency. Flink exposes this via its metrics system; export to Prometheus and visualise in Grafana. Alert on P95, not average — average latency can be fine while P95 is breaching the SLO.

② Watermark Lag — The Silent Data-Dropping Indicator

The watermark lag is how far behind real time the current watermark is. A watermark at T − 2 minutes means the stream processor believes "all events with timestamp before T−2min have arrived." If that lag exceeds your watermark advance heuristic (say, 5 minutes), it means late events beyond that gap are being silently dropped — they arrive after their window has already been closed and emitted.

A growing watermark lag usually means one of three things: (1) one or more Kafka partitions is stalled (no new events, so the watermark for that partition doesn't advance); (2) a source with very high event-time skew (mobile devices going offline); (3) a slow consumer that isn't reading fast enough to advance. Flink's web UI shows per-partition watermarks — check for a single stalled partition dragging the global watermark down.

③ Kafka Consumer Lag (Processing Lag) — Backpressure Made Visible

Consumer lag is the difference between the latest offset in the Kafka partition and the offset the stream processor has committed. A lag near zero means the job is keeping up with the event rate. A growing lag means the processor is slower than the producer — backpressure is building. If lag grows without bound, the job will eventually fall so far behind that a restart triggers a replay spanning hours of backlogged events, causing an extended recovery period.

Alert on lag growth rate, not absolute lag. Some lag is normal during job startup. Continuously growing lag at a rate of >10,000 events/second for more than 5 minutes warrants immediate investigation. Common causes: a new operator added to the pipeline with insufficient parallelism, a cold cache miss storm at startup (async I/O waiting for Redis to warm), or a temporarily slow sink (database write latency spike).

④ Checkpoint Duration — The OOM Canary

Checkpoint duration measures how long it takes to snapshot operator state to durable storage (typically HDFS or S3). A healthy job checkpoints in seconds. A job whose checkpoints take 10, 20, 60 seconds has growing state — and growing state is the most common precursor to an out-of-memory crash.

The mechanism: Flink's checkpoint barrier must pass through every operator before the snapshot is complete. If one operator has large RocksDB state (hundreds of GB), the barrier waits for that operator to finish flushing its state. Checkpoint duration is proportional to state size. Growing duration = growing state. Growing state = no TTL configured, or TTL too long, or key cardinality growing unboundedly (e.g., keying by user-agent string instead of user-id).

⑤ Restart Count — The Reliability Canary

Every time a Flink task manager crashes and recovers (due to OOM, network partition, JVM GC pause, or NullPointerException in operator code), the restart counter increments. A well-tuned streaming job should have zero restarts per hour. Even one restart per hour means the job is spending time recovering from checkpoints rather than processing events, and the root cause is not yet fixed.

Restarts are not catastrophic by themselves — Flink's exactly-once semantics mean state is consistent after recovery. But they're a signal that something is wrong, and the wrong thing will eventually become a longer outage. Common causes: memory pressure (increase task manager heap or switch to RocksDB state backend), operator NullPointerExceptions from malformed input events (add null-guard logic), or JVM GC pauses exceeding the heartbeat timeout (tune G1GC or switch to ZGC for large heaps).

⑦ State Size — The Slow-Growing OOM Risk

State size measures the total bytes of keyed state held by each operator. After a job reaches steady state, state size should plateau — new keys being added should roughly equal old keys expiring due to TTL. If state size grows linearly without bound, the job will eventually exhaust heap or RocksDB disk space and crash.

The three causes of unbounded state growth: (1) No TTL configured — the operator accumulates state for every key it has ever seen, including keys that will never appear again (deactivated users, cancelled sessions). (2) Key cardinality explosion — keying by a high-cardinality string like device user-agent or IP address rather than a bounded identifier like user ID. (3) Window state not expiring — a bug in window trigger configuration that prevents state from being garbage-collected after the window closes.

Tools for streaming metrics: Flink Web UI (built-in, real-time task manager view), Flink + Prometheus (JMX metrics exporter), Kafka Streams JMX metrics + Prometheus, Confluent Control Center (Kafka lag + throughput dashboards), Datadog Streaming Monitor (end-to-end SLO + anomaly detection), Honeycomb (event-level traces through the streaming pipeline for deep latency attribution).

Seven metrics separate a well-monitored streaming job from a silently failing one: end-to-end latency (SLO compliance), watermark lag (silent drop detection), consumer lag (backpressure), checkpoint duration (state growth canary), restart count (reliability), records in/out (throughput per stage), and state size (OOM risk). Alert on P95 latency, watermark lag exceeding the watermark heuristic, and any restart count above zero. Growing checkpoint duration is the early warning for an OOM that will materialize weeks later.

Section 17

Pitfalls and Anti-Patterns — The 7 Most Expensive Streaming Mistakes

Every streaming team eventually hits these. The first time is a production incident. The second time is a postmortem that references the first postmortem. The goal of this section is to make sure you hit zero of them for the first time in production.

The mistake: You build a "revenue per day" aggregation using a tumbling window on processing time (the wall clock when events arrive at the processor). Everything looks fine in testing. Then you deploy to production, and mobile app users who were offline for 6 hours sync their sessions. Their events arrive all at once, 6 hours after they happened. Processing-time windows bucket all of them into the current hour's window — not the hour they actually happened.

Why it's bad: Your revenue-per-day chart shows a spike at 2 PM (when the offline devices synced) and correct-but-lower numbers for 8 AM (when the transactions actually occurred). Business decisions made on this data — pricing, promotion timing, inventory adjustments — are based on corrupted time attribution. This is not a display bug; it's a correctness bug in aggregated business data.

The fix: Use event time for all business-logic aggregations. Set a watermark strategy that matches your expected event skew (e.g., 30-minute watermark for mobile apps that may sync after extended offline periods). Accept that results will be delayed by up to 30 minutes from processing time — but they will be correct. The latency cost of event-time correctness is almost always worth it for financial and business-metric pipelines.

Never use processing time for any window that produces numbers used in business reporting, billing, or financial reconciliation. Use it only for operational metrics where approximate attribution is acceptable (e.g., "is the system overloaded right now?" — that's inherently a processing-time question).

The mistake: You add a keyed aggregation — "count page views per user" — without configuring a state time-to-live (TTL) — the rule that says "delete a key's state if it hasn't been touched in N hours." Every user who ever visits the site gets an entry in the operator's state store. Active users are fine. But inactive users — accounts created two years ago, never seen since — accumulate in state indefinitely. After six months, the state store holds entries for 50 million users, most of whom haven't been active in months.

Why it's bad: State stores live in heap memory (for small state) or in RocksDB on local disk. Either way, unbounded growth eventually exhausts the resource. The crash is not immediate — it sneaks up over weeks and months as state accumulates. When it finally crashes, recovery requires reading back all that state from the checkpoint, which is slow. And the root cause is a missing TTL configuration, not a hardware failure or load spike.

The fix: Configure state TTL for every stateful operator. In Flink:

StateTtlConfig ttlConfig = StateTtlConfig .newBuilder(Time.hours(24)) .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) .build(); ValueStateDescriptor<Long> stateDescriptor = new ValueStateDescriptor<>("pageviewCount", Long.class); stateDescriptor.enableTimeToLive(ttlConfig);

The TTL should match the business meaning: "count per day" → TTL of 25 hours (slightly more than a day to handle boundary events). "user session state" → TTL of 30 minutes after last activity (session timeout). "fraud velocity window" → TTL equal to the longest velocity window plus a 20% buffer.

The mistake: You key your stream by country_code because your business analysts want "revenue per country." There are 200 countries. Your Flink job has 50 task slots. You set parallelism to 200 and think you've fully utilised the cluster. But Flink can only achieve 200-way parallelism — regardless of how many task slots you add beyond 200, the additional slots sit idle. And in practice, 80% of your traffic comes from the United States, so one partition is processing 80% of all events.

Why it's bad: First, you've capped your maximum possible parallelism at 200 (the key cardinality). Adding more hardware doesn't help. Second, the "US" partition is a hot key that processes 80× more events than the "Liechtenstein" partition. The hot partition determines end-to-end latency because it's the slowest operator. Backpressure from the hot key propagates upstream.

The fix: Key at a finer granularity than the aggregation. Key by user_id (cardinality in the millions), then aggregate into per-country totals in a second operator. The first operator achieves full parallelism. The second operator, keyed by country, handles only the rollup and is much lower throughput. Alternatively, use a pre-aggregation combiner pattern: partial aggregates computed in parallel, then merged in a single reduce step.

The mistake: Your fraud detection pipeline needs to enrich each transaction with user history from a Postgres database. You add a MapFunction that calls jdbcConnection.query("SELECT * FROM user_history WHERE user_id = ?", event.getUserId()). It works in testing (small traffic, fast DB). In production, the DB has 50ms average query latency. Your Flink operator thread blocks for 50ms per event. With 10,000 events/second per task slot and one blocking call per event, you need 500 task slots just to keep up. You had 20. Consumer lag grows at 9,980 events/second.

Why it's bad: Blocking I/O inside a streaming operator serialises all processing through the blocking call. The network round trip — typically 0.5–50ms for a cache hit, 10–100ms for a DB query — becomes the throughput bottleneck. This is not a hardware problem; it's a threading model problem.

The fix: Use Flink's AsyncFunction API, which lets the operator issue multiple lookups concurrently, continuing to process other events while waiting for responses:

public class AsyncEnrichmentFunction extends RichAsyncFunction<Transaction, EnrichedTransaction> { @Override public void asyncInvoke(Transaction tx, ResultFuture<EnrichedTransaction> resultFuture) { // Non-blocking: sends request, returns immediately CompletableFuture<UserProfile> profileFuture = redisAsyncClient.get(tx.getUserId()); profileFuture.thenAccept(profile -> resultFuture.complete(Collections.singleton( new EnrichedTransaction(tx, profile)))); } }

With 100 concurrent async requests in flight, a 50ms lookup latency allows a single task slot to process ~2,000 events/second — 100× better than synchronous. Use Redis (sub-millisecond lookups) over Postgres (10–100ms) for hot-path enrichment wherever possible.

The mistake: You implement a payment confirmation pipeline. Flink guarantees exactly-once processing — events are counted exactly once, state is consistent after failures. You have a SinkFunction that calls a third-party payment API to confirm each transaction. You believe exactly-once means each transaction is confirmed exactly once. A task manager crashes. Flink recovers from checkpoint. The sink function is called again for events that were already processed before the crash. The payment API receives duplicate confirmation calls. The customer is charged twice.

Why it's bad: Flink's exactly-once guarantee covers state mutations and Kafka output topics — both of which participate in the two-phase commit protocol during checkpointing. External systems that do not participate in this protocol (REST APIs, JDBC writes outside Flink's JDBC sink, email sends, SMS notifications) can receive duplicate calls when operators replay after recovery. Exactly-once is a property of the stream processor's internal state and Kafka integration, not a global guarantee over all I/O.

The fix: All external side effects must be idempotent. Assign an idempotency key — a unique tag attached to each outbound call so the receiver can recognise and skip a repeat — typically the event's unique ID from the source system, possibly combined with the processing attempt number. The external system deduplicates on this key. For payment APIs, this is called the idempotency-key header (Stripe, Adyen both support it). For database writes, use INSERT ... ON CONFLICT DO NOTHING or UPSERT semantics keyed on the event ID.

Exactly-once means Flink won't count the same event twice in its own state and won't write duplicate records to Kafka topics (with exactly-once sink configured). It does NOT mean your external HTTP calls, email sends, or third-party API writes happen exactly once. You are responsible for idempotency on external side effects — always.

The mistake: You set a watermark advance heuristic of 5 seconds — meaning "I expect events to arrive at most 5 seconds late." This feels generous. But your data includes events from IoT devices that buffer locally and send in bursts every 60 seconds, mobile app events from users on cellular networks with up to 2-minute delays, and server-side events that are fine. The 5-second watermark advances past the 60-second and 120-second late events. Their windows have already been emitted and closed. The events are silently dropped. No error is logged. No metric is incremented (unless you configured late-data metrics). The resulting aggregates undercount by 30%.

Why it's bad: Silent data loss is the worst kind — it doesn't produce an alert, a stack trace, or a dead-letter queue entry. The output just looks slightly wrong. You don't notice until an analyst compares the streaming dashboard against a batch reconciliation report and finds a 30% discrepancy. By then, days of data have been affected.

The fix: Measure your actual event skew distribution before setting the watermark. In Flink, enable late-data side outputs and count them:

OutputTag<SensorEvent> lateDataTag = new OutputTag<SensorEvent>("late-data"){}; SingleOutputStreamOperator<Aggregate> result = stream .keyBy(SensorEvent::getDeviceId) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .sideOutputLateData(lateDataTag) .aggregate(new CountAggregate()); // Monitor this stream's size — if non-zero, watermark is too tight DataStream<SensorEvent> lateEvents = result.getSideOutput(lateDataTag);

Set the watermark to the 99th percentile of your observed event skew, not the median. For IoT / mobile data, 2–5 minutes is often the correct heuristic. Accept the output delay; do not accept silent data loss.

The mistake: The streaming job has been running flawlessly in staging for two weeks. Checkpoints complete in 3 seconds. Everything looks great. You ship to production. Three weeks later, at 3 AM, a task manager runs out of memory and crashes. Flink tries to recover from the latest checkpoint. Recovery fails: the savepoint is incompatible with the current code because a developer added a new field to the state schema without a migration strategy. The job cannot restart. State has to be discarded and the job resumes from the earliest available offset, reprocessing hours of events. Downstream consumers see duplicate events for the reprocessed period.

Why it's bad: Savepoint compatibility is something you can only discover by actually testing checkpoint-and-recover. The incompatibility is typically caused by: (1) adding a field to a Flink POJO state class (a plain Java object used to hold state) without registering a migration; (2) changing the key serializer type; (3) changing the operator UID (Flink identifies operators by UID — if the UID changes, Flink can't match checkpoint state to operators). None of these cause errors during normal operation. They only surface during recovery.

The fix: Make savepoint restore testing a mandatory part of your deployment pipeline. Before every production deployment:

Savepoint deployment checklist:
1. Take a manual savepoint of the current production job: flink savepoint <job_id>
2. Deploy the new code in a staging environment, pointing at the production savepoint
3. Verify the job starts successfully from the savepoint
4. Verify state is correctly restored (emit a few test events and check that accumulated state is present)
5. Only then deploy to production and switch

Also: assign stable operator UIDs using .uid("my-fraud-scorer") on every stateful operator. Without this, Flink generates UIDs from job topology position — add or remove any operator upstream and all UIDs shift, breaking all savepoints.

The seven most expensive streaming mistakes are: using processing time for event-time business logic (corrupted time attribution), unbounded state growth without TTL (eventual OOM), low-cardinality keys capping parallelism (hot-key bottleneck), synchronous external lookups blocking operator threads (100× throughput degradation), assuming exactly-once covers external API calls (duplicate charges), watermark too tight causing silent late-data drops (silent undercounting), and not testing savepoint restore before production (3 AM incompatibility disaster). All seven are preventable with the correct configuration; none produce an error until it is too late to prevent the damage.

Section 18

Practice Exercises — Reason About Streams Like a Senior Engineer

These exercises are not "recite the definition" drills. Each one asks you to apply the principles from Sections 13–17 to a novel scenario you haven't seen before — the same kind of open-ended reasoning you'd do in a system design interview or a production incident postmortem. Work through each exercise before expanding the solution.

Scenario: You're building an analytics dashboard that shows "unique active users in the last 60 minutes," updated every minute. Your events are user activity heartbeats emitted every 30 seconds by the front end. You need to display the count at 12:00, 12:01, 12:02, etc., where each count covers the most recent 60 minutes.

Which window type is correct? Tumbling (60-min), Hopping (size=60min, advance=1min), Sliding, or Session? Explain why the others are wrong.

Think about the relationship between window size and how frequently you want a new result. If you want a new result every minute but the window covers 60 minutes, how much do successive windows overlap?

Solution: A hopping window with size=60 minutes and hop=1 minute is correct.

A tumbling window (non-overlapping 60-minute windows) would give you one count per hour — but you want one count every minute. Wrong frequency.

A sliding window would give you a result every time a new event arrives — which is ~every 30 seconds per user, producing far too many output events (one per heartbeat, not one per minute). Also, "sliding" in most frameworks means you define a slide interval; what you're describing IS a hopping window with hop=1 min.

A session window closes when there's a gap in activity — but here you explicitly want a fixed 60-minute rolling window regardless of session gaps. Wrong semantics.

The hopping window with size=60min, hop=1min produces exactly 60 overlapping windows at any instant (each covering the past 60 minutes), with a new window result emitted every minute. This is precisely the "rolling last-hour count, updated every minute" semantic. Each event participates in 60 windows (one per minute it falls within), so the memory overhead is 60× that of a tumbling window — acceptable for 60× richer output.

Scenario: Your team's Flink job has been running for two weeks. This morning, the on-call engineer notices the watermark is 2 hours behind real time (lag = 120 minutes). The job was processing correctly yesterday (lag = 90 seconds). No new code was deployed overnight. What are the three most likely causes, ranked by probability?

Watermark advancement is determined by the minimum watermark across all source partitions. What could cause one or more Kafka partitions to stop advancing their watermark?

Solution — Three most likely causes:

1. One Kafka partition is stalled (no new events) — highest probability. Flink's global watermark is the minimum watermark across all source operator instances. If a single Kafka partition receives no new messages for 2 hours (because a producer died, a service stopped routing to that partition, or a topic rebalance went wrong), that partition's watermark does not advance. The global watermark is capped at that stalled partition's last event. Check: flink list → look at per-partition watermark in the Flink Web UI → find the stalled partition → investigate the producer for that partition.

2. A large batch of events with old event-time timestamps arrived. A batch job or data migration injecting historical events with timestamps from 2 hours ago would push the effective event-time of the stream backward (or prevent it from advancing) if those events were dominating the source. Less common but possible if a data pipeline resumed after a 2-hour outage. Check: inspect the event-time timestamps of messages currently in the input Kafka topic.

3. A source instance is idle due to a rebalance or consumer group rebalance. If Kafka triggered a consumer group rebalance (e.g., due to a Flink task manager restart or an added/removed partition), one source operator instance might temporarily not be assigned any partitions — meaning it emits no watermark. Flink's idle source detection (configurable with withIdleness(Duration.ofMinutes(5))) should handle this, but if not configured, one idle source drags the watermark. Check: verify idle source marking is configured in the watermark strategy.

Scenario: You're building a stream processor that tracks "number of purchases per user per calendar day" for a loyalty rewards system. The system uses UTC midnight as the day boundary. Events can arrive up to 90 minutes late (mobile devices offline, then syncing). Your Kafka retention is 30 days. What is the correct state TTL for the per-user per-day count, and why? What happens if TTL is too short? Too long?

Consider: when does a given "user+day" key definitely no longer need to accept new events? Factor in both the day boundary and the allowed lateness. Then add a safety buffer.

Solution:

The correct state TTL is 25.5 hours (24 hours for the calendar day + 90 minutes for allowed lateness + a small safety buffer).

Reasoning: state for "user U on day D" must survive until no more events with timestamp in day D can possibly arrive. The last event for day D arrives at most at end-of-day-D + 90 minutes = 25.5 hours after the start of day D. After that point, the state for (U, D) can be safely expired.

TTL too short (e.g., exactly 24 hours): State for the current day is expired before the 90-minute late-arrival window closes. Late events arriving 30–90 minutes after midnight are processed, find no state, start a fresh count from 1, and produce an undercount for day D. The missing count is silently lost — no error, just wrong loyalty point totals.

TTL too long (e.g., 30 days): State accumulates for 30 days × number_of_active_users. For 10 million active users, 30 days of per-user per-day state is 300 million state entries. If each entry is ~50 bytes (user ID + day key + count), that's 15 GB of state — potentially fine for RocksDB, but checkpoint size and duration will be large. The excess retention serves no functional purpose (no event with a 30-day delay is expected). Correct TTL avoids this waste.

Rule of thumb: TTL = window size + allowed lateness + 10% safety buffer. Not shorter. Not 10× longer.

Scenario: You want to join two Kafka topics: "ad_clicks" (user clicks an ad) and "conversions" (user makes a purchase). You want to produce a record when an ad click is followed by a purchase within 24 hours — to attribute the conversion to the ad. Why can't you just join on user_id without a time bound? What goes wrong, and how do you set the bound correctly?

Think about what "join" means for unbounded streams. If you join without a time bound, what state must the operator keep, and for how long?

Solution:

Without a time bound, a stream-stream join on user_id must buffer every unmatched click event until a matching conversion arrives — which might be never. If a user clicked an ad in January and never purchased, their click event must be kept in state indefinitely, waiting for a conversion that will never come. For millions of users, this is unbounded state growth. The operator runs out of memory or disk. This is why time-bounded joins are mandatory for stream-stream joins.

Setting the time bound correctly: The join window should be the maximum time delta that the business considers an "attribution window." Here, that's 24 hours. In Flink:
- Join clicks and conversions where |click.eventTime - conversion.eventTime| ≤ 24 hours
- Implemented as an IntervalJoin: clickStream.intervalJoin(conversionStream).between(Time.hours(-24), Time.hours(0)) — meaning the click can precede the conversion by up to 24 hours.

The state implication: click events are buffered for at most 24 hours waiting for a conversion. After 24 hours, they are expired from the join state. This bounds state to at most 24 hours' worth of unmatched click events — a constant upper bound, not an infinite one. For 1 million clicks/hour × 24 hours × 100 bytes/click = ~2.4 GB of state. Known, bounded, manageable.

Set the time bound too narrow (e.g., 1 hour) and you miss real attributions from users who browse then convert later. Set it too wide (e.g., 30 days) and state balloons. The correct bound is a business decision, not a technical one — but the technical consequence of the bound (state size) must feed back into that decision.

Scenario: Your team needs to build three separate streaming jobs. For each, pick the best framework from: Apache Flink, Kafka Streams, Spark Structured Streaming. Explain your choice.

Job A: Real-time fraud detector. Sub-100ms end-to-end latency required. Rich stateful logic: multi-feature velocity checks, async Redis enrichment, model scoring. 50,000 events/second peak.
Job B: Daily revenue aggregation refreshed every minute. Uses existing Spark cluster. Team is experienced with Spark DataFrame API. Results feed a BI dashboard queried by analysts via SQL.
Job C: A lightweight per-user session counter embedded inside a Java microservice. No separate cluster. Must deploy as a standard JAR. 5,000 events/second. Low ops overhead priority.

Match the framework to the deployment model and latency requirement. Kafka Streams is a library, not a cluster. Spark Structured Streaming is micro-batch. Flink is true streaming with the richest stateful operator API.

Solutions:

Job A → Apache Flink. Sub-100ms latency rules out Spark Structured Streaming (micro-batch model has inherent batch interval latency, typically 100ms–1s minimum). Flink's true event-at-a-time processing achieves <10ms operator latency with correct configuration. The async I/O API for Redis enrichment is native to Flink (AsyncFunction) and is the correct solution for low-latency enrichment. Flink's keyed state with RocksDB handles the stateful velocity checks efficiently at 50k events/sec.

Job B → Spark Structured Streaming. The team has an existing Spark cluster and DataFrame expertise — reusing both saves months of platform migration. A "refreshed every minute" SLA is a micro-batch model, which is Spark's natural execution model. Analysts querying via SQL is a Spark strength (Spark SQL, Delta Lake integration). The latency requirement (minute-level refresh) does not require true streaming. Operational continuity and DataFrame familiarity outweigh any theoretical performance advantages of Flink here.

Job C → Kafka Streams. Kafka Streams is a Java library that runs inside an existing JVM process — no separate cluster, no cluster management, no additional infrastructure cost. It deploys as a standard JAR. For 5,000 events/sec with simple session counting (a single keyed aggregation), Kafka Streams is more than sufficient and dramatically simpler to operate than standing up a Flink cluster. The "low ops overhead" constraint is the deciding factor. Note: Kafka Streams requires Kafka — if the microservice is already reading from Kafka (as most microservices do in event-driven architectures), this is zero additional infrastructure.

Five exercises covering the five reasoning skills every streaming engineer needs: picking the correct window type for a "rolling count updated every minute" requirement (hopping window, size=60, hop=1); diagnosing a 2-hour watermark lag (stalled Kafka partition is most likely); setting TTL for time-bounded per-user aggregations (window + allowed lateness + buffer); understanding why stream-stream joins require time bounds (unbounded state without them); and matching framework to job requirements (Flink for sub-100ms, Spark for existing infrastructure + micro-batch, Kafka Streams for embedded library deployment). If you worked through all five without peeking, you understand streaming at a level that lets you design systems, not just describe them.

Section 19

Bug Studies — When Stream Processors Go Wrong

Theory is clean. Production is messy. These four incidents are composite reconstructions of real failure modes — the kind of bugs that only manifest at scale, under load, or after a network hiccup at exactly the wrong moment. Each one teaches a principle that no amount of documentation drilling will ingrain as deeply as a war story.

Bug 1 — "Our Hourly Purchase Metric Was Lying for Three Days"

Incident: A mobile commerce platform used a streaming job to compute "purchases in the last hour" for a real-time dashboard. The metric was used by the ops team to detect dropped traffic. After a widespread mobile network outage lasting 90 minutes, the metric showed a dramatic spike — not a dip — for the next four hours, because offline clients retried queued events when connectivity returned. Events from three and four hours ago flooded in, all timestamped with processing time (wall clock on the processor), so the job counted them as "happening now." The ops team thought they had a traffic surge. They were actually watching a replay of history.

What Went Wrong

The engineer who wrote the job used System.currentTimeMillis() (processing time) to assign timestamps to events. This is the simplest choice — but it means "when the processor saw the event," not "when the event actually happened." When mobile clients caught up after a network blip, three hours of queued events arrived in minutes. The processor happily stamped them all with the current wall clock. The sliding-window aggregation counted them as current purchases. The dashboard lied for hours until the queue drained.

The diagram shows why the bug is invisible in normal operation. Most of the time events arrive within milliseconds of being created. Processing time ≈ event time, so the metric is correct. But the moment you have any buffering — mobile offline mode, a slow consumer, a batch replay — the gap explodes. Processing-time metrics look accurate right up until the moment they catastrophically mislead you.

// BAD: stamping with wall-clock time on the processor DataStream<Purchase> purchases = source .map(raw -> { Purchase p = parse(raw); p.setTimestamp(System.currentTimeMillis()); // ← wall clock on THIS machine return p; }) .assignTimestampsAndWatermarks( WatermarkStrategy .forMonotonousTimestamps() // ← assumes events are always on time .withTimestampAssigner((p, t) -> p.getTimestamp()) ); // "purchases in last hour" window purchases .keyBy(Purchase::getUserId) .window(SlidingEventTimeWindows.of(Time.hours(1), Time.minutes(5))) .sum("amount") .print(); // When offline clients flush 4-hour-old events, they all land in the "last hour" window.

// GOOD: use the timestamp the CLIENT embedded in the event payload DataStream<Purchase> purchases = source .assignTimestampsAndWatermarks( WatermarkStrategy .<RawEvent>forBoundedOutOfOrderness(Duration.ofMinutes(5)) // ← allow 5-min late .withTimestampAssigner((event, t) -> event.getClientTimestampMs()) // ← client clock ) .map(raw -> parse(raw)); // Same window — but now event time is correct purchases .keyBy(Purchase::getUserId) .window(SlidingEventTimeWindows.of(Time.hours(1), Time.minutes(5))) .allowedLateness(Duration.ofMinutes(10)) // ← handle stragglers gracefully .sum("amount") .print(); // Mobile replay events are now placed in their real historical windows — not faked into "now".

Lesson: Always embed a client-generated timestamp in the event payload before it reaches Kafka. Never rely on the moment of processing as a proxy for the moment of occurrence. If you don't control the producer, use ingestion time from Kafka's CreateTime metadata — still not perfect, but orders of magnitude better than processor wall clock.

Bug 2 — "Our Stream-Table Join Ate 200GB of Memory"

Incident: A team enriched a clickstream with user profile data by joining the click events stream against a "user_profiles" Kafka topic (a changelog table). As the product grew, new users joined every day. Each new user record landed in the join's internal state store. After 18 months, the Flink job's state had grown to 200 GB. The job ran fine until a TaskManager was replaced during a cluster upgrade — the new TaskManager could only allocate 240 GB before OOM. The job crashed, couldn't restore from checkpoint (checkpoint was also 200 GB and the recovery timed out), and the team had to cold-start with 6 hours of event replay.

What Went Wrong

A stream-table join is seductive because it looks like a database join. But unlike a database, the stream processor keeps the entire join side in memory / state store across the job's lifetime. Every user who ever existed accumulates in the state store — including users who haven't clicked in two years. No TTL was configured. The state just grew. Indefinitely.

// BAD: join with no TTL — state accumulates forever StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); // Kafka changelog → Table (every user ever seen is kept in state) tEnv.executeSql( "CREATE TABLE user_profiles (user_id STRING, tier STRING, region STRING, " + "PRIMARY KEY (user_id) NOT ENFORCED) " + "WITH ('connector'='kafka', 'topic'='user-profiles', ...)"); tEnv.executeSql( "CREATE TABLE clicks (user_id STRING, page STRING, ts TIMESTAMP(3), ...) " + "WITH ('connector'='kafka', 'topic'='clicks', ...)"); // This join keeps ALL user_profiles state in RocksDB forever tEnv.executeSql( "INSERT INTO enriched_clicks " + "SELECT c.user_id, c.page, u.tier, u.region, c.ts " + "FROM clicks c JOIN user_profiles FOR SYSTEM_TIME AS OF c.ts AS u " + "ON c.user_id = u.user_id"); // After 18 months: 200 GB state. After cluster upgrade: OOM.

// GOOD: configure idle state retention so stale entries are evicted StreamTableEnvironment tEnv = StreamTableEnvironment.create(env); // Tell Flink to evict state not accessed for 7 days tEnv.getConfig().setIdleStateRetention(Duration.ofDays(7)); tEnv.executeSql( "CREATE TABLE user_profiles (user_id STRING, tier STRING, region STRING, " + "PRIMARY KEY (user_id) NOT ENFORCED) " + "WITH ('connector'='kafka', 'topic'='user-profiles', ...)"); tEnv.executeSql( "CREATE TABLE clicks (user_id STRING, page STRING, ts TIMESTAMP(3), ...) " + "WITH ('connector'='kafka', 'topic'='clicks', ...)"); // Same join — now state for users who haven't clicked in 7 days is cleaned up tEnv.executeSql( "INSERT INTO enriched_clicks " + "SELECT c.user_id, c.page, u.tier, u.region, c.ts " + "FROM clicks c JOIN user_profiles FOR SYSTEM_TIME AS OF c.ts AS u " + "ON c.user_id = u.user_id"); // Also: set alerting on state-size growth rate, not just absolute size.

Lesson: Streaming state is not like a database table — it doesn't have built-in row eviction. Every stateful operator must have an explicit TTL or idle-state retention policy. Set up alerting on state growth rate, not just the current absolute size. A state store that grows by 10 MB per day will kill you in 18 months whether or not today's alert fires.

Bug 3 — "We Lost 99.7% of Late Events for Three Weeks (Silently)"

Incident: A data pipeline computed hourly metrics from a multi-source Kafka topic. One of the source partitions was served by a slow consumer that consistently lagged by 30 seconds. The watermark was configured with only 1 second of out-of-orderness tolerance. Because watermarks in Flink are the minimum across all partitions, the slow partition held the watermark 30 seconds behind actual processing time. Windows triggered 30 seconds late. But any event from the slow partition that arrived more than 1 second past the window-close was dropped as "too late." After investigation: 99.7% of windows closed before the slow partition's events arrived. Silent data loss — no error, no dead-letter queue, no alert — for three weeks.

What Went Wrong

The team set out-of-orderness to 1 second because "our events are nearly real-time." That's true for 11 of the 12 source partitions. But the global watermark is a minimum across all sources. One slow source drags the watermark for everyone. The 1-second tolerance was set for the happy-path case and ignored the tail behavior of any slow partition.

The diagram shows the crux of the problem: the global watermark is the minimum watermark across all live sources. A single slow partition tanks the global watermark for everyone. Events from the fast partitions that arrive with event-time between T70 and T100 are fine — but events from the slow partition that also have event-time in that range arrive after the window has already closed. They are silently dropped.

// BAD: 1 second out-of-orderness for a source with heterogeneous partition lag WatermarkStrategy<Event> strategy = WatermarkStrategy .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(1)) // ← assumes all partitions are fast .withTimestampAssigner((e, t) -> e.eventTimestamp()); DataStream<Event> events = env .fromSource(kafkaSource, strategy, "Kafka Source"); // No late-event routing — dropped events vanish silently. No metric, no DLQ.

// GOOD: generous tolerance + dead-letter routing + lag monitoring WatermarkStrategy<Event> strategy = WatermarkStrategy .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(60)) // ← covers slow partition tail .withTimestampAssigner((e, t) -> e.eventTimestamp()) .withIdleness(Duration.ofSeconds(30)); // ← don't let an idle partition freeze watermark DataStream<Event> events = env .fromSource(kafkaSource, strategy, "Kafka Source"); // Route genuinely late events to a side output for analysis OutputTag<Event> lateTag = new OutputTag<>("late-events") {}; SingleOutputStreamOperator<Metric> main = events .keyBy(Event::getKey) .window(TumblingEventTimeWindows.of(Time.hours(1))) .allowedLateness(Duration.ofSeconds(60)) .sideOutputLateData(lateTag) // ← collect, don't silently discard .aggregate(new CountAgg()); // Ship late events to a monitoring topic main.getSideOutput(lateTag) .addSink(lateEventSink); // alert if rate exceeds threshold

Lesson: Set allowed out-of-orderness based on the slowest source partition under load, not the average-case fast path. Always route late events to a side output or dead-letter queue with a metric. If your drop rate is zero, great — but you need the metric to know it's zero. Silent drops are the worst class of streaming bug because they accumulate undetected until someone notices a number doesn't add up.

Bug 4 — "We Deduplicated Our Way Into Data Loss"

Incident: A payments team suspected their Flink job was emitting duplicate records under certain retry scenarios. In response, a backend engineer added deduplication logic in the consumer service: "if we've seen this event ID in the last 24 hours, ignore it." The Flink job already had exactly-once semantics enabled end-to-end. The consumer's dedup layer was now treating valid events that arrived slightly delayed — not duplicates, just late — as duplicates. Payment completion events were being silently dropped. Revenue went uncredited. The bug went undetected for nine days.

What Went Wrong

Exactly-once in a Flink pipeline means: within the Flink job, no event is counted twice. It does NOT mean the downstream consumer receives each message exactly once — that depends on the delivery guarantee of the sink (Kafka, database, etc.). The engineer treated "exactly-once in the pipeline" as "exactly-once delivery to the consumer," which is a different guarantee. The consumer's home-built dedup was also too aggressive — its 24-hour window was keyed on event ID, so any retry or delivery with the same event ID within 24 hours was dropped, even if the first attempt had been acknowledged by Flink but not yet consumed by the downstream service.

// CONSUMER SIDE — the engineer added this "safety net" Set<String> seenEventIds = new HashSet<>(); // In practice: Redis with 24h TTL public void processEvent(PaymentEvent event) { if (seenEventIds.contains(event.getId())) { // "We've seen this — must be a duplicate from Flink retries" log.warn("Dropping duplicate event: {}", event.getId()); return; // ← WRONG: this event may be brand-new, not a Flink retry } seenEventIds.add(event.getId()); creditPayment(event); // actual business logic } // Problem: Flink already guarantees exactly-once in the pipeline. // Legitimate delayed events look "duplicate" to this consumer logic. // Result: valid payment events are dropped silently.

// CORRECT: understand what exactly-once ACTUALLY covers // Flink exactly-once = within the Flink job's state machine only. // The Kafka sink uses transactional writes (KafkaTransactionProducer). // Downstream consumer reads from Kafka with read_committed isolation. // This gives exactly-once delivery FROM Flink TO Kafka topic. // Consumer simply reads and processes — no home-built dedup needed: public void processEvent(PaymentEvent event) { // Trust the pipeline. If you still need idempotency at the DB layer, // use an INSERT ... ON CONFLICT DO NOTHING or upsert, not application dedup. creditPayment(event); } // If you DO need consumer-side dedup (for an at-least-once source), // key it on something that changes between legitimate retries, e.g., // (event_id, processing_attempt_number) — NOT just event_id. // And use a TTL short enough that legitimate delayed events aren't caught.

Lesson: Exactly-once is a property of the pipeline's internal state, not of the downstream delivery contract. Understand exactly what boundary the guarantee covers before layering additional dedup logic on top. Home-built dedup in consumers is dangerous — it must be idempotent itself, it must handle legitimate late arrivals, and it must not fight the framework's own guarantees. If you genuinely need dedup at the consumer, use database upserts (idempotent by nature) rather than an in-memory set.

The four bugs share a theme: the hardest streaming failures are the silent ones. Using processing time instead of event time looks fine until a network blip proves it wrong. State without TTLs looks fine until a cluster upgrade forces recovery. A tight watermark tolerance looks fine until one slow partition silently drops 99.7% of events. Consumer dedup fighting an exactly-once pipeline looks fine until a payment goes uncredited. Every streaming system needs metrics for late-event drop rate, state growth rate, and end-to-end latency — alert before the silent failure becomes a visible crisis.

Section 20

Real-World Architectures — How Big Companies Use Streaming

Stream processing is not an academic exercise. It runs the dashboards that show engineers what their systems are doing right now. It powers the fraud scores that fire before a payment completes. It feeds the recommendation models that update based on what users clicked three seconds ago. Here is how five companies actually built their streaming infrastructure — the architecture shape, the trade-offs they made, and why those choices were right for their context.

This generic shape — producers → broker → stream processor → multiple sinks — repeats at every company below. What differs is what runs at each box, how the state is managed, and what latency target drives the design. Understanding the shape means you can read any company's architecture blog post and immediately identify where their innovation actually lives.

Netflix Mantis — Operational Intelligence at Enormous Scale

Netflix's Mantis platform is a custom-built reactive stream processor (originally running on Apache Mesos with Netflix's own Fenzo scheduler, open-sourced in 2019) designed for what Netflix calls "operational intelligence" — real-time analysis of the system itself. When a CDN node starts dropping packets, when a client region suddenly spikes error rates, when a specific title starts buffering more than usual — Mantis detects these events and routes them to the right team within seconds. The scale is genuinely large: Netflix processes trillions of events per day across its observability pipeline, with Mantis covering the operational/observability use cases and a separate Flink-based platform (Keystone) handling analytics workloads.

The interesting architectural choice Netflix made is treating stream processing as a first-class engineering discipline, not a bolt-on analytics tool. Mantis jobs are deployed the same way microservices are deployed — versioned, rolled out with canaries, rolled back if a health check fails. Jobs have SLOs. The operational model for a stream processing job looks exactly like the operational model for an API service. This is why Mantis can run reliably at scale: the operational discipline comes from applying proven software engineering practices to a domain (streaming) that often gets treated as a "data team problem."

Key pattern: Treat stream processing jobs like production services — with versioning, canary deployments, SLOs, and on-call rotation. Don't let them live in "data team land" where operational practices are looser.

Uber's Real-Time Data Platform — Flink for Surge Pricing and Fraud

Uber's streaming infrastructure grew from a specific pain point: surge pricing. Surge pricing requires knowing, in real time, the ratio of supply (available drivers) to demand (rider requests) in each geographic cell. A batch job that runs every 10 minutes is useless — by the time the batch runs, the supply/demand ratio has changed. The latency requirement is under 30 seconds, ideally under 10. This forced Uber to build real stream processing rather than "frequent batch."

Uber settled on Apache Flink as their primary compute engine. The streaming jobs consume from their internal event bus (built on Kafka), compute geographic aggregations using Flink's keyed state, and write results to their operational data stores. The same infrastructure also runs fraud detection — a transaction arrives, the streaming job enriches it with recent account history (from state), computes a risk score, and writes the score to a cache that the payment authorization service reads before approving the charge. The fraud score must be available within the authorization window, which is typically under 200 milliseconds from the score computation request.

Key pattern: Uber separates "online" streaming (low-latency, serves real-time decisions) from "near-real-time" streaming (slightly higher latency, feeds dashboards and ML training). Different SLOs → different pipeline configs for the same Flink infrastructure.

Stripe Radar — Sub-100ms Fraud Scoring in the Payment Path

Stripe's fraud system, Radar, is interesting because the latency constraint is genuinely extreme: the fraud score must be computed and returned within the synchronous payment API call. A payment authorization that takes 300 ms instead of 100 ms is visible to users and costs conversion rate. This means the streaming pipeline that builds the features Radar uses must write its outputs to a low-latency read path (typically an in-memory cache or a key-value store optimized for sub-millisecond reads).

The streaming jobs at Stripe continuously update feature vectors: how many payments has this card made in the last 15 minutes? Has this email address appeared in a declined-payment stream? What is the velocity of charges from this IP? These feature vectors are written to a fast cache. When a new payment arrives, Radar reads from the cache (not from the streaming job), computes the fraud score using the pre-computed features, and returns the result synchronously. The streaming job and the scoring service are decoupled — the stream processor updates features asynchronously, the scorer reads pre-computed features synchronously.

Key pattern: When streaming outputs must feed synchronous API calls, decouple the write path (streaming job → cache update) from the read path (API call → cache read). The stream processor never sits in the synchronous call path.

LinkedIn Samza — The Original "Streaming as First-Class Citizen"

LinkedIn built Apache Samza (now an Apache project) because their use cases — real-time feed ranking updates, ML feature computation, observability — required stateful stream processing before Flink existed in its current form. Samza's key design choice was using Kafka both as the message bus and as the durable log for state. Samza operators checkpoint their state to local RocksDB instances and can replay from Kafka to recover — an elegant architecture that avoids the complexity of a separate state backend like S3 or HDFS.

LinkedIn's feed uses Samza to compute real-time signals: how many connections liked this post in the last 10 minutes, is this post trending in a specific industry, has this author published multiple posts in a short window (potential spam). These signals update the ranking model in near-real time. The streaming job reads from Kafka topics that receive activity events (likes, comments, shares), maintains counts in local RocksDB state, and writes updated signal vectors back to Kafka topics that the feed ranking service consumes.

Key pattern: Samza collapses state backend and event bus into Kafka — elegant for organizations already deeply committed to Kafka. The trade-off is that recovery requires replaying potentially large Kafka topics, which can be slow after a major failure.

Pinterest Goku — Streaming Aggregation for Time-Series Metrics

Pinterest's Goku system handles a specific problem: collecting, aggregating, and serving billions of time-series metrics from their infrastructure. Every service emits metrics. Aggregating them (sum, count, percentile) across many services, hosts, and time windows requires streaming computation. Goku uses a streaming aggregation layer that pre-computes rollups (per-minute, per-5-minute, per-hour) so that dashboards can query pre-aggregated data rather than raw events. The streaming layer makes dashboards fast; without it, every dashboard query would need to scan billions of raw metric points.

The architecture is simpler than Stripe's or Uber's — no ML scoring, no geographic keyed state. It is essentially: emit metric → Kafka → streaming aggregation tier → aggregated rollup store (Goku) → dashboard reads from rollup store. The simplicity is deliberate: the team chose the right tool for the problem size rather than over-engineering for theoretical scale they didn't need.

Key pattern: Streaming aggregation for time-series metrics is one of the most ROI-positive uses of stream processing. It converts an otherwise O(events) query into an O(1) lookup, making real-time dashboards fast without a data warehouse.

Every company's streaming architecture follows the same fundamental shape — producers, broker, processor, sinks — but the innovation lives in the operational practices (Netflix), the latency contracts (Stripe, Uber), the state backend choices (LinkedIn), and the query shape (Pinterest). Understanding the generic shape means you can read any architecture blog post and immediately identify where the interesting trade-offs actually live, rather than being dazzled by the scale numbers.

Section 21

Common Misconceptions — Getting the Mental Model Right

Stream processing has a set of persistent misconceptions that trip up experienced engineers who are new to it. These aren't beginner mistakes — they're the kind of subtle wrong assumptions that lead to production incidents. Each one below is a real belief that real engineers have held in real code review discussions.

The wrong belief: Streaming is what you do when you want your batch job to run more often. It's just a batch job with a shorter schedule interval.

Why it's wrong: Batch and stream processing have fundamentally different mental models. In batch processing, data has a clear end — you load all the rows, process them, write results, done. The data is bounded. You can sort it. You can count it. You can take the max over all of it. In stream processing, data is unbounded — there is no last row, no end of file, no "I've seen everything." This changes every algorithm. You can't sort an infinite stream. You can't compute the exact median of an infinite stream without storing every element. Windowing (cutting infinity into finite chunks) only exists because you can't do unbounded aggregation. The constraint of unboundedness forces entirely different data structures, algorithms, and operational practices. "Faster batch" will get you killed by the first event that arrives 10 minutes late.

The right mental model: Streaming is about queries over infinite, ever-changing data. Batch is about queries over a fixed, complete dataset. The data model is fundamentally different — not just the schedule.

The wrong belief: If my Flink job has exactly-once semantics enabled, my downstream consumers will never see duplicate events, and I don't need to handle idempotency anywhere else.

Why it's wrong: Exactly-once in a stream processing framework is a guarantee about the pipeline's internal state. It means: given a failure and recovery, the pipeline's state (counters, aggregations, joined results) reflects exactly what you'd get if no failure had occurred. It does NOT guarantee that every write to an external system (a database, an HTTP endpoint, a non-transactional Kafka topic) happens exactly once. External side effects must be made idempotent separately. The Flink-to-Kafka exactly-once guarantee requires transactional writes to Kafka and that consumers read with isolation.level=read_committed. For any non-transactional sink (a REST API, a Redis INCR, an S3 file write), you must design for idempotency in the sink itself — exactly-once in the pipeline doesn't help you there.

The right mental model: Exactly-once covers the pipeline's state machine. Idempotency at external sinks is your job. Draw the boundary clearly: inside the pipeline = framework's responsibility; outside the pipeline = your responsibility.

The wrong belief: Doubling the number of Flink TaskManagers or Kafka Streams instances will halve the processing time. More parallelism = more speed, always.

Why it's wrong: Parallelism in streaming is bounded by key cardinality for keyed operators. If you're computing per-user aggregations and you have 100,000 distinct active users, then parallelism beyond 100,000 provides no benefit — there are only 100,000 distinct keys to distribute. In practice, key skew is the more common problem: a small number of "hot keys" (a viral post ID, a popular product ID) receive a disproportionate fraction of events. Increasing parallelism doesn't help the hot-key partition — it just sits with a backlog while the other partitions are idle. Excessive parallelism also adds overhead: more network shuffles, more state shards to checkpoint, more coordination. For simple filter-and-enrich pipelines with no keyed state, adding parallelism past the input partition count gives zero benefit.

The right mental model: Parallelism is bounded by key cardinality for keyed ops, by input partition count for source operators, and by throughput need for stateless operators. Profile before scaling. Key skew is usually the actual bottleneck.

The wrong belief: Apache Flink is the "serious" framework for low-latency streaming. Kafka Streams is for simple use cases. If you need millisecond latency, use Flink.

Why it's wrong: Kafka Streams is a Java library that runs in-process with your application. There is no network hop to a separate cluster — events are consumed from Kafka, processed in the same JVM, and results are written back to Kafka or to local state. For simple operations (filtering, mapping, per-key aggregations without complex joins), this in-process model can be lower latency than Flink because Flink requires network shuffles between TaskManagers for operations that require data redistribution. Flink's strength is complex stateful operators, long-window aggregations, exactly-once end-to-end, and massive scale with thousands of parallel tasks. For a service that needs to enrich Kafka events with simple logic and write results back to Kafka — Kafka Streams running in your existing service process often wins on latency, cost, and operational simplicity.

The right mental model: Flink wins on: complex stateful ops, large state, high event volume, exactly-once end-to-end. Kafka Streams wins on: simple ops, Kafka-native deployment, low operational overhead, sub-millisecond latency for in-process operations.

The wrong belief: Lambda architecture (dual batch + stream pipeline) is obsolete. Kappa architecture (stream only) has replaced it. If you're building new, use Kappa.

Why it's wrong: Kappa architecture requires very mature streaming infrastructure: the ability to replay events from storage with the same pipeline that handles live events, operationally stable reprocessing tooling, and a stream processor reliable enough that your "batch" results (reprocessing) are trusted as much as historical batch results. Many organizations don't have this maturity. Lambda architecture's dual-pipeline complexity is a genuine cost — but the benefit is that the batch path (Spark, Hadoop, BigQuery) can reprocess historical data with the full power of mature batch tooling while the streaming path provides real-time results. For organizations where the streaming team is small, Lambda may be the pragmatic choice because the batch half requires less operational expertise in stream processing. Lambda is not dead — it's appropriate for a specific maturity level.

The right mental model: Lambda = two pipelines, operational complexity, but proven batch fallback. Kappa = one pipeline, conceptual simplicity, but demands mature streaming ops. Choose based on your team's streaming operational maturity, not on what the latest conference talks recommend.

The wrong belief: Once a watermark reaches T, it means the system has received all events with event-time ≤ T. Windows can close with confidence that no late data will arrive.

Why it's wrong: Watermarks are heuristics, not guarantees. They are estimates of how far behind real-world time the stream is. A watermark at T means "we believe most events with event-time ≤ T have arrived, but we acknowledge some may still be in transit." The out-of-orderness bound you configure is a guess about the worst-case delay in your system. In practice, true late data does arrive after the watermark passes. That's exactly why frameworks provide allowedLateness and side outputs — to handle the data that arrives after the watermark closes the window. A system designed as if watermarks are guarantees will silently drop late events. A system designed knowing watermarks are heuristics will route late events to a side output and handle them gracefully.

The right mental model: Watermarks are estimates, not promises. Design for late data from the start: configure allowedLateness, use side outputs for genuine stragglers, and monitor your late-event rate as a production metric.

The wrong belief: Stateful stream processing — joins, aggregations with crash-safe state, exactly-once semantics — is only for large companies with dedicated streaming engineering teams. Most teams should stick to stateless pipelines or simple batch jobs.

Why it's wrong: Modern frameworks have dramatically reduced the complexity of stateful streaming. Kafka Streams' DSL lets you write per-user aggregations with a few lines of code — the framework handles state persistence, checkpoint recovery, and rebalancing automatically. Flink's SQL and Table API let you write stateful streaming logic that looks like plain SQL. Apache Beam's unified model means you can write once and run on Flink, Spark, or Google Cloud Dataflow. The operational complexity is still real — you need to think about state TTLs, checkpoint intervals, and reprocessing strategy — but the programming model complexity is no longer a barrier. A small team with solid Kafka knowledge can successfully operate a Kafka Streams application for common use cases without a dedicated streaming platform team.

The right mental model: The programming model for stateful streaming is now approachable. The operational discipline (TTLs, checkpoints, reprocessing, monitoring) still requires investment. Don't avoid stateful streaming — invest in the operational practices instead.

The seven misconceptions share a root: streaming looks like something familiar (faster batch, a distributed system, a database) but has genuinely different properties that invalidate assumptions imported from those domains. Watermarks are heuristics not guarantees, exactly-once covers the pipeline boundary not external sinks, and more parallelism is bounded by key cardinality. Getting the mental model right means knowing exactly which assumption breaks in which failure mode — and that only comes from understanding the why behind each property.

Section 22

Operational Playbook — Run Stream Processing in Production

Knowing how stream processing works is one thing. Running it in production at 3 AM when a downstream sink is backed up, a slow partition is dragging your watermark, and your checkpoint is taking 4 minutes instead of 30 seconds — that's a different skill entirely. This playbook is five stages, from "picking a framework" to "optimizing the thing you've been running for six months."

Each stage has its own failure modes, and skipping a stage usually means paying for it in the next stage. Teams that skip "Test" discover their checkpoint recovery doesn't work during a 2 AM outage. Teams that skip "Monitor" don't know their state is growing until the OOM happens. The stages are ordered because each one creates the foundation for the next.

Stage 1 — Pick the Right Framework

The framework choice locks in your operational model for years. Get this right before writing a line of code.

Apache Flink — Choose when: sub-second latency is a hard requirement, you have complex stateful operations (long-window joins, pattern detection, aggregations over large state), you need end-to-end exactly-once to non-Kafka sinks, or you expect to scale to very high event volumes. Operational cost: you need a Flink cluster (or managed Flink on AWS/Confluent/GCP), which means learning JobManager/TaskManager topology, checkpoint configuration, and savepoint management. This is a real operational investment.
Kafka Streams — Choose when: your events are already in Kafka, your team is already Kafka-native, you want zero additional infrastructure (it's a Java library), and your operations are relatively simple (filter, enrich, aggregate per key). Operational cost: you manage scaling by deploying more instances of your service — the same way you'd scale any other microservice. Very approachable for a small team.
Spark Structured Streaming — Choose when: you already have a Spark data platform, your team knows DataFrames/SQL, and you're okay with micro-batch latency (typically 1-10 seconds). Strong for unified batch+stream processing. Weaker than Flink for sub-second latency or complex stateful ops.
Apache Beam — Choose when: you want portability across execution engines (run on Flink today, migrate to Google Dataflow tomorrow without rewriting). Good for teams in multi-cloud environments. The unified model adds some abstraction overhead — you may find the native Flink API more expressive for complex jobs.

Stage 2 — Onboard Safely

Before the first event flows through your production job, three foundations must be in place or you will regret it the first time something goes wrong.

Provision state backend and checkpoint storage first. Configure your Flink job to checkpoint to durable storage (S3, HDFS, or GCS) before going live. The default in-memory state backend loses all state on a TaskManager restart. For Kafka Streams, the state is stored in RocksDB on local disk — ensure the persistent volume is correctly configured and survives pod restarts in your Kubernetes setup.
Set up monitoring before traffic. Deploy metrics dashboards before the first event flows. Minimum metrics from day one: consumer lag (per partition), end-to-end processing latency, checkpoint duration, checkpoint size, and state store size. You want a week of "normal" baseline before you need those dashboards to diagnose a problem.
Design for restart-from-checkpoint from day one. Every job should be structured so that triggering a checkpoint recovery produces identical results to uninterrupted processing. If your job writes to an external database, ensure the writes are idempotent or transactional. If your job calls an external API, ensure the API is idempotent. Checkpoints are useless if recovery causes double-writes or inconsistent state.

Stage 3 — Test Failure Modes (Not Just Happy Path)

Stream processing jobs have failure modes that unit tests and integration tests almost never cover. You must test these explicitly — in a staging environment that mirrors production — before you go live.

Kill a worker and verify checkpoint recovery. Terminate a TaskManager mid-processing while the job is under load. Time the recovery. Verify that the output exactly matches what you'd have gotten without the failure. If recovery takes more than 5 minutes, your checkpoint interval is too long or your state is too large.
Simulate late data. Replay historical events through your job with event timestamps from 30 minutes ago, 2 hours ago, 1 day ago. Verify that: (a) the late events are routed to the correct windows or to side outputs, (b) no late event causes incorrect output in "current" windows, and (c) your late-event metrics fire correctly.
Verify reprocessing works end-to-end. Take a savepoint, reset offsets to a point 24 hours ago, restore from the savepoint, and let the job reprocess. Verify that the final output is identical to the output from the original processing run. If it's not — your job has non-determinism (e.g., it reads from a side database that has changed) and you need to address that before a real reprocessing scenario forces the issue.

Stage 4 — Monitor the Right Things

Most streaming monitoring setups alert on the wrong metrics — they alert on CPU and memory (which are lagging indicators) instead of the leading indicators that tell you about imminent problems before they become outages.

Consumer lag (per partition, not aggregate). A growing per-partition lag means that specific partition's consumer is falling behind. Aggregate lag can look fine while one partition accumulates a multi-hour backlog. Alert on the max-partition lag, not the sum.
Watermark lag. The difference between wall-clock time and the current watermark. A growing watermark lag means events are arriving more slowly than expected (source is slow, partition is idle, or network has a problem). Alert when watermark lag exceeds 2× your out-of-orderness bound.
Checkpoint duration and state size growth rate. A checkpoint that takes longer than 10% of your checkpoint interval will eventually cause issues. State size that grows more than 5% per day will become a problem in 20 days. Alert on growth rates, not just absolute thresholds.
Late-event drop rate. As discussed in the bug studies: if you're not measuring how many events are being dropped as "too late," you don't know if your watermark configuration is correct. This metric should be zero in normal operation. Any non-zero rate is a signal that your out-of-orderness bound is too tight.

Stage 5 — Optimize for Throughput and Latency

Optimization happens after you've been running in production long enough to have real bottleneck data. Premature optimization in streaming is particularly wasteful — the bottleneck is almost never where you'd guess without profiling. Here are the optimizations that most frequently matter:

Async I/O for external lookups. If your job needs to look up data from an external database or API (e.g., enrich events with user data from Postgres), use Flink's AsyncDataStream or Kafka Streams' async processor. Synchronous external calls inside a map() operator are the single most common cause of low throughput in streaming jobs — the entire pipeline waits for each external call.
Tune RocksDB for your workload. Flink's default RocksDB configuration is conservative. For read-heavy jobs (lots of lookups into state), increase the block cache size. For write-heavy jobs (constant state updates), tune the write buffer size. Incorrect RocksDB tuning can cause 3-5× performance degradation versus optimal settings.
Enable incremental checkpoints. Full checkpoints copy the entire state on each checkpoint interval. Incremental checkpoints copy only the delta since the last checkpoint. For jobs with large state (10+ GB), enabling incremental checkpoints can reduce checkpoint duration from minutes to seconds.
Detect and mitigate key skew. If one key receives 1,000× more events than average (a viral post ID, a shared service account), that key's partition becomes a hotspot. Strategies: add a random salt suffix to hot keys before aggregation, then aggregate the salted results in a second pass; or detect hot keys and route them to a separate "hot key" pipeline with higher parallelism.

The five-stage operational playbook — Pick, Onboard, Test, Monitor, Optimize — is the difference between a streaming job that fails at 3 AM six months after launch and one that runs reliably for years. The most important stages are the ones teams skip: Test (nobody chaos-tests their streaming job before go-live) and Monitor (most teams alert on the wrong metrics). Get the foundations right first, then optimize when you have real bottleneck data — not before.

Section 23

Cheat Sheet & Glossary

A quick-reference summary of the key rules and vocabulary from this entire page. Use this during review, before an interview, or when you need to quickly re-orient yourself after a few weeks away from streaming work.

Quick-Reference Rules

Always use event time (client timestamp) for business metrics. Processing time (when the processor saw it) distorts metrics during replays and network delays.

A watermark at T means "we estimate events up to T have mostly arrived" — not a guarantee. Always configure allowedLateness and route stragglers to a side output.

Tumbling = fixed, non-overlapping buckets. Hopping = fixed size, slides by step (overlapping). Session = gaps between events define boundaries — no fixed size.

A checkpoint is a consistent snapshot of ALL operator state taken simultaneously. Recovery from checkpoint rewinds the pipeline to a known-good state and replays from the checkpoint offset.

Exactly-once semantics cover the pipeline's state machine. External sinks (REST APIs, non-transactional databases) still need idempotency at the sink level.

When a downstream operator can't keep up, backpressure propagates upstream to slow ingestion. This is healthy — it prevents buffer overflow. Sustained backpressure signals a scale or logic bottleneck.

Sub-second latency + complex stateful ops + large state + end-to-end exactly-once. If you need all four, Flink is the right tool.

Kafka-native, simple to moderate stateful ops, no extra cluster infrastructure wanted. Runs as a library in your existing service.

Lambda = two pipelines (batch + stream), operational complexity, mature batch fallback. Kappa = one streaming pipeline, conceptual simplicity, demands mature streaming ops. Choose based on team maturity.

Glossary

Dataflow graph: The directed acyclic graph of operators (source → transform → sink) that defines a streaming job. Each node is a parallelizable compute step.
Operator: A single step in the dataflow graph. Can be a source (emits events), a transformation (map, filter, join, aggregate), or a sink (writes results). Operators run in parallel across multiple workers.
Source: The origin of a stream. Common sources: Kafka topic, Kinesis stream, a database CDC stream, a socket. The source operator reads from the external system and emits events into the dataflow graph.
Sink: The destination for processed events. Common sinks: Kafka topic, PostgreSQL, Redis, Elasticsearch, S3. For exactly-once delivery, sinks must be transactional or idempotent.
Watermark: A monotonically increasing timestamp that signals "we believe all events with event-time ≤ T have (mostly) arrived." Used to trigger window computations. A heuristic, not a guarantee. The global watermark is the minimum across all active source partitions.
Window: A finite, bounded slice of an infinite stream over which an aggregation is computed. Three types: tumbling (fixed, non-overlapping), hopping (fixed size, sliding step), session (gaps between events define boundaries).
Keyed state: State scoped to a specific key (e.g., user ID, session ID). Partitioned across workers by key hash. Each worker owns a subset of keys and maintains their state independently. The most common form of streaming state.
Checkpoint: A periodic, consistent snapshot of all operator state written to durable storage. Enables recovery from worker failure by restoring state and replaying events from the checkpoint's source offset. Created automatically on a configured interval.
Savepoint: A manually triggered checkpoint used for intentional job restarts — upgrades, topology changes, A/B testing. Unlike automatic checkpoints (which Flink may delete after recovery), savepoints are explicitly managed and retained until manually deleted.
Exactly-once: A delivery semantic: despite failures and retries, each event affects the pipeline's state exactly once. Implemented via distributed snapshots (checkpoints) + transactional sink writes. Applies to the pipeline's internal state — not necessarily to all external side effects.
Backpressure: The mechanism by which a slow downstream operator signals upstream operators to slow down. Prevents buffer overflow and out-of-memory errors. In Flink, backpressure propagates through the network buffer pool. Sustained backpressure indicates a throughput imbalance that needs resolution.
Key skew: When the data distribution across keys is uneven — a small number of keys receive a disproportionate fraction of events. Causes: viral content IDs, shared service accounts, geographic hotspots. Creates hotspot partitions where one worker is overloaded while others are idle.

The nine cheat-sheet rules distill the most important mental shifts: always use event time, treat watermarks as estimates, know exactly what exactly-once covers (and doesn't), and match your framework to the specific requirements. The glossary gives you precise language for discussing streaming systems in design reviews and interviews — use the technical terms correctly, backed by understanding of what they actually mean.

Stream Processing — Computing Over Endless Rivers of Data

TL;DR — Stream Processing in Plain English

Why You Need This — When Batch Isn't Fast Enough

The Story: A Fraud Detector That Fires at 9 AM

The Three Generations of Latency Compression

The Fraud Detector Rewired

Mental Model — The Dataflow Graph

The Assembly Line Analogy

The Four Kinds of Operators

A Concrete Example: Count Purchases Per User Per Minute

Parallelism: The Same Graph, Running Across Workers

Core Concepts — The Stream Processing Vocabulary

Term Glossary — Plain English First

Event Time vs Processing Time — Why Time Is the Hardest Problem

Clock 1: Event Time — When It Actually Happened

Clock 2: Processing Time — When Your System Saw It

Why This Breaks Naive Stream Processing

The Real-World Scenario: Mobile Offline-Then-Resync

When to Use Each Clock

Watermarks — How to Decide "We've Seen Enough"

The Core Idea: A Progress Signal

How Watermarks Are Generated

The Latency–Completeness Trade-off

Watermarks Are Heuristics — Late Data Still Happens

Windowing — Carving Infinity Into Computable Pieces

Tumbling Windows — Fixed Buckets, No Overlap

Hopping Windows — Overlapping Slides

Sliding Windows — Per-Event Recomputation

Session Windows — The Gap-Based Window

Stateful Operations — Joins, Aggregations, and the State Store

Where State Lives: In-Memory vs RocksDB

Aggregations — Count, Sum, and Custom Accumulators

Joins — Stream-Stream vs Stream-Table

The Unbounded State Problem

Checkpoints, Savepoints, and Recovery

How Checkpoints Work — The Barrier Mechanism

Savepoints — Checkpoints You Control

Exactly-Once Semantics — The Hardest Property to Achieve

The Three Preconditions for Exactly-Once

How Flink Achieves Exactly-Once: Two-Phase Commit at the Sink

Backpressure, Throughput, and Latency Trade-offs

How Flink Implements Backpressure: Credit-Based Flow Control

Throughput vs Latency: The Inevitable Trade-off

Tuning Knobs for Production Pipelines

Framework Comparison — Flink vs Kafka Streams vs Spark Structured Streaming vs Beam

The Four Frameworks in Depth

At a Glance: 4-Way Comparison Table

Common Stream Processing Use Cases — Where This Actually Lives in Production

Lambda vs Kappa vs Streaming-Only — The Big Architecture Decision

When to Use Each

Reprocessing and Replay — How to Fix a Bug in Historical Data

Pattern A: Side-by-Side Reprocessing (Zero-Downtime Bug Fix)

Pattern B: Bulk Replay (Simpler, Requires Downtime Window)

The Math: How Long Does Replay Take?

Schema Evolution During Replay

Monitoring Stream Processors — The Metrics That Actually Catch Incidents

Each Metric Explained — What It Tells You and Why

Pitfalls and Anti-Patterns — The 7 Most Expensive Streaming Mistakes

Practice Exercises — Reason About Streams Like a Senior Engineer

Bug Studies — When Stream Processors Go Wrong

What Went Wrong

What Went Wrong

What Went Wrong

What Went Wrong

Real-World Architectures — How Big Companies Use Streaming

Common Misconceptions — Getting the Mental Model Right

Operational Playbook — Run Stream Processing in Production

Cheat Sheet & Glossary

Quick-Reference Rules

Glossary

Related Topics — What to Study Next