TL;DR โ Stream Processing in Plain English
- What stream processing actually IS โ computing over infinite, unbounded data where you can never wait for the data to "finish"
- Why batch processing breaks down the moment you need answers in seconds, not hours โ and the exact math explaining the latency gap
- The four hard problems that make streaming genuinely tricky: time skew, windowing, stateful aggregations, and exactly-once semantics under retries
- How a stream processing program is really a dataflow graph โ a chain of operators that runs forever, partitioned across workers
- How Apache Flink, Kafka Streams, Spark Structured Streaming, and Apache Beam each make different bets on that latency-vs-correctness spectrum
- Why event time and processing time diverge โ and why that divergence is the root cause of the hardest bugs in streaming systems
- What a watermark is and why it's how the system decides "we've seen enough, close the window"
Stream processing is the discipline of computing over data that never ends. There is no "load the file, run the query, print the results" loop โ there is only a continuous river of events flowing in, and your job is to extract aggregates, detect patterns, and produce outputs while the river is still running. The four hard problems are: time (events arrive late and out of order, so which clock do you trust?), windowing (how do you carve infinity into computable chunks?), state (aggregations need memory across events โ what happens when the worker crashes?), and exactly-once (retries under failures mean the same event might arrive twice โ how do you count it only once?). Get those four right and you can build fraud detectors that fire in two seconds, dashboards that update live, and ML feature pipelines that never go stale.
In batch processing, you wait for data to accumulate, then run a job over it. The data has a clear start and end: last night's orders, last month's logs, last year's transactions. Stream processing throws that model away. Click events, sensor readings, payment authorisations, log lines โ these never stop arriving. There is no "end of data". You must produce continuous results โ counts, averages, anomaly alerts โ while the data is still flowing, with latency in the milliseconds to seconds range rather than hours. The fundamental shift is from bounded data (finite, fits in a batch job) to unbounded data (infinite, grows forever, can't be loaded all at once).
Moving from batch to streaming isn't just "run the batch job more often." Four entirely new problems appear. First, time skew: an event's timestamp (when it happened) and when you see it (wall clock on the processor) are different โ mobile phones go offline, networks delay packets, replays replay old events. Second, windowing: to compute "purchases per minute" you must slice the infinite stream into finite time buckets. Third, stateful aggregation: "count per user" means the operator must remember what it has seen so far โ which requires durable, crash-safe state stores. Fourth, exactly-once semantics: when a worker crashes and restarts, it will re-process some events; you need to ensure your outputs count each event exactly once despite the retries. All four problems are hard. Solving all four together is what distinguishes Apache Flink from a weekend side project.
You already know Kafka โ the distributed log where events land first. Stream processing is the compute layer that sits on top of Kafka (or Kinesis, or Pulsar). Events flow out of Kafka, into a stream processor, and the processor writes results to Kafka, a database, a dashboard, or an alerting system. Think of Kafka as the highway and stream processing as the toll booths doing real work on every passing car โ continuously, at highway speed, forever. The major frameworks today are Apache Flink (stateful, event-time-native, production workhorse), Kafka Streams (a Java library embedded in your service, no separate cluster), Spark Structured Streaming (micro-batch model with DataFrame API), and Apache Beam (a unified programming model that runs on Flink, Spark, or Dataflow as a backend).
Why You Need This โ When Batch Isn't Fast Enough
Before we build anything, let's prove that batch processing โ the tool almost every team starts with โ has a fundamental latency floor it cannot cross. Understanding why that floor exists makes every streaming design decision obvious.
The Story: A Fraud Detector That Fires at 9 AM
Imagine an e-commerce company. They process thousands of orders per hour. They have a fraud detection model โ it looks at order value, shipping address, IP geolocation, and purchase history to flag suspicious orders. The data team built it as a Hadoop batch job. Every night at midnight the job runs, scanning all orders from the previous day. By 9 AM the next morning, the fraud report lands in the operations team's inbox.
This isn't a software bug. It's an architectural ceiling. Batch processing has a formula for its latency that you can't escape:
The data-collection-window is how long you wait before starting the job (nightly = up to 24 hours, hourly = up to 1 hour). Processing-time is how long the job takes to run (minutes to hours for large datasets). You cannot reduce the total latency below the collection window without running the job more frequently โ and at some frequency, "running the batch more often" becomes streaming.
The Three Generations of Latency Compression
The industry didn't jump from nightly batch to millisecond streaming overnight. It happened in three steps, each compressing latency by roughly 100ร:
The diagram shows the key insight: in nightly batch processing, the dominant cost is the data-collection window โ the time you spend waiting before the job even starts. Micro-batch (Spark Streaming running 30-second to 2-minute batches) reduces this dramatically but doesn't eliminate it. True streaming removes the collection window entirely โ each event is processed the moment it arrives. The processing time per event is tiny (single-digit milliseconds for modern hardware), so the total latency collapses to the time it takes for one event to flow through the pipeline.
The Fraud Detector Rewired
Back to our e-commerce company. They migrate the fraud detector to Apache Flink. The same ML model logic, now expressed as a stream processing job reading from a Kafka topic. Here's what changes:
Before (batch): Order placed at 8:00 PM โ batch job starts midnight โ fraud detected 9:05 AM โ 13 hours later. Package already delivered.
After (streaming): Order placed at 8:00:00 PM โ Flink processes the event at 8:00:00.300 PM โ fraud score exceeds threshold โ payment declined at 8:00:00.500 PM. 500 milliseconds total. Transaction never completes.
The ML model didn't change. The business logic didn't change. Only the execution model changed โ from "accumulate and batch" to "process each event as it arrives". That 500ms is the fundamental latency floor of the streaming approach: the time for a Kafka consumer to poll the event, route it through Flink's operator graph, and write the result back.
Mental Model โ The Dataflow Graph
You don't write a stream processing program the way you write a for-loop. There is no "start here, end here". Instead, you describe a graph of operations โ boxes connected by arrows โ and the runtime keeps that graph running forever, processing each event as it flows through. Understanding this graph mental model is the single most important conceptual step in stream processing.
The Assembly Line Analogy
Imagine a car factory assembly line. Cars (events) enter at one end. They pass through a series of stations: paint station, engine installation station, quality check station, packaging station. Each station does one specific job. Stations run in parallel โ car #10 is being painted while car #9 gets its engine installed. The line never stops; it runs 24/7 as long as cars keep arriving.
A stream processing program is exactly this assembly line. Events enter at the left (from Kafka, Kinesis, or a file). They pass through a series of operators โ filter this event, transform that field, group by user ID, count over a 1-minute window, write to the database. The operator chain is called a dataflow graph. You write the graph once; the runtime keeps it executing forever.
The Four Kinds of Operators
Every dataflow graph is built from four types of operators. Each type has a specific role:
-
Source operators โ the entry points. They read events from an external system and inject them into the graph. Examples: Kafka consumer reading from a topic, a Kinesis shard reader, a file watcher. One source per input stream.
-
Transform operators โ stateless per-event work. They receive one event, do some computation, and emit zero or more events. Examples:
map(convert one event to another format),filter(drop events that don't match a condition),flatMap(split one event into many). These are the simplest operators โ no memory required across events. -
Stateful operators โ aggregations and joins that need to remember past events. Examples:
count per user(must keep a running count per user ID),sum of order values per 1-minute window(must accumulate over time),join two streams(must buffer one side waiting for the other). These require a state store โ memory that persists across events and survives crashes. -
Sink operators โ the exit points. They take processed results and write them somewhere: Kafka topic, PostgreSQL table, Elasticsearch index, Prometheus metrics endpoint, dashboard WebSocket. One sink per output destination.
A Concrete Example: Count Purchases Per User Per Minute
Let's make this tangible. Say you want to count how many purchases each user makes every minute โ the kind of real-time signal a fraud detector or recommendation engine needs. The dataflow graph has exactly five steps:
The diagram shows the five operators for our purchase-count pipeline. Events flow left to right. The Kafka source emits a raw purchase event. The filter operator drops cancelled orders (stateless โ just a predicate on each event). The keyBy(userId) operator partitions the stream so all events for the same user flow to the same worker โ this is essential because the aggregation needs to see all of a user's events. The tumbling window groups one minute of events per user, then the count aggregation fires. The result โ "user u1 made 7 purchases between 12:00 and 13:00" โ is written to a Kafka output topic for downstream systems to consume.
Parallelism: The Same Graph, Running Across Workers
The runtime doesn't run just one copy of this graph. It runs N copies in parallel โ one per partition. If our Kafka input topic has 64 partitions, Flink spins up 64 parallel copies of the filter+keyBy+window+count chain. Each copy handles a subset of events. The parallelism is automatic โ you express the graph once, and the runtime fans it out. This is how stream processors handle millions of events per second: not by making one machine faster, but by spreading the workload across hundreds of identical operator instances.
Core Concepts โ The Stream Processing Vocabulary
Stream processing has precise terminology. Before you can read the Flink docs, debug a stateful job, or design a windowed aggregation, you need these terms locked in. Each term below is introduced with a plain English sentence first โ the jargon name comes after, once you already understand the idea.
The concept map shows how the major ideas connect. At the centre is the dataflow graph โ everything else either feeds into it (unbounded data, sources, sinks) or is a property of how it behaves (windows, event time, watermarks, state, exactly-once). The red arrows highlight the time-related cluster โ this is where most stream processing complexity lives, which is why Sections 5 and 6 give it dedicated treatment.
Term Glossary โ Plain English First
A single event that carries data about one thing that happened โ a click, a purchase, a sensor reading, a log line โ is called a record or event. We use both terms interchangeably.
Data that has a fixed, known end โ last month's transactions, a CSV file โ is called bounded data. Data that never stops arriving โ live click streams, IoT sensors, trading feeds โ is called unbounded data. Stream processing is the discipline of computing over unbounded data.
The time when an event actually happened in the real world โ encoded in the event's timestamp field โ is called event time. The time when your processing system sees the event โ the wall clock at your operator โ is called processing time. They diverge. This divergence is the root of most hard problems in streaming โ we'll explore it in depth in Section 5.
A heartbeat signal the system emits saying "we believe event time has reached T" is called a watermark. Watermarks let windowed operators decide "it's safe to close the 12:00โ13:00 bucket and emit the count". Without watermarks, a window would never know when all its events had arrived โ it would wait forever.
Carving the infinite stream into finite, computable time buckets is called windowing. The four window shapes are: tumbling (fixed non-overlapping buckets), hopping (overlapping, advance by step), sliding, and session (gap-based, closes after inactivity). We cover all four in Section 7.
When an operator needs to remember data across many events โ like "how many purchases has user #1234 made today?" โ it uses keyed state (state partitioned per key) or operator state (state global to the operator). State stores are the memory that makes stateful aggregations possible โ and the reason you need crash recovery.
A snapshot of the entire state of all operators at a point in time, saved to durable storage, is called a checkpoint. Checkpoints enable crash recovery: if a worker dies, the job restarts from the last checkpoint and replays only the events that arrived after it. A manually triggered checkpoint you can restart from intentionally is called a savepoint.
When each event is counted exactly once, even if the system retries after a crash, that property is called exactly-once semantics. The easier (and more common) guarantee, where events may be processed more than once after a crash but are never lost, is called at-least-once semantics. Whether exactly-once matters depends on whether your sink is idempotent.
When a downstream operator can't keep up and the upstream operator needs to slow down, the system applies backpressure. This prevents out-of-memory crashes when load spikes temporarily overwhelm a bottleneck operator. Flink and Kafka Streams both handle backpressure automatically โ you don't implement it yourself.
Event Time vs Processing Time โ Why Time Is the Hardest Problem
Here's a question that sounds simple: "How many purchases happened between 2 PM and 3 PM yesterday?" If you're running batch queries on a database, this is trivial โ filter by timestamp, count rows, done. In stream processing, this question reveals a deep problem that affects the correctness of almost every real-time computation. The problem is that there are two different clocks in a streaming system, and they disagree.
Clock 1: Event Time โ When It Actually Happened
Every event carries a timestamp โ the moment the event occurred in the real world. A purchase event has a purchasedAt field; a click event has a clickedAt field; an IoT sensor reading has a measuredAt field. This timestamp is set by the source system โ the mobile app, the payment processor, the sensor firmware. It represents ground truth: the event happened at exactly this time.
This clock is called event time. When you want to answer "what happened between 2 PM and 3 PM?", event time is the right clock to use. It doesn't matter when your processor saw the event โ it matters when the purchase occurred.
Clock 2: Processing Time โ When Your System Saw It
Your stream processing operator has a different clock โ the wall clock on the machine running the operator. When an event arrives in the operator's input buffer, that wall clock says "5:03 PM". This is called processing time. It's the time the system actually observed the event, not when the event happened.
For many simple use cases, processing time is perfectly fine. If you want "how many events are we seeing right now?" (a throughput metric), processing time is correct โ you don't care when the events happened, you care about current arrival rate. But the moment you ask "what happened in a specific real-world time window?", processing time gives you wrong answers.
The diagram shows five events (A through E) plotted on two axes: their event time (top, amber axis) and their actual arrival at the processor (bottom, blue axis). Events A, D, and E arrive quickly โ only a second or two of network delay, so the two clocks stay close. Events B and C are the problem: their event time says they happened at 2:01 PM, but they arrive at the processor at 2:04โ2:05 PM โ over three minutes late. Why? Possible reasons: a mobile phone went offline and buffered the events, a network partition delayed delivery, or a consumer process crashed and replayed a Kafka partition. In all these cases, the real-world truth is that these events happened at 2:01 PM โ but the processor saw them three minutes after it had already moved on.
Why This Breaks Naive Stream Processing
Suppose you're using processing time for your 2:00โ3:00 PM window. Your system opens the window when the wall clock reaches 2:00 PM and closes it when the wall clock reaches 3:00 PM. Events B and C arrive at 2:04 and 2:05 PM processing time โ they fall inside the window and get counted. This seems fine.
But now try to answer: "How many purchases happened between 2:00 PM and 2:01 PM?" With processing-time windows, B and C are in a completely different 1-minute bucket than A even though B and C happened at 2:01 PM. Your answer is wrong โ B and C contribute to the wrong window.
The Real-World Scenario: Mobile Offline-Then-Resync
Here's the canonical example. A user opens your mobile shopping app at 2:50 PM. The phone is on spotty cellular. The user makes a purchase. The app sends the event โ but the network drops the packet. The app retries every 30 seconds. At 5:00 PM, the user moves into WiFi range. The app reconnects and successfully delivers the purchase event.
This diagram illustrates the gap clearly. The purchase event has event time 2:50 PM โ that's when the user tapped "buy". The processor doesn't see it until 5:00 PM when the phone reconnects to WiFi. A naive processing-time system puts this event in a "5:00 PM" window. An event-time system โ if its watermark hasn't already closed the 2:50 PM window โ correctly assigns the event to the 2:50 PM bucket. That two-hour gap is extreme, but smaller gaps (seconds to minutes) happen constantly in production systems due to network jitter, retry queues, and consumer group rebalancing.
When to Use Each Clock
Neither clock is universally "better" โ each answers a different question:
Watermarks โ How to Decide "We've Seen Enough"
Section 5 ended with a problem: if you're doing event-time windowing, how do you know when all the events for a given time window have arrived? You can't wait forever โ the stream never ends. But if you close the window too early, late events that should have been in that window are lost. Watermarks are the answer to this dilemma. They are the mechanism stream processors use to reason about time under uncertainty.
The Core Idea: A Progress Signal
Imagine you're counting all purchases between 2:00 PM and 3:00 PM. Events for that window are still trickling in. You've seen events with timestamps up to 2:58 PM. You want to emit the count โ but you're not sure if more 2:xx PM events are still in transit.
A watermark is a special signal you inject into the event stream that says: "I am confident that all events with event time earlier than T have already been seen." When the windowed operator sees a watermark at T = 3:05 PM, it knows: no more events with timestamp earlier than 3:05 PM should arrive (or if they do, they're considered "late"). It's safe to close the 2:00โ3:00 PM window and emit the count.
Formally: a watermark with timestamp T is an assertion that all events with event time < T have been delivered to this operator. Every windowed operator waits for a watermark that passes the window's end time before it fires and emits results.
The diagram shows the watermark flow. Events 1, 2, and 3 carry event times of 2:55โ2:59 PM and arrive at the processor around 3:00โ3:01 PM processing time. The watermark advances as events arrive โ it's generated by subtracting a configured "allowed lateness" from the highest observed event time. When the watermark reaches T = 3:01 PM (which passes the window boundary of 3:00 PM), the window closes and emits its count of 3 events. The late event (L), which has event time 2:50 PM but arrives at 3:05 PM processing time, arrives after the watermark has already passed โ so it cannot be added to the now-closed window. Instead, it's routed to a late-data side output stream for special handling.
How Watermarks Are Generated
Watermarks don't appear by magic โ someone has to generate them. The most common strategy in Apache Flink and Kafka Streams is called bounded out-of-orderness: you pick a fixed delay (say, 10 seconds) and set the watermark to max_seen_event_time - 10s. In plain English: "I'm assuming no event will arrive more than 10 seconds late."
Example: you've seen events with event time up to 2:59:50 PM. Your allowed lateness is 30 seconds. Watermark = 2:59:50 PM โ 30s = 2:59:20 PM. This tells operators: "All events with event time earlier than 2:59:20 PM should have arrived by now."
The LatencyโCompleteness Trade-off
The allowed lateness setting is the most important tuning knob in event-time windowing. It controls a fundamental trade-off:
The trade-off is real and unavoidable. A tight watermark (say, 2-second allowed lateness) means your window results arrive quickly after the window boundary closes โ but any event that was delayed more than 2 seconds is considered "late" and either dropped or sent to a side output stream for separate processing. A loose watermark (say, 60-second allowed lateness) captures nearly all late events โ but your window results are delayed by 60 seconds, which may be unacceptable for a fraud detection alert.
In practice, most production teams use between 5 and 30 seconds of allowed lateness, based on the 99th-percentile event delay they observe in their system. You measure it empirically (histogram of event-time-to-processing-time gaps), then set the allowed lateness to cover that percentile comfortably.
Watermarks Are Heuristics โ Late Data Still Happens
Here's the hard truth: watermarks are a bet, not a guarantee. You're saying "I believe no event will arrive more than 30 seconds late." Sometimes you're wrong. A phone might be offline for 2 hours. A batch replay job might emit events with timestamps from yesterday. In those cases, events arrive after their window's watermark has already passed.
Stream processors handle true late data in three ways:
-
Drop it: The simplest option. Late events that arrive after the watermark are silently discarded. Acceptable when late events are rare and the count error is small relative to total volume.
-
Side output stream: Route late events to a separate Kafka topic or output stream for special handling โ manual review, reprocessing, or inclusion in the next window. Apache Flink calls this a side output.
-
Allowed lateness on the window: Some frameworks let you keep a window "open" for extra time beyond the watermark, accepting late events and issuing updated results. Flink calls this "allowed lateness" on the window โ you can configure a window to accept events up to 5 minutes past its watermark, emitting a corrected result each time a late event arrives. The trade-off: you hold window state in memory for longer.
Windowing โ Carving Infinity Into Computable Pieces
Here is the central problem with infinite data: you can never compute "the total" because there is no total yet โ more data is always coming. To produce any aggregate โ count, sum, average โ you have to first answer the question: aggregate over WHAT? The answer is a window: a bounded slice of the infinite stream that you can actually finish computing.
Think of it like a bus driver counting passengers. You can't count "total passengers ever" without waiting forever. But you can count "passengers per 10-minute shift" โ that's a tumbling window. Or "passengers in the last 10 minutes, updated every 2 minutes" โ a hopping window. Or "all passengers in one continuous ride from boarding to exit" โ a session window. Each window type answers a different business question, and choosing the wrong one produces subtly wrong results that are very hard to debug.
There are four standard window types. Let's build an intuition for each before looking at the math.
What this SVG shows: all four window types on the same event-time axis (0โ9 minutes). Tumbling windows are strict boxes โ each event belongs to exactly one window. Hopping windows overlap, so the same event can count in several windows. Sliding windows recompute fresh for every arriving event. Session windows have no fixed size โ they grow as long as events keep arriving, and they close when there's a gap longer than the timeout.
Tumbling Windows โ Fixed Buckets, No Overlap
A tumbling window is the simplest window: fixed size, non-overlapping, no gaps. Every event falls into exactly one bucket. "Count purchases every 5 minutes" means windows [0:00โ4:59], [5:00โ9:59], [10:00โ14:59] โ and so on forever. Each window produces exactly one output record when it closes.
The math is clean: a stream of N events over T minutes produces exactly โT/window_sizeโ output records โ one per window. If your window is 5 minutes wide and you run for an hour, you get 12 output records.
Use tumbling windows when each bucket must be independent and non-overlapping โ billing periods, SLA reports, hourly dashboards. The "no overlap" property means you can't accidentally double-count an event in two windows.
The code walkthrough: keyBy(order -> order.userId) creates K parallel window-streams โ one per user. Think of it as splitting the single infinite stream into K sub-streams, one per user, and then windowing each independently. This is keyed windowing, and it's how you get "5 purchases per user per minute" rather than "5 purchases total per minute". The aggregate() call fires once per window when the window closes, producing one output per (userId, window) pair.
Hopping Windows โ Overlapping Slides
What if you want a 10-minute rolling count that updates every minute โ not once every ten minutes? That means windows have to overlap: a new one starts before the previous one ends. This shape is called a hopping window (sometimes called a "sliding window" in older frameworks โ the naming is inconsistent, which is a constant source of confusion). It has two parameters: a size and a hop interval. The window size is how wide each window is; the hop interval is how often a new window starts.
Example: size = 10 minutes, hop = 1 minute. Every minute, a new 10-minute window opens. Each event falls into up to size/hop windows simultaneously โ here, up to 10. This means one event produces up to 10 output records across all the windows it participates in. That's called the fanout โ more overlap = more computation = more cost.
Use hopping windows when you want a "trailing average" that updates frequently โ "average CPU over the last 5 minutes, refreshed every 30 seconds" or "7-day rolling average of sales, updated daily". The overlap is the feature: you get smooth, continuously updated aggregates rather than step-function jumps.
What this code does: every minute, a fresh 10-minute window opens. Each event lands in up to ten of these overlapping windows simultaneously โ so every minute you emit a result covering "the last 10 minutes ending now." Confusingly, Flink's API name SlidingEventTimeWindows matches what most of the literature now calls a hopping window. Read the two parameters as "(size, hop)" and ignore the name.
Sliding Windows โ Per-Event Recomputation
A true sliding window (distinct from hopping) recomputes the aggregate fresh on every event arrival. Every new event triggers a new window result that covers [now - window_size, now]. This gives you the most up-to-date possible result, but at the cost of computing a new result for every single event โ which is expensive at high throughput.
True sliding windows are relatively rare in practice because the per-event recomputation cost is prohibitive at scale. Most use cases that look like sliding windows are better served by hopping windows with a small hop interval. The distinction matters for interviews: if asked "how do you compute a trailing 5-minute average that updates in real time?", the correct answer is a hopping window (e.g., size=5min, hop=1sec) โ not a true per-event sliding window.
Session Windows โ The Gap-Based Window
A session window has no fixed size at all. Instead, it's defined by an inactivity gap: "group all events that arrive within 30 minutes of each other; when there's a 30-minute gap with no events, close the window." This is how you detect a user's browsing session, a driver's continuous trip, or a machine's active work period.
The tricky part: session window sizes are completely unpredictable. A session could be 2 events over 10 seconds or 300 events over 45 minutes. This makes session windows more expensive to manage in state โ the framework can't pre-allocate fixed buckets; it has to dynamically open, extend, merge, and close windows per-key as events arrive.
What this SVG shows: top row โ two clusters of events separated by a 38-minute gap. Since the gap exceeds the 30-minute timeout, the first cluster closes as Session 1, and the second cluster opens a new Session 2. Bottom row โ the session merge scenario: two "tentative" sessions that end up merging into one when a new event arrives within the gap of both. Flink handles this automatically, but it requires the framework to store and merge window state dynamically.
What this SVG shows: keyBy(userId) splits the single mixed event stream into K parallel sub-streams โ one per user. Each sub-stream runs its own independent window with its own isolated state. User A's window closes and produces output completely independently of User B's window. This parallelism is why Flink can process millions of concurrent user sessions without a single shared bottleneck โ each key is its own little stream.
keyBy() creates K independent parallel window streams, enabling per-user or per-entity aggregations at full parallelism.Stateful Operations โ Joins, Aggregations, and the State Store
Most useful things you want to do with a stream require memory. Counting purchases per user requires the operator to remember how many purchases it has seen so far for each user. Joining click events with user profiles requires looking up a profile table. Detecting "three failed logins in 10 minutes" requires storing the last three login events per user. Anything beyond "print each event as it arrives" is a stateful operation.
State in stream processing is different from state in a normal application. The big difference: state must survive worker crashes. If the node holding "User A has made 4,987 purchases so far" crashes and forgets everything, you can't just restart โ User A's counter resets to zero and you get wrong answers. So state has to be durable and crash-safe by default. This requirement immediately rules out a simple HashMap in memory as the only state storage option โ you need something that can checkpoint and recover.
Where State Lives: In-Memory vs RocksDB
Flink (and Kafka Streams) offer two main state backends:
-
Keep everything in RAM (HashMapStateBackend) โ state lives as plain Java objects in the JVM heap. Blazingly fast reads and writes โ nanosecond latency. Checkpoints snapshot the entire state to remote storage (S3/HDFS) periodically. The downside: if your state is large (millions of user counters) you need enormous RAM, and full checkpoints are slow because you must serialize everything.
-
Put it on local disk in a fast key-value store (EmbeddedRocksDBStateBackend) โ state lives in RocksDB, an embedded key-value store on local disk that organises data as an LSM tree (the same write-friendly structure used by Cassandra and many NoSQL engines). State larger than RAM is fine โ RocksDB pages in what it needs. Slightly slower than in-memory (microseconds vs nanoseconds), but supports terabyte-scale state on commodity hardware. HashMapStateBackend is Flink's default, but RocksDB is the recommended choice for production jobs with large or long-window state (and is required for incremental checkpoints). Incremental checkpoints are supported โ only the changed SST files (the small on-disk segments RocksDB writes) go to S3, not the whole state.
What this SVG shows: the two state backends side by side. In-memory (HashMapStateBackend) stores state as plain Java objects in the JVM heap โ fast, but RAM-bounded. RocksDB stores state on local SSD โ slightly slower, but supports terabyte-scale state and uses incremental checkpoints that only write changed data to S3, keeping checkpoint overhead low even with massive state.
Aggregations โ Count, Sum, and Custom Accumulators
The most common stateful operation is an aggregation: count events per key per window, sum transaction amounts per merchant, compute the 99th percentile latency per service. Flink provides built-in aggregations (count, sum, min, max, first, last) and a generic AggregateFunction interface for custom logic.
Here is a hand-written accumulator that computes a running average per user. Watch for three pieces โ a start state (the empty accumulator), an update rule (how each new event changes that state), and an output rule (how the accumulated state becomes a final number) โ because every custom aggregation you ever write will have exactly these three pieces:
The merge() method is only needed for session windows โ when two session window fragments merge (as shown in the SVG above), Flink needs to combine their accumulators. For tumbling and hopping windows, merge is never called.
Joins โ Stream-Stream vs Stream-Table
Joining two streams together is one of the most powerful and most misunderstood operations in stream processing. There are two fundamentally different kinds of join, and picking the wrong one produces either incorrect results or a system that eats all your RAM.
Stream-Stream Join: Both sides are live, infinite streams. An order event on the left needs to match a payment confirmation event on the right, where both events happened within the same 5-minute window. The key question: how long do you wait for a match? If the order arrives at t=0 and the payment confirmation arrives at t=4m59s, they should match. But if the payment never arrives, you need to eventually stop waiting and close that window. Stream-stream joins use a time window to bound how long each side waits for a match from the other side. Events that fall outside the window are unmatched and either discarded or sent to a side output.
Stream-Table Join: One side is a live stream of events; the other side is a continuously-updated table (in Kafka Streams, this is called a KTable; in Flink it's a connect() with a BroadcastState or a Table API lookup). Every incoming event on the stream side looks up the latest value in the table at the time the event arrives. This is how you enrich click events with user profiles: the click is the stream, the user profile database is the table. No time window needed โ you always use the most recent profile.
What this SVG shows: the two join architectures side by side. In a stream-stream join, both streams buffer incoming events in state until they find a match within the time window โ late events can miss their match entirely. In a stream-table join, the KTable compacts to the latest value per key, and every incoming stream event does a point lookup โ no buffering required on the stream side, and the table side only keeps the latest value rather than all historical events.
The Unbounded State Problem
State can grow without bound if you're not careful. "Count per user" over a stream that's been running for 3 years means you now have state for every user who ever existed โ tens of millions of entries that are mostly stale. Your RocksDB state store gradually expands until the worker runs out of disk.
The fix is state TTL (time-to-live): any state entry not updated within a configured period is automatically expired and cleaned up. In Flink:
What this code does: the StateTtlConfig attaches an expiry rule to a piece of state. OnCreateAndWrite means "reset the 7-day timer every time you write" โ so an active user never expires. NeverReturnExpired means once the timer is up, the state behaves as if it never existed, even if it hasn't been physically cleaned up yet. Wire this config onto every ValueStateDescriptor you create; otherwise the state for that descriptor grows forever.
Checkpoints, Savepoints, and Recovery
Here is an uncomfortable truth about stream processing: workers crash. Machines fail. Kubernetes evicts pods. The JVM runs out of memory. In a batch job, you just re-run the job from scratch. In a streaming job that has been running for days and accumulated gigabytes of state, "restart from scratch" means you lose everything โ every counter, every partial aggregation, every in-progress session window. You'd have to replay every event ever produced just to rebuild the state.
The solution is checkpointing: periodically snapshot the complete state of every operator in the DAG to durable remote storage (S3, HDFS, Azure Blob). When a crash happens, restart from the most recent checkpoint and replay only the events that arrived after that checkpoint. This is how streaming systems achieve fault tolerance without losing data or correctness.
How Checkpoints Work โ The Barrier Mechanism
The tricky part is making the snapshot consistent. If Operator A snapshots its state at T=100, and Operator B snapshots its state at T=110, but B has already processed events that A hasn't snapshotted yet โ the combined state is inconsistent. You'd recover to an impossible mixed state.
Flink solves this with a simple trick: send a special marker message into the stream alongside the data, and have every operator snapshot itself the moment the marker passes through. Because the marker travels in the same order as the data, every operator captures "the state right after the same set of events" โ even though they snapshot at different wall-clock moments. These markers are called checkpoint barriers, and the underlying idea is a classic 1985 result called the ChandyโLamport distributed snapshot algorithm. Here's how it plays out:
- The JobManager (the Flink coordinator) triggers a checkpoint. It injects a special "barrier" message into every Kafka partition at a specific offset.
- Each barrier flows through the DAG with the data. When a source operator receives a barrier, it snapshots its state (e.g., the current Kafka offset) and passes the barrier downstream.
- An operator with two input channels waits until it has received barriers from all input channels (barrier alignment) before taking its own snapshot. Events arriving after the barrier are buffered until the snapshot is complete.
- Once all operators have snapshotted and reported success to the JobManager, the checkpoint is complete. The JobManager records the checkpoint ID and its location in S3.
- On crash, all operators reload state from the latest complete checkpoint and tell their Kafka consumers to seek back to the recorded offset. Events between the checkpoint and the crash are replayed.
What this SVG shows: a three-operator DAG with a barrier flowing through it. The key step is the aggregator waiting for barriers from both input channels before taking its snapshot โ this is "barrier alignment" and is what makes the snapshot globally consistent. Without alignment, two operators could snapshot at different logical moments and produce a checkpoint that represents an impossible state.
What this SVG shows: the crash-and-recovery timeline. A checkpoint completes at T=60s and is saved to S3. The job continues normally until T=120s when a worker crashes. On restart, the system loads the T=60s snapshot and replays 60 seconds of Kafka events. Those 60 seconds of events are processed again โ which means any output produced during that 60-second window might be produced twice unless your sink is idempotent. This is why exactly-once semantics (Section 10) is a separate, harder problem.
Savepoints โ Checkpoints You Control
A savepoint is a user-triggered checkpoint. Unlike automatic checkpoints (which the framework triggers periodically and automatically deletes old ones), a savepoint is triggered manually via the Flink CLI or REST API and is retained indefinitely.
Savepoints are used for: (a) planned job upgrades โ stop the old job, take a savepoint, deploy the new version, restore from savepoint. The new job picks up exactly where the old one left off. (b) A/B testing โ fork two jobs from the same savepoint. (c) Disaster recovery testing โ verify you can actually restore from a saved state before you need to in production.
.uid("my-operator") in Flink. Without UIDs, Flink autogenerates them based on position in the DAG โ meaning any topology change (even adding an operator before an existing one) invalidates all savepoints. With explicit UIDs, you can restructure your job and still restore from an old savepoint.
Exactly-Once Semantics โ The Hardest Property to Achieve
Here is the problem. After a crash, the system replays events from the last checkpoint. Those events have already been processed โ the output was already written to the sink before the crash. When you replay them, you process them again. Now the sink has seen each event twice. If your sink is a counter in PostgreSQL, you've incremented it twice. If your sink is a Kafka topic, you've published each record twice. Downstream consumers receive duplicate data and produce wrong aggregates.
This is the at-least-once guarantee: every event is processed at least once, possibly more times due to replay. Most streaming pipelines run at-least-once by default. It's much easier to implement because you don't need any coordination between the checkpoint and the sink โ just write, and on replay, write again.
Exactly-once means every event contributes to the output exactly once, even across crashes and replays. Achieving this requires three things working together: a replayable source, a checkpointed state store, and an idempotent or transactional sink. If any of the three is missing, you fall back to at-least-once.
The Three Preconditions for Exactly-Once
-
Replayable source (Kafka offsets โ yes; raw TCP โ no) โ The source must be able to replay from a specific position. Kafka gives you exactly this via consumer offsets โ "replay from offset 42810 on partition 3." A raw TCP socket does not: once a byte is consumed, it's gone. This is why nearly all production exactly-once pipelines use Kafka (or Kinesis, or Pulsar) as the source โ not because of speed, but because of replayability.
-
State stores participate in checkpoints โ All operator state (counts, sums, join buffers) must be snapshotted as part of the checkpoint. When the job restores from the checkpoint, all state is exactly as it was at checkpoint time. Flink does this automatically for all built-in state types. Custom state that lives outside Flink's state management (e.g., a Redis call you make in your map operator) is not included in checkpoints โ you'd need to handle idempotency yourself.
-
Idempotent or transactional sink โ The sink must not double-count replayed events. Two approaches: (a) Idempotent writes โ writing the same event twice produces the same result as writing it once. A key-value store where writes are
PUT key โ valueis idempotent; an incrementing counter is not. Elasticsearch upserts (document ID as idempotency key) are idempotent. (b) Two-phase commit (2PC) โ the sink buffers writes since the last checkpoint, and only commits them when the checkpoint completes. If the job crashes before the checkpoint completes, the buffered writes are discarded โ the replay will re-produce them and they'll be committed with the next successful checkpoint.
How Flink Achieves Exactly-Once: Two-Phase Commit at the Sink
The trick is to make the sink's writes invisible until the checkpoint succeeds. The sink stages its writes inside an open database/Kafka transaction. If the checkpoint finishes, the transaction commits and downstream consumers see the writes. If the job crashes, the transaction is rolled back and the writes never existed โ so the replay can safely re-do them. This "stage first, commit only after everyone else is also safe" dance is the classic two-phase commit (2PC) protocol from distributed databases, and Flink wraps it in a class called TwoPhaseCommitSinkFunction. Here's what happens during a checkpoint cycle:
- Pre-commit phase: As events flow in, the sink writes them to a pending transaction (in Kafka: to a producer transaction that hasn't been committed yet; in JDBC: to a temp table or a transaction not yet flushed). These writes are invisible to readers.
- Barrier arrives: When the checkpoint barrier reaches the sink operator, the sink calls
preCommit()โ it flushes any buffered writes to the pending transaction but does not yet commit. - Snapshot: The sink snapshots its transaction handle (e.g., the Kafka producer transaction ID) into the checkpoint state.
- Checkpoint completes: Once all operators have successfully snapshotted, the JobManager notifies the sink. The sink calls
commit()โ the Kafka producer transaction commits, records become visible. - On crash before commit: The checkpoint didn't complete, so there is no record of the transaction. On restart, the sink calls
abort()on any open transactions and starts fresh. The replayed events will be written into a new transaction that commits with the next successful checkpoint.
What this SVG shows: the happy path (top) and crash path (bottom) through the two-phase commit cycle. On the happy path, events are buffered in a pending Kafka transaction, preCommit() flushes on the barrier, the transaction handle is snapshotted, and commit() fires when the checkpoint succeeds โ making records visible. On the crash path, the job crashes after preCommit() but before the checkpoint completes โ the open transaction is aborted, the job restores from the previous checkpoint, replays events, and the new transaction commits with the next successful checkpoint.
What this SVG shows: the honest picture of what exactly-once actually guarantees. Inside the Flink pipeline โ source offsets, operator state, and a Kafka sink with 2PC โ every event is counted exactly once. Outside the pipeline boundary โ an email sent from a map operator, a payment API called from a flatMap, a non-idempotent database increment โ exactly-once does NOT help. Those external calls need application-level idempotency: a unique event ID you check before processing, or a transaction ID that deduplicates at the receiver.
Backpressure, Throughput, and Latency Trade-offs
A streaming pipeline is only as fast as its slowest operator. If you have a DAG where the source produces 1 million events per second, an enrichment operator processes 800k per second, and a database-write sink handles only 200k per second โ what happens to the other 800k events? They pile up somewhere. If that "somewhere" is an unbounded buffer, you'll run out of memory in minutes. If that "somewhere" is nowhere (events are dropped), you get data loss. The real answer is backpressure: the slow downstream operator signals to the upstream that it's overwhelmed, and upstream slows down to match.
Think of it like a highway on-ramp meter. When the highway (downstream) is congested, the meter light (backpressure signal) slows cars onto the highway (upstream producers). The cars don't disappear โ they just wait at the meter. The total throughput is the same as the bottleneck; the backpressure mechanism just makes the pipeline degrade gracefully instead of crashing.
How Flink Implements Backpressure: Credit-Based Flow Control
Flink uses a credit-based flow control mechanism between operators. Each operator maintains a set of network buffers for receiving data. When an upstream operator wants to send data to downstream, it first checks how many "credits" (available buffers) the downstream has. If the downstream has no available buffers (it's processing at full capacity), the upstream pauses โ it stops consuming from its input until credits become available.
This credit-based approach is elegant because it propagates backpressure all the way back through the DAG automatically. If the sink is slow, the aggregator behind it runs out of output buffers, which means it can't accept new input from the map operator, which means the source slows down reading from Kafka. The source's consumer group falls behind Kafka โ events stay in Kafka rather than piling up in the pipeline. Kafka acts as the "pressure relief valve" for the whole system.
What this SVG shows: a four-stage pipeline where the DB sink is the bottleneck at 200K events/sec. Credit-based backpressure propagates the constraint backwards through the DAG โ each operator sees "no credits available from downstream" and slows its own output. The entire pipeline stabilises at 200K/sec. No events are lost โ the excess events stay in Kafka, visible as consumer lag. This is the correct behaviour: consumer lag is a warning sign, not data loss.
Throughput vs Latency: The Inevitable Trade-off
Throughput and latency are in tension in every streaming system. High throughput requires batching: instead of processing each event one-at-a-time, the framework batches multiple events into a single network transfer or a single RocksDB write. Batching amortises the per-event overhead โ one network round-trip for 1000 events instead of 1000 round-trips. But batching means events sit in a buffer waiting for the batch to fill up, which adds latency.
What this SVG shows: the throughput-latency curve as buffer size increases. Throughput rises quickly and then plateaus โ doubling the buffer from 1000 to 2000 events barely increases throughput, but it doubles buffering latency. The "optimal knee" is the point where throughput is near its plateau but latency is still low. In Flink, this corresponds to a network buffer timeout of 100โ200ms and buffer sizes of a few thousand events โ the defaults are tuned for this sweet spot.
Tuning Knobs for Production Pipelines
-
Parallelism per operator โ The most important tuning knob. If the DB sink is the bottleneck at 200K/sec with parallelism=2, increasing to parallelism=8 (assuming the DB can handle it) raises throughput to 800K/sec. Profile which operator is the bottleneck using Flink's Web UI or backpressure metrics, then increase that operator's parallelism.
-
RocksDB block cache size โ For RocksDB state, the block cache is the in-memory read cache for the most frequently accessed state. Too small = constant disk reads = slow state access = backpressure. For a job with millions of active keys, 1โ4 GB of block cache per task manager is typical. Configure via
state.backend.rocksdb.block.cache-size. -
Async I/O for external lookups โ If your map operator calls an external database or REST API synchronously, each event waits for the I/O to complete before processing the next event. With 10ms average external lookup latency, a single-threaded operator can process at most 100 events/sec. Flink's
AsyncFunctionallows you to pipeline 100+ concurrent requests per operator โ suddenly 100ms I/O gives you 1000 events/sec per operator instead of 100. -
Network buffer count and timeout โ
taskmanager.network.memory.num-buffers-per-channelcontrols how many network buffers each channel gets.taskmanager.network.netty.sendreceive-buffer-sizecontrols buffer size per TCP connection. The buffer flush timeout (execution.buffer-timeout, default 100ms) is the maximum time an event waits in a network buffer before being sent downstream โ the primary latency knob for sub-100ms pipelines.
Framework Comparison โ Flink vs Kafka Streams vs Spark Structured Streaming vs Beam
Four major options dominate the stream processing landscape, and they make completely different architectural bets. Choosing the wrong one forces painful migrations. Understanding each option's core model โ not just its feature list โ is what lets you make the right call for your situation.
The key insight before the comparison: each framework's model determines what it's good at. Flink's event-at-a-time model makes it great at sub-second latency and complex stateful logic, but it requires a separate cluster to run. Kafka Streams' library model makes it trivially deployable but tightly couples you to Kafka. Spark Structured Streaming's micro-batch model gives great throughput and batch/stream unification but sacrifices sub-second latency. Beam's abstraction model lets you write once and run on any backend, but adds an indirection layer that can obscure performance.
What this SVG shows: the four frameworks positioned on a latency vs state-complexity chart. Flink occupies the low-latency, high-state-complexity corner โ it's the choice when you need millisecond responses and complex stateful logic. Kafka Streams sits in the low-to-medium latency, simpler-deployment zone โ when you're already in a Kafka-first microservices world and don't want to run a separate cluster. Spark Structured Streaming occupies the higher-latency, batch-stream-unification zone โ when you already run Spark for batch jobs and want to reuse the same code for streaming. Beam's dashed rectangle spans the first three because it's not a runtime โ it's a programming model that compiles to any of them.
The Four Frameworks in Depth
Model: True event-at-a-time streaming. Each event is processed individually as it arrives โ not batched. The operator network runs continuously, with state persisted in RocksDB on each worker node.
Runtime: Separate cluster โ a JobManager (coordinator) and one or more TaskManagers (workers). Deployable on Kubernetes, YARN, or standalone. This is Flink's biggest operational cost compared to Kafka Streams โ you're running an extra cluster.
Latency: Sub-second, typically 1โ10ms end-to-end for simple pipelines. Complex stateful pipelines (joins, session windows) add a few milliseconds of state access overhead. Checkpointing adds a brief pause (usually <5ms with incremental checkpoints on fast storage).
State model: First-class. Keyed state (ValueState, ListState, MapState, AggregatingState) stored in RocksDB. Supports both event-time and processing-time. State TTL. Exactly-once via two-phase commit sinks. Savepoints for upgrades.
Exactly-once: Yes โ Kafka sources + Kafka/JDBC sinks participate in checkpoint-based two-phase commit.
Languages: Java (primary), Scala, Python (PyFlink). SQL via Flink SQL / Table API.
When to choose Flink: You need sub-second latency + complex stateful logic (session windows, stream-stream joins, ML feature engineering). You're processing at scale where Kafka Streams' library model becomes hard to operationalise. Used by Netflix (anomaly detection), Uber (real-time analytics), Stripe (fraud prevention), Alibaba (ecommerce analytics).
Model: Event-at-a-time (same model as Flink) but implemented as a Java library that runs inside your existing application process. There is no separate stream processing cluster. Your Spring Boot service IS the stream processor.
Runtime: Embedded in your application. You add a dependency, write a Topology, and call streams.start(). Kafka itself handles parallelism and failover โ consumer group rebalancing reassigns partitions across multiple instances of your service. Scale by deploying more pods.
Latency: Sub-second โ similar to Flink for simple operations. Kafka Streams processes records within a poll batch, so latency is roughly the Kafka consumer poll interval (default 100ms) plus processing time.
State model: RocksDB-backed KStore (key-value) and KTable (changelog). State backed by an internal Kafka changelog topic โ on restart, the state is rebuilt from the changelog (which can take minutes for large state, unlike Flink's checkpoint restore). State can also be materialised to interactive query servers via Kafka Streams' Interactive Queries API.
Exactly-once: Yes โ via Kafka transactions (processing.guarantee=exactly_once_v2). Requires Kafka broker โฅ2.5. Slightly lower throughput than at-least-once due to transaction overhead.
Languages: Java/Kotlin only (the library is JVM-only). Kafka Streams DSL (high-level) or Processor API (low-level).
When to choose Kafka Streams: Your event bus is Kafka, your services are JVM-based, and you don't want the operational overhead of a separate Flink cluster. Best for enrichment, filtering, joining with KTables, and lightweight aggregations within a microservices architecture. Not ideal if your state exceeds a few GB (the changelog rebuild on restart is painful at scale).
Walkthrough: the builder declares two inputs โ a KTable (a continuously updated lookup table keyed by customer ID) and a KStream (the live flow of orders). The .join() enriches each order with the latest customer record. The .groupBy() re-keys by product ID, then a 5-minute tumbling window counts orders per product, and the result is written back to a Kafka topic. The whole "topology" runs inside your service process โ there is no separate cluster to babysit.
Model: Micro-batch โ Spark collects incoming events into small batches (configurable trigger interval, minimum ~100ms) and processes each batch as a mini-DataFrame batch job. True event-at-a-time processing is not the model; you always process a micro-batch.
Runtime: Separate Spark cluster (Spark driver + executors). Deployable on Databricks, EMR, GKE, or standalone. The same cluster that runs your batch Spark jobs also runs Spark Structured Streaming jobs โ which is the killer feature if you already have a Spark cluster and team.
Latency: 100ms to a few seconds. The micro-batch interval is the minimum latency โ you can't get sub-100ms because you're waiting for a micro-batch to accumulate. For most analytics use cases (dashboards, hourly reports, near-realtime ML feature pipelines), this is completely fine.
State model: Stateful operations supported via mapGroupsWithState / flatMapGroupsWithState (arbitrary state) or built-in windowed aggregations. State stored in Spark's in-memory execution model. Fault tolerance via WAL (write-ahead log) and checkpointing to HDFS/S3. State is generally less flexible than Flink's โ no RocksDB, no TTL out of the box (requires custom logic).
Exactly-once: Yes โ the micro-batch model provides natural exactly-once via batch atomicity. Each micro-batch either fully commits or fully rolls back. Idempotent Kafka sources + transactional Kafka sinks give end-to-end exactly-once.
Languages: Python (PySpark), Scala, Java, SQL. The DataFrame API means you can write streaming jobs in the same Python/SQL you use for batch โ a huge productivity win if your team is Python-first.
When to choose Spark Structured Streaming: You already run Spark for batch jobs and want to reuse the same API, same cluster, same team skills. Your latency requirement is 100msโ1s and not sub-100ms. You want Python or SQL as your primary language. You do a lot of joins between batch datasets and streaming data (broadcast joins work especially well in Spark).
Model: A programming model and SDK, not a runtime. You write a Beam pipeline using the Beam SDK (Python, Java, Go, SQL). Beam compiles your pipeline to run on any supported runner: Apache Flink, Apache Spark, Google Cloud Dataflow (managed), or the Direct runner (for local testing). The Beam model unifies batch and streaming into one API โ a batch job is just a streaming job with a bounded source.
Runtime: The runner you choose. Using Beam doesn't save you from choosing a runtime โ it defers the choice. A Beam pipeline on the Flink runner runs in a Flink cluster. A Beam pipeline on Dataflow runs in Google's managed infrastructure.
Latency / throughput: Completely determined by the runner. Beam on Flink = Flink's latency. Beam on Spark = Spark's latency. Beam itself adds a modest overhead (the translation layer), but in practice it's negligible compared to I/O and state access.
Exactly-once: Supported on runners that support it. Beam on Flink inherits Flink's two-phase commit. Beam on Dataflow inherits Dataflow's exactly-once guarantees.
Languages: Java, Python, Go, SQL. Python support via the Python SDK + cross-language transforms.
When to choose Beam: You want to write once and be able to switch runners (e.g., start on Dataflow for managed simplicity, migrate to Flink for cost control). You use Google Cloud and want Dataflow's managed autoscaling without writing Flink jobs directly. You need a unified batch+stream programming model where the same pipeline code runs in both modes.
The real cost of Beam: The abstraction layer means you can't always use runner-specific features (e.g., Flink's advanced state APIs, Spark's broadcast joins). Debugging is harder โ your code runs on Flink, but your mental model is Beam, and mapping errors between the two is non-trivial. Beam is the right bet if portability matters; if you're 100% committed to one runner, writing native Flink or Spark code gives more control.
At a Glance: 4-Way Comparison Table
| Property | Apache Flink | Kafka Streams | Spark Structured Streaming | Apache Beam |
|---|---|---|---|---|
| Processing model | Event-at-a-time (true streaming) | Event-at-a-time (library) | Micro-batch (100msโ1s batches) | Programming model (runner decides) |
| Runtime | Separate Flink cluster | Library inside your app (no cluster) | Separate Spark cluster | Depends on runner (Flink/Spark/Dataflow) |
| Latency | 1โ10ms | 10โ50ms | 100ms โ a few seconds | Depends on runner |
| State model | RocksDB / in-memory; keyed + operator state; TTL; savepoints | RocksDB KStore/KTable; rebuilt from Kafka changelog | In-memory / WAL; less flexible than Flink | Inherits runner's state model |
| Exactly-once | Yes (2PC sinks + checkpoints) | Yes (Kafka transactions) | Yes (micro-batch atomicity) | Runner-dependent |
| Languages | Java, Scala, Python, SQL | Java, Kotlin (JVM only) | Python, Scala, Java, SQL | Java, Python, Go, SQL |
| Best for | Complex stateful ops, sub-second latency, large-scale | Kafka-first microservices, no extra cluster wanted | Already-Spark teams, Python/SQL, batch+stream unification | Runner portability, Google Cloud Dataflow, write-once |
| Used by | Netflix, Uber, Stripe, Alibaba | Confluent customers, microservice-heavy orgs | Databricks users, data-warehouse-centric orgs | Google Cloud users, teams needing multi-cloud portability |
What this SVG shows: a decision tree for picking a streaming framework. Start at the top: if you're Kafka-native and want no extra cluster, Kafka Streams. If you already have Spark or need Python/SQL as your primary language, Spark Structured Streaming. If you need to deploy on multiple cloud runtimes or want Dataflow, Beam. If none of those apply โ you need sub-second latency, complex stateful ops, and don't mind running a separate cluster โ Apache Flink.
Common Stream Processing Use Cases โ Where This Actually Lives in Production
Knowing how stream processing works is only half the job. Knowing which problem calls for it โ and how to wire the pieces together for that specific domain โ is what makes the difference between a design that ships and one that stays on a whiteboard. The six use cases below are the ones you will encounter in real production systems. Each one has a different reason for needing streaming, a different state model, and a different set of traps for the unwary.
The diagram above shows the fundamental pattern behind fraud detection: a transaction event arrives in Kafka, a Flink job enriches it asynchronously from Redis (user-profile features like historical velocity, typical spend, location), runs an ML scoring function, and emits a decision โ all before the payment settles. The async Redis enrichment is critical: a synchronous database call in the hot path adds ~20 ms per event; async I/O brings that to ~0.5 ms without blocking the operator thread pool.
Why streaming? Payment settlement windows are 1โ10 seconds. A batch job running every hour is 3,600ร too slow. The fraud decision must happen in-flight or not at all.
Architecture pattern: Kafka topic receives payment events โ Flink job keys by user_id โ async enrichment from Redis feature store (velocity counts, typical merchant categories, device fingerprint) โ ML scoring (usually a pre-loaded ONNX model in the operator) โ decision emitted back to Kafka. A stream-table join is used here: the "stream" side is the live transaction events; the "table" side is the user-profile Redis store, which is itself updated by a separate stream reading account activity events.
State model: Each operator maintains a per-user rolling window of recent transactions (last 60 seconds, last 5 minutes, last hour) as keyed state. State TTL is set to 2 hours to cover the longest velocity window. Checkpoint interval is 30 seconds because state recovery must be fast โ a crash that causes a 90-second gap in fraud scoring is unacceptable.
The trap everyone falls into: Synchronous external lookups inside the operator. If you call redisClient.get(userId) synchronously, you block the operator thread while waiting for a network round trip. One slow Redis node can cascade into a full pipeline stall. Always use async I/O (Flink's AsyncFunction, Kafka Streams' async client) so the operator thread can process other events while waiting for the enrichment response.
Why streaming? User preferences change in minutes, not days. A user who just watched three thriller movies is in a different mood than they were yesterday. A recommendation engine trained on yesterday's batch data misses this signal entirely.
Architecture pattern: Event stream (clicks, watches, purchases, ratings) โ Flink job computes rolling user-feature vectors (genre preferences, recency-weighted engagement scores, collaborative filtering signals) โ writes updated feature vectors to a low-latency feature store (Redis, Feast, DynamoDB) โ recommendation service reads from the feature store at query time. The stream processor never serves recommendations directly โ it keeps the feature store fresh. The serving layer is stateless HTTP.
State model: Exponentially-weighted moving averages work well here. New signals are given high weight; signals from last week decay. This avoids unbounded state growth โ you don't store every click ever, just the current weighted vector. State size stays constant per user regardless of how long they've been a customer.
The trap: Writing to the feature store on every single event. If a user generates 50 events per session, you'd do 50 Redis writes per user per session. Instead, use a tumbling window of 10โ30 seconds to batch feature updates โ one write per window per active user. Latency penalty is negligible for recommendation quality; write amplification reduction is dramatic.
Why streaming? An alert that fires 20 minutes after a production incident is nearly useless. Observability requires sub-minute aggregation to catch P95 latency spikes, error rate surges, and CPU saturation before they become full outages.
Architecture pattern: Application agents emit structured log lines and metrics into Kafka โ a stream processor (Flink, or purpose-built pipelines inside tools like Datadog Agent) computes per-service, per-endpoint aggregates (request count, error count, P50/P95/P99 latency) over 1-minute and 5-minute tumbling windows โ results land in a time-series database (InfluxDB, Prometheus remote write, Datadog metrics API) โ dashboards and alert rules query from there.
Scale note: A service emitting 10,000 requests/second generates 10,000 log lines/second. Aggregating that into a per-minute P95 requires a streaming reducer โ you cannot buffer a full minute of raw log lines in memory and sort them. The standard approach is a reservoir sampling or t-digest algorithm that computes approximate percentiles in constant memory.
The trap: Using the wall clock at the processor (processing-time windows) instead of the event's own timestamp (event-time windows) for SLO calculations. If a service sends metrics with a 30-second delay (buffering), a processing-time window shows a false dip followed by a spike. Event-time windows with a 60-second watermark give you the correct picture at the cost of a 60-second output delay โ still well within SLO alert requirements.
The CDC diagram shows the most powerful use of stream processing for data integration. Debezium reads PostgreSQL's write-ahead log (WAL) โ the same log PostgreSQL uses for crash recovery โ and emits every INSERT, UPDATE, and DELETE as a structured event to Kafka. The Flink job consumes those events and materialises them into multiple derived stores simultaneously: Elasticsearch for full-text search, Redis for fast cache lookups, and a data warehouse for analytics. Every downstream store is updated within seconds of the source write. No nightly ETL job. No stale search indexes. No cache that's hours out of date.
Why streaming? The alternative โ polling the source database every N minutes for changes โ is slow, expensive (full table scans), and often forbidden on production databases. CDC reads the WAL, which is already being written for crash recovery. It's zero-overhead on the source and delivers every change in order.
Architecture pattern: Kafka Connect + Debezium connector reads the Postgres WAL โ emits change events to Kafka โ a Flink job (or simply Kafka Connect sink connectors) materialises them into derived stores. Each derived store gets its own consumer group, consuming from the same Kafka topic independently, so a slow Elasticsearch indexing job cannot block Redis cache updates.
The ordering guarantee: CDC events for the same primary key come in WAL order. This means a row that is INSERTed then UPDATEd twice then DELETEd will appear in that exact order in Kafka. The Flink job must handle "upsert" semantics โ write or overwrite on INSERT/UPDATE, delete on DELETE. If you accidentally process events out of order (e.g., by parallelising without keying on primary key), you can apply an UPDATE before its INSERT and corrupt the derived store.
The trap: Not handling schema evolution. When a developer adds a column to the source table, the Debezium event schema changes. If your Flink job has a hard-coded deserialization schema, it will crash. Use a Schema Registry (Confluent, AWS Glue) with forward-compatible Avro or Protobuf schemas so the stream processor can handle old and new schemas simultaneously during rolling deployments.
Why streaming? A retailer with a warehouse management system and an e-commerce front end needs stock counts to be consistent within seconds โ not minutes, not hours. "Add to cart" for the last item must not succeed for two customers simultaneously. Stream processing keeps a running tally of reservations and releases, decrementing available stock in real time.
Architecture pattern: Inventory events (reserved, released, fulfilled, received) โ Kafka โ Flink job with per-SKU keyed state maintaining available count โ sink to a low-latency store readable by the checkout service. Pricing events (promotion start, promotion end, competitor price match) flow through a parallel pipeline that writes to the same store with a different key prefix.
The challenge: Exactly-once guarantees matter acutely here. An over-count due to duplicated inventory events could oversell stock. The Flink + Kafka exactly-once transaction protocol (described in S10) is the correct solution โ the checkpoint barrier + Kafka transactional producer ensures each inventory event affects the count exactly once even across failures.
Why streaming? A factory floor with 50,000 temperature sensors each emitting a reading every second generates 50,000 events/second. You cannot store all of these in a queryable OLTP database โ storage and write throughput would be prohibitive. Stream processing aggregates them (average, min, max, anomaly detection) in real time, storing only the aggregated results and high-value anomaly events.
Architecture pattern: MQTT broker or Kafka for device ingestion โ Flink with per-device keyed state and tumbling windows โ anomaly detection (compare current reading against a rolling baseline, flag if deviation > 3ฯ) โ alert routing to on-call teams + archival of raw events to cold storage (S3) for later analysis.
Session windows shine here: A sensor that goes offline and comes back should produce a new "session" of readings, not be merged with readings from two hours ago. Session windows with a 5-minute inactivity gap naturally handle device reconnects and offline periods without any special-casing in application code.
The scale trap: Flink partitions state by key. If you key by device_id and have one million devices, you have one million logical partitions โ that's fine, state is spread across workers. But if you key by factory_id with only 20 factories, you have 20 partitions regardless of task manager count. The resulting "hot key" problem means 20 operators handle all load, and 20 is a hard parallelism ceiling. Always key at the finest granularity the business logic allows.
Lambda vs Kappa vs Streaming-Only โ The Big Architecture Decision
Before you write a single line of stream processing code, you have to answer a bigger question: what architecture will your data pipeline follow? Three schools of thought have emerged over the past fifteen years, each representing a different bet on where streaming technology was headed at the time. Understanding all three โ and why the industry has been converging toward the third โ is essential for any system design conversation at staff or principal level.
The Lambda architecture diagram shows the original big data design pattern introduced by Nathan Marz around 2011. Data flows into two parallel pipelines that run simultaneously: a batch layer reprocesses all historical data on a schedule (every few hours) to produce correct, accurate results; a speed layer processes the most recent data with low latency to fill in the freshness gap. At query time, results from both layers are merged. The architecture solves the freshness problem but introduces a maintenance nightmare: every piece of business logic must be written and maintained in two separate systems.
The Kappa architecture, articulated by Jay Kreps (Kafka co-creator) around 2014, makes a simple but powerful observation: if your event log is durable and replayable โ which Kafka is โ then you don't need a separate batch layer. Reprocessing is just running the same streaming job against an old Kafka offset. One codebase. One test suite. One execution engine. When a bug ships, you deploy a fixed version of the streaming job, point it at offset 0 (or the offset from N days ago), let it catch up to real time, then switch traffic and shut down the old version.
When to Use Each
Reprocessing and Replay โ How to Fix a Bug in Historical Data
Here is one of the most underrated properties of a well-designed streaming system: when your code has a bug, you can go back in time and reprocess history as if the bug never existed. This is not magic. It works because Kafka (and similar durable logs) retains every event for a configurable retention period. If that retention is 30 days, every event from the past 30 days still exists at its original offset. A fixed version of your streaming job can replay from offset 0 and produce corrected output โ exactly as if the correct code had been running all along.
Pattern A: Side-by-Side Reprocessing (Zero-Downtime Bug Fix)
The safest reprocessing pattern runs the fixed job alongside the production job simultaneously, writing to a separate output topic or table. Once the replay job catches up to real time and results are validated, you perform a clean cutover โ update the consumer configuration to point at the new output, shut down the old job. Zero downtime. Full validation window before switching.
The diagram shows the side-by-side pattern in action. The buggy v1 job keeps running โ production continues uninterrupted. The fixed v2 job starts from a historical offset (say, the point in time when the bug was introduced) and replays at full speed. Kafka allows replay to run faster than real time because there is no waiting for new events โ the processor reads from disk at whatever speed the network allows. In practice, a replay job processes 5โ15ร faster than the event rate at which those events originally arrived, so a 30-day backlog might catch up in 2โ3 days of wall-clock time. Once v2 catches up and its output is validated, consumers switch and v1 is decommissioned.
Pattern B: Bulk Replay (Simpler, Requires Downtime Window)
If you can afford a brief maintenance window โ say, a few hours overnight โ bulk replay is simpler. You pause the production job, take a checkpoint, deploy the fixed code, point the job at the offset where the bug was introduced, and let it replay to the latest offset. Once it's caught up, resume normal operation. No dual output topics. No consumer switching. Just a clean reprocess.
The tradeoff: downtime during replay. For most internal analytics pipelines, this is acceptable. For fraud detection or inventory systems with SLAs in the seconds range, side-by-side is the only option.
The Math: How Long Does Replay Take?
Here's the calculation every engineer should be able to do before committing to Kappa:
events_to_replay = event_rate ร retention_secondsreplay_duration = events_to_replay / (replay_throughput)Example: 1 million events/sec ingestion rate, 7-day Kafka retention, replay throughput of 5 million events/sec (5ร real-time speed):
events = 1,000,000 ร 604,800 = 6.048 ร 10^11 eventsduration = 6.048 ร 10^11 / 5,000,000 = ~33.6 hours of replayCritical constraint: the replay must complete before the bug-fixed code's divergence from the buggy code compounds to a point where output is materially different between the two versions. If the bug only affects edge cases, you have more time. If it corrupts every event, replay must complete before the v2 job's output diverges from business truth.
Schema Evolution During Replay
One frequently overlooked problem: when you replay events from 30 days ago, some of those events were written with an older schema. A column that didn't exist 30 days ago will be absent from old events. A Flink job deserializing those events must handle both old and new schema versions simultaneously. The solution is a Schema Registry (Confluent Schema Registry, AWS Glue) with backward-compatible schema evolution, so the deserializer can read both the old Avro schema (version 1) and the new one (version 2) without crashing. Never write a stream processor that hardcodes a single schema version if replay is part of your operational model.
Monitoring Stream Processors โ The Metrics That Actually Catch Incidents
A streaming job that is silently dropping late events, slowly growing its state into an OOM, or falling behind real time looks perfectly healthy from the outside โ no errors thrown, no crashes, CPU and memory within bounds. The metrics that catch these silent failures are specific to streaming and different from the CPU/memory/request-rate dashboard you'd use for a web service. Here are the seven metrics that actually matter, and what each one tells you before it becomes a 3 AM incident.
The dashboard above shows the seven metrics in a typical Flink or Kafka Streams monitoring setup. Panels โ โโข are the highest-priority SLO metrics โ they directly measure whether the pipeline is meeting its business commitments. Panels โฃโโฅ are operational health metrics that predict future failures before they occur. Panel โฆ is the sneaky one: state size growth looks benign until it causes an OOM crash at 3 AM.
Each Metric Explained โ What It Tells You and Why
This is the time from when an event occurred (event-time timestamp in the message) to when the output record lands in the sink (database write, Kafka output topic, dashboard update). It is the metric that directly answers "are we meeting our latency SLA?" A fraud detector promising sub-200ms decisions should alert at 150ms โ not at 200ms when it's already breached.
How to measure it: subtract the event-time header from the sink write timestamp. Emit this as a histogram per operator stage so you can identify which stage contributes the most latency. Flink exposes this via its metrics system; export to Prometheus and visualise in Grafana. Alert on P95, not average โ average latency can be fine while P95 is breaching the SLO.
The watermark lag is how far behind real time the current watermark is. A watermark at T โ 2 minutes means the stream processor believes "all events with timestamp before Tโ2min have arrived." If that lag exceeds your watermark advance heuristic (say, 5 minutes), it means late events beyond that gap are being silently dropped โ they arrive after their window has already been closed and emitted.
A growing watermark lag usually means one of three things: (1) one or more Kafka partitions is stalled (no new events, so the watermark for that partition doesn't advance); (2) a source with very high event-time skew (mobile devices going offline); (3) a slow consumer that isn't reading fast enough to advance. Flink's web UI shows per-partition watermarks โ check for a single stalled partition dragging the global watermark down.
Consumer lag is the difference between the latest offset in the Kafka partition and the offset the stream processor has committed. A lag near zero means the job is keeping up with the event rate. A growing lag means the processor is slower than the producer โ backpressure is building. If lag grows without bound, the job will eventually fall so far behind that a restart triggers a replay spanning hours of backlogged events, causing an extended recovery period.
Alert on lag growth rate, not absolute lag. Some lag is normal during job startup. Continuously growing lag at a rate of >10,000 events/second for more than 5 minutes warrants immediate investigation. Common causes: a new operator added to the pipeline with insufficient parallelism, a cold cache miss storm at startup (async I/O waiting for Redis to warm), or a temporarily slow sink (database write latency spike).
Checkpoint duration measures how long it takes to snapshot operator state to durable storage (typically HDFS or S3). A healthy job checkpoints in seconds. A job whose checkpoints take 10, 20, 60 seconds has growing state โ and growing state is the most common precursor to an out-of-memory crash.
The mechanism: Flink's checkpoint barrier must pass through every operator before the snapshot is complete. If one operator has large RocksDB state (hundreds of GB), the barrier waits for that operator to finish flushing its state. Checkpoint duration is proportional to state size. Growing duration = growing state. Growing state = no TTL configured, or TTL too long, or key cardinality growing unboundedly (e.g., keying by user-agent string instead of user-id).
Every time a Flink task manager crashes and recovers (due to OOM, network partition, JVM GC pause, or NullPointerException in operator code), the restart counter increments. A well-tuned streaming job should have zero restarts per hour. Even one restart per hour means the job is spending time recovering from checkpoints rather than processing events, and the root cause is not yet fixed.
Restarts are not catastrophic by themselves โ Flink's exactly-once semantics mean state is consistent after recovery. But they're a signal that something is wrong, and the wrong thing will eventually become a longer outage. Common causes: memory pressure (increase task manager heap or switch to RocksDB state backend), operator NullPointerExceptions from malformed input events (add null-guard logic), or JVM GC pauses exceeding the heartbeat timeout (tune G1GC or switch to ZGC for large heaps).
State size measures the total bytes of keyed state held by each operator. After a job reaches steady state, state size should plateau โ new keys being added should roughly equal old keys expiring due to TTL. If state size grows linearly without bound, the job will eventually exhaust heap or RocksDB disk space and crash.
The three causes of unbounded state growth: (1) No TTL configured โ the operator accumulates state for every key it has ever seen, including keys that will never appear again (deactivated users, cancelled sessions). (2) Key cardinality explosion โ keying by a high-cardinality string like device user-agent or IP address rather than a bounded identifier like user ID. (3) Window state not expiring โ a bug in window trigger configuration that prevents state from being garbage-collected after the window closes.
Pitfalls and Anti-Patterns โ The 7 Most Expensive Streaming Mistakes
Every streaming team eventually hits these. The first time is a production incident. The second time is a postmortem that references the first postmortem. The goal of this section is to make sure you hit zero of them for the first time in production.
The mistake: You build a "revenue per day" aggregation using a tumbling window on processing time (the wall clock when events arrive at the processor). Everything looks fine in testing. Then you deploy to production, and mobile app users who were offline for 6 hours sync their sessions. Their events arrive all at once, 6 hours after they happened. Processing-time windows bucket all of them into the current hour's window โ not the hour they actually happened.
Why it's bad: Your revenue-per-day chart shows a spike at 2 PM (when the offline devices synced) and correct-but-lower numbers for 8 AM (when the transactions actually occurred). Business decisions made on this data โ pricing, promotion timing, inventory adjustments โ are based on corrupted time attribution. This is not a display bug; it's a correctness bug in aggregated business data.
The fix: Use event time for all business-logic aggregations. Set a watermark strategy that matches your expected event skew (e.g., 30-minute watermark for mobile apps that may sync after extended offline periods). Accept that results will be delayed by up to 30 minutes from processing time โ but they will be correct. The latency cost of event-time correctness is almost always worth it for financial and business-metric pipelines.
The mistake: You add a keyed aggregation โ "count page views per user" โ without configuring a state time-to-live (TTL) โ the rule that says "delete a key's state if it hasn't been touched in N hours." Every user who ever visits the site gets an entry in the operator's state store. Active users are fine. But inactive users โ accounts created two years ago, never seen since โ accumulate in state indefinitely. After six months, the state store holds entries for 50 million users, most of whom haven't been active in months.
Why it's bad: State stores live in heap memory (for small state) or in RocksDB on local disk. Either way, unbounded growth eventually exhausts the resource. The crash is not immediate โ it sneaks up over weeks and months as state accumulates. When it finally crashes, recovery requires reading back all that state from the checkpoint, which is slow. And the root cause is a missing TTL configuration, not a hardware failure or load spike.
The fix: Configure state TTL for every stateful operator. In Flink:
The TTL should match the business meaning: "count per day" โ TTL of 25 hours (slightly more than a day to handle boundary events). "user session state" โ TTL of 30 minutes after last activity (session timeout). "fraud velocity window" โ TTL equal to the longest velocity window plus a 20% buffer.
The mistake: You key your stream by country_code because your business analysts want "revenue per country." There are 200 countries. Your Flink job has 50 task slots. You set parallelism to 200 and think you've fully utilised the cluster. But Flink can only achieve 200-way parallelism โ regardless of how many task slots you add beyond 200, the additional slots sit idle. And in practice, 80% of your traffic comes from the United States, so one partition is processing 80% of all events.
Why it's bad: First, you've capped your maximum possible parallelism at 200 (the key cardinality). Adding more hardware doesn't help. Second, the "US" partition is a hot key that processes 80ร more events than the "Liechtenstein" partition. The hot partition determines end-to-end latency because it's the slowest operator. Backpressure from the hot key propagates upstream.
The fix: Key at a finer granularity than the aggregation. Key by user_id (cardinality in the millions), then aggregate into per-country totals in a second operator. The first operator achieves full parallelism. The second operator, keyed by country, handles only the rollup and is much lower throughput. Alternatively, use a pre-aggregation combiner pattern: partial aggregates computed in parallel, then merged in a single reduce step.
The mistake: Your fraud detection pipeline needs to enrich each transaction with user history from a Postgres database. You add a MapFunction that calls jdbcConnection.query("SELECT * FROM user_history WHERE user_id = ?", event.getUserId()). It works in testing (small traffic, fast DB). In production, the DB has 50ms average query latency. Your Flink operator thread blocks for 50ms per event. With 10,000 events/second per task slot and one blocking call per event, you need 500 task slots just to keep up. You had 20. Consumer lag grows at 9,980 events/second.
Why it's bad: Blocking I/O inside a streaming operator serialises all processing through the blocking call. The network round trip โ typically 0.5โ50ms for a cache hit, 10โ100ms for a DB query โ becomes the throughput bottleneck. This is not a hardware problem; it's a threading model problem.
The fix: Use Flink's AsyncFunction API, which lets the operator issue multiple lookups concurrently, continuing to process other events while waiting for responses:
With 100 concurrent async requests in flight, a 50ms lookup latency allows a single task slot to process ~2,000 events/second โ 100ร better than synchronous. Use Redis (sub-millisecond lookups) over Postgres (10โ100ms) for hot-path enrichment wherever possible.
The mistake: You implement a payment confirmation pipeline. Flink guarantees exactly-once processing โ events are counted exactly once, state is consistent after failures. You have a SinkFunction that calls a third-party payment API to confirm each transaction. You believe exactly-once means each transaction is confirmed exactly once. A task manager crashes. Flink recovers from checkpoint. The sink function is called again for events that were already processed before the crash. The payment API receives duplicate confirmation calls. The customer is charged twice.
Why it's bad: Flink's exactly-once guarantee covers state mutations and Kafka output topics โ both of which participate in the two-phase commit protocol during checkpointing. External systems that do not participate in this protocol (REST APIs, JDBC writes outside Flink's JDBC sink, email sends, SMS notifications) can receive duplicate calls when operators replay after recovery. Exactly-once is a property of the stream processor's internal state and Kafka integration, not a global guarantee over all I/O.
The fix: All external side effects must be idempotent. Assign an idempotency key โ a unique tag attached to each outbound call so the receiver can recognise and skip a repeat โ typically the event's unique ID from the source system, possibly combined with the processing attempt number. The external system deduplicates on this key. For payment APIs, this is called the idempotency-key header (Stripe, Adyen both support it). For database writes, use INSERT ... ON CONFLICT DO NOTHING or UPSERT semantics keyed on the event ID.
The mistake: You set a watermark advance heuristic of 5 seconds โ meaning "I expect events to arrive at most 5 seconds late." This feels generous. But your data includes events from IoT devices that buffer locally and send in bursts every 60 seconds, mobile app events from users on cellular networks with up to 2-minute delays, and server-side events that are fine. The 5-second watermark advances past the 60-second and 120-second late events. Their windows have already been emitted and closed. The events are silently dropped. No error is logged. No metric is incremented (unless you configured late-data metrics). The resulting aggregates undercount by 30%.
Why it's bad: Silent data loss is the worst kind โ it doesn't produce an alert, a stack trace, or a dead-letter queue entry. The output just looks slightly wrong. You don't notice until an analyst compares the streaming dashboard against a batch reconciliation report and finds a 30% discrepancy. By then, days of data have been affected.
The fix: Measure your actual event skew distribution before setting the watermark. In Flink, enable late-data side outputs and count them:
Set the watermark to the 99th percentile of your observed event skew, not the median. For IoT / mobile data, 2โ5 minutes is often the correct heuristic. Accept the output delay; do not accept silent data loss.
The mistake: The streaming job has been running flawlessly in staging for two weeks. Checkpoints complete in 3 seconds. Everything looks great. You ship to production. Three weeks later, at 3 AM, a task manager runs out of memory and crashes. Flink tries to recover from the latest checkpoint. Recovery fails: the savepoint is incompatible with the current code because a developer added a new field to the state schema without a migration strategy. The job cannot restart. State has to be discarded and the job resumes from the earliest available offset, reprocessing hours of events. Downstream consumers see duplicate events for the reprocessed period.
Why it's bad: Savepoint compatibility is something you can only discover by actually testing checkpoint-and-recover. The incompatibility is typically caused by: (1) adding a field to a Flink POJO state class (a plain Java object used to hold state) without registering a migration; (2) changing the key serializer type; (3) changing the operator UID (Flink identifies operators by UID โ if the UID changes, Flink can't match checkpoint state to operators). None of these cause errors during normal operation. They only surface during recovery.
The fix: Make savepoint restore testing a mandatory part of your deployment pipeline. Before every production deployment:
1. Take a manual savepoint of the current production job:
flink savepoint <job_id>2. Deploy the new code in a staging environment, pointing at the production savepoint
3. Verify the job starts successfully from the savepoint
4. Verify state is correctly restored (emit a few test events and check that accumulated state is present)
5. Only then deploy to production and switch
Also: assign stable operator UIDs using
.uid("my-fraud-scorer") on every stateful operator. Without this, Flink generates UIDs from job topology position โ add or remove any operator upstream and all UIDs shift, breaking all savepoints.
Practice Exercises โ Reason About Streams Like a Senior Engineer
These exercises are not "recite the definition" drills. Each one asks you to apply the principles from Sections 13โ17 to a novel scenario you haven't seen before โ the same kind of open-ended reasoning you'd do in a system design interview or a production incident postmortem. Work through each exercise before expanding the solution.
Scenario: You're building an analytics dashboard that shows "unique active users in the last 60 minutes," updated every minute. Your events are user activity heartbeats emitted every 30 seconds by the front end. You need to display the count at 12:00, 12:01, 12:02, etc., where each count covers the most recent 60 minutes.
Which window type is correct? Tumbling (60-min), Hopping (size=60min, advance=1min), Sliding, or Session? Explain why the others are wrong.
A tumbling window (non-overlapping 60-minute windows) would give you one count per hour โ but you want one count every minute. Wrong frequency.
A sliding window would give you a result every time a new event arrives โ which is ~every 30 seconds per user, producing far too many output events (one per heartbeat, not one per minute). Also, "sliding" in most frameworks means you define a slide interval; what you're describing IS a hopping window with hop=1 min.
A session window closes when there's a gap in activity โ but here you explicitly want a fixed 60-minute rolling window regardless of session gaps. Wrong semantics.
The hopping window with size=60min, hop=1min produces exactly 60 overlapping windows at any instant (each covering the past 60 minutes), with a new window result emitted every minute. This is precisely the "rolling last-hour count, updated every minute" semantic. Each event participates in 60 windows (one per minute it falls within), so the memory overhead is 60ร that of a tumbling window โ acceptable for 60ร richer output.
Scenario: Your team's Flink job has been running for two weeks. This morning, the on-call engineer notices the watermark is 2 hours behind real time (lag = 120 minutes). The job was processing correctly yesterday (lag = 90 seconds). No new code was deployed overnight. What are the three most likely causes, ranked by probability?
1. One Kafka partition is stalled (no new events) โ highest probability. Flink's global watermark is the minimum watermark across all source operator instances. If a single Kafka partition receives no new messages for 2 hours (because a producer died, a service stopped routing to that partition, or a topic rebalance went wrong), that partition's watermark does not advance. The global watermark is capped at that stalled partition's last event. Check:
flink list โ look at per-partition watermark in the Flink Web UI โ find the stalled partition โ investigate the producer for that partition.2. A large batch of events with old event-time timestamps arrived. A batch job or data migration injecting historical events with timestamps from 2 hours ago would push the effective event-time of the stream backward (or prevent it from advancing) if those events were dominating the source. Less common but possible if a data pipeline resumed after a 2-hour outage. Check: inspect the event-time timestamps of messages currently in the input Kafka topic.
3. A source instance is idle due to a rebalance or consumer group rebalance. If Kafka triggered a consumer group rebalance (e.g., due to a Flink task manager restart or an added/removed partition), one source operator instance might temporarily not be assigned any partitions โ meaning it emits no watermark. Flink's idle source detection (configurable with
withIdleness(Duration.ofMinutes(5))) should handle this, but if not configured, one idle source drags the watermark. Check: verify idle source marking is configured in the watermark strategy.
Scenario: You're building a stream processor that tracks "number of purchases per user per calendar day" for a loyalty rewards system. The system uses UTC midnight as the day boundary. Events can arrive up to 90 minutes late (mobile devices offline, then syncing). Your Kafka retention is 30 days. What is the correct state TTL for the per-user per-day count, and why? What happens if TTL is too short? Too long?
The correct state TTL is 25.5 hours (24 hours for the calendar day + 90 minutes for allowed lateness + a small safety buffer).
Reasoning: state for "user U on day D" must survive until no more events with timestamp in day D can possibly arrive. The last event for day D arrives at most at
end-of-day-D + 90 minutes = 25.5 hours after the start of day D. After that point, the state for (U, D) can be safely expired.TTL too short (e.g., exactly 24 hours): State for the current day is expired before the 90-minute late-arrival window closes. Late events arriving 30โ90 minutes after midnight are processed, find no state, start a fresh count from 1, and produce an undercount for day D. The missing count is silently lost โ no error, just wrong loyalty point totals.
TTL too long (e.g., 30 days): State accumulates for 30 days ร number_of_active_users. For 10 million active users, 30 days of per-user per-day state is 300 million state entries. If each entry is ~50 bytes (user ID + day key + count), that's 15 GB of state โ potentially fine for RocksDB, but checkpoint size and duration will be large. The excess retention serves no functional purpose (no event with a 30-day delay is expected). Correct TTL avoids this waste.
Rule of thumb: TTL = window size + allowed lateness + 10% safety buffer. Not shorter. Not 10ร longer.
Scenario: You want to join two Kafka topics: "ad_clicks" (user clicks an ad) and "conversions" (user makes a purchase). You want to produce a record when an ad click is followed by a purchase within 24 hours โ to attribute the conversion to the ad. Why can't you just join on user_id without a time bound? What goes wrong, and how do you set the bound correctly?
Without a time bound, a stream-stream join on
user_id must buffer every unmatched click event until a matching conversion arrives โ which might be never. If a user clicked an ad in January and never purchased, their click event must be kept in state indefinitely, waiting for a conversion that will never come. For millions of users, this is unbounded state growth. The operator runs out of memory or disk. This is why time-bounded joins are mandatory for stream-stream joins.Setting the time bound correctly: The join window should be the maximum time delta that the business considers an "attribution window." Here, that's 24 hours. In Flink:
- Join clicks and conversions where
|click.eventTime - conversion.eventTime| โค 24 hours- Implemented as an
IntervalJoin: clickStream.intervalJoin(conversionStream).between(Time.hours(-24), Time.hours(0)) โ meaning the click can precede the conversion by up to 24 hours.The state implication: click events are buffered for at most 24 hours waiting for a conversion. After 24 hours, they are expired from the join state. This bounds state to at most 24 hours' worth of unmatched click events โ a constant upper bound, not an infinite one. For 1 million clicks/hour ร 24 hours ร 100 bytes/click = ~2.4 GB of state. Known, bounded, manageable.
Set the time bound too narrow (e.g., 1 hour) and you miss real attributions from users who browse then convert later. Set it too wide (e.g., 30 days) and state balloons. The correct bound is a business decision, not a technical one โ but the technical consequence of the bound (state size) must feed back into that decision.
Scenario: Your team needs to build three separate streaming jobs. For each, pick the best framework from: Apache Flink, Kafka Streams, Spark Structured Streaming. Explain your choice.
- Job A: Real-time fraud detector. Sub-100ms end-to-end latency required. Rich stateful logic: multi-feature velocity checks, async Redis enrichment, model scoring. 50,000 events/second peak.
- Job B: Daily revenue aggregation refreshed every minute. Uses existing Spark cluster. Team is experienced with Spark DataFrame API. Results feed a BI dashboard queried by analysts via SQL.
- Job C: A lightweight per-user session counter embedded inside a Java microservice. No separate cluster. Must deploy as a standard JAR. 5,000 events/second. Low ops overhead priority.
Job A โ Apache Flink. Sub-100ms latency rules out Spark Structured Streaming (micro-batch model has inherent batch interval latency, typically 100msโ1s minimum). Flink's true event-at-a-time processing achieves <10ms operator latency with correct configuration. The async I/O API for Redis enrichment is native to Flink (
AsyncFunction) and is the correct solution for low-latency enrichment. Flink's keyed state with RocksDB handles the stateful velocity checks efficiently at 50k events/sec.Job B โ Spark Structured Streaming. The team has an existing Spark cluster and DataFrame expertise โ reusing both saves months of platform migration. A "refreshed every minute" SLA is a micro-batch model, which is Spark's natural execution model. Analysts querying via SQL is a Spark strength (Spark SQL, Delta Lake integration). The latency requirement (minute-level refresh) does not require true streaming. Operational continuity and DataFrame familiarity outweigh any theoretical performance advantages of Flink here.
Job C โ Kafka Streams. Kafka Streams is a Java library that runs inside an existing JVM process โ no separate cluster, no cluster management, no additional infrastructure cost. It deploys as a standard JAR. For 5,000 events/sec with simple session counting (a single keyed aggregation), Kafka Streams is more than sufficient and dramatically simpler to operate than standing up a Flink cluster. The "low ops overhead" constraint is the deciding factor. Note: Kafka Streams requires Kafka โ if the microservice is already reading from Kafka (as most microservices do in event-driven architectures), this is zero additional infrastructure.
Bug Studies โ When Stream Processors Go Wrong
Theory is clean. Production is messy. These four incidents are composite reconstructions of real failure modes โ the kind of bugs that only manifest at scale, under load, or after a network hiccup at exactly the wrong moment. Each one teaches a principle that no amount of documentation drilling will ingrain as deeply as a war story.
What Went Wrong
The engineer who wrote the job used System.currentTimeMillis() (processing time) to assign timestamps to events. This is the simplest choice โ but it means "when the processor saw the event," not "when the event actually happened." When mobile clients caught up after a network blip, three hours of queued events arrived in minutes. The processor happily stamped them all with the current wall clock. The sliding-window aggregation counted them as current purchases. The dashboard lied for hours until the queue drained.
The diagram shows why the bug is invisible in normal operation. Most of the time events arrive within milliseconds of being created. Processing time โ event time, so the metric is correct. But the moment you have any buffering โ mobile offline mode, a slow consumer, a batch replay โ the gap explodes. Processing-time metrics look accurate right up until the moment they catastrophically mislead you.
CreateTime metadata โ still not perfect, but orders of magnitude better than processor wall clock.
What Went Wrong
A stream-table join is seductive because it looks like a database join. But unlike a database, the stream processor keeps the entire join side in memory / state store across the job's lifetime. Every user who ever existed accumulates in the state store โ including users who haven't clicked in two years. No TTL was configured. The state just grew. Indefinitely.
What Went Wrong
The team set out-of-orderness to 1 second because "our events are nearly real-time." That's true for 11 of the 12 source partitions. But the global watermark is a minimum across all sources. One slow source drags the watermark for everyone. The 1-second tolerance was set for the happy-path case and ignored the tail behavior of any slow partition.
The diagram shows the crux of the problem: the global watermark is the minimum watermark across all live sources. A single slow partition tanks the global watermark for everyone. Events from the fast partitions that arrive with event-time between T70 and T100 are fine โ but events from the slow partition that also have event-time in that range arrive after the window has already closed. They are silently dropped.
What Went Wrong
Exactly-once in a Flink pipeline means: within the Flink job, no event is counted twice. It does NOT mean the downstream consumer receives each message exactly once โ that depends on the delivery guarantee of the sink (Kafka, database, etc.). The engineer treated "exactly-once in the pipeline" as "exactly-once delivery to the consumer," which is a different guarantee. The consumer's home-built dedup was also too aggressive โ its 24-hour window was keyed on event ID, so any retry or delivery with the same event ID within 24 hours was dropped, even if the first attempt had been acknowledged by Flink but not yet consumed by the downstream service.
Real-World Architectures โ How Big Companies Use Streaming
Stream processing is not an academic exercise. It runs the dashboards that show engineers what their systems are doing right now. It powers the fraud scores that fire before a payment completes. It feeds the recommendation models that update based on what users clicked three seconds ago. Here is how five companies actually built their streaming infrastructure โ the architecture shape, the trade-offs they made, and why those choices were right for their context.
This generic shape โ producers โ broker โ stream processor โ multiple sinks โ repeats at every company below. What differs is what runs at each box, how the state is managed, and what latency target drives the design. Understanding the shape means you can read any company's architecture blog post and immediately identify where their innovation actually lives.
Netflix's Mantis platform is a custom-built reactive stream processor (originally running on Apache Mesos with Netflix's own Fenzo scheduler, open-sourced in 2019) designed for what Netflix calls "operational intelligence" โ real-time analysis of the system itself. When a CDN node starts dropping packets, when a client region suddenly spikes error rates, when a specific title starts buffering more than usual โ Mantis detects these events and routes them to the right team within seconds. The scale is genuinely large: Netflix processes trillions of events per day across its observability pipeline, with Mantis covering the operational/observability use cases and a separate Flink-based platform (Keystone) handling analytics workloads.
The interesting architectural choice Netflix made is treating stream processing as a first-class engineering discipline, not a bolt-on analytics tool. Mantis jobs are deployed the same way microservices are deployed โ versioned, rolled out with canaries, rolled back if a health check fails. Jobs have SLOs. The operational model for a stream processing job looks exactly like the operational model for an API service. This is why Mantis can run reliably at scale: the operational discipline comes from applying proven software engineering practices to a domain (streaming) that often gets treated as a "data team problem."
Uber's streaming infrastructure grew from a specific pain point: surge pricing. Surge pricing requires knowing, in real time, the ratio of supply (available drivers) to demand (rider requests) in each geographic cell. A batch job that runs every 10 minutes is useless โ by the time the batch runs, the supply/demand ratio has changed. The latency requirement is under 30 seconds, ideally under 10. This forced Uber to build real stream processing rather than "frequent batch."
Uber settled on Apache Flink as their primary compute engine. The streaming jobs consume from their internal event bus (built on Kafka), compute geographic aggregations using Flink's keyed state, and write results to their operational data stores. The same infrastructure also runs fraud detection โ a transaction arrives, the streaming job enriches it with recent account history (from state), computes a risk score, and writes the score to a cache that the payment authorization service reads before approving the charge. The fraud score must be available within the authorization window, which is typically under 200 milliseconds from the score computation request.
Stripe's fraud system, Radar, is interesting because the latency constraint is genuinely extreme: the fraud score must be computed and returned within the synchronous payment API call. A payment authorization that takes 300 ms instead of 100 ms is visible to users and costs conversion rate. This means the streaming pipeline that builds the features Radar uses must write its outputs to a low-latency read path (typically an in-memory cache or a key-value store optimized for sub-millisecond reads).
The streaming jobs at Stripe continuously update feature vectors: how many payments has this card made in the last 15 minutes? Has this email address appeared in a declined-payment stream? What is the velocity of charges from this IP? These feature vectors are written to a fast cache. When a new payment arrives, Radar reads from the cache (not from the streaming job), computes the fraud score using the pre-computed features, and returns the result synchronously. The streaming job and the scoring service are decoupled โ the stream processor updates features asynchronously, the scorer reads pre-computed features synchronously.
LinkedIn built Apache Samza (now an Apache project) because their use cases โ real-time feed ranking updates, ML feature computation, observability โ required stateful stream processing before Flink existed in its current form. Samza's key design choice was using Kafka both as the message bus and as the durable log for state. Samza operators checkpoint their state to local RocksDB instances and can replay from Kafka to recover โ an elegant architecture that avoids the complexity of a separate state backend like S3 or HDFS.
LinkedIn's feed uses Samza to compute real-time signals: how many connections liked this post in the last 10 minutes, is this post trending in a specific industry, has this author published multiple posts in a short window (potential spam). These signals update the ranking model in near-real time. The streaming job reads from Kafka topics that receive activity events (likes, comments, shares), maintains counts in local RocksDB state, and writes updated signal vectors back to Kafka topics that the feed ranking service consumes.
Pinterest's Goku system handles a specific problem: collecting, aggregating, and serving billions of time-series metrics from their infrastructure. Every service emits metrics. Aggregating them (sum, count, percentile) across many services, hosts, and time windows requires streaming computation. Goku uses a streaming aggregation layer that pre-computes rollups (per-minute, per-5-minute, per-hour) so that dashboards can query pre-aggregated data rather than raw events. The streaming layer makes dashboards fast; without it, every dashboard query would need to scan billions of raw metric points.
The architecture is simpler than Stripe's or Uber's โ no ML scoring, no geographic keyed state. It is essentially: emit metric โ Kafka โ streaming aggregation tier โ aggregated rollup store (Goku) โ dashboard reads from rollup store. The simplicity is deliberate: the team chose the right tool for the problem size rather than over-engineering for theoretical scale they didn't need.
Common Misconceptions โ Getting the Mental Model Right
Stream processing has a set of persistent misconceptions that trip up experienced engineers who are new to it. These aren't beginner mistakes โ they're the kind of subtle wrong assumptions that lead to production incidents. Each one below is a real belief that real engineers have held in real code review discussions.
The wrong belief: Streaming is what you do when you want your batch job to run more often. It's just a batch job with a shorter schedule interval.
Why it's wrong: Batch and stream processing have fundamentally different mental models. In batch processing, data has a clear end โ you load all the rows, process them, write results, done. The data is bounded. You can sort it. You can count it. You can take the max over all of it. In stream processing, data is unbounded โ there is no last row, no end of file, no "I've seen everything." This changes every algorithm. You can't sort an infinite stream. You can't compute the exact median of an infinite stream without storing every element. Windowing (cutting infinity into finite chunks) only exists because you can't do unbounded aggregation. The constraint of unboundedness forces entirely different data structures, algorithms, and operational practices. "Faster batch" will get you killed by the first event that arrives 10 minutes late.
The wrong belief: If my Flink job has exactly-once semantics enabled, my downstream consumers will never see duplicate events, and I don't need to handle idempotency anywhere else.
Why it's wrong: Exactly-once in a stream processing framework is a guarantee about the pipeline's internal state. It means: given a failure and recovery, the pipeline's state (counters, aggregations, joined results) reflects exactly what you'd get if no failure had occurred. It does NOT guarantee that every write to an external system (a database, an HTTP endpoint, a non-transactional Kafka topic) happens exactly once. External side effects must be made idempotent separately. The Flink-to-Kafka exactly-once guarantee requires transactional writes to Kafka and that consumers read with isolation.level=read_committed. For any non-transactional sink (a REST API, a Redis INCR, an S3 file write), you must design for idempotency in the sink itself โ exactly-once in the pipeline doesn't help you there.
The wrong belief: Doubling the number of Flink TaskManagers or Kafka Streams instances will halve the processing time. More parallelism = more speed, always.
Why it's wrong: Parallelism in streaming is bounded by key cardinality for keyed operators. If you're computing per-user aggregations and you have 100,000 distinct active users, then parallelism beyond 100,000 provides no benefit โ there are only 100,000 distinct keys to distribute. In practice, key skew is the more common problem: a small number of "hot keys" (a viral post ID, a popular product ID) receive a disproportionate fraction of events. Increasing parallelism doesn't help the hot-key partition โ it just sits with a backlog while the other partitions are idle. Excessive parallelism also adds overhead: more network shuffles, more state shards to checkpoint, more coordination. For simple filter-and-enrich pipelines with no keyed state, adding parallelism past the input partition count gives zero benefit.
The wrong belief: Apache Flink is the "serious" framework for low-latency streaming. Kafka Streams is for simple use cases. If you need millisecond latency, use Flink.
Why it's wrong: Kafka Streams is a Java library that runs in-process with your application. There is no network hop to a separate cluster โ events are consumed from Kafka, processed in the same JVM, and results are written back to Kafka or to local state. For simple operations (filtering, mapping, per-key aggregations without complex joins), this in-process model can be lower latency than Flink because Flink requires network shuffles between TaskManagers for operations that require data redistribution. Flink's strength is complex stateful operators, long-window aggregations, exactly-once end-to-end, and massive scale with thousands of parallel tasks. For a service that needs to enrich Kafka events with simple logic and write results back to Kafka โ Kafka Streams running in your existing service process often wins on latency, cost, and operational simplicity.
The wrong belief: Lambda architecture (dual batch + stream pipeline) is obsolete. Kappa architecture (stream only) has replaced it. If you're building new, use Kappa.
Why it's wrong: Kappa architecture requires very mature streaming infrastructure: the ability to replay events from storage with the same pipeline that handles live events, operationally stable reprocessing tooling, and a stream processor reliable enough that your "batch" results (reprocessing) are trusted as much as historical batch results. Many organizations don't have this maturity. Lambda architecture's dual-pipeline complexity is a genuine cost โ but the benefit is that the batch path (Spark, Hadoop, BigQuery) can reprocess historical data with the full power of mature batch tooling while the streaming path provides real-time results. For organizations where the streaming team is small, Lambda may be the pragmatic choice because the batch half requires less operational expertise in stream processing. Lambda is not dead โ it's appropriate for a specific maturity level.
The wrong belief: Once a watermark reaches T, it means the system has received all events with event-time โค T. Windows can close with confidence that no late data will arrive.
Why it's wrong: Watermarks are heuristics, not guarantees. They are estimates of how far behind real-world time the stream is. A watermark at T means "we believe most events with event-time โค T have arrived, but we acknowledge some may still be in transit." The out-of-orderness bound you configure is a guess about the worst-case delay in your system. In practice, true late data does arrive after the watermark passes. That's exactly why frameworks provide allowedLateness and side outputs โ to handle the data that arrives after the watermark closes the window. A system designed as if watermarks are guarantees will silently drop late events. A system designed knowing watermarks are heuristics will route late events to a side output and handle them gracefully.
allowedLateness, use side outputs for genuine stragglers, and monitor your late-event rate as a production metric.
The wrong belief: Stateful stream processing โ joins, aggregations with crash-safe state, exactly-once semantics โ is only for large companies with dedicated streaming engineering teams. Most teams should stick to stateless pipelines or simple batch jobs.
Why it's wrong: Modern frameworks have dramatically reduced the complexity of stateful streaming. Kafka Streams' DSL lets you write per-user aggregations with a few lines of code โ the framework handles state persistence, checkpoint recovery, and rebalancing automatically. Flink's SQL and Table API let you write stateful streaming logic that looks like plain SQL. Apache Beam's unified model means you can write once and run on Flink, Spark, or Google Cloud Dataflow. The operational complexity is still real โ you need to think about state TTLs, checkpoint intervals, and reprocessing strategy โ but the programming model complexity is no longer a barrier. A small team with solid Kafka knowledge can successfully operate a Kafka Streams application for common use cases without a dedicated streaming platform team.
Operational Playbook โ Run Stream Processing in Production
Knowing how stream processing works is one thing. Running it in production at 3 AM when a downstream sink is backed up, a slow partition is dragging your watermark, and your checkpoint is taking 4 minutes instead of 30 seconds โ that's a different skill entirely. This playbook is five stages, from "picking a framework" to "optimizing the thing you've been running for six months."
Each stage has its own failure modes, and skipping a stage usually means paying for it in the next stage. Teams that skip "Test" discover their checkpoint recovery doesn't work during a 2 AM outage. Teams that skip "Monitor" don't know their state is growing until the OOM happens. The stages are ordered because each one creates the foundation for the next.
The framework choice locks in your operational model for years. Get this right before writing a line of code.
- Apache Flink โ Choose when: sub-second latency is a hard requirement, you have complex stateful operations (long-window joins, pattern detection, aggregations over large state), you need end-to-end exactly-once to non-Kafka sinks, or you expect to scale to very high event volumes. Operational cost: you need a Flink cluster (or managed Flink on AWS/Confluent/GCP), which means learning JobManager/TaskManager topology, checkpoint configuration, and savepoint management. This is a real operational investment.
- Kafka Streams โ Choose when: your events are already in Kafka, your team is already Kafka-native, you want zero additional infrastructure (it's a Java library), and your operations are relatively simple (filter, enrich, aggregate per key). Operational cost: you manage scaling by deploying more instances of your service โ the same way you'd scale any other microservice. Very approachable for a small team.
- Spark Structured Streaming โ Choose when: you already have a Spark data platform, your team knows DataFrames/SQL, and you're okay with micro-batch latency (typically 1-10 seconds). Strong for unified batch+stream processing. Weaker than Flink for sub-second latency or complex stateful ops.
- Apache Beam โ Choose when: you want portability across execution engines (run on Flink today, migrate to Google Dataflow tomorrow without rewriting). Good for teams in multi-cloud environments. The unified model adds some abstraction overhead โ you may find the native Flink API more expressive for complex jobs.
Before the first event flows through your production job, three foundations must be in place or you will regret it the first time something goes wrong.
- Provision state backend and checkpoint storage first. Configure your Flink job to checkpoint to durable storage (S3, HDFS, or GCS) before going live. The default in-memory state backend loses all state on a TaskManager restart. For Kafka Streams, the state is stored in RocksDB on local disk โ ensure the persistent volume is correctly configured and survives pod restarts in your Kubernetes setup.
- Set up monitoring before traffic. Deploy metrics dashboards before the first event flows. Minimum metrics from day one: consumer lag (per partition), end-to-end processing latency, checkpoint duration, checkpoint size, and state store size. You want a week of "normal" baseline before you need those dashboards to diagnose a problem.
- Design for restart-from-checkpoint from day one. Every job should be structured so that triggering a checkpoint recovery produces identical results to uninterrupted processing. If your job writes to an external database, ensure the writes are idempotent or transactional. If your job calls an external API, ensure the API is idempotent. Checkpoints are useless if recovery causes double-writes or inconsistent state.
Stream processing jobs have failure modes that unit tests and integration tests almost never cover. You must test these explicitly โ in a staging environment that mirrors production โ before you go live.
- Kill a worker and verify checkpoint recovery. Terminate a TaskManager mid-processing while the job is under load. Time the recovery. Verify that the output exactly matches what you'd have gotten without the failure. If recovery takes more than 5 minutes, your checkpoint interval is too long or your state is too large.
- Simulate late data. Replay historical events through your job with event timestamps from 30 minutes ago, 2 hours ago, 1 day ago. Verify that: (a) the late events are routed to the correct windows or to side outputs, (b) no late event causes incorrect output in "current" windows, and (c) your late-event metrics fire correctly.
- Verify reprocessing works end-to-end. Take a savepoint, reset offsets to a point 24 hours ago, restore from the savepoint, and let the job reprocess. Verify that the final output is identical to the output from the original processing run. If it's not โ your job has non-determinism (e.g., it reads from a side database that has changed) and you need to address that before a real reprocessing scenario forces the issue.
Most streaming monitoring setups alert on the wrong metrics โ they alert on CPU and memory (which are lagging indicators) instead of the leading indicators that tell you about imminent problems before they become outages.
- Consumer lag (per partition, not aggregate). A growing per-partition lag means that specific partition's consumer is falling behind. Aggregate lag can look fine while one partition accumulates a multi-hour backlog. Alert on the max-partition lag, not the sum.
- Watermark lag. The difference between wall-clock time and the current watermark. A growing watermark lag means events are arriving more slowly than expected (source is slow, partition is idle, or network has a problem). Alert when watermark lag exceeds 2ร your out-of-orderness bound.
- Checkpoint duration and state size growth rate. A checkpoint that takes longer than 10% of your checkpoint interval will eventually cause issues. State size that grows more than 5% per day will become a problem in 20 days. Alert on growth rates, not just absolute thresholds.
- Late-event drop rate. As discussed in the bug studies: if you're not measuring how many events are being dropped as "too late," you don't know if your watermark configuration is correct. This metric should be zero in normal operation. Any non-zero rate is a signal that your out-of-orderness bound is too tight.
Optimization happens after you've been running in production long enough to have real bottleneck data. Premature optimization in streaming is particularly wasteful โ the bottleneck is almost never where you'd guess without profiling. Here are the optimizations that most frequently matter:
-
Async I/O for external lookups. If your job needs to look up data from an external database or API (e.g., enrich events with user data from Postgres), use Flink's
AsyncDataStreamor Kafka Streams' async processor. Synchronous external calls inside amap()operator are the single most common cause of low throughput in streaming jobs โ the entire pipeline waits for each external call. - Tune RocksDB for your workload. Flink's default RocksDB configuration is conservative. For read-heavy jobs (lots of lookups into state), increase the block cache size. For write-heavy jobs (constant state updates), tune the write buffer size. Incorrect RocksDB tuning can cause 3-5ร performance degradation versus optimal settings.
- Enable incremental checkpoints. Full checkpoints copy the entire state on each checkpoint interval. Incremental checkpoints copy only the delta since the last checkpoint. For jobs with large state (10+ GB), enabling incremental checkpoints can reduce checkpoint duration from minutes to seconds.
- Detect and mitigate key skew. If one key receives 1,000ร more events than average (a viral post ID, a shared service account), that key's partition becomes a hotspot. Strategies: add a random salt suffix to hot keys before aggregation, then aggregate the salted results in a second pass; or detect hot keys and route them to a separate "hot key" pipeline with higher parallelism.
Cheat Sheet & Glossary
A quick-reference summary of the key rules and vocabulary from this entire page. Use this during review, before an interview, or when you need to quickly re-orient yourself after a few weeks away from streaming work.
Quick-Reference Rules
Glossary
- Dataflow graph
- The directed acyclic graph of operators (source โ transform โ sink) that defines a streaming job. Each node is a parallelizable compute step.
- Operator
- A single step in the dataflow graph. Can be a source (emits events), a transformation (map, filter, join, aggregate), or a sink (writes results). Operators run in parallel across multiple workers.
- Source
- The origin of a stream. Common sources: Kafka topic, Kinesis stream, a database CDC stream, a socket. The source operator reads from the external system and emits events into the dataflow graph.
- Sink
- The destination for processed events. Common sinks: Kafka topic, PostgreSQL, Redis, Elasticsearch, S3. For exactly-once delivery, sinks must be transactional or idempotent.
- Watermark
- A monotonically increasing timestamp that signals "we believe all events with event-time โค T have (mostly) arrived." Used to trigger window computations. A heuristic, not a guarantee. The global watermark is the minimum across all active source partitions.
- Window
- A finite, bounded slice of an infinite stream over which an aggregation is computed. Three types: tumbling (fixed, non-overlapping), hopping (fixed size, sliding step), session (gaps between events define boundaries).
- Keyed state
- State scoped to a specific key (e.g., user ID, session ID). Partitioned across workers by key hash. Each worker owns a subset of keys and maintains their state independently. The most common form of streaming state.
- Checkpoint
- A periodic, consistent snapshot of all operator state written to durable storage. Enables recovery from worker failure by restoring state and replaying events from the checkpoint's source offset. Created automatically on a configured interval.
- Savepoint
- A manually triggered checkpoint used for intentional job restarts โ upgrades, topology changes, A/B testing. Unlike automatic checkpoints (which Flink may delete after recovery), savepoints are explicitly managed and retained until manually deleted.
- Exactly-once
- A delivery semantic: despite failures and retries, each event affects the pipeline's state exactly once. Implemented via distributed snapshots (checkpoints) + transactional sink writes. Applies to the pipeline's internal state โ not necessarily to all external side effects.
- Backpressure
- The mechanism by which a slow downstream operator signals upstream operators to slow down. Prevents buffer overflow and out-of-memory errors. In Flink, backpressure propagates through the network buffer pool. Sustained backpressure indicates a throughput imbalance that needs resolution.
- Key skew
- When the data distribution across keys is uneven โ a small number of keys receive a disproportionate fraction of events. Causes: viral content IDs, shared service accounts, geographic hotspots. Creates hotspot partitions where one worker is overloaded while others are idle.