Kafka Deep Dive — System Guide

Section 1

TL;DR — Kafka in Plain English

Why Kafka is fundamentally different from a message queue — it's a distributed, append-only log that retains messages for days or weeks instead of deleting them on consume
The three-layer structure every Kafka deployment shares: topics (logical channels) → partitions (sharding units) → log files (physical storage on each broker)
How the three superpowers work — replay lets a consumer rewind history, partitions give linear throughput scaling, and replication via ISR gives durability without a single point of failure
What changed in Kafka 4.0: ZooKeeper is gone, replaced by KRaft (Kafka Raft), making Kafka a fully self-contained distributed system
How consumer groups coordinate so multiple consumers share a topic's partitions without stepping on each other's toes

Kafka is not a queue — it's a distributed, append-only log. Every message written to Kafka gets a permanent sequence number called an offset and lives on disk for as long as you configure retention (hours, days, forever). Any consumer can read from any offset at any time, independently of every other consumer. That one design decision — keep the data, don't delete it — unlocks replayability, horizontal throughput scaling via partitions, and strong durability via replication, all at once. If the Message Queues page showed you what a queue is, this page shows you what happens when you throw the queue assumption away entirely.

The core insight: a traditional queue deletes a message the moment a consumer reads it. Kafka does the opposite — it appends every message to the end of a log file and keeps it there. Each message gets a monotonically increasing number called an offset, like line numbers in a text file. A consumer says "give me message 4,712" and Kafka reads exactly that line. Multiple consumers can read the same log simultaneously, each tracking their own offset. When a new analytics team needs the last 30 days of order events, they don't ask the order service to re-send anything — they just rewind Kafka to offset 0 and replay the log themselves. This is the property that makes Kafka indispensable for data pipelines, event sourcing, and ML training jobs.

The log abstraction gives you three things ordinary queues cannot deliver together. First, replayability: because messages aren't deleted, any consumer — new or existing — can rewind to any point in history and reprocess. Second, parallelism via partitions: a single topic is split into N independent partition logs, each handled by a different broker. One consumer group can assign one consumer per partition, so you get N-way parallelism simply by increasing partition count. Third, durability via replication: each partition is copied to R brokers (typically R=3). One copy is the leader that handles all reads and writes; the others are followers that pull data from the leader. Even if a broker dies, the remaining copies hold every message — nothing is lost.

Every Kafka cluster used to require a separate ZooKeeper ensemble to track which broker is the controller, store topic metadata, and manage leader elections. ZooKeeper was a second distributed system you had to operate alongside Kafka — double the operational complexity. Starting in Kafka 2.8 (experimental) and Kafka 3.3 (production-ready), Kafka introduced KRaft (Kafka Raft): a built-in consensus protocol that handles metadata and leader elections entirely inside Kafka itself. In Kafka 4.0 (released 2025), ZooKeeper mode was removed entirely. A modern Kafka cluster is now a single, self-contained system. Simpler to deploy, simpler to operate, and capable of hosting millions of partitions instead of the ~200k practical limit the old ZooKeeper bottleneck imposed.

Kafka is an append-only, distributed log — not a queue. Messages are retained for a configurable period and addressed by offset, so any consumer can replay history independently. Topics are sharded into partitions for throughput, and each partition is replicated across brokers for durability. Kafka 4.0 is fully self-contained: ZooKeeper is gone, replaced by the built-in KRaft consensus protocol.

Section 2

Why You Need This — When Queues Aren't Enough

You've already learned what a message queue does. So why does Kafka exist at all? The answer is that traditional queues — RabbitMQ, SQS, ActiveMQ — solve the current delivery problem beautifully, but they're fundamentally built around one assumption: once a message is consumed, it's done. Delete it. Move on. That assumption works fine until you hit three problems that every growing engineering team hits eventually.

The Story: An E-Commerce Company Hits the Queue Wall

Imagine an e-commerce platform that started simple. Six months ago, the team wired up RabbitMQ for "OrderPlaced" events. The order service publishes to an exchange, and two consumers — an email service and an inventory service — each get a copy via separate queues. It works great. But then three things happen.

Problem 1 — Replay is impossible. The ML team wants to train a recommendation model on the last 30 days of order events. They ask the backend team: "Can we get those events?" The answer is no. RabbitMQ deleted every message the moment the email and inventory consumers acknowledged it. The data is gone. The ML team has to reconstruct 30 days of history from database logs — a painful, error-prone, week-long effort.

Problem 2 — Throughput hits a ceiling. The marketing team runs a flash sale. Order volume spikes 10×. There's one "OrderPlaced" queue, one exchange, and consumers chewing through messages as fast as they can. The queue grows faster than consumers drain it. You can add more consumer processes, but they all compete for the same queue — and the queue itself becomes a bottleneck. The broker serializes every enqueue and dequeue through a single logical channel. You can't parallelise across the single queue without routing logic hacks.

Problem 3 — Every new downstream consumer is expensive. Analytics, search re-indexing, fraud detection, and a real-time dashboard all want "OrderPlaced" events. With RabbitMQ pub/sub, each consumer gets its own queue bound to the exchange. That's fine for 2 consumers — but at 8 consumers, you're managing 8 queues, 8 bindings, and 8 independent consumer groups all pulling from the same exchange. If one consumer falls behind, its queue grows without bound, consuming broker memory. Each new team adds operational weight.

The Kafka Switch: What Changes

The team migrates to Kafka. They create one topic called order-events with 50 partitions. Here's what those three problems look like now.

Replay fixed: Kafka retains every event for 7 days (or longer if you want). When the ML team needs 30 days of data, they spin up a consumer group starting at offset 0. They reprocess the entire log at their own pace without touching the production consumers at all. The data was always there — Kafka just held it.

Throughput ceiling gone: 50 partitions distribute across the cluster. A consumer group for the email service can run 50 consumer processes, each assigned one partition — 50× parallel throughput. Need 100×? Create 100 partitions. The throughput scales with partition count, and partition count is a configuration knob you can tune.

New consumers are free: Every consumer group maintains its own offset pointer per partition. Adding the fraud detection team is a matter of declaring a new consumer group name. They start reading from offset 0 (or from "now") without affecting the email or inventory groups. No new queues. No new bindings. No additional broker memory per consumer. Just a new offset bookmark.

The fundamental difference. A traditional queue has one reader position per queue — once you consume message N, it's gone for every consumer. Kafka has N reader positions — one per consumer group per partition — all consuming the same underlying log. This is not an optimisation; it's a fundamentally different data structure. A queue is a conveyor belt that disposes of items. Kafka is a shared library where every book stays on the shelf.

The diagram above shows the core asymmetry. On the left, a traditional queue has one position that advances as the consumer reads — the data behind the cursor is gone. On the right, Kafka's log keeps every message at its offset forever (within your retention window). Three consumer groups — email, inventory, and the ML team — all read the same log, each with its own offset bookmark. The ML team rewound to offset 0 and is replaying history while email and inventory are already at the newest message (offset 7). None of them interfere with the others.

Traditional queues delete messages after consumption, which creates three hard limits: no replay, no parallelism beyond a single queue, and growing operational overhead per downstream consumer. Kafka solves all three by replacing the queue with an append-only log. Messages live at a fixed offset for as long as you configure, and every consumer group maintains its own offset bookmark — making new consumers free and replay trivial.

Section 3

Mental Model — The Distributed Append-Only Log

Before diving into Kafka's components, you need one crystal-clear mental model. Everything in Kafka — partitions, offsets, consumer groups, replication — follows logically from this single idea. Get this image locked in your head and the rest of the page will feel obvious.

The Notebook Analogy

Imagine a shared notebook in a busy office. Three rules govern how people use it: (1) you can only write at the end — nobody inserts pages in the middle or deletes existing entries; (2) every entry is automatically numbered sequentially — 1, 2, 3, … forever; (3) anyone can come back and read any entry by its number, at any time, as many times as they want.

That notebook is a log. And Kafka is a networked, distributed version of this notebook — fast enough to handle millions of entries per second, durable enough to survive machine failures, and big enough to store weeks or months of history.

The offset is everything. An is just an integer — the position of a message in the log. Offset 0 is the very first message. Offset 4,712 is the 4,713th message. A consumer says "I want everything from offset 4,712 onwards" and Kafka reads exactly that. The consumer stores its current offset (where it's up to) somewhere — historically in Kafka itself, in a special topic called __consumer_offsets. When the consumer restarts, it resumes from where it left off.

Now Shard the Notebook — Partitions

One notebook gets full eventually. And one person writing in it at a time is a throughput limit. The solution: split the notebook into N parallel notebooks, called partitions. Each partition is its own independent log — its own offsets starting from 0, its own sequence of messages. A topic is just the name you give to a collection of related partitions.

Producers write to partitions in parallel. Consumers read from partitions in parallel. Because partitions are independent, you can distribute them across multiple machines — each machine (called a broker) handles a subset of partitions, so no single machine is the bottleneck. This is how Kafka scales linearly: double the partition count, roughly double the throughput.

Ordering insight: Messages within one partition are always delivered in offset order — guaranteed. But there is no ordering guarantee across different partitions. If ordering matters for a set of messages (e.g., all events for user #1234 must be processed in sequence), you must route those messages to the same partition using a message key. More on this in S6.

The diagram shows topic "order-events" split into three partitions spread across three brokers. The email-service consumer group (green) is reading partition 0 up to offset 6, while the analytics consumer group (amber) is only up to offset 2 on the same partition. They both read the same data from the same log — broker 1 doesn't know or care how many consumer groups are reading from it. Each group just tracks its own offset bookmark. Adding a new consumer group is literally free: just declare a group name and start reading.

Kafka's mental model is simple: a topic is a collection of append-only logs (partitions), where every message gets a permanent offset. Consumers are just bookmarks — they track their position in each partition and advance it as they read. Multiple consumer groups can read the same log simultaneously with zero interference. Distributing partitions across brokers gives you linear throughput scaling with no single bottleneck.

Section 4

Core Concepts — The Kafka Vocabulary

Kafka has a precise vocabulary. Before you can read the official docs, debug a cluster, or talk to a colleague about Kafka architecture, you need these terms locked in. Every term below is introduced with plain English first, then the Kafka name.

How to read this section: Plain English description comes first. The bold Kafka term follows. Every term also gets a you can hover over throughout the page.

The vocabulary map above shows a three-broker cluster hosting two topics ("orders" and "payments"), each with multiple partitions. Purple cells are leaders — they handle all reads and writes. Gray cells are replicas — they mirror the leader's data in case the leader dies. Notice that leaders are spread across brokers: this distributes I/O load evenly. Now let's define each term precisely.

Infrastructure Terms

Broker — A single Kafka server. A broker stores some partitions on disk, handles connections from producers and consumers assigned to those partitions, and participates in leader elections. A broker is just a process running on a machine — scale the cluster by adding more brokers.
Cluster — A group of brokers working together. A Kafka cluster appears as a single logical system from the outside, but internally it distributes data and load across all its brokers.
Controller — One broker in the cluster is elected the controller. It's the cluster's "manager" — it tracks which brokers are alive, decides which broker should be the leader of each partition, and handles broker failures. In Kafka 4.0 (KRaft mode), controllers form a Raft quorum to coordinate without ZooKeeper.
KRaft — Short for "Kafka Raft." The built-in consensus protocol that replaced ZooKeeper for storing cluster metadata and running leader elections. A KRaft controller quorum is just a subset of Kafka brokers that store metadata — no separate system to operate.

Data Model Terms

Topic — The logical name for a stream of related events. Think of a topic like a table name in a database — it's how you refer to a category of events ("order-events", "payment-updates", "user-activity"). Internally, a topic is just a collection of N partitions.
Partition — One ordered, immutable log within a topic. A partition is the fundamental unit of Kafka's parallelism and distribution — each partition lives on one broker (its leader), can be written to and read from independently, and has its own offset sequence starting from 0.
Offset — The position number of a message within a partition. An offset is just an integer (0, 1, 2, …) that permanently identifies a message's position. Offsets never change or repeat. A consumer stores its current offset to know where to resume after a restart.
Log Segment — A partition is physically stored as a set of files on disk called log segments. When a segment reaches a configured size limit (default: 1 GB) or time limit, Kafka closes it and starts a new one. Retention and compaction policies operate at the segment level — Kafka can delete old segments wholesale rather than seeking individual messages.
Message (Record) — The unit of data in Kafka. A Kafka record has a key (optional bytes — used to determine which partition the message goes to), a value (the payload bytes — your actual data), optional headers (metadata like correlation IDs or content-type), and a timestamp.

Replication Terms

Replica — A copy of a partition stored on a different broker. Each partition has 1 leader and R-1 follower replicas. The replication factor (RF) is how many total copies exist — RF=3 means 1 leader + 2 followers.
Leader — The one replica of a partition that handles all producer writes and (by default) all consumer reads. The leader is elected by the controller when the cluster starts or when a broker failure occurs.
In-Sync Replicas (ISR) — The set of replicas currently caught up to the leader. The ISR is the safety net: only replicas in the ISR are eligible to be elected as the new leader if the current leader dies. If the ISR shrinks to 1 (just the leader), you have no durability protection — a leader failure means data loss.

Producer / Consumer Terms

Producer — The component that writes messages to Kafka. A producer sends records to a topic's leader partitions. It can choose which partition to use via a key, a custom partitioner, or leave it to Kafka's default round-robin assignment.
Consumer — The component that reads messages from Kafka. A consumer polls a broker for records, advancing its offset as it processes each one. Crucially, consuming a record does not delete it — the offset just moves forward in the consumer's own tracking.
Consumer Group — A named set of consumers that cooperate to read a topic. Kafka assigns each partition to exactly one consumer in the group at any moment — so a consumer group gets every message exactly once across the group. Multiple groups can read the same topic simultaneously, each at their own pace.
Exactly-Once Semantics (EOS) — A delivery guarantee where each message is processed exactly once end-to-end, with no duplicates even if a producer retries or a consumer crashes mid-batch. EOS is built on two pillars: the idempotent producer (retrying a duplicate send is a no-op) and Kafka transactions (atomic multi-partition writes committed or aborted as a unit).
Retention and Compaction — Two strategies for managing how long data stays in Kafka. Time-based retention deletes messages older than a threshold (e.g., 7 days). Log compaction keeps only the latest value per unique key — useful for maintaining current state (e.g., the latest profile per user_id) rather than a full history.

Ecosystem Terms

Kafka Connect — A framework for moving data between Kafka and external systems (databases, S3, Elasticsearch) using pre-built, reusable connectors. No custom producer or consumer code needed.
Kafka Streams — A Java library for building stateful stream processing applications on top of Kafka. Kafka Streams reads from Kafka topics, processes the stream (filter, join, aggregate, window), and writes results back to Kafka — all within a regular Java application, no separate processing cluster required.
Schema Registry — An external service (from Confluent) that stores the schemas for messages in Kafka topics. A Schema Registry ensures producers and consumers agree on the structure of a message — and allows schemas to evolve without breaking existing consumers.
MirrorMaker 2.0 — The tool for replicating Kafka topics across clusters or data centers. MirrorMaker 2 is built on Kafka Connect and handles offset translation, topic renaming, and bidirectional replication between clusters.

Kafka's vocabulary maps cleanly onto a physical picture: a cluster of brokers stores partition-replicas of topics. One replica per partition is the leader (handles I/O); others are followers (pull to stay in sync). The in-sync replica set (ISR) determines durability — only ISR members can be promoted to leader. Consumer groups maintain independent offset bookmarks, making replay and multi-team consumption effortless. Kafka 4.0 runs this entire system without ZooKeeper, using KRaft.

Section 5

The Anatomy of a Topic — Partitions, Replicas, Brokers

A topic is deceptively simple to define ("a named stream of events") but understanding what it looks like inside a running cluster is where Kafka finally clicks. Let's build the picture layer by layer — topic → partitions → replicas → brokers — and do the math that shows why each design choice was made.

Layer 1: Topic → Partitions (Sharding)

A topic is the user-facing name — "order-events", "user-clicks". Internally, Kafka immediately divides it into N independent sub-logs called partitions. Why? Because one log on one machine has a throughput ceiling set by that machine's disk and network. Splitting into N partitions lets N different brokers handle different slices of the write load in parallel. The producer sends different messages to different partitions simultaneously — true parallel writes with no coordination needed between partitions.

Layer 2: Partitions → Replicas (Durability)

Each partition is copied to R brokers — the replication factor. Typical production value: RF=3. Why 3? It gives you tolerance for 1 full broker failure while still having a quorum. With RF=3 and the ISR shrinking to 2 (one broker died), you can still elect a new leader from the ISR — no data loss, no downtime. With RF=2, if the leader fails and the follower is even slightly behind, you face a hard choice between waiting for the follower or accepting potential data loss.

The math: a topic with 10 partitions × RF=3 means 30 partition-replicas spread across your brokers. If you have 5 brokers, each broker holds about 6 partition-replicas on average. Kafka tries to spread leaders evenly — roughly 2 leader partitions per broker — so write load is balanced.

Layer 3: Leader vs Follower

Among the 3 replicas for any partition, exactly one is the leader. The other two are followers. Here's the critical rule: all producer writes go to the leader. Followers continuously fetch (pull) new data from the leader to stay in sync. By default, all consumer reads also go to the leader — though Kafka 2.4+ introduced "follower fetching" for consumers in the same rack as a follower, which can reduce cross-AZ network costs.

The leader tracks which followers are "caught up" — these form the ISR (in-sync replicas). A follower is considered in-sync if it has fetched all messages up to the leader's last acknowledged offset within a configurable lag window (default: 30 seconds, set by replica.lag.time.max.ms). If a follower falls behind — say a broker is under heavy load — it drops out of the ISR. Now the tricky case: if the leader dies and the ISR is empty (only the dead leader was in-sync), Kafka faces a forced choice — wait for an in-sync replica to come back (availability loss) or promote a behind-the-times replica that's missing some messages (data loss risk). That second option, promoting a non-ISR replica, is what Kafka calls an unclean leader election.

The ISR is your durability contract. When a producer sends a message with acks=all (or acks=-1), Kafka waits for all ISR members to confirm they've written the message before acknowledging the producer. This means as long as at least one ISR member survives, no acknowledged message is ever lost. The ISR size is the real-time indicator of your cluster's durability health.

The diagram shows three brokers each holding all three partitions — one as a leader and two as followers. This is the ideal distribution: leaders are spread evenly (one leader per broker per topic), so write load is shared equally. Follower arrows flow from leader to follower — followers pull from their leader. When Broker 3 fails, the partition-2 leader is gone, but Brokers 1 and 2 have in-sync followers — one is immediately elected as the new leader. Producers are transparently redirected to the new leader. From the producer's perspective, the failure was a brief pause of milliseconds to seconds, not a data loss event.

A Kafka topic is sharded into N partitions for throughput and each partition is replicated to R brokers for durability. One replica is the leader (handles all I/O); others are followers that pull data from the leader. The in-sync replica set (ISR) defines which followers are caught up and eligible to become the new leader on failure. With RF=3 and acks=all, you can survive a single broker failure with zero data loss. Leaders are spread across brokers to balance write load evenly.

Section 6

The Math of Partition-Based Parallelism — Why Kafka Scales

Partitions are Kafka's magic scaling knob. But the scaling is not infinite or free — there are hard limits and real trade-offs. This section explains exactly how the parallelism works, how to size partition counts, what the ordering trade-off is, and how the partition key determines which partition a message lands in.

The Fundamental Rule: Partitions = Max Parallelism

Kafka assigns each partition in a topic to at most one consumer per consumer group at a time. This is the rule that makes everything else fall out logically.

1 consumer, 4 partitions: the single consumer reads all 4 partitions sequentially (actually interleaved by polling). Single-threaded; doesn't scale.
4 consumers, 4 partitions: perfect parallelism. Each consumer owns one partition. All 4 consumers run simultaneously, each processing their partition's messages independently. This is maximum throughput for this topic.
6 consumers, 4 partitions: 4 consumers are active (one each), 2 consumers sit idle. Kafka won't assign a partition to two consumers in the same group. The extra consumers are hot standbys — if a consumer dies, Kafka rebalances and one of the idle consumers picks up the orphaned partition.

The ceiling: You can never get more parallelism from a consumer group than the number of partitions. If you need more throughput, increase partition count. Partition count is the only lever — adding more consumers beyond the partition count does nothing for throughput.

How to Size Partition Count

Here's the question every Kafka engineer faces: "How many partitions should I create for this topic?" The honest answer is: start with an estimate, then adjust. Here's the reasoning framework.

Step 1 — Estimate throughput need. Suppose your topic needs to handle 200 MB/s peak write throughput.

Step 2 — Estimate throughput per partition. Kafka can sustain substantial throughput per partition (the exact figure depends on message size, broker hardware, replication factor, and network — published benchmarks land roughly in the 10–100 MB/s range per partition under realistic production conditions, with ~10–15 MB/s a common conservative planning number). For conservative planning, estimate 10–20 MB/s per partition under production load.

Step 3 — Divide. 200 MB/s ÷ 20 MB/s per partition = 10 partitions minimum. Round up for headroom. A common starting point for high-volume topics: 12, 24, or 48 partitions (multiples of common consumer counts).

More partitions = more overhead. Partitions are not free. Each partition is a directory on disk with multiple log segment files. More partitions mean: (1) more open file handles per broker; (2) longer leader-election time if a broker fails (more partitions to reassign); (3) longer consumer group rebalance time (more partitions to redistribute); (4) more memory on the controller for metadata. The practical limit for most clusters is tens of thousands of partitions total — not millions. KRaft (Kafka 4.0) dramatically raises this ceiling compared to ZooKeeper, but the broker-level overhead per partition still exists.

Partition Keys and Ordering

When a producer sends a message to Kafka, it can optionally include a partition key — an arbitrary byte string attached to the message. Kafka applies a hash function to the key and takes the result modulo the partition count to decide which partition the message goes to.

The magic: all messages with the same key always go to the same partition. And within a partition, messages are strictly ordered by offset. So if you use user_id as your key, all events for user #5,001 land in the same partition, in arrival order. Your consumer for that partition sees user #5,001's events in perfect chronological sequence — no reordering needed.

Key design is ordering design. The key determines what ordering guarantees you get. Use user_id → per-user ordering. Use order_id → per-order ordering. Use null (no key) → round-robin across all partitions, no ordering guarantee but maximum throughput distribution. Choosing the wrong key can create a hot partition — one partition receiving far more messages than others because a particular key is very common (e.g., using country code as a key when 70% of traffic is from one country).

The diagram shows four messages from a producer. Messages A and B both carry the key "user:1001" — both hash to partition-2. Message C carries "user:9999" — hashes to partition-0. Message D carries "order:ABC" — hashes to partition-1. A consumer reading partition-2 will always see message A before message B, because they arrived in that order and that order is preserved by the log. This gives you per-user, per-order, or per-entity ordering guarantees without any coordination between partitions.

Kafka's throughput scales by adding partitions. The rule is rigid: maximum consumer group parallelism equals partition count, and adding more consumers than partitions does nothing for throughput (though it adds failover capacity). Choose partition count based on your peak throughput requirement divided by per-partition capacity, then round up for headroom. Use a partition key (e.g., user_id) when you need ordering within a logical entity — same key always maps to the same partition, and same partition means ordered delivery. Beware hot partitions from skewed key distributions.

Section 7

Producer Deep Dive — How Messages Get Into Kafka

Before a single byte hits a Kafka broker, the producer has already done a surprising amount of work. It serialized your object into bytes, decided which partition to send it to, bundled it with other messages to reduce network overhead, optionally compressed the batch, and picked a durability level that trades latency for safety. Understanding each of those steps lets you tune Kafka for your specific workload — whether you need maximum throughput, minimum latency, or zero data loss.

Step 1 — Serialization: Objects to Bytes

Kafka is fundamentally a byte store. It has no idea whether you're sending JSON, Avro, Protobuf, or raw strings. The producer's first job is to take your key and value objects and convert them to byte arrays using a A component that converts a Java (or other language) object into a byte array for network transmission. The matching Deserializer on the consumer side converts it back. Serializer. You configure one serializer for the key and one for the value. A common mistake is using StringSerializer for a key when you need deterministic partitioning — that only works if your key is truly a String; using the wrong serializer silently corrupts partition routing.

Step 2 — Partition Selection: Which Partition Gets This Message?

After serialization, the producer must decide which partition the message goes to. The rule is straightforward:

Key present: hash the serialized key bytes using murmur2, mod by the number of partitions. Same key → same partition, always. This guarantees ordering for all messages sharing a key (e.g., all events for user ID 42 land on partition 7).
No key (null): the The built-in partitioner used when no key is provided. In Kafka 2.4+ it uses sticky partitioning — filling one partition's batch before moving to the next — for better batching efficiency versus pure round-robin. DefaultPartitioner uses sticky partitioning (Kafka 2.4+): it fills one batch for one partition, then switches. This maximises batch size compared to pure round-robin.
Custom partitioner: you can implement the Partitioner interface to route by business logic (e.g., VIP customers always land on partition 0 for priority processing).

Why key-based partitioning matters for ordering. Kafka only guarantees message order within a single partition. If you need all events for a given user to arrive in order, all those events must go to the same partition — which means they must share the same key. If you send user events with no key, they scatter across partitions and ordering is lost. Key → partition affinity is not a nice-to-have; it's the mechanism by which Kafka provides ordering guarantees at all.

Step 3 — Batching: The Throughput Engine

Sending one message per network request would be cripplingly slow — the per-request overhead (TCP, Kafka protocol framing, broker processing) would dominate. Instead, the producer accumulates messages into a A group of ProducerRecords destined for the same topic-partition, compressed and sent together in a single network request. Reduces per-message overhead dramatically. RecordBatch before sending. Two config knobs control when the batch is dispatched:

batch.size (default 16 384 bytes): the maximum bytes the batch can hold. When full, the batch is sent immediately.
linger.ms (default 0 ms): how long to wait for the batch to fill before sending anyway. Default 0 means send immediately — high throughput but small batches. Setting linger.ms=5 or linger.ms=20 allows the batch to accumulate more messages, dramatically increasing throughput at the cost of a few milliseconds of added latency.

The trade-off is real and tunable: linger.ms=0 gives minimum per-message latency; linger.ms=20 + large batch.size can multiply throughput 5–10× for high-volume producers.

Step 4 — Compression: Squeezing the Batch

Before sending, the producer can compress the entire batch. Compression happens at the batch level — not per-message — which means it benefits from redundancy across messages (e.g., many JSON records with identical field names compress extremely well together). Options:

none: no compression, lowest CPU, highest bandwidth.
gzip: best compression ratio, highest CPU cost. Good for archival or low-volume topics.
snappy: balanced — reasonable compression, low CPU. Google's choice for internal pipelines.
lz4: fastest compression/decompression, moderate ratio. Best for latency-sensitive high-throughput topics.
zstd (Kafka 2.1+): best ratio-to-CPU trade-off overall; the modern default recommendation.

Compressed batches are stored compressed on the broker and decompressed only by the consumer — brokers never decompress for routing. This means compression saves disk space AND network bandwidth end-to-end.

Step 5 — Acknowledgement Strategy: The Durability Dial

Once the batch is ready and compressed, the producer sends it to the partition leader and waits for an acknowledgement. The acks config is the most important durability knob in Kafka. Think of it as a dial from "fire and forget" to "I need proof this is safe":

This diagram shows the three acknowledgement levels side by side. With acks=0, the producer sends and immediately moves on — the broker might not have even written to disk yet. With acks=1, the leader writes to its log and sends an ack, but if the leader crashes before its followers replicate that batch, the message is gone. With acks=all (equivalently acks=-1), the write is only acknowledged after every in-sync replica has written it — so even if the leader dies immediately, the followers already have the data.

Step 6 — Idempotent Producer: Eliminating Duplicate Writes

Here's a subtle failure mode: with acks=all, the leader replicates the batch and sends an ack — but the network drops the ack before the producer receives it. The producer thinks the send failed and retries. The broker receives the same batch again and writes it a second time. You now have a duplicate. This is the at-least-once delivery problem.

Kafka 0.11 introduced the A producer configured with enable.idempotence=true. The broker assigns the producer a unique Producer ID (PID) and tracks a sequence number per partition. Duplicate sends with the same (PID, partition, sequence) are silently ignored by the broker. idempotent producer to solve this. Setting enable.idempotence=true causes the broker to assign the producer a unique Producer ID (PID). Every message gets a monotonically increasing sequence number per partition. If the broker receives a message with a sequence number it has already written for that PID and partition, it discards the duplicate silently. The producer can retry freely — duplicates are impossible as long as the producer instance stays alive.

Idempotence is now the default. As of Kafka 3.0, enable.idempotence=true is the default when acks is not explicitly set. If you set acks=all manually, idempotence is automatically enabled too. You don't have to opt in — but you should know what it's doing for you.

Step 7 — Transactional Sends: Atomic Writes Across Partitions

Idempotence protects against duplicates on a single producer instance, but what if you need to write to multiple partitions or topics atomically? For example: a stream-processing job reads from topic A, transforms, and writes to topic B. If it crashes halfway through writing a batch to topic B, some consumers will see partial results. Kafka's A Kafka feature (KIP-98, Kafka 0.11) allowing a producer to write to multiple topic-partitions atomically. Either all writes in the transaction are committed and visible to read_committed consumers, or none are. transaction API solves this. You assign the producer a stable transactional.id, call beginTransaction(), send to multiple partitions, then commitTransaction(). If the producer crashes and restarts, the broker uses the transactional.id to detect and abort any in-flight transaction from the old instance.

This pipeline runs in the producer client library — no broker involvement until Step 6. The accumulator runs in a background thread, collecting records into per-partition batches. The sender thread drains those batches and fires them at the brokers. This two-thread design means your application thread calling producer.send() almost never blocks — it just drops the record into the accumulator's memory buffer and returns immediately.

Java Config Snippet

Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092"); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class); // Throughput focus: buffer for 20ms, large batches, fast compression props.put(ProducerConfig.LINGER_MS_CONFIG, "20"); props.put(ProducerConfig.BATCH_SIZE_CONFIG, "131072"); // 128 KB props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "zstd"); props.put(ProducerConfig.ACKS_CONFIG, "1"); // leader ack only // Idempotence (default in Kafka 3.0+ but explicit here for clarity) props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true"); KafkaProducer<String, String> producer = new KafkaProducer<>(props);

Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092"); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class); // Durability focus: wait for all ISR replicas props.put(ProducerConfig.ACKS_CONFIG, "all"); props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true"); props.put(ProducerConfig.RETRIES_CONFIG, "10"); props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "5"); // safe with idempotence // Transactional (for atomic multi-partition writes) props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "order-producer-1"); KafkaProducer<String, String> producer = new KafkaProducer<>(props); producer.initTransactions();

In the throughput config, linger.ms=20 lets batches fill up — you trade 20ms of added latency for dramatically larger batches and higher throughput. In the durability config, acks=all combined with enable.idempotence=true and MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5 gives you exactly-once delivery without sacrificing too much concurrency. Without idempotence, you'd have to set in-flight requests to 1 to avoid reordering on retry.

The Kafka producer is a multi-stage pipeline: serialize → partition-select → batch into accumulator → compress → dispatch with chosen acks level. linger.ms and batch.size are the throughput dial; acks=all + enable.idempotence=true is the durability dial. Idempotent producers use (PID, partition, sequence) to eliminate duplicate writes on retry, and the transactional API extends atomicity across multiple partitions.

Section 8

Consumer Deep Dive — Consumer Groups, Offsets, Rebalancing

On the producer side, complexity lives inside one machine. On the consumer side, complexity is distributed — multiple consumers coordinating to share partitions, tracking their own positions, and handling the inevitable fact that any consumer can crash at any moment. This section is the one most engineers get wrong in production. Understanding the heartbeat model, offset semantics, and rebalancing is the difference between a Kafka consumer that works and one that silently processes every message twice.

Consumer Groups: Dividing the Work

Imagine a topic with 12 partitions. You have one consumer reading all 12 — it works, but that single process is a throughput bottleneck. The solution is a A named group of consumer instances that collectively consume a topic. Kafka assigns each partition to exactly one consumer in the group, so consumption is parallelised across the group. Multiple groups can consume the same topic independently. consumer group: a set of consumer processes that share the consumption of a topic. Kafka's rule is simple: each partition is assigned to exactly one consumer within a group at any given time. With 12 partitions and 3 consumers in the group, each consumer gets 4 partitions. Scale to 12 consumers — each consumer gets exactly 1 partition. Add a 13th consumer — it sits idle because there are no unassigned partitions. Want more parallelism? Increase partition count.

Different consumer groups are completely independent. The email service group, the analytics group, and the fraud-detection group can all read the same topic. Each group has its own set of offset pointers. One group being behind doesn't affect another. This is why Kafka enables "fan-out for free" — no extra queues, no extra bindings, just a new group name.

The diagram shows the 1:1 relationship between partition and consumer. Each consumer owns its assigned partitions exclusively — Consumer B cannot accidentally read from Partition 0, even in a busy system. The offset each consumer has committed is stored in Kafka's internal __consumer_offsets topic, not in the broker's memory — so even if the consumer crashes and a new instance takes over, it resumes from the last committed offset, not from the beginning of the log.

Offsets: Your Position in the Log

An A monotonically increasing integer that uniquely identifies a record's position within a single Kafka partition. Consumer groups track their progress by committing the offset of the last successfully processed record to the __consumer_offsets topic. offset is just a number — the position of the next record to read. Kafka tracks two distinct offsets per consumer-partition pair:

Current position: the offset the consumer is about to fetch next (in memory, changes on every poll()).
Committed offset: the last offset the consumer has durably committed to __consumer_offsets. If the consumer crashes, it restarts from here.

The gap between committed offset and current position is the at-least-once window. If the consumer processes messages 100–199, then crashes before committing, it will reprocess 100–199 on restart. This is fine for idempotent processing. If you need exactly-once, use Kafka Streams or the transactional consumer API.

Auto-commit is dangerous for batch processing. The default enable.auto.commit=true commits offsets every auto.commit.interval.ms (default 5 seconds) — regardless of whether you've finished processing the messages in that batch. If your consumer reads 1,000 records, auto-commit fires at the 5-second mark, and then your processing throws an exception at record 800 — you've committed past records 800–999 and will never reprocess them. For any non-trivial consumer, use enable.auto.commit=false and commit explicitly after successful processing.

The Heartbeat Model: How Kafka Knows You're Alive

Kafka doesn't watch your consumer process directly. Instead, the consumer sends periodic heartbeats to the A broker that manages a consumer group — tracking membership, handling joins/leaves, and triggering rebalances. Each consumer group is assigned one coordinator broker based on hash of the group name. Group Coordinator broker. The consumer also must call poll() regularly — the Kafka client sends heartbeats in a background thread, but if poll() isn't called within max.poll.interval.ms, the client itself declares itself dead and leaves the group.

Two config knobs matter here:

session.timeout.ms (default 45 000 ms): if the coordinator doesn't hear a heartbeat within this window, it considers the consumer dead and triggers a rebalance.
max.poll.interval.ms (default 300 000 ms = 5 min): the maximum time between two consecutive poll() calls. If your processing is so slow that you don't call poll() within 5 minutes, Kafka boots you from the group — even if your heartbeat thread is still healthy.

The classic consumer bug: slow processing + default max.poll.interval.ms. Your consumer reads a batch of 500 messages, each requiring a 2-second database write. 500 × 2s = 1 000 seconds. Kafka's default max.poll.interval.ms is 300 seconds. At second 300, Kafka decides the consumer is dead, revokes its partitions, and triggers a rebalance. Another consumer instance takes over, reads the same batch from the committed offset, and starts processing. Meanwhile, your original consumer finishes writing at second 1 000, calls poll(), and gets kicked with a CommitFailedException. Result: every message in that batch processed twice. Fix: either increase max.poll.interval.ms, reduce max.poll.records so each batch is smaller, or move slow processing to async threads.

Rebalancing: Redistributing Partitions

Whenever a consumer joins or leaves the group, Kafka must redistribute partitions. This is called a The process of redistributing partition assignments across consumers in a group. Triggered by a consumer joining, leaving, or failing the heartbeat. During a rebalance, all consumption pauses until assignments are complete. rebalance. The original (eager) rebalancing protocol is painful: all consumers revoke all partitions, rejoin the group, and wait for the coordinator to assign new partitions. During this stop-the-world pause, nothing is consumed.

The diagram makes the key insight visual: eager rebalancing stops everything even if only one consumer joins and only a few partitions need to move. Cooperative-sticky (introduced in KIP-429, Kafka 2.4, released 2019) fixes this by negotiating incrementally — consumers keep their stable partitions and only hand off the partitions that need to migrate. For groups with hundreds of partitions, this is the difference between a sub-second hiccup and a 30-second consumption pause that triggers downstream latency alerts.

Enable it with partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor in your consumer config.

Java Consumer Config Snippet

Properties props = new Properties(); props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092"); props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processors"); props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class); props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class); // Manual offset commit — never lose a message silently props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false"); // Heartbeat / liveness props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000"); // 30 s props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, "10000"); // 10 s (1/3 of session timeout) props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "600000"); // 10 min (tune to your slowest batch) // Reduce batch size to avoid max.poll.interval.ms breach props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100"); // Cooperative rebalancing — incremental, no stop-the-world props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, CooperativeStickyAssignor.class.getName()); KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); consumer.subscribe(List.of("order-events")); while (true) { ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500)); for (ConsumerRecord<String, String> record : records) { process(record); // your logic here } consumer.commitSync(); // commit only after ALL records in batch processed }

Key choices in this config: ENABLE_AUTO_COMMIT_CONFIG=false + commitSync() after the loop body gives exactly at-least-once semantics — you never commit past a record you haven't processed. MAX_POLL_RECORDS_CONFIG=100 keeps batches small so processing completes well within MAX_POLL_INTERVAL_MS. The heartbeat interval is set to one-third of session timeout — a common rule of thumb that gives the broker two missed heartbeats before declaring the consumer dead, reducing false positives from transient network hiccups.

A consumer group assigns each partition to exactly one consumer, enabling parallelism equal to the partition count. Offsets are committed to __consumer_offsets; the gap between committed and current offset is the at-least-once window. The heartbeat+poll model has two independent liveness checks: session.timeout.ms for the heartbeat thread and max.poll.interval.ms for the application thread. Cooperative-sticky rebalancing (Kafka 2.4+, KIP-429) eliminates stop-the-world pauses by only revoking partitions that need to move.

Section 9

The ISR Replication Protocol — How Kafka Stays Durable

Kafka stores every partition on multiple brokers so that a single broker failure doesn't destroy data. But "stored on multiple brokers" is a vague promise — the real question is: at what point exactly is a message safe? The answer lives in a mechanism called the The set of replicas for a partition that are fully caught up with the leader's log. Only ISR replicas are eligible to become the new leader. The leader maintains this list and removes replicas that fall too far behind. In-Sync Replica (ISR) set, and understanding it precisely is what separates engineers who configure Kafka correctly from those who lose data in production.

The Replication Flow: Leader Writes, Followers Pull

Every partition has exactly one leader replica at any moment. All writes go to the leader. The leader appends the message to its local log and then — this is the key part — waits for followers to catch up before acknowledging a committed write (when acks=all).

Followers don't get pushed data by the leader. They actively pull from the leader, just like a consumer pulls messages. Each follower sends a Fetch request to the leader, specifying the offset it already has. The leader responds with the new messages. The follower appends them and sends another Fetch. This continuous pull loop means the follower's lag is measurable in real time — the leader knows exactly how far behind each follower is.

The diagram introduces a critical concept: the The offset up to which all ISR replicas have replicated. Consumers can only read records at or below the high watermark — this prevents a consumer from reading a record that might be lost if the leader crashes before followers replicate it. High Watermark (HW). The leader maintains this offset, advancing it only when all ISR replicas have confirmed they've written up to that point. Consumers can only read records at or below the HW. This is why acks=all + ISR gives a true durability guarantee: by the time a consumer reads a record, every ISR replica has it — so even if the leader dies that instant, a follower takes over with all the data the consumer already read.

What "In-Sync" Means — and What Removes a Replica from ISR

A follower is considered "in sync" as long as it has fetched data from the leader within the last replica.lag.time.max.ms milliseconds (default 30 000 ms). If a follower falls behind — maybe it's on a slow machine, dealing with a GC pause, or a network partition has cut it off — it's removed from the ISR list. From that moment it no longer holds up acks=all writes. The leader waits only for the remaining ISR members.

When the lagging follower catches up, it's added back to the ISR. This dynamic membership is what makes ISR elegant: the cluster automatically adapts to slow or failing replicas without stopping writes — it just temporarily reduces the durability guarantee until the replica recovers.

min.insync.replicas: The Hard Durability Floor

ISR shrinkage creates a danger: if all but one replica fall behind, the ISR shrinks to just the leader. Now acks=all only waits for the leader to write — but if the leader crashes before any follower replicates, all writes since the ISR shrunk are lost. The min.insync.replicas config prevents this by setting a floor: if the ISR drops below this number, writes with acks=all are rejected with NotEnoughReplicasException rather than accepted and potentially lost.

The recommended production configuration for critical topics is replication.factor=3 + min.insync.replicas=2 + acks=all on the producer. This tolerates one broker failure with no data loss. If a second broker fails and only one ISR replica remains, Kafka throws NotEnoughReplicasException to the producer — writes stop rather than accepting data that might be lost. In the language of distributed systems, Kafka deliberately picks consistency over availability when its safety floor is breached — what the CAP theorem calls CP behaviour (Consistency + Partition tolerance). Better to refuse new writes than to silently lose them.

Leader Election and unclean.leader.election.enable

When the leader broker crashes, Kafka elects a new leader from the ISR. Because all ISR replicas are fully caught up (that's the definition of ISR), any of them can become leader without losing any committed messages. This is clean failover.

The danger arises if the ISR is empty — maybe all replicas fell behind and then the leader crashed. If unclean.leader.election.enable=true (the dangerous default in older Kafka versions), Kafka can elect a lagging follower as leader. That follower is missing messages the old leader committed — those messages are gone. Setting unclean.leader.election.enable=false (the default in Kafka 3.0+) prevents this: if no ISR replica is available, the partition stays offline until a replica comes back. Data is preserved at the cost of temporary unavailability.

Kafka durability rests on the ISR — the set of replicas fully caught up with the leader. A write is committed when all ISR members confirm it; the high watermark advances accordingly, and consumers only read committed data. min.insync.replicas=2 with replication.factor=3 is the standard durability floor: one broker can fail safely; a second failure pauses writes rather than accepting data loss. unclean.leader.election.enable=false prevents an out-of-sync replica from becoming leader and destroying committed messages.

Section 10

KRaft — How Kafka Replaced ZooKeeper

For over a decade, every Kafka cluster came with a hidden dependency: A distributed coordination service (Apache) used by pre-4.0 Kafka to store cluster metadata, elect the controller broker, manage topic configurations, and handle partition leader elections. Deprecated in Kafka 3.x, removed in Kafka 4.0. Apache ZooKeeper. ZooKeeper is a separate distributed system — it runs its own processes, its own quorum, its own JVMs — and Kafka leaned on it for all cluster metadata: which broker is the controller, which partitions have which leaders, what the topic configs are, who the ISR members are. If you wanted to run Kafka, you ran ZooKeeper first. If ZooKeeper had problems, Kafka had problems.

Why ZooKeeper Had to Go

ZooKeeper worked — but it created three operational headaches that got worse as Kafka clusters scaled:

Two systems to operate: separate monitoring, separate upgrades, separate on-call runbooks. When something went wrong, the first question was always "is it Kafka or ZooKeeper?"
Controller failover was slow: when the active Kafka controller broker died, the new controller had to reload all partition metadata from ZooKeeper. With hundreds of thousands of partitions, this could take minutes.
Partition count ceiling: every partition change (creation, leadership change, ISR change) was a ZooKeeper write. At roughly 200 000 partitions, ZooKeeper's write throughput became a real bottleneck. Kafka's practical limit was far below what the log-based architecture could theoretically support.

KRaft: Kafka's Own Consensus Layer

KIP-500 (proposed 2019) described replacing ZooKeeper with a Raft-based consensus protocol built into Kafka itself. The result is Kafka's built-in metadata consensus protocol, replacing ZooKeeper. Introduced experimentally in Kafka 2.8 (Apr 2021), production-ready in 3.3 (Oct 2022), ZooKeeper deprecated in 3.5 (Jun 2023), and the only supported mode in Kafka 4.0 (Mar 2025). Uses the Raft consensus algorithm via an internal __cluster_metadata topic. KRaft (Kafka Raft). Instead of outsourcing metadata to ZooKeeper, Kafka stores it in an internal topic called __cluster_metadata, replicated across a small set of dedicated controller nodes using the Raft consensus algorithm.

In pre-KRaft deployments, every metadata operation was a round-trip to ZooKeeper — a separate distributed system that Kafka engineers had to monitor, tune, and upgrade in lockstep. In KRaft, the controller nodes are just Kafka brokers with the process.roles=controller config. They replicate metadata in the __cluster_metadata topic using Raft, just like normal Kafka topics are replicated. The active controller pushes metadata changes to brokers via a new MetadataFetch API — brokers subscribe and stay current without polling ZooKeeper.

Timeline

Kafka 2.8 (Apr 2021) — KRaft introduced as early access (KIP-500). Not production-ready.
Kafka 3.3 (Oct 2022) — KRaft declared production-ready for new clusters (KIP-833). ZooKeeper mode still supported.
Kafka 3.5 (Jun 2023) — ZooKeeper mode officially deprecated. Migration tooling from ZooKeeper to KRaft matured.
Kafka 4.0 (Mar 2025) — ZooKeeper mode removed entirely. KRaft is the only supported mode.

Controller failover that previously took minutes (due to ZooKeeper metadata reload) now completes in milliseconds, because the new controller node already has the full metadata log locally. The theoretical partition count ceiling rises from roughly 200k to millions, though operational concerns (rebalance time, memory) still apply in practice.

ZooKeeper was Kafka's external metadata store — a separate distributed system managing topic configs, partition assignments, and controller elections. KRaft (KIP-500) replaces it with a Raft-based consensus layer built directly into Kafka, storing metadata in the internal __cluster_metadata topic. ZooKeeper was removed in Kafka 4.0 (2025). The wins: single system to operate, controller failover in milliseconds instead of minutes, and support for vastly more partitions.

Section 11

Storage Internals — Log Segments, Index Files, Page Cache

Kafka's reputation for high throughput — handling millions of messages per second on commodity hardware — isn't magic. It comes from a few deliberate storage choices that align perfectly with how operating systems and disk hardware actually work. Once you understand the segment layout and the page cache trick, Kafka's performance characteristics stop being mysterious and start being predictable.

The Partition on Disk: A Directory of Segment Files

Each partition is stored as a directory on the broker's file system. Inside that directory is not one giant file but a series of One of several fixed-size (or time-bounded) files that together make up a Kafka partition's log. New messages are appended to the active segment. When it reaches segment.bytes or segment.ms, it is sealed (rolled) and a new active segment starts. log segments — fixed-size chunks of the append-only log. At any moment, one segment is the active segment receiving new writes. All older segments are sealed (immutable).

A partition directory contains three file types per segment, named by the segment's base offset (the offset of the first message in that segment):

00000000000000000000.log — the raw message data, appended sequentially. This is the append-only log.
00000000000000000000.index — a sparse offset index: maps logical offset to byte position in the .log file. Every ~4 KB of log data gets one index entry. A binary search on the index finds the right byte position in O(log n) time.
00000000000000000000.timeindex — a timestamp index: maps message timestamp to offset. Used for time-based offset lookup (e.g., "give me messages from the last hour").

The index design is elegant: it doesn't need one entry per message (that would make the index enormous for high-throughput topics). A sparse index with one entry per ~4 KB means binary search narrows the .log file seek to within one OS page, and the subsequent linear scan is at most a few kilobytes. This is why Kafka can serve any arbitrary offset lookup in microseconds regardless of how many messages are in the log.

Page Cache: Why Kafka Doesn't Use Its Own Buffer

Most databases maintain their own buffer pool — a region of RAM they manage themselves, deciding which disk blocks to cache. Kafka does the opposite: it lets the The Linux kernel's transparent disk I/O cache. Blocks read from disk are kept in RAM automatically. Writes to disk go to page cache first, then are flushed to disk asynchronously. Kafka relies on the OS page cache rather than managing its own memory buffer. OS page cache do all caching. When Kafka appends to a log segment, it writes to the page cache. The OS flushes to disk asynchronously. When a consumer reads data that was recently written, the OS serves it directly from page cache — no disk I/O at all.

Why is this better than a self-managed buffer? Three reasons: the OS page cache is backed by all available free RAM (growing automatically as JVM heap shrinks), it persists across JVM restarts (a freshly restarted Kafka broker can still serve hot data from cache), and the OS has decades of optimisation in its page replacement algorithms.

Zero-Copy: From Disk to Socket Without Touching User Space

When a consumer fetches data, the naive path would be: disk to kernel buffer, kernel buffer to user-space Kafka JVM, JVM to kernel socket buffer, socket buffer to network. That's two extra kernel-userspace copies. Kafka avoids both with the Linux sendfile() system call — the A Linux kernel feature that transfers data directly from a file (or page cache) to a network socket without copying through user space. Kafka uses this for consumer fetch responses, dramatically reducing CPU and memory bus usage. zero-copy transfer moves data from page cache directly to the network socket via DMA, never touching the JVM heap at all.

Note that zero-copy is only available for uncompressed messages or when producer and consumer use the same compression codec — if codec conversion is needed, the broker must decompress and recompress through user space, losing the zero-copy benefit. Consistent compression settings across your pipeline keep this fast path open.

Retention and Log Compaction

Kafka offers two cleanup policies for sealed segments:

Time/size-based deletion (cleanup.policy=delete, default): sealed segments older than retention.ms (default 7 days) or beyond retention.bytes are deleted. The entire segment file is removed atomically.
Log compaction (cleanup.policy=compact): a background log-cleaner thread keeps only the latest message per key, removing older records with the same key. This makes a compacted topic a compact key-value store — always the latest value for every key, replayable from the start. Used heavily by Kafka Streams and event-sourcing systems.

Each Kafka partition is a directory of segment files — .log (data), .index (sparse offset-to-byte-position map), and .timeindex (timestamp-to-offset). Segments roll when they hit segment.bytes or segment.ms. The OS page cache transparently caches hot segments in RAM, and the sendfile() zero-copy syscall moves data from page cache to network socket without a JVM heap copy. Retention deletes old sealed segments; log compaction keeps only the latest value per key for changelog-style topics.

Section 12

Exactly-Once Semantics — How Kafka Achieves It

Distributed systems have a classic problem: when a message send fails, did the broker receive it or not? The network is unreliable. If you retry, you might send it twice. If you don't retry, you might lose it. For years, Kafka gave you exactly two choices: at-most-once (don't retry, messages can be lost) or at-least-once (retry, messages can be duplicated). Then Kafka 0.11 (released June 2017) introduced a third option — A delivery guarantee where each message is delivered and processed exactly once — no losses, no duplicates. Introduced in Kafka 0.11 via KIP-98 using three building blocks: idempotent producer, transactions, and read_committed consumers. exactly-once semantics (EOS) via KIP-98. It's the crown jewel feature, and understanding exactly how it works — and where it doesn't — will save you from deploying it incorrectly.

Building Block 1 — Idempotent Producer

The first building block you already met in S7: the idempotent producer. Each producer instance receives a Producer ID (PID) from the broker on startup. Every message batch the producer sends carries a sequence number — a monotonically increasing counter per partition. When the broker receives a batch, it checks: "Have I already written a batch with (PID, partition, sequence)?" If yes, it discards the duplicate. If no, it writes and records the sequence number.

The broker wrote seq=7 on the first send, the ack got lost, the producer retried with the same seq=7, and the broker silently discarded it. From the consumer's perspective, message seq=7 appears exactly once. No application code needed — this is automatic when enable.idempotence=true. The limitation: idempotence is scoped to a single producer instance's lifetime. If the producer process crashes and restarts, it gets a new PID. That's what transactions solve.

Building Block 2 — Transactions: Atomic Multi-Partition Writes

Imagine a Kafka Streams job: it reads from orders, transforms, and writes results to order-totals while also committing its consumer group offset on __consumer_offsets. If it crashes mid-write, some consumers might see a partial update — results written to order-totals but the offset not committed, so the job re-reads and re-writes on restart. The transaction API makes all three writes atomic: either all succeed or none are visible.

The mechanism uses a A broker role (selected by hashing the transactional.id) that manages the state of a transaction — tracking which partitions are included, whether the transaction is open/committed/aborted, and coordinating writes across brokers. Transaction Coordinator — one of the brokers, selected by hashing the transactional.id.

The COMMIT marker is the key mechanism. The producer's records are written to the partition logs immediately in step 3, but they're tagged as "in-transaction" — read_committed consumers buffer them and skip them until the COMMIT marker arrives in step 6. If the producer crashes before committing, the transaction coordinator eventually writes an ABORT marker and the records are permanently invisible to consumers, as if they were never written.

Building Block 3 — read_committed Consumer

Set isolation.level=read_committed on consumers that must only see committed data. With this config, the consumer's Fetch calls never return records that are part of an open (uncommitted) transaction. It buffers any transactional records it receives until the COMMIT or ABORT marker arrives, then either delivers them (COMMIT) or discards them (ABORT). The default isolation.level=read_uncommitted sees all records regardless of transaction state — which means a consumer could read data that later gets aborted, producing phantom reads.

Java Transactional Producer Snippet

// Producer config Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092"); props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "stream-processor-1"); // stable, unique per instance props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true"); // required for transactions props.put(ProducerConfig.ACKS_CONFIG, "all"); // required for transactions KafkaProducer<String, String> producer = new KafkaProducer<>(props); producer.initTransactions(); // registers transactional.id with coordinator, gets epoch // Processing loop: try { producer.beginTransaction(); // Write results to output topic producer.send(new ProducerRecord<>("order-totals", key, result)); // Commit consumer offsets atomically WITH the output write. // Either both the new records AND the new offset commit are visible, // or neither is — eliminating the "processed twice" problem on restart. producer.sendOffsetsToTransaction(currentOffsets, consumerGroupMetadata); producer.commitTransaction(); // COMMIT marker written to all involved partitions } catch (ProducerFencedException e) { // A newer instance with the same transactional.id is running — we are the zombie. // The broker incremented the epoch and rejected our writes. Must NOT retry. producer.close(); } catch (KafkaException e) { producer.abortTransaction(); // writes ABORT marker — records become invisible // safe to retry the full batch }

The sendOffsetsToTransaction() call is the secret to end-to-end exactly-once in Kafka Streams: it writes the consumer group's offset commit as part of the same transaction as the output records. Either both the new output records AND the new committed offset are visible together, or neither is. If the job restarts, it re-reads from the pre-transaction offset and reprocesses, but the output of the previous attempt is invisible (aborted). No duplicates.

The fencing mechanism: the transactional.id is how Kafka handles zombie producers. When a new producer calls initTransactions() with the same transactional.id, the coordinator increments an epoch. Any write from the old instance (with the stale, lower epoch) is rejected with ProducerFencedException. This guarantees only one active transactional producer per transactional.id at any time.

Where EOS Works — and Where It Doesn't

EOS boundary: Kafka-to-Kafka only. Exactly-once semantics in Kafka is a guarantee scoped to Kafka-internal writes. A transaction covering "read from topic A, write to topic B, commit offset" is atomic — all within Kafka. The moment your processing involves an external system (a database write, an HTTP call, a file write), you are outside Kafka's transaction boundary. Kafka cannot atomically coordinate a Kafka commit with a Postgres COMMIT. For external sinks you need idempotent consumers (detect and discard duplicates on the sink side) or outbox patterns specific to that external system. Kafka Streams gives you exactly-once within Kafka for free; the last mile to an external database is still your responsibility.

Performance cost: transactional writes add overhead — the COMMIT marker write, coordination with the transaction coordinator, and extra Fetch buffering on the consumer side. Public benchmarks (Confluent and others) generally measure a single-digit-percent throughput reduction compared to non-transactional writes at high volumes — the idempotent producer overhead itself is essentially free, with most of the cost coming from the transaction coordinator round-trip on commit. For most workloads this is negligible compared to the correctness guarantee. For extreme throughput cases where occasional duplication is acceptable, stay at-least-once.

Kafka's exactly-once semantics (EOS), introduced in Kafka 0.11 via KIP-98, rests on three building blocks: (1) the idempotent producer uses (PID, partition, sequence) to deduplicate retries at the broker; (2) the transaction API makes writes to multiple partitions atomically visible via COMMIT/ABORT markers coordinated by the transaction coordinator; and (3) the isolation.level=read_committed consumer buffers transactional records until it sees the COMMIT marker. Together these enable exactly-once processing within Kafka pipelines. Across external systems, idempotent consumer logic is still required on the sink side.

Section 13

Log Compaction — Keep Only the Latest Value Per Key

So far, Kafka's retention model has been simple: messages older than a threshold (time or size) get deleted. That works great for event streams where history matters only for a window — application logs, metrics, click events. But what if what you care about is not history, but current state? If a user changes their email address five times, you don't need all five old addresses — you just need the one they have right now.

That's exactly what log compaction solves. Instead of throwing away old messages by age, compaction throws away superseded values for the same key. It keeps only the most recent message for every unique key, forever. The log becomes a snapshot of current state — a changelog you can replay from the beginning and always arrive at the same answer, no matter when you read it.

Where Compaction Is Used

Three patterns use compaction constantly in production:

Kafka Streams state store changelogs — when a Kafka Streams app builds a stateful aggregation (e.g., running totals per user), it writes its internal state to a compacted changelog topic. If the app crashes and restarts, it replays the changelog from offset 0 and rebuilds the state store. Without compaction, this replay would grow forever — compaction keeps only the latest value per state key, so replay stays fast regardless of how long the app has been running.
Database CDC (Change Data Capture) — a source connector streaming Postgres row changes writes one message per row change, keyed by primary key. After a row is updated 50 times, only the latest version matters for a consumer that wants current state.
Configuration topics — a service configuration topic keyed by service name contains the latest config for each service. Compaction ensures a new instance reading the topic always sees current config, not config from six months ago.

How Compaction Works Mechanically

Kafka splits a partition's log into two regions: the clean portion (already compacted — each key appears at most once) and the dirty portion (new writes, keys may repeat). A background thread called the log cleaner periodically reads the dirty portion, builds an in-memory hash map of key → latest-offset, then copies records into new segments, dropping any record whose key maps to a higher offset. The result replaces the dirty segments with a new compacted segment, and that segment moves into the clean portion.

The SVG shows the key insight: before compaction, keys A, B, and C each appear multiple times (yellow = superseded, green = latest). After compaction runs, only three records remain — one per key, the most recent write. The log shrinks permanently. A consumer that replays from offset 0 will read exactly the latest value for each key, as if those old writes never happened.

Deleting a Key — The Tombstone

Here's a subtle but critical detail: if compaction always keeps the latest message per key, how do you delete a key entirely? You can't just stop writing to it — compaction would keep the last real value forever. The answer is a tombstone: a message with the key set normally but the value set to null. The log cleaner sees a tombstone and knows to remove all records for that key entirely. But it doesn't remove the tombstone immediately — it keeps it for a configurable window (`delete.retention.ms`, default 24 hours) so that downstream consumers have time to see the delete and remove the key from their own state. After that window, the tombstone itself disappears.

The tombstone flow shows why the two-phase delete is necessary. Step 1: producer writes a null value for the key. Step 2: the tombstone sits in the log long enough for all downstream consumers to see it and remove the key from their own local state (e.g., a KTable, a cache, a database). Step 3: after `delete.retention.ms` expires, the log cleaner discards the tombstone itself. Step 4: the key is fully gone. If the tombstone vanished immediately, a slow consumer might miss the delete and keep a stale record in their own store forever.

Compaction vs. deletion — pick based on what matters: history or current state. If you care about events over time (audit log, activity stream), use cleanup.policy=delete with a retention window. If you care about current state per key (user profiles, config, Kafka Streams changelog), use cleanup.policy=compact. If you want bounded history plus compaction (e.g., CDC with last-14-days window), use cleanup.policy=compact,delete.

Log compaction is an alternative retention model that keeps only the latest message per key, making the partition an always-current snapshot of state rather than a bounded time window of events. A background log cleaner thread deduplicates keys by removing superseded records. Deleting a key requires a tombstone — a message with a null value — which is retained for `delete.retention.ms` so slow consumers can see the delete before it disappears. Use `cleanup.policy=compact` for changelog topics, state stores, CDC, and configuration topics.

Section 14

Kafka Streams — Stream Processing on Top of Kafka

Kafka is great for moving data around. But what if you want to transform data as it flows — filter events, aggregate counts, join two streams together, compute running totals? You could consume from a topic, do the computation in your application, and write results to another topic. That works, but it's a lot of plumbing to manage. Kafka Streams is a Java library that turns that plumbing into a first-class programming model.

Here's the critical insight that confuses a lot of engineers first encountering it: Kafka Streams is not a separate cluster. It's not like Spark or Flink, which require their own cluster of worker nodes. Kafka Streams is a plain Java library — you import it into your application like any other dependency. Your application's JVM is the processing node. You can run it on Kubernetes, on EC2, as a plain JAR — anywhere Java runs. Kafka provides the coordination; your application provides the compute.

KStream and KTable — Two Mental Models

Kafka Streams gives you two ways to interpret a topic's data:

KStream — the "events" view. Every record in the topic is an independent event. Reading a stream of clicks means each click is a distinct thing that happened. Appending a new record adds to the stream; old records stay unchanged. Think: a river flowing past you — each drop of water is a separate event.
KTable — the "current state" view. Every record in the topic is an update to a key's current value. If the topic contains user profile updates keyed by user ID, the KTable represents the latest profile for each user. A new record for key U123 replaces the previous record for U123. Think: a spreadsheet where each row is one entity and the latest write wins. Under the hood, a KTable is backed by a compacted topic — exactly what we just learned.

The SVG makes the mental model concrete. A KStream keeps every record — if user U1 clicked three products, you see three records, which is correct for counting clicks. A KTable keeps only the latest record per key — if user U1 updated their email three times, only the final email address is current, which is correct for a user profile lookup. Same Kafka topic, two different interpretations of what "a new record" means.

Stateful Operations — Where It Gets Interesting

Stateless operations (filter, map, flatMap) are simple — each record is processed independently. But stateful operations — windowed aggregations, stream-table joins, session windows — require Kafka Streams to remember something between records. Where does it store that state?

Kafka Streams uses RocksDB (an embedded key-value store, developed at Facebook) as a local state store on each application instance. Every stateful operation (e.g., "count events per user in the last 5 minutes") is backed by a RocksDB store on disk. But local disk alone isn't durable — if the application restarts, RocksDB is gone. So Kafka Streams continuously logs every write to the state store into a changelog topic in Kafka (a compacted topic). If the app restarts, it replays the changelog from the beginning, rebuilds RocksDB, and is back to exactly the state it had before the crash. You get the speed of local disk with the durability of Kafka.

A Real Topology: Fraud Detection

Here's a concrete example. A fraud detection service reads from the transactions topic (a KStream) and needs to join each transaction against the current user risk profile (from the user-profiles KTable, backed by a compacted changelog). If a high-risk user triggers a transaction above a threshold, the service writes an alert to the fraud-alerts topic.

The topology is: transactions KStream → stream-table join with user-profiles KTable → filter → fraud-alerts sink topic. The join processor enriches each transaction record with the user's latest risk score from the compacted KTable. The filter drops anything below the fraud threshold. This entire pipeline runs inside a Java application — no separate processing cluster, no YARN, no Spark submission scripts.

Code: A Kafka Streams Topology

Here's what the fraud detection topology looks like in code. Notice the key structural elements: build a StreamsBuilder, attach source topics, transform, and start the application.

// ── Dependencies: org.apache.kafka:kafka-streams:3.7.x ── StreamsBuilder builder = new StreamsBuilder(); // 1. Source stream — one record per transaction event KStream<String, Transaction> transactions = builder.stream("transactions", Consumed.with(Serdes.String(), transactionSerde)); // 2. KTable — backed by compacted topic; always-current user profiles KTable<String, UserProfile> userProfiles = builder.table("user-profiles", Consumed.with(Serdes.String(), profileSerde), // RocksDB state store named "user-profiles-store" Materialized.as("user-profiles-store")); // 3. Join: enrich each transaction with the user's current risk profile KStream<String, EnrichedTransaction> enriched = transactions .join(userProfiles, (txn, profile) -> new EnrichedTransaction(txn, profile), Joined.with(Serdes.String(), transactionSerde, profileSerde)); // 4. Filter: only high-risk, high-value transactions KStream<String, EnrichedTransaction> alerts = enriched .filter((userId, etxn) -> etxn.getUserRiskScore() > 80 && etxn.getAmountUsd() > 1_000.0); // 5. Sink: write alerts to output topic alerts.to("fraud-alerts", Produced.with(Serdes.String(), enrichedTxnSerde)); // 6. Build and start the topology KafkaStreams streams = new KafkaStreams(builder.build(), new StreamsConfig(Map.of( StreamsConfig.APPLICATION_ID_CONFIG, "fraud-detector-v1", StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092" ))); streams.start();

Walking through the key lines: builder.stream() subscribes to the transactions topic and treats it as a stream of independent events. builder.table() subscribes to the user-profiles topic and treats it as a snapshot of current state — the Materialized.as("user-profiles-store") names the local RocksDB store that backs it. The .join() call does a stream-table join: for each incoming transaction, Kafka Streams looks up the key in the KTable's local RocksDB store (no network hop — it's on the same JVM). StreamsConfig.APPLICATION_ID_CONFIG uniquely identifies this application for consumer group coordination and changelog topic naming. If this app has 5 instances running in parallel, Kafka automatically assigns partitions across all 5 — the topology scales horizontally by adding more instances.

Why Kafka Streams wins over a separate processing cluster for many use cases. Flink and Spark Streaming are excellent for complex, high-cardinality stateful processing at massive scale. But they add operational overhead: a cluster to manage, a deployment pipeline separate from your application, job submission, and a new failure domain. For the majority of "read topic A, enrich with data from topic B, write to topic C" pipelines, Kafka Streams lives inside your application, deploys with your application, and scales by deploying more instances of your application. No separate cluster. No separate ops team.

Kafka Streams is a Java library (not a separate cluster) for stream processing on top of Kafka. KStream interprets a topic as an append-only event log; KTable interprets it as a map of latest values per key. Stateful operations are backed by local RocksDB stores, with every state change logged to a compacted changelog topic for crash recovery. A stream-table join can enrich live events with current state without any external database calls — the state lives in local storage on the same JVM, rebuilt on startup by replaying the changelog.

Section 15

Kafka Connect — Move Data In and Out Without Writing Code

There's a recurring task in every data-driven engineering team: "we need to get data from X into Kafka (or from Kafka into Y)." Sometimes X is a Postgres database, sometimes it's S3, sometimes it's Elasticsearch. If you had to write a consumer or producer from scratch for every integration, you'd spend most of your time writing and maintaining plumbing code — not building product features.

Kafka Connect solves this with a framework and an ecosystem of pre-built connectors. Instead of code, you provide a JSON configuration file. Connect workers pick up the configuration, instantiate the appropriate connector, and handle the polling, batching, error handling, and offset tracking on your behalf. For the vast majority of "move data between X and Kafka" use cases, you write zero application code.

Source Connectors and Sink Connectors

Connectors come in two flavors:

Source connectors bring data into Kafka from an external system. The most important one is Debezium (open-source, maintained by Red Hat): every relational database keeps a low-level change journal that records every write before it touches the actual tables — Postgres calls it the WAL (Write-Ahead Log), MySQL calls it the binlog, MongoDB calls it the change stream. Debezium reads that journal directly and streams every insert, update, and delete into Kafka as an event. This whole technique — streaming a database's own change journal — is called CDC (Change Data Capture). Instead of polling a table with SELECT * WHERE updated_at > ? every minute (slow, misses deletes, heavy on the DB), Debezium hooks into the database's own change feed at the wire level.
Sink connectors push data out of Kafka into an external system. Common ones: an Elasticsearch sink (for full-text search indexing), an S3 sink (for storing event archives in object storage), a JDBC sink (for writing back to a relational database), a BigQuery sink (for analytics).

The architecture SVG shows how Connect sits between external systems and Kafka. Debezium (source connector) reads Postgres's WAL and writes CDC events to topic-per-table. S3 and Elasticsearch sink connectors consume those topics and write to their respective stores. Every connector has a `tasks.max` configuration — how many parallel task threads to run. The Connect workers themselves run as a distributed cluster, and tasks are rebalanced across workers automatically if one goes down.

The Debezium CDC Pipeline — A Walk-Through

The CDC pipeline: Postgres writes every row change to its WAL (this happens regardless of Debezium — it's Postgres's own durability mechanism). Debezium opens a replication slot on Postgres and reads that WAL stream, translating each row-level change into a structured event. Events land in a Kafka topic named after the table (e.g., db.public.orders), keyed by the table's primary key. From there, multiple downstream consumers subscribe independently: an S3 sink archives the raw events, a search service re-indexes changed records, a fraud detector reacts to suspicious patterns — all without placing any additional load on Postgres beyond the replication slot.

Deploying a Debezium Connector via REST

Kafka Connect exposes a REST API. You deploy a connector by POSTing a JSON config. Here's what a Debezium Postgres connector configuration looks like:

{ "name": "orders-debezium-source", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "tasks.max": "1", "database.hostname": "postgres", "database.port": "5432", "database.user": "debezium", "database.password": "${file:/opt/kafka/external.properties:db.password}", "database.dbname": "ecommerce", "database.server.name": "db", "table.include.list": "public.orders,public.users", "plugin.name": "pgoutput", "slot.name": "debezium_slot", "publication.name": "debezium_pub", "topic.prefix": "db", "key.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer", "value.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer", "key.converter.schema.registry.url": "http://schema-registry:8081", "value.converter.schema.registry.url": "http://schema-registry:8081" } }

Key fields to understand: connector.class selects the Debezium Postgres plugin. tasks.max=1 is intentional — a single Postgres replication slot can only be consumed by one thread; setting this higher would be ignored for Postgres CDC (unlike other connectors where parallelism helps). table.include.list filters which tables to stream. plugin.name=pgoutput uses Postgres's native logical replication protocol (available since Postgres 10, no extra plugins needed). The key.converter and value.converter lines tell Connect to serialize records as Avro and register schemas with the Schema Registry — the topic of the next section.

Kafka Connect is a framework and ecosystem of pre-built connectors that move data between Kafka and external systems without custom application code. Source connectors bring data into Kafka (Debezium for CDC via Postgres WAL / MySQL binlog / MongoDB change streams); sink connectors push data out (S3, Elasticsearch, JDBC, BigQuery). Connectors are deployed declaratively via JSON config over a REST API. Debezium is the de-facto standard CDC connector — it reads the database's own replication stream, adds zero query load, and captures inserts, updates, and deletes at the wire level.

Section 16

Schema Registry & Data Contracts — Preventing the Schema Drift Disaster

Imagine 50 producer services writing to a user-events topic, and 80 consumer services reading from it. One producer team adds a new field — referral_code — to the event. They forget to tell anyone. The next morning, 80 consumer services throw deserialization exceptions and crash. Your on-call engineer spends six hours tracking down which producer changed the schema. This isn't a hypothetical. It happens in every organization that grows past ~10 services sharing Kafka topics without contracts.

A Schema Registry solves this by acting as a central contract system for your messages. Before publishing, each producer registers a description of the message shape — a "schema" — written in one of three compact binary formats (Avro and Protobuf are the two most common; JSON Schema is the third). The registry assigns a schema ID (a small integer). The producer embeds that ID as a 4-byte prefix on every message. Consumers fetch the schema by ID and deserialize correctly — even if they're running an older version of the code. Crucially, the registry enforces compatibility rules at write time, so a producer that tries to publish an incompatible schema change gets rejected before any consumer sees broken data.

The Three Compatibility Modes

Understanding compatibility rules is the key to getting Schema Registry right. Think about compatibility from the perspective of: who can still read this data after the change?

BACKWARD compatibility — new schema, old consumer. Old consumers (running the old code) can read messages written with the new schema. Safe changes: add optional fields with defaults, remove optional fields. Unsafe: change a field type, rename a required field, remove a required field. This is the most common mode — you deploy new producers before updating all consumers, so old consumers must still work.
FORWARD compatibility — old schema, new consumer. New consumers (running new code) can read messages written with the old schema. Safe: remove fields, add optional fields. This is useful when you want to ship new consumers that tolerate old producers still in flight.
FULL compatibility — both BACKWARD and FORWARD simultaneously. The most restrictive: only add or remove optional fields with defaults. Use this when you need zero downtime rolling upgrades where producers and consumers may be on different versions at the same time.

The flow: (1) the producer registers its schema with the registry before its first publish. The registry checks the compatibility rule — if the schema is BACKWARD compatible with the previous version, it returns a schema ID (e.g., 42). If not, it rejects the registration with a 409 error. (2) The producer receives the schema ID. (3) Every message sent to Kafka has a 5-byte header: magic byte 0 (Confluent wire format) + 4-byte schema ID. The actual payload is compact Avro-encoded binary — no field names embedded, just values. (4) A consumer reads a message, extracts the schema ID from the header, and looks it up in the registry (results are cached locally after the first fetch — no per-message round-trip). The consumer uses the fetched schema to deserialize the bytes correctly, even if the producer used a newer schema version than the consumer's local model.

The compatibility table makes the trade-offs explicit. BACKWARD (green YES in the left column) is the most common default — you're protecting old consumers from breaking when you deploy new producers. FORWARD protects new consumers from old producers still in flight. FULL requires both simultaneously, which is the right choice for zero-downtime rolling deployments but restricts you to only adding or removing optional fields with defaults — any other change is rejected.

Checking Compatibility Before Deploying

The Schema Registry REST API lets you test compatibility before you push a new schema version. This is something you'd put into your CI pipeline:

# Test if a new schema version is compatible BEFORE registering it # Returns {"is_compatible": true} or {"is_compatible": false, "messages": [...]} curl -X POST \ -H "Content-Type: application/vnd.schemaregistry.v1+json" \ --data '{ "schema": "{\"type\":\"record\",\"name\":\"OrderEvent\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"amount_usd\",\"type\":\"double\"},{\"name\":\"referral_code\",\"type\":[\"null\",\"string\"],\"default\":null}]}" }' \ http://schema-registry:8081/compatibility/subjects/order-events-value/versions/latest # Register the schema (after compatibility check passes) curl -X POST \ -H "Content-Type: application/vnd.schemaregistry.v1+json" \ --data '{ "schema": "..." }' \ http://schema-registry:8081/subjects/order-events-value/versions # Response: {"id":43} ← new schema ID

The first call hits the /compatibility/ endpoint — it validates the new schema against the latest registered version without actually registering it. If your CI pipeline runs this check before merging a schema change, incompatible changes are caught before they reach production. The new field referral_code is defined as a union type ["null", "string"] with a default of null — this makes it optional with a default, which is BACKWARD compatible (old consumers can ignore it).

Schema Registry implementations: Confluent Schema Registry (originally Apache 2.0; relicensed under the Confluent Community License around the 5.0 era — still source-available and free for most uses, just not OSI-OSS) is the most widely used. Apicurio Registry (Red Hat, fully Apache 2.0) is a popular open-source alternative with the same wire format. AWS Glue Schema Registry is the managed option on AWS. All three support Avro, Protobuf, and JSON Schema, and all implement the same 5-byte wire format header.

Schema Registry prevents schema drift by acting as a central contract system for Kafka messages. Producers register schemas (Avro, Protobuf, or JSON Schema) and get back a schema ID, which they embed as a 4-byte prefix in every message. Consumers fetch schemas by ID for deserialization. Compatibility rules — BACKWARD (protect old consumers), FORWARD (protect new consumers), or FULL (protect both) — are enforced at registration time, rejecting breaking changes before any consumer sees broken data. A CI-integrated compatibility check endpoint lets teams catch schema incompatibilities before deployment.

Section 17

Pitfalls & Anti-Patterns — What Breaks Kafka Clusters in Production

Most Kafka problems don't come from misunderstanding the fundamentals — they come from misconfiguration. These are the seven mistakes that appear most often in production incidents, post-mortems, and support tickets. Each one has a clear cause, a clear consequence, and a clear fix.

Mistake: Creating a topic with a small partition count (e.g., 3) expecting to scale later by adding consumers.

Why it's bad: Kafka's consumer group model assigns at most one consumer per partition. If your topic has 3 partitions and you deploy 10 consumer instances, 7 of them sit idle with no partition assigned. You've paid for 10 pods and get 3× throughput. Adding more consumers does nothing until you increase partition count — and you cannot decrease partition count without recreating the topic.

How to spot it: Consumer lag grows despite high consumer CPU and memory headroom. Adding more consumer pods has no effect. kafka-consumer-groups.sh --describe shows many consumers with no assigned partitions.

Fix: Plan partition count upfront. A common rule: target the maximum parallelism you might need (e.g., if you might ever scale to 50 consumer pods, start with 50+ partitions). Use bytes-per-second math: if you expect 500 MB/s throughput and each partition handles ~50 MB/s, you need at least 10 partitions for throughput. You can increase partition count later with kafka-topics.sh --alter --partitions, but note: increasing partitions reshuffles key-to-partition mapping, which breaks ordering guarantees for keyed messages.

Mistake: Creating every topic with hundreds or thousands of partitions "just in case."

Why it's bad: Each partition replica is a directory of log segment files on each broker. 1,000 topics × 100 partitions × 3 replicas = 300,000 open file handles. Most Linux kernels default to 65,536 file descriptors per process — you will hit this. Beyond file descriptors: when a broker goes down, the controller must elect a new leader for every partition it owned. 300,000 partitions means 300,000 leader elections happening sequentially. In pre-KRaft Kafka (ZooKeeper mode), this could take tens of minutes. Even with KRaft, very high partition counts slow down controller failover and increase metadata replication overhead.

How to spot it: Cluster restarts take an unusually long time. Broker startup logs show "Loading metadata for N partitions" taking minutes. File descriptor errors in broker logs. Controller CPU spikes on broker failures.

Fix: Right-size partitions per topic. A well-run cluster with hundreds of topics and 20-50 partitions each scales to millions of messages/second without these symptoms. Resist the urge to pre-provision 10× partitions — just plan for the realistic maximum.

Mistake: Running acks=all on producers but leaving min.insync.replicas at the default of 1.

Why it's bad: acks=all means the broker waits for all ISR members to acknowledge before responding to the producer. But ISR can shrink — if the follower replicas fall behind (say, a slow disk, GC pause, or network hiccup), they drop out of the ISR. If ISR = {leader only} and min.insync.replicas=1, then acks=all is effectively satisfied by just the leader. The leader could crash immediately after acknowledging, before any follower received the data — and the message is lost. You set acks=all thinking you had durability; you actually had a false sense of safety.

How to spot it: ISR size dropping to 1 (check kafka.server:type=ReplicaManager,name=IsrShrinksPerSec metric). Data loss discovered only after a broker failure post-mortem.

Fix: For financial, audit, or irreplaceable event topics: set min.insync.replicas=2 (with replication.factor=3). This means ISR must contain at least 2 replicas before a produce succeeds. Combined with acks=all, you now have true multi-replica durability. Producers will get a NotEnoughReplicasException if ISR shrinks below 2 — which surfaces the problem immediately instead of silently losing data. Accept the availability trade-off: if 2 of 3 replicas are unavailable, the partition becomes read-only temporarily. For most durable-data use cases, this is correct behavior.

Mistake: Setting unclean.leader.election.enable=true on topics that contain important data.

Why it's bad: Normally, a new leader can only be elected from ISR — replicas that are fully caught up. If all ISR members are dead and a lagging out-of-sync replica is the only broker alive, Kafka will refuse to elect it as leader by default. This is correct: the out-of-sync replica is missing some committed messages. Promoting it would cause those messages to disappear. With unclean.leader.election.enable=true, Kafka will promote the lagging replica rather than making the partition unavailable. You get availability at the cost of confirmed message loss — and the loss is permanent and silent. The messages that the dead ISR members acknowledged but the surviving replica never received are gone.

How to spot it: Consumer offsets jump backward after a leader change. Messages that were produced with acks=all are missing. Kafka broker logs show "Unclean leader election" events.

Fix: Leave unclean.leader.election.enable=false (default since Kafka 0.11.0). Accept that if all ISR replicas die, the partition is unavailable until at least one replica recovers — this is the safe behavior. Only enable unclean elections on explicitly throwaway or lossy topics (e.g., real-time metrics you're willing to lose during a disaster) and document that decision explicitly.

Mistake: Consumers that do slow processing (e.g., external HTTP calls, database writes, ML inference) inside the poll loop without tuning session timeout and max.poll.interval.ms.

Why it's bad: The Kafka consumer group coordinator uses a heartbeat thread to detect dead consumers. If your consumer takes longer than max.poll.interval.ms (default: 5 minutes) between two consecutive poll() calls, the coordinator assumes it's dead and triggers a rebalance. During a rebalance, all partitions are reassigned — every consumer in the group pauses consumption until the rebalance completes. If the slow processing consistently exceeds the interval, you enter a rebalance loop: the consumer is perpetually being kicked out and rejoining, never making forward progress, and creating churn that blocks the entire group.

How to spot it: Consumer lag grows but consumer CPU is busy. Kafka broker logs show frequent "LeaveGroup" and "JoinGroup" for the same consumer ID. kafka-consumer-groups.sh shows the consumer cycling through states.

Fix: Either (a) reduce processing time per message (batch DB writes, use async HTTP calls), (b) increase max.poll.interval.ms to match your worst-case processing time, (c) decouple fetching from processing — poll messages into an in-memory queue and process them in a separate thread pool, pausing the consumer if the queue is full. Option (c) is the most robust for genuinely slow processing.

Mistake: Using the default producer configuration (non-idempotent) with retries > 0 and assuming at-least-once delivery is good enough — without realising duplicate processing has downstream costs.

Why it's bad: When a producer sends a batch and the network fails after the broker received the data but before the acknowledgment reaches the producer, the producer retries. Without idempotence, the broker has no way to know this is a retry — it appends the data again, creating a duplicate. In a financial ledger, an order processing pipeline, or any system where "wrote this message once" matters, duplicates silently cause real-world harm: double-charged customers, inflated counts, downstream deduplication complexity.

How to spot it: Downstream consumers see identical messages (same content, different offsets). Counting metrics show values slightly higher than expected. Only detectable by comparing expected vs. actual counts — not surfaced by any Kafka metric directly.

Fix: Set enable.idempotence=true on the producer (default is already true in Kafka 3.0+ when acks=all is set, but it's worth being explicit). This assigns each producer instance a PID and a monotonically increasing sequence number per partition. The broker deduplicates within the producer's in-flight window. Combine with EOS transactions (covered in S12) for cross-partition atomicity. Cost: negligible — idempotence adds no measurable throughput overhead.

Mistake: Writing a tool or monitoring script that reads from the internal __consumer_offsets topic directly to track consumer lag.

Why it's bad: __consumer_offsets is an internal compacted topic that stores consumer group offset commits and group metadata. Its encoding format is a private, version-dependent binary protocol — not a public API. It has changed between Kafka versions multiple times and will change again. Reading it requires reimplementing Kafka's internal offset codec. Every Kafka upgrade risks breaking your tool silently. More importantly, you're reinventing infrastructure that already exists: the kafka.consumer:type=consumer-fetch-manager-metrics,client-id=...,name=records-lag-max JMX metric, the AdminClient listConsumerGroupOffsets() API, and kafka-consumer-groups.sh --describe are the correct, stable, supported ways to track consumer lag.

How to spot it: Custom lag monitoring scripts that parse raw bytes from __consumer_offsets. Lag dashboards that break after a Kafka version upgrade.

Fix: Use the supported AdminClient API for programmatic offset tracking. Use Kafka-provided CLI tools for ad-hoc inspection. For production monitoring, use Burrow (LinkedIn's open-source Kafka consumer lag monitoring tool), or the consumer lag metrics exposed via JMX/Prometheus — both read offset data through stable public APIs.

The seven most common Kafka production pitfalls: under-partitioning prevents consumer scaling; over-partitioning causes file descriptor exhaustion and slow controller failover; omitting min.insync.replicas creates silent data loss behind a false safety net; enabling unclean leader election trades permanent data loss for availability; long consumer processing loops trigger rebalances that stall the entire group; non-idempotent producers create duplicates on retries; and reading __consumer_offsets directly builds fragile tools that break on every Kafka upgrade. Each has a clear fix that trades no significant performance for meaningful reliability.

Section 18

Practice Exercises — Build Your Intuition

Reading about Kafka is necessary but not sufficient. These five exercises force you to apply the concepts under constraints — the same way interview questions and on-call incidents will. Work through each one before expanding the solution. The goal isn't the right answer; it's building the reasoning path that gets you to a defensible design.

You're designing a user-events topic for a gaming platform. Current throughput is 50,000 msg/s. The team expects 10× growth over 18 months, reaching ~500,000 msg/s at peak. The average message size is 512 bytes. Your Kafka brokers can sustain approximately 50 MB/s write throughput per partition under production conditions (accounts for replication overhead, fsync, etc.).

Questions to answer:

How many partitions do you need at current throughput? At peak throughput?
What's the minimum partition count you should create the topic with today, and why?
What ordering constraint, if any, would affect your partition count decision?

Think about two separate constraints: throughput (bytes/second math) and consumer parallelism (how many consumer instances might you run at peak?). Also think about what happens to key-to-partition mapping when you increase partition count.

Solution:

Throughput math: 512 bytes × 500,000 msg/s = 256 MB/s. At 50 MB/s per partition: 256 / 50 = ~6 partitions minimum for throughput. That's surprisingly few — Kafka is efficient.

Consumer parallelism: If you're running a consumer group for downstream processing and might scale to 50 pods at peak load, you need at least 50 partitions — each pod gets one partition. Consumer parallelism, not throughput, usually drives partition count for most workloads.

Create with 100 partitions today. Why 100 and not 50? Because you cannot decrease partition count. If growth exceeds 10×, you'd need to increase — but increasing partitions reshuffles which partition a given key hashes to, breaking per-key ordering guarantees. Creating with 2× your expected peak provides headroom for growth without a reshuffle. Overhead from 100 extra empty partitions is negligible.

Ordering: If per-user event ordering matters (e.g., events from the same user must be processed in sequence), you must key messages by user ID. With 100 partitions, all events for a user hash to the same partition — ordering is preserved within that partition. Changing partition count later would scatter a user's events across different partitions, breaking ordering. Lock in your partition count early and don't change it for keyed topics.

Your team maintains two Kafka topics:

financial-transactions — payment confirmations, refunds, fraud flags. Losing a single message would require manual reconciliation and could result in regulatory violations.
click-tracking — every button click on the web app. Used for analytics and A/B test metrics. 50,000 events/second. Losing 0.1% of events is acceptable. Latency on every produce call matters for user experience.

For each topic, choose: acks setting (0, 1, or all), min.insync.replicas, and replication.factor. Justify each choice.

The trade-off is: higher acks = more latency per produce call, more durability. Lower acks = faster, weaker guarantee. Map the business requirement to the technical knob.

Solution:

financial-transactions: acks=all, min.insync.replicas=2, replication.factor=3. Rationale: losing data has regulatory and financial consequences — you accept 2–5ms extra latency per produce call. With RF=3 and min.ISR=2, you tolerate one replica failure without impacting producers. The leader must wait for at least one follower to confirm before acknowledging — the message survives a single broker crash.

click-tracking: acks=1, min.insync.replicas=1 (or leave at default), replication.factor=2. Rationale: leader-only acknowledgment cuts produce latency significantly. At 50,000 msg/s, a 2ms extra round-trip per message adds 100 seconds of cumulative wait per second of real time — it compounds. RF=2 saves storage and reduces replication overhead while still surviving single broker failures for most events. The 0.1% loss tolerance covers the window between leader crash and follower promotion. Lower replication factor also reduces cost for a high-volume, low-value topic.

A team reports the following behavior: their consumer application processes messages correctly, but the same messages appear to be re-processed every 30 minutes — orders get processed twice, emails sent twice, database records duplicated. The consumer uses manual offset commits. Describe what is likely happening and how to diagnose and fix it.

Think about what "re-processing" means in terms of offsets. What determines where a consumer starts reading after a restart or rebalance?

Solution:

Root cause candidates, in order of likelihood:

1. Offsets are not being committed (most likely): With manual commit (enable.auto.commit=false), offsets are only persisted if the application explicitly calls consumer.commitSync()` or `consumer.commitAsync(). If the application crashes, is restarted, or a rebalance triggers, the consumer group coordinator uses the last committed offset. If commits aren't happening (bug in commit call path, exception before commit, commit called at the wrong time), the consumer restarts from wherever it last committed — possibly thousands of messages back. Every restart re-processes those messages.

Diagnosis: Run kafka-consumer-groups.sh --bootstrap-server kafka:9092 --group <group> --describe and watch the LAG column over 30-minute intervals. If LAG suddenly jumps up every ~30 minutes and then drains, that's a rebalance or restart triggering re-processing from an old committed offset. Also check: does the application restart on a schedule (e.g., a cron job, nightly deployment, container OOM kill every 30 min)?

2. Session timeout causing rebalance loop: If processing takes longer than max.poll.interval.ms, the consumer is evicted and rejoins — re-consuming unprocessed messages (since the offset was never committed for the in-flight batch). The 30-minute interval is suspicious — check if max.poll.interval.ms is set to 1800000 (30 min default).

Fix: Audit the commit call path — add logging around every commitSync/commitAsync call and confirm it's being reached. Commit offsets immediately after successful processing (and ideally after idempotent downstream writes). If rebalances are the cause, either reduce processing time or use a cooperative rebalance strategy (set partition.assignment.strategy=CooperativeStickyAssignor) to avoid full group pauses.

A team runs a Postgres database with three tables: orders, order_items, and customers. They want to archive all row-level changes to S3 in Parquet format for analytics. Requirements: (a) capture inserts, updates, AND deletes; (b) data in S3 within 5 minutes of the change; (c) no additional load on the Postgres primary; (d) schema changes in Postgres should not break the pipeline.

Describe the full architecture: which connectors, how they're configured, what Kafka topics are created, and how requirement (d) is handled.

Think about each requirement separately. (a) points to CDC vs. polling. (b) constrains the S3 sink's flush interval. (c) rules out query-based approaches. (d) points to Schema Registry.

Solution:

Source: Debezium Postgres connector (not JDBC source — JDBC polling misses deletes). Configure with plugin.name=pgoutput, table.include.list=public.orders,public.order_items,public.customers. This creates three Kafka topics: db.public.orders, db.public.order_items, db.public.customers. Each message contains before/after the row change, plus the operation type (c=create, u=update, d=delete). Debezium reads the WAL via a replication slot — zero query load on the primary (requirement c). Set tasks.max=1 per connector — Postgres replication slots are single-threaded.

Schema handling (requirement d): Schema Registry with BACKWARD compatibility. Configure Debezium with Avro serializers pointing at the Schema Registry. When a column is added to orders, Debezium detects the schema change via the WAL, registers the new Avro schema (adding an optional field with default = backward compatible), and continues. The S3 sink reads schema by ID — it can handle both old and new schema records. Without Schema Registry, a column add would produce messages the S3 sink couldn't deserialize.

Sink: Confluent S3 Sink Connector (or Apache Kafka Connect S3 sink). Configure: s3.bucket.name=my-change-events, flush.size=10000 (records per Parquet file), rotate.interval.ms=300000 (5-minute max delay — requirement b), storage.class=io.confluent.connect.s3.storage.S3Storage, format.class=io.confluent.connect.s3.format.parquet.ParquetFormat. Path: s3://my-change-events/topics/{topic}/year={year}/month={month}/day={day}/ — automatically partitioned for Athena/Glue querying. tasks.max=3 (one per topic, parallelized across Connect workers).

Topology summary: Postgres → (WAL) → Debezium source connector → Kafka (3 topics, Avro + Schema Registry) → S3 sink connector → S3 Parquet files (5-min SLA). Total custom code: 0 lines. Two JSON config files deployed via Connect REST API.

Your team needs to build a real-time enrichment pipeline: for each incoming transaction event (from transactions topic, ~200 msg/s), join against the current user profile (from user-profiles topic, ~1 update/second across 10M users) to produce an enriched event. The enrichment code is straightforward Java. The team runs Java microservices on Kubernetes. No existing Spark infrastructure exists.

Should you use Kafka Streams or Apache Spark Streaming? Justify your choice, and list the three most important factors that would push you toward the other option.

Think about: deployment complexity, state store size, existing infrastructure, operational overhead of running a separate cluster vs. embedding a library.

Solution — Use Kafka Streams for this workload.

Why Kafka Streams wins here:

1. No separate cluster. The team already runs Java services on Kubernetes. Kafka Streams is a library dependency — you add it to your existing service, deploy it the same way you deploy everything else, and monitor it the same way. Spark requires a separate Spark cluster (YARN, Kubernetes-native Spark, or Databricks), a job submission pipeline, and a separate operational team or at minimum a separate deployment lifecycle. For a team without existing Spark infrastructure, the setup cost alone is measured in weeks.

2. Stream-table join is a native primitive. The KStream-KTable join is exactly what this workload requires. The user-profiles KTable is backed by local RocksDB, meaning join lookups are sub-millisecond local disk reads — no network call to a database or distributed state store. For 200 msg/s, the state store for 10M user profiles is roughly 10M × (key + profile size) ≈ maybe 5–10GB of RocksDB — easily fits on a Kubernetes pod with persistent storage.

3. Operational simplicity at this scale. 200 msg/s is not a scale problem. Kafka Streams handles millions of messages per second on modest hardware. The overhead of running a distributed Spark cluster for 200 msg/s is pure waste.

Three factors that would push you toward Spark Streaming instead:

Existing Spark infrastructure and expertise: If the team already has Spark clusters, data engineers who know Spark, and existing pipelines — adding one more Spark job is lower cost than introducing a new technology (Kafka Streams).
Complex windowed aggregations or ML features: Spark has a richer windowing model (sliding windows, session windows, global aggregations across keys) and native integration with MLlib. For feature engineering pipelines feeding ML models, Spark's DataFrame/Dataset API is more ergonomic than Kafka Streams DSL.
Non-Java languages: Kafka Streams is a Java/Scala library. If your team works primarily in Python or Go, Spark (with PySpark) or Apache Flink (with Python API) are more natural fits. Kafka Streams has no first-class Python support.

These five exercises cover the core Kafka design decisions: partition count sizing (throughput math vs. consumer parallelism), durability configuration trade-offs (acks vs. min.ISR vs. latency), consumer re-processing diagnosis (offset commit bugs and rebalance loops), CDC pipeline architecture with Debezium and Schema Registry, and the Kafka Streams vs. Spark Streaming choice. Work these end-to-end — the reasoning chain matters more than the numeric answers, because production problems always have context that changes the numbers but not the reasoning framework.

Section 19

Bug Studies — When Kafka Goes Wrong in Production

Reading the theory tells you how Kafka is supposed to work. Reading production incidents tells you how it actually fails. Every bug below was real (the names are soft, the consequences were not). Each one traces a single misconfigured value or misunderstood guarantee all the way to data loss, silent corruption, or six hours of runaway rebalancing. Understanding these incidents at the mechanism level — not just "set this flag" — is what separates engineers who use Kafka from engineers who run Kafka.

Bug A — Silent Data Loss from Unclean Leader Election

Incident. A financial services team had a Kafka cluster with three brokers. During a brief network partition, the partition leader (Broker 1) was isolated. Broker 3 — which had fallen 20 minutes behind in replication and was no longer in the ISR — was elected as the new leader. When Broker 1 came back, Kafka truncated its log to match Broker 3's shorter log. Twenty minutes of committed messages — customer transactions — vanished silently. The flag responsible: unclean.leader.election.enable=true, which was the default in older Kafka versions.

What Went Wrong

Kafka tracks which followers are fully caught up with the leader — that set is the In-Sync Replica (ISR), the only replicas Kafka considers safe to promote on failover. When unclean.leader.election.enable=true, Kafka will elect any replica as leader — even one that is badly behind — rather than wait for an ISR member to become available. The word "unclean" is not marketing: it means the new leader's log is shorter than the old leader's, so Kafka must truncate the old leader's log when it rejoins. Those truncated messages are gone. Producers received acks=all acknowledgements for some of those messages from the old leader before the partition — but "all" meant "all replicas in the ISR at that moment", and Broker 3 had already left the ISR. The ack was technically correct but the durability window was narrower than the team assumed.

The lagged Broker 3 becomes leader (unclean election). When Broker 1 rejoins, its log is truncated to match Broker 3's shorter log. Messages in offsets 7200–9999 disappear permanently.

# server.properties — the dangerous default (pre-Kafka 3.0) unclean.leader.election.enable=true # allows out-of-ISR replicas to lead min.insync.replicas=1 # only 1 replica needs to be in sync for acks=all # Result: acks=all gives a false sense of safety. # A lagged replica CAN become leader, and Kafka WILL truncate the leader's log on rejoin.

# server.properties — production-safe settings unclean.leader.election.enable=false # only ISR members can become leader min.insync.replicas=2 # for RF=3, at least 2 replicas must be in-sync # Producer must also use: # acks=all — wait for all ISR replicas to acknowledge # Now: if fewer than 2 replicas are in-sync, the producer gets an exception # instead of a silent ack that later gets rolled back. Fail loudly > fail silently.

Lesson. In Kafka, durability is a product of three settings working together: acks=all (producer), min.insync.replicas=N (topic/broker), and unclean.leader.election.enable=false (broker). Any one of them set carelessly breaks the chain. The combination is a contract: data is durable only if all three are set correctly and remain set.

Bug B — Endless Rebalance Loop from a Slow Database Write

Incident. A team's consumer read messages from a user-events topic and wrote each one to a legacy Oracle database. The DB write was slow — sometimes 8 seconds per message. Kafka's consumer default max.poll.interval.ms=300000 (5 minutes) seemed generous. But the consumer processed messages in large batches, and one batch hit a slow query that took 7 minutes. Kafka's group coordinator decided the consumer was dead and triggered a rebalance. The consumer finished its batch, called poll() again — and triggered another rebalance because it was now rejoining. The cycle repeated for 6 hours. The backlog grew by hundreds of thousands of messages. No errors were logged — the consumer thought it was working fine.

What Went Wrong

The consumer group protocol has a heartbeat mechanism: the consumer sends a heartbeat on a background thread, so the coordinator knows the consumer is alive. But there is a second clock that heartbeats cannot reset: max.poll.interval.ms. This is the maximum time the coordinator will wait between two successive poll() calls from the same consumer. If the consumer spends too long processing a batch — even while the heartbeat thread is humming along perfectly — the coordinator pronounces it dead and reassigns its partitions. The consumer is not dead; it's just slow. But Kafka doesn't know the difference, and it cannot wait forever. The result is a rebalance storm: partition → new consumer → new consumer processes → slow → rebalance → repeat.

Each time the consumer takes longer than max.poll.interval.ms, the coordinator triggers a rebalance. The consumer rejoins but the slow write repeats. No errors surface — just an ever-growing backlog.

// Each poll returns up to max.poll.records (default 500) // Processing all 500 before next poll() can exceed max.poll.interval.ms while (true) { ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100)); for (ConsumerRecord<String, String> record : records) { slowDatabaseWrite(record); // may take seconds per record } // If 500 records × 3 sec each = 25 minutes — far past the 5-min limit }

// Option 1: reduce batch size so processing stays under the interval props.put("max.poll.records", 50); // smaller batch props.put("max.poll.interval.ms", 120000); // 2 min — sized to actual processing time // Option 2: async processing — hand off records and poll immediately while (true) { ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100)); submitToExecutor(records); // non-blocking; commit only after executor confirms } // Async adds at-least-once complexity — manage offsets carefully // Option 3: use Kafka Streams, which handles this automatically KStream<String, String> stream = builder.stream("user-events"); stream.foreach((key, value) -> slowDatabaseWrite(value)); // Streams manages poll intervals and threading for you

Lesson. Size max.poll.interval.ms to your actual worst-case processing time, not to "something large." Use max.poll.records to control how many records arrive per poll. If your processing is genuinely variable and slow, move to async processing or Kafka Streams — the plain consumer loop is not designed for long per-record latencies.

Bug C — Lost Order, Charged Customer: acks=1 and a Crashed Replica

Incident. An e-commerce team used acks=1 for their orders topic — the leader acknowledges as soon as it writes the message to its own log, before any follower replicates. A hardware failure killed the leader broker seconds after the ack was sent. The follower that took over had not yet pulled the last 3 messages. The payment service had already been told "order confirmed". The fulfillment service never received those 3 orders. Three customers were charged; three orders were never placed.

What Went Wrong

With acks=1, the broker's acknowledgement only guarantees the message reached the leader's local log buffer. If the leader crashes before any follower replicates that message, the new leader — elected from the ISR — will not have the message, and it is gone. The ISR is updated asynchronously: followers pull from the leader, and there is always a window where the leader is ahead of every follower. acks=1 puts the producer in that window intentionally, trading durability for lower latency. For financial events, that is not a trade worth making.

# Producer config — acks=1 (leader-only acknowledgement) acks=1 # The leader acks immediately after writing to its own log. # Followers haven't replicated yet. If the leader dies in this window: # - Ack was sent → producer thinks message is safe # - New leader (follower) doesn't have the message # - Message is gone

# Producer config — durable for financial events acks=all # wait for ALL in-sync replicas to acknowledge retries=2147483647 # retry forever (Kafka manages ordering with idempotence) enable.idempotence=true # prevent duplicates on retry delivery.timeout.ms=120000 # Topic config replication.factor=3 min.insync.replicas=2 # at least 2 replicas must be in-sync

Lesson. Choose acks based on what a lost message costs your business, not on latency benchmarks. For financial events, audit logs, or anything that triggers a real-world side effect (charge, email, shipment), use acks=all + enable.idempotence=true + min.insync.replicas=2. The added latency (a few milliseconds for replica acknowledgement) is negligible compared to the cost of a missing transaction.

Bug D — 8 Consumers Sitting Idle: Partition Count vs. Consumer Count

Incident. A platform team created a payment-events topic with 4 partitions. When load increased, they added 12 consumer instances expecting 12× throughput. Instead, throughput stayed constant. Checking the consumer group revealed 4 consumers actively processing (one per partition) and 8 consumers in state STABLE but assigned 0 partitions — idle, consuming CPU, memory, and confusing the on-call engineer.

What Went Wrong

In a Kafka consumer group, each partition is assigned to exactly one consumer at a time. If your consumer group has more members than the topic has partitions, the extra consumers sit idle. Kafka cannot give two consumers in the same group the same partition — that would break the ordering guarantee and cause duplicate processing. The rule is simple and absolute: parallelism = min(partition count, consumer count). Adding consumers beyond the partition count wastes resources and can actually slow things down slightly due to increased rebalance complexity.

With 4 partitions and 12 consumers in the same group, only 4 consumers receive work. The remaining 8 are fully idle — the partition count is the hard ceiling on intra-group parallelism.

# Topic has 4 partitions. Team scaled consumers to 12. kafka-consumer-groups.sh --describe --group payment-processor # Output: # CONSUMER-ID PARTITION LAG # consumer-1 0 4200 # consumer-2 1 4100 # consumer-3 2 4050 # consumer-4 3 3990 # consumer-5 - - ← idle # consumer-6 - - ← idle # ... (8 idle consumers)

# Fix: increase partition count to match desired parallelism kafka-topics.sh --alter --topic payment-events \ --partitions 12 \ --bootstrap-server broker:9092 # Rule: parallelism = min(partition_count, consumer_count) # For 12 consumers → need at least 12 partitions # Over-partition slightly (e.g. 24 partitions) to allow future scaling # without another partition-count increase (which can't be decreased later)

Lesson. Partition count is the parallelism ceiling — set it at topic creation to the maximum parallelism you will ever need, because you can increase it but not decrease it (and increasing partitions disrupts key-based ordering). A common rule of thumb: over-provision by 2× your expected peak consumer count.

Four real production failures, one pattern: a single misconfigured parameter breaks a guarantee you were silently relying on. unclean.leader.election.enable=true truncates committed logs; an undersized max.poll.interval.ms creates a rebalance storm; acks=1 loses the last message before a crash; and partition count below consumer count wastes half your fleet. Each fix is one config line — but only if you understand the mechanism.

Section 20

Real-World Architectures — How Big Companies Use Kafka

Kafka's Wikipedia page describes what it is. Kafka's engineering blogs describe what it actually does under pressure. The five architectures below are drawn from public engineering posts — each shows a distinct usage pattern, and together they map out the full space of what Kafka enables at scale.

Five companies, five patterns: invention and activity tracking (LinkedIn), real-time operational backbone (Uber), analytics pipeline (Netflix), managed service (Confluent), and real-time OLAP (Apache Pinot).

LinkedIn — Where Kafka Was Born (7 Trillion Messages/Day)

Kafka was created at LinkedIn in 2010–2011 by Jay Kreps, Neha Narkhede, and Jun Rao to solve exactly the three problems described in Section 2: replay, throughput ceilings, and fan-out. LinkedIn's engineering blog (2019, 2022, and 2024 posts) describes a cluster now processing roughly 7 trillion messages per day across dozens of topics — member activity events (profile views, connection requests, feed clicks), search index updates, site monitoring, and ML feature pipelines. The key insight LinkedIn published is that Kafka became the single source of truth for derived data: instead of every downstream system querying the database, they all subscribe to Kafka topics and maintain their own derived views. This dramatically reduced cross-service coupling and allowed teams to iterate independently.

Pattern: Activity backbone + derived data. Every user interaction publishes to Kafka. Every downstream service (recommendation engine, search, analytics) is a consumer group reading its own slice of the log. Adding a new consumer (say, a new ML experiment) requires zero changes to the producer. This is the pub/sub at scale model that Kafka was designed for, and LinkedIn still runs it at the highest documented Kafka volume in the world.

Uber — The Central Nervous System for Ride Dispatch

Uber's engineering blog describes Kafka as the central nervous system of their real-time data infrastructure. Every GPS ping from a driver's phone, every ride request from a rider, every surge pricing calculation, and every ML feature for ETA prediction flows through Kafka. Uber's unique challenge is geo-distribution: they run Kafka clusters in multiple regions and need to replicate data between them for analytics and disaster recovery. They built uReplicator (in production at Uber from late 2015, open-sourced in 2016) specifically because the existing MirrorMaker tool at the time didn't handle their scale. The broader community later built MirrorMaker 2.0 (released with Kafka 2.4), which incorporated many of uReplicator's ideas and is now the standard for cross-cluster replication.

Pattern: Real-time operational backbone + cross-region replication. Kafka sits between every microservice. Producers are GPS collectors, booking services, and payment systems. Consumers are the dispatch engine, analytics pipeline, ML feature store, and fraud detection system. MirrorMaker 2.0 replicates topics between US, EU, and APAC clusters. This pattern — Kafka as the central bus with geo-replication — is now common in any globally distributed real-time platform.

Netflix — Keystone Pipeline, Hundreds of Billions to Trillions of Events/Day

Netflix's Keystone pipeline (introduced in their 2016 tech blog post "Kafka Inside Keystone Pipeline", with later updates describing growth into the trillions of events per day) processes vast volumes of events: playback events (play, pause, seek, buffer), A/B test impressions, recommendation impressions, device telemetry, and service-level error tracking. Netflix uses Kafka as the ingestion layer for all of these, then uses Kafka Connect to sink data into Amazon S3 (for batch analytics), Elasticsearch (for operational search and alerting), and Druid (for real-time OLAP). The producers are instrumented client libraries inside the Netflix app on every device (TV, phone, browser). A single "play" action on your TV generates dozens of events that flow through Kafka within seconds of the interaction.

Pattern: Event ingestion → Kafka → multi-sink via Connect. Kafka does not compute anything — it is a durable buffer. All the interesting processing happens downstream: Spark jobs reading from S3, Elasticsearch queries powering the monitoring dashboards, and Druid powering the real-time metrics. Kafka's role is to absorb any spike in event volume (device spikes at 8pm on a Friday) and deliver data to each sink at the sink's own pace. This is the "shock absorber" pattern at its largest documented scale.

Confluent — Commercial Kafka and the Managed Service Story

Confluent was founded in 2014 by the three Kafka creators after they left LinkedIn. It is not a user of Kafka in the way LinkedIn or Netflix is — it is the primary commercial steward of the Apache Kafka project and the operator of Confluent Cloud, a fully managed Kafka service available on AWS, Azure, and GCP. Confluent contributes the majority of Kafka's upstream code (KRaft, Tiered Storage, Schema Registry, ksqlDB) and is responsible for the enterprise support contracts that let most Fortune 500 companies run Kafka without hiring internal Kafka experts. If you are evaluating whether to self-host or use a managed service, Confluent Cloud and Amazon MSK (Managed Streaming for Apache Kafka, GA May 2019 after a 2018 re:Invent preview) are the two dominant options — covered in more depth in the Operational Playbook section below.

Pattern: Managed Kafka-as-a-Service. Confluent Cloud handles broker provisioning, upgrades, partition rebalancing, monitoring, and TLS/SASL security configuration. You pay per GB ingested and stored rather than per EC2 instance. The tradeoff: less control over broker configuration, higher per-unit cost at extreme volume, but dramatically reduced operational burden. Most teams choose managed until they're processing tens of TB per day, at which point self-hosting economics often tip.

Apache Pinot at Uber and LinkedIn — Real-Time OLAP on Top of Kafka

Both Uber and LinkedIn independently hit a problem Kafka itself cannot solve: fast ad-hoc analytical queries on streaming data. Kafka is a great buffer and log, but you cannot run a SQL query like "give me the P99 latency per city for the last 5 minutes" against a Kafka topic — Kafka has no query engine. Both companies built columnar OLAP systems that consume directly from Kafka and serve low-latency queries: Uber's AresDB (now largely superseded) and LinkedIn's Apache Pinot (open-sourced by LinkedIn in 2015, entered the Apache Incubator in 2018, and graduated to a top-level Apache project in August 2021). Pinot ingests directly from a Kafka topic via a real-time ingestion connector, stores data in columnar segments, and answers OLAP queries in milliseconds on data that is only seconds old. Uber adopted Pinot as well, and it became the shared real-time analytics engine for both companies.

Pattern: Kafka → real-time OLAP. Kafka provides the streaming ingest; Pinot provides the query layer. This is the Lambda/Kappa architecture's "speed layer" implemented without custom stream processing code — you just point Pinot at a Kafka topic and write SQL. The pattern is increasingly common for operational dashboards, A/B test monitoring, and fraud detection surfaces where you need both freshness (seconds) and query flexibility (arbitrary SQL filters).

Five patterns mapped to companies and technology choices. Notice that Kafka itself appears in every row — the patterns differ in what sits beside and downstream of Kafka, not in the core log.

Kafka's real-world usage falls into five patterns: activity backbone (LinkedIn), operational nervous system with geo-replication (Uber), shock-absorber ingestion pipeline (Netflix), managed service for teams without Kafka ops expertise (Confluent/MSK), and real-time OLAP layer (Apache Pinot). The common thread is Kafka as the immutable, replayable source of truth — everything else is downstream.

Section 21

Common Misconceptions — Mental-Model Errors

Pitfalls (Section 17) are operational mistakes you make while running Kafka. Misconceptions are the wrong mental models you carry into that work. They're more dangerous because they make the pitfalls feel like surprises rather than logical consequences of how the system works. Each of the seven misconceptions below is a belief that was false from day one.

The belief: Kafka is just a fancier RabbitMQ — you produce a message, a consumer reads it, the message is deleted.

The reality: Kafka is an append-only log — every message gets written to the end of a file and stays there. Messages are never deleted on consume; they are retained for a configurable period (days, weeks, forever with tiered storage). Consumers track their own position — a number called an offset — independently. A message can be read by 50 consumer groups simultaneously, each at a different position in the log. When a consumer group is deleted, the messages remain. When a new team wants historical data, they rewind to offset 0 — no re-publishing required. The queue mental model actively misleads you because you'll under-specify retention, over-delete topics, and fail to use replay when you need it most.

The test: If Kafka were a queue, consumer groups would not exist. Each consumer group would consume each message once, after which it would be gone. The fact that groups are independent — and that a message persists after one group reads it — is the proof that Kafka is a log, not a queue.

The belief: Throughput is linearly proportional to partition count. To double throughput, double partitions.

The reality: Partitions do increase parallelism — up to a point. But one special broker — the controller (elected by KRaft) — has to track every partition's leader, ISR, and metadata. Beyond roughly 200,000 partitions per cluster (a rough empirical ceiling — LinkedIn documented this before KRaft), the controller becomes a bottleneck: leader election during broker restarts slows dramatically, metadata propagation increases end-to-end latency, and JVM garbage collection on the controller becomes visible in P99 latency. KRaft improved this ceiling significantly, but it remains finite. Additionally, each partition adds to the file descriptor count on every broker and to the memory used by Kafka's internal bookkeeping (the leader epoch cache, which records every leadership change so failovers can detect stale replicas).

The practical rule: Size partitions to your maximum expected parallelism + a 2× buffer. A topic that needs 10 consumers in production should have 20 partitions — not 10,000. You can always add more consumer instances equal to partition count; you can't undo partition sprawl without deleting and recreating the topic.

The belief: RF=3 gives me 3 copies, so I can afford to lose 2 brokers before data is at risk.

The reality: RF=3 means 3 brokers each hold a copy of the partition — but this is only useful if min.insync.replicas is set correctly. The default value is 1, which means a producer's acks=all is satisfied as soon as a single replica acknowledges — even if the other two are down, lagged, or on fire. With min.insync.replicas=1 and RF=3, you can functionally write with only one replica in-sync, which defeats the entire purpose of replication. The correct production setting is min.insync.replicas=2 with RF=3: you can tolerate the loss of one broker (not two) while maintaining durability guarantees. Lose 2 brokers with this setting and producers get a NotEnoughReplicasException — which is the correct behavior: Kafka refuses to accept writes it can't guarantee, rather than silently accepting data it might lose.

The belief: Kafka preserves the order in which messages were produced — message 1 is always consumed before message 2.

The reality: Kafka guarantees ordering within a single partition, not across the topic. If your topic has 12 partitions and you produce messages A, B, C in sequence, there is no guarantee A, B, C land on the same partition unless you set an explicit partition key. Messages with the same key always go to the same partition (Kafka uses a hash of the key to select the partition), so key-based ordering is guaranteed. But if you're using round-robin partitioning (the default when no key is set), A might land on partition 0, B on partition 3, C on partition 7 — and three separate consumers read them in completely different orders. For use cases where ordering matters (financial transactions per account, events per user session), you must set a partition key (account ID, session ID) and tolerate that all events for a key go to one partition, limiting parallelism for that key to one consumer at a time.

The belief: If I turn on Kafka's "process each message exactly once" guarantee — known as exactly-once semantics (EOS) — my entire pipeline (including writes to PostgreSQL, S3, or an email service) becomes exactly-once.

The reality: Kafka's EOS guarantee covers only Kafka-to-Kafka flows within a single cluster: a Kafka Streams application reading from topic A and writing to topic B can guarantee exactly-once. The moment you write to a system outside Kafka (a database, an API call, an email), you are back in at-least-once territory unless that external system supports idempotent writes. Writing to PostgreSQL with a unique constraint on a transaction ID effectively gives you exactly-once semantics through the idempotency of the upsert — but that guarantee comes from PostgreSQL, not from Kafka. Kafka EOS is not magic fairy dust you sprinkle on your pipeline; it is a specific guarantee about offset commits and producer transactions within Kafka's log.

The belief: Kafka needs a ZooKeeper cluster to function — you always need to run both.

The reality: Kafka now has its own built-in consensus layer — KRaft (short for Kafka Raft) — for tracking which broker is the controller and storing topic metadata. It was introduced experimentally in Kafka 2.8 (2021), became production-ready in Kafka 3.3 (2022), and in Kafka 4.0 (released March 2025) ZooKeeper mode was completely removed. Kafka 4.0 and later require KRaft — there is no ZooKeeper option. If you see tutorials or blog posts from before 2022 describing ZooKeeper setup steps, they are outdated. A modern Kafka cluster is a single self-contained distributed system: a subset of brokers act as the KRaft quorum (the "controller quorum") and manage metadata. You do not need to operate a separate ZooKeeper ensemble, monitor ZooKeeper separately, or deal with the infamous ZooKeeper session timeout race conditions. This is an unambiguous improvement — if you're starting a new Kafka deployment, KRaft is the only option.

The belief: Kafka Streams is a separate service I need to deploy alongside Kafka, like a Flink or Spark cluster.

The reality: Kafka Streams is a Java library — a jar you add to your application, not a separate service. It runs entirely inside your application process. There is no separate "Streams cluster" to provision, no YARN or Kubernetes scheduler to manage, no separate JVM fleet. Your application reads from a Kafka topic, processes records using the Streams DSL (filter, map, join, aggregate), and writes results back to another Kafka topic. State stores (for windowed aggregations) live inside your application or in Kafka's changelog topics. The tradeoff is that you must manage the scaling of your application instances manually — but since each instance is a plain JVM process, Kubernetes horizontal pod autoscaling works just fine. The absence of an external cluster is a massive operational simplicity win for teams that would otherwise need to maintain both Kafka and a separate stream processing framework.

Seven mental-model corrections: Kafka is a log, not a queue; more partitions help only to a point; RF=3 does not mean you can lose 2 brokers without setting min.insync.replicas; ordering is per-partition only; EOS applies within Kafka, not to external systems; ZooKeeper is gone in 4.0; and Kafka Streams is a library, not a cluster. Correcting these gives you a model that predicts Kafka's behavior accurately instead of producing pleasant surprises.

Section 22

Operational Playbook — Run Kafka in Production

You have the mental model, the bugs, and the misconceptions. Now here is the five-stage playbook for running Kafka in production — from the first decision (managed or self-hosted?) through day-two optimization. This is not exhaustive, but it covers the decisions that bite teams most often.

The five stages of the Kafka operational lifecycle. Stage 1 (Pick) is a one-time decision; stages 4 and 5 are ongoing loops.

Stage 1 — Pick: Managed vs. Self-Hosted

The first decision determines your entire operational posture. Two managed options dominate the market:

Confluent Cloud — fully managed, multi-cloud (AWS/Azure/GCP), includes Schema Registry, ksqlDB, and Kafka Connect as managed services. Priced per GB ingested + CKU (Confluent units). Supports Kafka 3.x and 4.x. Best when you want zero Kafka ops overhead and are willing to pay a premium for convenience.
Amazon MSK (Managed Streaming for Apache Kafka) — managed within your AWS account, brokers run inside your VPC. You control broker instance types, storage, and network settings. Less abstracted than Confluent Cloud — you still manage some broker config — but cheaper at moderate scale and tightly integrated with IAM, CloudWatch, and other AWS services.

Self-host when: you're processing many tens of TB per day (managed cost exceeds self-host ops cost), you need broker-level configuration access (custom JVM flags, custom log directories, exact partition placement), or you have compliance requirements that forbid third-party-managed infrastructure.

Default recommendation: start managed (Confluent Cloud or MSK), move to self-hosted if and when cost or control requirements demand it. Most teams never need to self-host.

Stage 2 — Onboard: Cluster and Topic Configuration

# Broker configuration — production defaults for a 3-broker cluster num.partitions=12 # default for new topics (overridable per topic) default.replication.factor=3 # every topic gets 3 replicas by default min.insync.replicas=2 # requires at least 2 in-sync replicas for acks=all unclean.leader.election.enable=false # never elect an out-of-ISR replica as leader log.retention.hours=168 # 7-day retention (adjust per topic) log.retention.bytes=-1 # unlimited by bytes (control via hours or GB) auto.create.topics.enable=false # prevent accidental topic creation; use IaC # Rack awareness — multi-AZ deployment broker.rack=us-east-1a # set to the AZ of each broker replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector # Monitoring metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsReporter # or JMX

Rack awareness is critical for multi-AZ deployments: Kafka uses the broker.rack label to ensure each partition's replicas are spread across AZs. Without it, all three replicas might land on brokers in the same AZ, and a single AZ failure takes down all copies.

Stage 3 — Test: Chaos and Exactly-Once Verification

Never go to production without running these four chaos tests. They expose the configuration mistakes in Stage 2 before users feel them.

Kill one broker. Verify ISR shrinks and expands correctly. Verify under-replicated partitions metric (target = 0) returns to 0 within your configured replica.lag.time.max.ms window after the broker restarts.
Kill two brokers simultaneously (with RF=3 and min.insync.replicas=2). Verify producers receive NotEnoughReplicasException rather than silently succeeding. Verify data written before the failure is intact when brokers recover.
Simulate a network partition. Isolate the leader from followers using firewall rules. Verify that if unclean.leader.election.enable=false, no follower is elected until the partition heals (or a manual override is issued). Verify no data loss when the partition heals.
Verify exactly-once for stream processors. Produce a known set of messages (with idempotent deduplication keys), run your Streams application, kill it mid-processing, restart it. Verify each message appears in the output topic exactly once.

Stage 4 — Monitor: The Five Metrics That Matter

Every Kafka broker exposes hundreds of internal counters and gauges through a standard Java monitoring interface called JMX (Java Management Extensions). Pair a JMX exporter with Prometheus and Grafana — or use your managed service's built-in dashboard — and alert on these five metrics:

Under-replicated partitions (URP). Should always be 0. Any non-zero value means at least one partition has fewer in-sync replicas than its replication factor. Alert immediately — a second broker failure now risks data loss.
Consumer group lag. The difference between the latest offset in a partition and the last committed offset by a consumer group. A growing lag means consumers can't keep up. Alert when lag exceeds your SLA (e.g., 30 seconds of lag for a real-time pipeline).
Active controller count. Should always be exactly 1. Zero means no controller (catastrophic). More than one means a split-brain scenario (also catastrophic). This metric rarely deviates, but when it does, it's a P0.
Request latency P99. Specifically Produce: RequestTotalTimeMs P99 and FetchConsumer: RequestTotalTimeMs P99. Sudden spikes indicate disk I/O saturation, network congestion, or GC pauses. Target P99 under 10ms for a healthy cluster.
Disk fill rate. Kafka brokers need free disk to write log segments. At ~85% disk usage, Kafka begins experiencing write errors. Track the trend, not just the current value — a retention misconfiguration can fill a disk in hours at high ingestion rates.

Stage 5 — Optimize: Partitions, Retention, and Tiered Storage

# Topic-level optimization examples # High-throughput, low-latency topic (payment events) partitions=48 # parallelism headroom: 2× expected consumer count replication.factor=3 min.insync.replicas=2 retention.ms=604800000 # 7 days — enough for replay + incident investigation # Audit log topic — long retention, low throughput partitions=4 # low parallelism needed replication.factor=3 retention.ms=-1 # retain forever (compliance) # Use tiered storage (KIP-405) to move old segments to object storage log.local.retention.ms=604800000 # keep 7 days local, the rest in S3/GCS # KIP-405 Tiered Storage — available in Kafka 3.6+ (preview), GA in Kafka 3.9+ # Dramatically reduces broker disk requirements for long-retention topics # Old segments are offloaded to S3/Azure Blob/GCS; consumers fetch seamlessly

Tiered Storage (KIP-405): As of Kafka 3.9 (released late 2024), Tiered Storage is generally available. It works by offloading log segments older than log.local.retention.ms to object storage (S3, GCS, Azure Blob), while keeping recent segments on local broker disk for low-latency reads. A consumer reading old offsets transparently fetches from object storage. This dramatically reduces broker disk requirements for audit, compliance, and ML-training topics that need multi-year retention. The tradeoff: reads from object storage are slower (10–100ms vs. <1ms from local disk) — acceptable for batch or historical replay but not for real-time consumers.

The five operational stages: choose managed (Confluent Cloud or MSK) unless you have specific reasons to self-host; configure RF=3 + min.insync.replicas=2 + rack awareness at onboarding; chaos-test before production; alert on under-replicated partitions, consumer lag, controller count, P99 latency, and disk fill rate; and optimize with tiered storage for long-retention topics once your cluster is stable.

Section 23

Cheat Sheet & Glossary — The 30-Second Recap

Bookmark this section. When you need to refresh before an interview or before reviewing a Kafka config, this is your entry point back into the full page.

Core Concepts at a Glance

Messages persist after consume. Retention is configured by time or bytes, not by delivery status.

Each partition is an independent ordered log. A consumer group assigns one consumer per partition. Max parallelism = partition count.

Each partition is copied to RF brokers (RF=3 is production default). One is the leader; the rest are followers.

The In-Sync Replica set tracks which followers are fully caught up. Only ISR members can become leader (when unclean election is disabled).

Three settings work together: producer acks=all, topic min.insync.replicas=2, broker unclean.leader.election.enable=false.

enable.idempotence=true assigns a sequence number to every message. The broker deduplicates retries automatically.

Kafka is now self-contained. The controller quorum uses Raft consensus internally. ZooKeeper mode is removed.

Log compaction retains the most recent message per key indefinitely. A tombstone (null value) deletes the key. Used for changelog/state topics.

A Java library, not a cluster. Runs inside your app. Handles windowed joins, aggregations, and exactly-once output to Kafka topics.

Glossary

Broker: A single Kafka server process. It stores log segments for the partitions assigned to it and serves produce/fetch requests.
Controller: The broker (or KRaft quorum) that manages cluster metadata: partition leaders, ISR membership, broker liveness. Exactly one controller is active at a time.
ISR (In-Sync Replica): The set of replicas that are caught up to the leader within replica.lag.time.max.ms. Only ISR members can receive acks=all acknowledgements and become the next leader.
Partition: An ordered, append-only sub-sequence of a topic. The unit of parallelism and storage assignment in Kafka. Messages within a partition have a unique, monotonically increasing offset.
Replica: A copy of a partition stored on a broker. Replication factor 3 means 3 replicas: one leader + two followers.
Leader: The single replica that handles all produce and fetch requests for a partition. Elected from the ISR. Responsible for propagating writes to followers.
Follower: A non-leader replica that pulls writes from the leader. If in the ISR, it can become the next leader if the current leader fails.
Offset: A monotonically increasing integer that identifies a message within a partition. Like a line number. Consumers track their position by committing their current offset.
Consumer Group: A named collection of consumer processes that together consume a topic's partitions. Each partition is assigned to exactly one consumer in the group. Multiple groups can read the same topic independently.
KRaft: Kafka Raft — the built-in consensus protocol that replaced ZooKeeper for metadata management. The controller quorum uses KRaft to elect leaders, store topic metadata, and coordinate broker membership.
KIP (Kafka Improvement Proposal): The formal RFC process for Kafka changes. KIP-405 introduced Tiered Storage; KIP-500 introduced KRaft. Numbers are searchable in the Apache Kafka wiki.
Compaction: A log retention mode where Kafka keeps only the most recent message per key, indefinitely. Used for topics that represent the current state of keyed entities (changelog, config, user profiles).
Tombstone: A message with a null value published to a compacted topic. Signals that the key should be deleted from the compacted log. After the next compaction cycle, the key disappears entirely.
EOS (Exactly-Once Semantics): The guarantee that a message is processed exactly once — not zero times (lost) and not more than once (duplicated). In Kafka, EOS applies within Kafka-to-Kafka flows (e.g., Kafka Streams reading from topic A and writing to topic B).
MirrorMaker 2.0: The official Kafka tool for cross-cluster replication. Built on Kafka Connect. Replicates topics, consumer group offsets, and ACLs between clusters in different regions or data centers.
Kafka Connect: A framework for moving data between Kafka and external systems (databases, S3, Elasticsearch) using pre-built source and sink connectors. Runs as a scalable, fault-tolerant cluster of workers.
Kafka Streams: A Java library for stateful stream processing. Runs inside your application process — no external cluster required. Supports filter, map, join, windowed aggregation, and exactly-once output to Kafka.
Schema Registry: A service (commonly Confluent Schema Registry) that stores and validates Avro/JSON/Protobuf schemas for Kafka topics. Ensures producers and consumers agree on message format. Prevents schema drift from breaking downstream consumers.

Nine cheat-sheet cards capture the core Kafka guarantees and design choices at a glance. The 18-term glossary is the vocabulary you need to read Kafka documentation, architecture discussions, and incident postmortems fluently. Both are designed for fast review — scan before an interview or design session, then use the relevant sections above for depth.

Apache Kafka — The Distributed Append-Only Log That Ate the Streaming World

TL;DR — Kafka in Plain English

Why You Need This — When Queues Aren't Enough

The Story: An E-Commerce Company Hits the Queue Wall

The Kafka Switch: What Changes

Mental Model — The Distributed Append-Only Log

The Notebook Analogy

Now Shard the Notebook — Partitions

Core Concepts — The Kafka Vocabulary

Infrastructure Terms

Data Model Terms

Replication Terms

Producer / Consumer Terms

Ecosystem Terms

The Anatomy of a Topic — Partitions, Replicas, Brokers

Layer 1: Topic → Partitions (Sharding)

Layer 2: Partitions → Replicas (Durability)

Layer 3: Leader vs Follower

The Math of Partition-Based Parallelism — Why Kafka Scales

The Fundamental Rule: Partitions = Max Parallelism

How to Size Partition Count

Partition Keys and Ordering

Producer Deep Dive — How Messages Get Into Kafka

Step 1 — Serialization: Objects to Bytes

Step 2 — Partition Selection: Which Partition Gets This Message?

Step 3 — Batching: The Throughput Engine

Step 4 — Compression: Squeezing the Batch

Step 5 — Acknowledgement Strategy: The Durability Dial

Step 6 — Idempotent Producer: Eliminating Duplicate Writes

Step 7 — Transactional Sends: Atomic Writes Across Partitions

Java Config Snippet

Consumer Deep Dive — Consumer Groups, Offsets, Rebalancing

Consumer Groups: Dividing the Work

Offsets: Your Position in the Log

The Heartbeat Model: How Kafka Knows You're Alive

Rebalancing: Redistributing Partitions

Java Consumer Config Snippet

The ISR Replication Protocol — How Kafka Stays Durable

The Replication Flow: Leader Writes, Followers Pull

What "In-Sync" Means — and What Removes a Replica from ISR

min.insync.replicas: The Hard Durability Floor

Leader Election and unclean.leader.election.enable

KRaft — How Kafka Replaced ZooKeeper

Why ZooKeeper Had to Go

KRaft: Kafka's Own Consensus Layer

Timeline

Storage Internals — Log Segments, Index Files, Page Cache

The Partition on Disk: A Directory of Segment Files

Page Cache: Why Kafka Doesn't Use Its Own Buffer

Zero-Copy: From Disk to Socket Without Touching User Space

Retention and Log Compaction

Exactly-Once Semantics — How Kafka Achieves It

Building Block 1 — Idempotent Producer

Building Block 2 — Transactions: Atomic Multi-Partition Writes

Building Block 3 — read_committed Consumer

Java Transactional Producer Snippet

Where EOS Works — and Where It Doesn't

Log Compaction — Keep Only the Latest Value Per Key

Where Compaction Is Used

How Compaction Works Mechanically

Deleting a Key — The Tombstone

Kafka Streams — Stream Processing on Top of Kafka

KStream and KTable — Two Mental Models

Stateful Operations — Where It Gets Interesting

A Real Topology: Fraud Detection

Code: A Kafka Streams Topology

Kafka Connect — Move Data In and Out Without Writing Code

Source Connectors and Sink Connectors

The Debezium CDC Pipeline — A Walk-Through

Deploying a Debezium Connector via REST

Schema Registry & Data Contracts — Preventing the Schema Drift Disaster

The Three Compatibility Modes

Checking Compatibility Before Deploying

Pitfalls & Anti-Patterns — What Breaks Kafka Clusters in Production

Practice Exercises — Build Your Intuition

Bug Studies — When Kafka Goes Wrong in Production

What Went Wrong

What Went Wrong

What Went Wrong

What Went Wrong