Publish/Subscribe (Pub/Sub)

Real World	What it represents	In pub/sub code
Newspaper company	The thing that creates and sends events	`publisher` / producer
The edition / story	The data being communicated	`message` / event payload
"Business News" section	The category of events	`topic` / channel
Distribution system (post office)	The intermediary that handles routing	`broker`
Individual subscriber	A service that cares about the events	`subscriber` / consumer
Subscription sign-up	Registering interest in a topic	`subscribe("topic.name")`

Evolution — How Messaging Got Here

Five eras from hand-rolled glue scripts to the event-driven substrate powering modern platforms.

Pub/sub didn't arrive fully formed. It evolved through five distinct eras, each one solving problems the previous era created. Understanding this history isn't trivia — it explains why every design decision in Kafka, RabbitMQ, and Google Pub/Sub looks the way it does. The quirks are answers to specific pain points, and once you know the pain, the answers make sense.

Each dot on that timeline is a step forward — and each step was forced by something that broke. Let's walk through all five.

Before standardised messaging libraries existed, teams who needed async communication built it themselves. The classic approach: one service writes a row to a database table called something like pending_jobs. Another service polls that table every few seconds, picks up rows, processes them, and deletes them. Job done. This is a queue — just a terrible one.

The why behind the polling: services running on different machines couldn't share memory or call each other reliably over the network. A database table was the only shared state both processes could access safely. It worked well enough for low-volume batch jobs in the 1990s. The problem arrived with scale.

The pain points that defined this era: Database polling hammers your DB with SELECT queries every few seconds even when there's nothing to process — wasted CPU and I/O. Rows held for long processing block other readers. When two consumers try to claim the same row, you need careful locking. And there's no fan-out: each row goes to exactly one consumer, so broadcasting an event to multiple services means writing multiple rows manually. By the late 90s, larger systems were groaning under this weight.

What changed and why: The core insight was that a database is not a message broker. A message broker is optimised for temporal delivery — get a message from A to B as fast as possible, then forget it. A relational database is optimised for persistent queries, transactions, and long-lived storage. Forcing one to do the other's job creates friction everywhere. This realisation drove the next era.

The Java Message Service specification (JMS 1.0, released in 1998) was the first serious attempt to standardise messaging in enterprise software. The idea was appealing: define a common API, and any compliant messaging provider — IBM MQ, TIBCO, BEA, ActiveMQ — can plug in underneath. Write your code once against the JMS API, swap providers if you want.

In practice, this worked only partially. JMS standardised the API — Session.createProducer(), MessageConsumer.receive() — but left the underlying wire protocol completely up to each vendor. IBM MQ used a proprietary binary protocol. TIBCO used a different one. ActiveMQ used OpenWire. If Service A ran on an IBM MQ broker and Service B ran on ActiveMQ, they couldn't talk without a bridge component. You were essentially vendor-locked to a single broker vendor per deployment.

JMS did standardise two important abstractions that survive today: queues (point-to-point — one producer, one consumer, each message consumed exactly once) and topics (pub/sub — one producer, many consumers, each subscriber gets their own copy). Every modern messaging system uses some form of this split, even if the names differ.

What changed and why: The vendor lock-in was a strategic problem. If you bet on TIBCO and they raised prices or went under, migrating was a multi-year project. What the industry needed wasn't just a common API — it needed a common wire protocol so different brokers could interoperate. That drove the next era.

The Advanced Message Queuing Protocol (AMQP)An open standard wire-level protocol for message-oriented middleware. Unlike JMS which only standardises the API, AMQP standardises the actual bytes on the wire — meaning any AMQP client can talk to any AMQP broker regardless of language or vendor. was born from frustration with vendor lock-in. JPMorgan Chase, Red Hat, Cisco, and others collaborated to define a common binary protocol that specified not just the API calls but the exact bytes on the wire. Any client library that speaks AMQP can talk to any AMQP broker. The era of proprietary silos was over.

RabbitMQ, first released in 2007, became the landmark implementation. It introduced a routing model that was genuinely more flexible than anything before it: the exchange + binding model. Publishers don't send messages directly to queues. They send to an exchange — a named routing component — which applies a binding rule to decide which queues receive the message.

Exchange Type	Routing Rule	Maps to What
`direct`	Route by exact routing key match	Point-to-point queue
`fanout`	Copy to ALL bound queues, ignore routing key	Pure pub/sub broadcast
`topic`	Match routing key patterns (wildcards `*` and `#`)	Topic-filtered pub/sub
`headers`	Match on message header attributes	Content-based routing

A fanout exchange is pure pub/sub: every queue bound to it gets a copy of every message, regardless of routing key. This is the model that directly inspired Google Pub/Sub's subscription model and many others that followed.

What changed and why: AMQP solved the interoperability problem but introduced a new one — the broker became the centre of the universe. Every message had to flow through the broker's exchange routing logic. At enormous scale (millions of messages per second), that single routing layer became a bottleneck. The broker also had limited memory: messages lived in RAM and maybe on disk, but once consumed they were gone. There was no concept of a subscriber "replaying" past messages. That gap became the motivation for the next era.

As cloud infrastructure matured, the big providers built managed pub/sub services that took the operational burden off engineering teams entirely. You didn't have to provision, patch, or scale your own broker — the cloud provider did it.

Google Cloud Pub/Sub (public beta March 2015, GA later that year) introduced a durable, at-least-once delivery model with configurable retention and a clean separation of topics (where publishers write) and subscriptions (where consumers read). Multiple subscriptions can attach to one topic, each maintaining its own independent cursor — so adding a new consumer doesn't affect existing ones at all. This addressed one of RabbitMQ's pain points: binding queues to exchanges was manual plumbing. In GCP Pub/Sub, you just create a new subscription and it starts receiving all future messages automatically.

AWS SNS (Simple Notification Service) took a different angle: fan-out to heterogeneous endpoints. One SNS topic can fan out to SQS queues, Lambda functions, HTTP endpoints, email addresses, and SMS simultaneously. The same order.placed event can trigger a Lambda that processes it in real time, land in an SQS queue for a worker that processes it asynchronously, and send an SMS alert to an on-call engineer — all from a single publish. This pattern became one of the most widely deployed event-driven architectures on AWS.

Redis Pub/Sub filled a different niche: ultra-low-latency, in-process fan-out where durability doesn't matter. Redis Pub/Sub has no message storage — if a subscriber isn't connected at the moment a message is published, that message is lost. But for use cases like real-time presence updates, live chat fan-out within a single data centre, or cache invalidation signals, that trade-off is fine. The gain is sub-millisecond delivery with zero broker overhead beyond what Redis already provides.

What changed and why: Cloud-native pub/sub made the pattern accessible to teams that couldn't or wouldn't operate their own broker infrastructure. But all these systems shared one limitation: they were fundamentally "fire and forget" from the broker's perspective — once a message was delivered and acknowledged, it was gone. You couldn't rewind to last Tuesday's events to replay them through a new service. The 2008 financial crisis and subsequent regulatory requirements (audit trails, reproducible computation) made replay a hard requirement in financial and analytics workloads. That gap drove Kafka.

Apache Kafka was created at LinkedIn and open-sourced in 2011. The key insight behind it was simple but profound: instead of treating messages as ephemeral items to be consumed and deleted, treat them as entries in an immutable, append-only log. A log keeps everything. New consumers can subscribe to the beginning of the log, not just the tail. Old consumers can replay from any point. The data is the log; the log is the truth.

This one design decision — log as the substrate — changed what was possible:

Replay: Deploy a new analytics service today and back-fill from 90 days of events. No special export needed — the data is already there.
Multiple consumer groups: Each consumer group maintains its own offset pointer in the log. They're completely independent. You can have a real-time processing group at offset 10,000,000 and a batch analytics group at offset 5,000,000 simultaneously, neither affecting the other.
Partitioned parallelism: Topics are split into partitions, each an independent log shard. Multiple consumers in a group can read different partitions in parallel, scaling throughput linearly with partition count.
Durability as a side effect: The log is just files on disk. Replication across brokers makes it fault-tolerant. Retention policies (time-based or size-based) control how far back you can go.

The convergence: Kafka didn't replace pub/sub — it expanded what pub/sub could do. Classic pub/sub (RabbitMQ, SNS) is great when you want fire-and-forget fan-out with low latency and no need to replay. Log-based pub/sub (Kafka) is great when the event history itself is valuable, when you need multiple independent consumers with different paces, or when you want to build stream processing pipelines. Modern architectures often use both: Redis Pub/Sub for sub-second in-memory fan-out, Kafka as the durable event backbone, SNS for cross-system notifications.

Era	Representative Tool	Key Innovation	Key Limitation
1 — Late 90s	DB polling tables	Async decoupling without a separate broker	Hammers DB, no fan-out, terrible throughput
2 — Early 2000s	JMS / ActiveMQ	Standardised API, queues + topics	Proprietary wire protocols, vendor lock-in
3 — ~2007	AMQP / RabbitMQ	Open wire protocol, flexible exchange routing	No message replay, broker memory limits
4 — 2010–2015	GCP Pub/Sub, SNS, Redis	Managed, no ops, heterogeneous fan-out	Fire-and-forget only, no replay
5 — 2011+	Apache Kafka	Immutable log, replay, partitioned parallelism	Higher operational complexity, not "instant forget"

Messaging evolved through five eras: (1) hand-rolled DB polling tables — simple but slow; (2) JMS standardised the API but not the wire, causing vendor lock-in; (3) AMQP/RabbitMQ standardised the wire protocol and introduced flexible exchange routing; (4) cloud-native services (GCP Pub/Sub, SNS, Redis) eliminated ops burden and enabled heterogeneous fan-out; (5) Kafka's log-based model added replay, independent consumer groups, and durable event history. Each step fixed the previous era's deepest pain point.

Internals — How a Broker Actually Routes a Message

Open the broker's hood: topic registry, subscriber lists, push vs pull, ack/redelivery, ordering, and partitioning — all in plain English first.

From the outside, a broker looks like a simple relay: messages go in on one end, messages come out the other. But inside, a broker is making a surprising number of decisions per message — which subscribers should receive it, how to deliver it, what to do if delivery fails, how to maintain ordering, and how to handle a subscriber that's falling behind. Understanding these internals is what separates engineers who use pub/sub from engineers who can debug it when it breaks at 3am.

The Topic Registry — The Broker's Routing Table

At its core, a broker maintains a topic registry — a map that says "topic X has subscribers A, B, and C." When a publisher sends a message tagged with a topic name, the broker looks up that topic in the registry and gets back the list of subscribers who should receive it. That lookup is the entire routing algorithm for basic pub/sub.

Why is this stored in a registry rather than derived on the fly? Because subscribers register interest ahead of time. When a service starts up, it makes a subscribe("order.placed") call to the broker. The broker records that subscription. Later, when a message arrives for order.placed, the broker already has the full subscriber list ready — no need to discover subscribers at delivery time. This pre-registration is what makes fan-out fast and what enables the broker to buffer messages for offline subscribers.

The diagram shows the five-step path a message takes through the broker. Step 1: the publisher sends the message. Step 2: the broker persists it and sends an acknowledgement back to the publisher — from this point on, the broker owns the message. Step 3: the broker looks up the topic in its registry to find the subscriber list. Step 4: the fan-out engine creates one copy of the message per subscriber and places each copy in that subscriber's delivery queue. Step 5: the broker delivers each copy and waits for an acknowledgement from each subscriber. Only when a subscriber acks does the broker remove the message from that subscriber's queue. No ack within the timeout window means redelivery.

Push vs Pull — Two Ways to Get Messages Out

Once a message is sitting in a subscriber's delivery queue, the broker needs to get it to the subscriber's code. There are two fundamentally different approaches to this, and they trade off throughput, latency, and back-pressure control.

Think of push vs pull like a waiter vs a cafeteria line. Push is the waiter bringing food to your table as soon as it's ready — you didn't ask, it just arrives. Pull is the cafeteria: you go get food when you're ready for it. Push feels faster but can overwhelm you if the kitchen is faster than you can eat. Pull gives you control but requires you to keep asking "is there more?"

	Push Delivery	Pull Delivery
How it works	Broker calls the subscriber's endpoint (HTTP callback, WebSocket, long-poll callback) when a message arrives	Subscriber sends a request to the broker asking "give me up to N messages"
Latency	Lowest — message arrives at subscriber as soon as it hits the broker	Slightly higher — subscriber must poll; messages sit until the next poll cycle
Back-pressure	Hard — if broker pushes faster than subscriber can process, subscriber buffers overflow	Natural — subscriber controls the rate by how often and how much it asks for
Example systems	GCP Pub/Sub (push subscriptions), AWS SNS → Lambda, WebSockets	GCP Pub/Sub (pull subscriptions), Apache Kafka, SQS
Best for	Low-volume, latency-sensitive workloads where the subscriber can keep up	High-throughput workloads where the consumer controls its own pace

Kafka uses pure pull. There's a deep reason for this: Kafka serves millions of messages per second across thousands of partitions. If the broker tried to push to every consumer simultaneously, it would need to track each consumer's processing capacity and throttle accordingly — enormous coordination overhead. With pull, each consumer group polls at its own pace. The broker just waits. A slow consumer has no impact on fast consumers, because each holds its own independent offset pointer.

Acknowledgements and Redelivery — The At-Least-Once Problem

Here's a scenario that causes subtle production bugs: a subscriber receives a message, starts processing it, crashes halfway through, and never sends an ack. What should the broker do?

The answer in virtually every production pub/sub system: redeliver. The broker has a timeout (e.g., 30 seconds). If no ack arrives within that window, the broker assumes the message was lost and puts it back in the subscriber's queue. The subscriber restarts, receives the message again, and processes it successfully. This is called at-least-once deliveryA delivery guarantee where every message is delivered to every subscriber at least once, but possibly more than once if a failure happens between processing and acknowledgement. Contrast with at-most-once (may be lost, never duplicated) and exactly-once (never lost, never duplicated — requires transactions)..

At-least-once means your handler WILL run more than once for the same message. This isn't a bug in the broker — it's an intentional safety trade-off. The alternative (at-most-once) would mean some messages could be silently dropped if a subscriber crashes. For most production systems, a duplicate is better than a missing event. But your subscriber code must be written to handle duplicates safely — a property called idempotency. An idempotent handler produces the same result whether it runs once or ten times for the same message. This is one of the most important design rules in pub/sub systems.

Ordering — When Does It Matter and When Can You Get It?

Ordering is one of the trickiest guarantees in pub/sub. The short answer: most pub/sub systems give you ordering within a partition or queue, but not across them. Here's why.

If a topic is served by a single broker thread in a single queue, messages arrive and are delivered in the order they were published — FIFO. Simple. But real systems fan out across multiple broker nodes and multiple subscriber instances for throughput. The moment you introduce parallelism, strict global ordering breaks — message A might be processed by subscriber instance 1 and message B by instance 2, and instance 2 might finish first even though B was published after A.

The standard solution is partition-key ordering: messages that must be ordered relative to each other share the same partition key. For example, all events for user #5001 carry the key user:5001. They all hash to the same partition, which is assigned to exactly one consumer in the consumer group. Within that partition, they're FIFO. Events for user #9999 are in a different partition and can process in parallel without interfering. You get per-entity ordering without sacrificing cross-entity throughput.

Partitioning — The Throughput Multiplier

A single broker thread handling a single topic queue will eventually hit a ceiling — the throughput of one CPU core reading from one disk. PartitioningSplitting a topic into multiple independent shards (partitions), each handled by a different broker node and consumed by a different subscriber instance. Partitions are the mechanism by which pub/sub systems scale throughput horizontally. solves this by splitting a topic into N independent sub-streams. Each partition is a completely independent queue on a different broker node. Publishers hash messages to partitions by key. Subscriber instances in a consumer group are each assigned a subset of partitions — so with 12 partitions and 4 subscriber instances, each instance handles 3 partitions in parallel.

The rule that governs scaling: maximum parallelism = number of partitions. You can add subscriber instances beyond partition count, but the extra instances sit idle with no partitions to read from. They serve only as hot standbys — ready to take over if an active instance dies and triggers a rebalance.

Inside a broker, messages flow through five stages: receive and persist, topic registry lookup, fan-out copy creation, per-subscriber queue delivery, and ack/redelivery management. Push delivery minimises latency; pull delivery gives subscribers natural back-pressure control — Kafka uses pull exclusively. At-least-once delivery is the default safety guarantee: brokers redeliver any message that isn't acked within the timeout, so subscriber handlers must be idempotent. Ordering is guaranteed only within a partition; use consistent partition keys to get per-entity ordering. Partitioning is the horizontal scaling mechanism — throughput scales with partition count, not subscriber count.

When To Use Pub/Sub

Decision framework, use-cases, and when pub/sub is the wrong tool — with a branching flowchart to wire it into your instincts.

Pub/sub solves a specific class of problems very well and handles others poorly. The mistake most engineers make is reaching for it by default because it sounds "scalable." Before picking up a broker, you need to be able to answer two questions: does this situation actually have multiple independent consumers? And does the publisher genuinely not need a response? If both answers are yes, pub/sub belongs in your design. If either is no, you're likely better served by something simpler.

Walk through those three questions for any design decision and you'll catch the majority of bad pub/sub uses before they're built. The pattern below shows both sides of the coin — the situations where pub/sub shines and where it creates more problems than it solves.

The three-question heuristic: If you can't name at least two independent services that need this event, skip the broker. If the caller needs the result before returning a response, use a direct call. If you're writing to a single consumer for ordering reasons, a queue (not pub/sub) is the right shape.

Use pub/sub when one event genuinely needs to trigger multiple independent downstream reactions, when you want producer-consumer decoupling, or when you need a durable event log. Avoid it when you need a synchronous reply, when there's only one consumer, when you need transactional rollback semantics, or when your handler cannot be made idempotent. The three-question flowchart — multiple consumers? needs reply? needs global ordering? — catches most design mistakes before they're built.

Pub/Sub vs Alternatives

Direct API calls, message queues, event streaming, webhooks, and the Observer pattern — how pub/sub compares and when to pick each one.

Pub/sub is one tool in a family of communication patterns. Choosing the wrong one doesn't cause immediate explosions — it causes slow architectural drift toward a system that's hard to change and harder to debug. The comparison below focuses on the shape of each pattern — who knows about whom, what happens on failure, and what you can do that you can't do with the others.

The visual above shows the five patterns at a glance — their topology tells you most of what you need to know. Now let's go deeper on each comparison.

The core distinction: pub/sub is for events that happened (past tense, notification, no response needed). Direct API calls are for commands that need a result (future tense, query, response required). If your operation is "send order placed notification," pub/sub is right. If it's "calculate the tax for this cart and return the result," a direct call is right.

This is the most commonly confused pairing. A message queue is a work distribution tool — you have a pool of workers and you want each job processed by exactly one of them. Pub/sub is a broadcast tool — you have an event and you want every interested party to know about it. They solve different problems. Kafka is interesting because it can do both: within a consumer group, it acts like a queue (each partition assigned to one consumer). Across consumer groups, it acts like pub/sub (each group gets all messages independently).

Think of classic pub/sub (Redis Pub/Sub, SNS) as a phone call: you need to be listening when the message comes in. Event streaming (Kafka) is a voicemail inbox: messages are stored, you listen whenever you're ready, and you can re-listen as many times as you want. If your use case requires replaying past events or catching up after downtime, you need event streaming. If you need sub-millisecond delivery and don't care about replay, classic pub/sub wins.

Webhooks are pub/sub's HTTP-native cousin, commonly used at API boundaries between different companies (Stripe calls your webhook when a payment succeeds; GitHub calls your webhook when a commit is pushed). Inside a single system, webhooks are the wrong choice because you don't get broker-managed retry, delivery guarantees, or fan-out without significant plumbing. At API boundaries where the subscriber is an external service you don't control, webhooks are standard and appropriate.

The Observer pattern (a Gang of Four behavioural pattern) is pub/sub's conceptual ancestor, but they live in different worlds. Observer is within a single process — a button click notifies listeners in the same browser tab. Pub/sub is across processes — an order service on Machine A notifies subscribers on Machines B, C, and D. The idea is identical; the implementation, durability, and scale are completely different.

Pub/sub is one tool in a family. Use a direct API call when you need a synchronous reply. Use a point-to-point queue when distributing work across competing workers. Use event streaming (Kafka) when you need replay, event history, or consumer groups at different offsets. Use webhooks at API boundaries with external systems. Use the Observer pattern for in-process event notification within a single codebase. Pub/sub sits squarely in the "one event, many independent async reactions" space — that's the shape it's designed for.

Real Companies — Pub/Sub at Scale

Five real systems that use pub/sub as a load-bearing architectural component — and the specific reasons each one reached for it.

Abstract patterns become concrete when you see them carrying real production load. The five examples below aren't just "they use Kafka" — each one illustrates a specific pub/sub property (fan-out, decoupling, replay, heterogeneous consumers, or low-latency delivery) that was the deciding factor in the architecture.

Numbers note: Where specific throughput figures are not verifiable from primary public sources, descriptions use approximate language. The architectural decisions are drawn from published engineering blog posts and conference talks.

When a user sends a message in a Slack channel with 500 members, Slack needs to push that message to every active member's WebSocket connection in near real-time. The naive approach — have the message handler look up all 500 member sessions and push to each one directly — doesn't scale when you have millions of concurrent users spread across thousands of servers.

Slack's approach (described in their engineering blog) uses an internal pub/sub system layered on top of stateful Channel Servers and Gateway Servers. When a user sends a message, the Gateway Server holding their connection routes it to the appropriate Channel Server (chosen by consistent hashing on channel ID). The Channel Server then fans the message out to every Gateway Server worldwide that has subscribed clients in that channel. Each Gateway Server pushes over its own open WebSocket connections. Internally, Slack uses Kafka for durable buffering and Redis for in-flight, fast-access job data — but the fan-out primitive itself is their own pub/sub abstraction over these stores.

Why pub/sub was the right shape here: The publisher (the server handling the sending user) has no knowledge of which other servers have active connections for channel members — that's constantly changing as users connect and disconnect. Pub/sub lets the publisher broadcast once; each server self-selects based on what it has subscribed to. An in-memory, low-durability fan-out primitive works well here because the cross-server hop is within Slack's data centres where latency is sub-millisecond, and if a server briefly misses a message the client fetches history from a separate persistent store on reconnect.

The trade-off acknowledged: An in-memory pub/sub layer typically has no message persistence — if a server is temporarily disconnected and misses a published message, that message is gone from the in-memory broker. Slack handles this with a separate history storage layer (backed by Kafka and a chat-history datastore) that clients can query to catch up. The pub/sub layer handles the "right now" fan-out; durable storage handles the "what did I miss" case. These are separate concerns with separate tools.

An Uber trip involves dozens of state transitions: driver goes online, driver goes nearby, driver accepts trip, driver arrives, trip starts, waypoints update, trip ends, payment processes. Multiple services need to react to each of these transitions: the mobile app needs to update the map, the pricing service needs to update the fare estimate, the safety system needs to log the event, the analytics pipeline needs the data.

This is a textbook pub/sub use case. Each state transition is published as an event to a topic (e.g., trip.driver.arrived). Services that care — map rendering, fare calculation, safety logging, analytics — subscribe to the events they need. The dispatch system that generates these events doesn't call each downstream service directly. It publishes and walks away.

Why this matters at Uber's scale: Uber processes millions of trips simultaneously, with driver location updates arriving multiple times per second per trip. At that rate, any direct-call architecture between services would create a cascading failure risk — one slow downstream service would back up the dispatch system. With pub/sub, each subscriber operates at its own pace. The analytics pipeline can fall behind during a traffic spike and catch up when load drops, with no impact on the real-time map updates that riders see. The concerns are genuinely independent, and pub/sub enforces that independence architecturally.

Uber's engineering team has published extensively on their use of Apache Kafka as the backbone for these event pipelines, with Kafka's log retention enabling the analytics and data science teams to replay event history for model training and back-testing.

LinkedIn created Apache Kafka specifically to solve their own messaging problem at scale. Before Kafka, LinkedIn had a fragmented landscape: ad hoc point-to-point integrations between systems, each with its own data pipeline. Every time a new data consumer needed activity data (for the news feed, for recommendations, for analytics), a new pipeline was built from scratch to connect to the source. The complexity grew combinatorially.

Kafka changed the shape of that problem. Instead of connecting each consumer to each producer directly, every producer publishes to Kafka topics. Every consumer reads from Kafka topics. The number of integrations drops from O(producers × consumers) to O(producers + consumers). LinkedIn uses Kafka for activity stream data (page views, searches, profile visits), infrastructure metrics, audit logs, and as the backbone for their data warehouse ingestion pipeline.

Why the log model specifically: LinkedIn's data science teams needed to train recommendation models on historical activity data. With a traditional pub/sub system (fire-and-forget), historical events would be gone once consumed. Kafka's log retention means the machine learning team can spin up a new model training job and replay months of activity data from the beginning of the log — no special export, no batch dump, just a new consumer group starting at offset 0. The event history is a permanent asset, not a transient stream.

This is the pub/sub property LinkedIn needed most: not just fan-out, but replayable fan-out where new consumers can join the conversation retroactively.

Google Cloud Spanner — Google's globally distributed relational database — exposes a change streams feature that lets you subscribe to all inserts, updates, and deletes on a table in near real time. The typical production pattern (and the one Google provides a managed Dataflow template for) is to run a Dataflow pipeline that uses the Apache Beam SpannerIO connector to read change records from Spanner and forward them to a GCP Pub/Sub topic. Your applications then subscribe to that Pub/Sub topic and receive the stream of changes as they happen.

This is a compelling demonstration of pub/sub's "producer doesn't know its consumers" property. The Spanner-to-Pub/Sub pipeline doesn't know whether you're using the change feed to sync a downstream read replica, to trigger a Cloud Function, to feed an analytics pipeline, or to implement event sourcing. It just publishes changes. The GCP Pub/Sub layer fans those changes out to however many subscribers you've set up, each processing at their own pace with their own independent subscription cursor.

Why GCP Pub/Sub specifically: GCP Pub/Sub's subscription model means each consumer gets its own independent delivery queue and ack state. If your analytics pipeline falls behind, that doesn't affect your cache-warming subscriber at all. And because GCP Pub/Sub retains unacked messages for a default of seven days (configurable up to 31 days), a subscriber that goes offline for a few hours can catch up when it recovers — without any special handling upstream. The durability and independent-cursor model is precisely what makes this pattern reliable.

One of the most widely deployed pub/sub patterns in AWS architectures is the SNS → SQS fan-out. The pattern works like this: you create one SNS topic for an event type (e.g., order-placed). You create multiple SQS queues — one per consumer service (email service queue, inventory service queue, analytics queue). You subscribe each SQS queue to the SNS topic. When a publisher sends a message to the SNS topic, SNS fans it out to all subscribed SQS queues simultaneously. Each consumer service reads from its own queue independently.

This is the canonical pattern for decoupling services in AWS. Why does SNS sit in front of SQS instead of publishing directly to SQS? Because SNS handles the fan-out. If you published directly to SQS, you'd have to make one publish call per consumer queue, and you'd need to know about all consumer queues. With SNS in front, you publish once and SNS does the fan-out. New consumer teams just subscribe their queue to the SNS topic — the publisher never changes.

SNS can also fan out to Lambda functions, HTTP endpoints, email addresses, and SMS simultaneously from the same topic. A payment failure event can simultaneously trigger a Lambda for automated retry logic, drop a message in an SQS queue for the fraud analysis worker, and send an SMS to the on-call engineer — all from one SNS publish call.

The pattern in one sentence: SNS is the broadcaster; SQS is the durable buffer. SNS gives you fan-out without storage; SQS gives you storage without fan-out. Together they give you both.

Five real systems, five different pub/sub properties. Slack uses an internal pub/sub layer (Channel Servers fanning out to Gateway Servers) for sub-millisecond in-region delivery, backed by Kafka for durability. Uber uses Kafka-backed pub/sub to decouple dispatch events from the many services that react to them, enabling independent scaling and failure isolation. LinkedIn built Kafka specifically to replace O(N²) direct integrations with an O(N) log-based architecture, with replay as a first-class capability. GCP Pub/Sub is the standard delivery layer for Spanner change streams (typically via a Dataflow pipeline), separating change producers completely from downstream consumers (pipelines, functions, replicas). AWS SNS+SQS is the standard fan-out pattern for AWS architectures, where SNS handles broadcast and SQS provides per-consumer durability.

Production Bug Case Studies

Three bugs that show up repeatedly in real pub/sub systems — and the thinking you need to prevent and diagnose them.

Pub/sub systems fail in predictable ways. The same three failure modes appear in almost every large-scale deployment. Understanding each one — what caused it, what the symptoms looked like, and how to fix it — is worth more than any theoretical knowledge about brokers. These are the bugs that page on-call engineers at 3am.

Incident Summary: A downstream analytics service begins processing messages 10× slower than usual (a database query it depends on acquired a slow lock). The broker's delivery queue for that subscriber grows. Memory usage on the broker climbs. Other topics on the same broker begin experiencing delivery latency. Eventually the broker starts refusing new publishes and the entire event pipeline stalls — even for topics the slow subscriber has nothing to do with.

What Went Wrong

The root cause is a mismatch between publish rate and consume rate. When a subscriber processes messages slower than they arrive, the broker must buffer the undelivered messages somewhere. In most brokers, that buffer is bounded — either by memory limits or disk quotas. When the buffer fills, the broker has two bad choices: drop new messages (losing data) or apply back-pressure to publishers (slowing down the entire system).

What makes this particularly nasty is the blast radius. The analytics service subscribed to order.placed is slow. But because all topics live on the same broker cluster and share the same memory pool, the broker's slowdown affects every other publisher and subscriber on the cluster — including the email service and inventory service that are completely healthy and working fine. One slow subscriber degrades the entire messaging infrastructure.

# analytics_consumer.py — the slow subscriber # No timeout, no concurrency, processes one message at a time # This is the code that causes the cascade import broker_client def handle_order_placed(msg): # This function occasionally takes 10+ seconds when the # data warehouse is under load — nobody noticed in dev result = slow_data_warehouse_query(msg["orderId"]) write_to_analytics_store(result) # ACK sent only AFTER all the work is done. # During a DW slowdown: 1 msg takes 10s = 6 msg/min # But publisher sends 600 msg/min → queue grows 594/min broker_client.ack(msg) # Single-threaded subscription — one slow message blocks all others broker_client.subscribe("order.placed", handle_order_placed) broker_client.start() # blocking loop, single thread

# analytics_consumer.py — fixed version # Three changes: concurrency, timeout guard, separate slow path import broker_client import asyncio from concurrent.futures import ThreadPoolExecutor executor = ThreadPoolExecutor(max_workers=20) # bounded concurrency def handle_order_placed(msg): try: # 1. ACK immediately (or set a short visibility timeout) # The message is safe in the DLQ if processing fails broker_client.ack(msg) # 2. Do slow work AFTER acking, non-blocking # If DW is slow, this backs up in our local executor — # NOT in the broker's queue executor.submit(process_analytics_async, msg) except Exception as e: # 3. On failure, message goes to dead-letter queue # for manual inspection, not silently dropped broker_client.nack(msg, dead_letter=True) log.error(f"analytics processing failed: {e}") def process_analytics_async(msg): # Runs in thread pool — DW slowness only fills our local pool # not the broker queue. Bounded by max_workers=20. result = slow_data_warehouse_query(msg["orderId"]) write_to_analytics_store(result) # ALSO: set a max inflight / consumer quota on the broker side # so this subscriber can never accumulate >10k unacked messages # broker_client.set_max_inflight(10_000) broker_client.subscribe("order.placed", handle_order_placed) broker_client.start()

Lesson: A slow subscriber is a broker resource leak. The fix has two layers: (1) decouple the ack from the slow processing so the broker's queue drains at message-receive speed, not processing speed; (2) bound your local concurrency so the slow work backs up in your own thread pool, not the broker's memory. Add a max_inflight or consumer quota at the broker level as a safety net.

How to Spot: Watch broker queue depth per subscriber. If one subscriber's queue grows monotonically while others are flat, that subscriber is the bottleneck. Set alerts on "consumer lag" (Kafka) or "approximate number of messages not visible" (SQS) per subscription. A steadily growing lag that doesn't recover is the early warning signal — intervene before it fills broker memory.

Incident Summary: A payment service subscribes to checkout.completed events and charges the customer's card. One evening, a network blip between the broker and the payment service causes an ack to be lost. The broker redelivers the message 30 seconds later. The payment handler runs again for the same checkout event. The customer is charged twice. The money is real; the duplicate charge triggers a wave of support tickets.

What Went Wrong

This is the classic at-least-once delivery collision. The broker guarantees every message will be delivered at least once — meaning it may deliver the same message multiple times if an ack is lost or the subscriber crashes between processing and acknowledging. This is the correct, safe default: the broker would rather send twice than not at all.

The bug is not in the broker. The bug is in the handler: it performs a side effect (charging a card) without checking whether that side effect has already been performed for this specific message. A handler that produces the same outcome whether it runs once or ten times is called idempotentFrom mathematics: a function is idempotent if applying it multiple times produces the same result as applying it once. In distributed systems, an idempotent operation is one that can be safely retried without changing the outcome beyond the first application.. The payment handler was not idempotent, so redelivery became a double charge.

// payment-handler.js — non-idempotent, dangerous with at-least-once async function handleCheckoutCompleted(msg) { const { checkoutId, customerId, amount } = msg; // BUG: no deduplication check. // If this message is delivered twice (at-least-once), we charge twice. await chargeCard(customerId, amount); // BUG: ack AFTER the charge, but if network drops the ack, // broker redelivers and we charge again. await broker.ack(msg); console.log(`Charged customer ${customerId} for checkout ${checkoutId}`); } broker.subscribe("checkout.completed", handleCheckoutCompleted);

// payment-handler.js — idempotent via deduplication table const processedCheckouts = new Map(); // In production: Redis or DB table async function handleCheckoutCompleted(msg) { const { checkoutId, customerId, amount } = msg; // FIX 1: Idempotency check — have we processed this checkout before? // Use checkoutId as the idempotency key (stable across redeliveries) const alreadyProcessed = await db.query( `SELECT 1 FROM processed_events WHERE event_id = ?`, [checkoutId] ); if (alreadyProcessed) { // Duplicate delivery — safe to ack and skip console.log(`Skipping duplicate for checkout ${checkoutId}`); await broker.ack(msg); return; } // FIX 2: Mark as processed AND charge in the same DB transaction // This is atomic: either both happen or neither does await db.transaction(async (tx) => { await tx.query( `INSERT INTO processed_events (event_id, processed_at) VALUES (?, NOW())`, [checkoutId] ); await chargeCard(customerId, amount, { idempotencyKey: checkoutId }); // Most payment APIs (Stripe, etc.) accept an idempotency key directly }); await broker.ack(msg); console.log(`Charged customer ${customerId} for checkout ${checkoutId}`); } broker.subscribe("checkout.completed", handleCheckoutCompleted);

Lesson: At-least-once delivery is a feature, not a bug. The broker is doing the right thing by redelivering unacked messages — it's protecting you from silent data loss. The contract is: you must make your handler idempotent. Use a stable message ID (the broker provides one, or embed one in your event schema) as an idempotency key. Check before acting, or use a transaction that atomically records "processed" and performs the side effect. Most payment APIs (Stripe, Adyen) support idempotency keys natively for exactly this reason.

How to Spot: Duplicate record creation, double charges, duplicate emails, or duplicate database rows that share the same business key are all symptoms of a non-idempotent handler meeting at-least-once delivery. Monitor for event processing counts per message ID. If you see any message ID with a processing count greater than 1, your handler is not idempotent. Add an idempotency table to your schema as standard practice in any system that consumes from a pub/sub broker.

Incident Summary: A team builds a notification system where each user gets their own personal topic: user.notifications.{userId}. With 2 million users, the broker is managing 2 million topics. Each topic has metadata (name, subscriber list, configuration) stored in memory on the broker. The broker's metadata memory climbs past its heap limit. It starts failing to create new topics. Then existing topic lookups begin timing out. Then the broker crashes under the metadata load.

What Went Wrong

Topics are not free. Every topic is a named entry in the broker's metadata store, and that metadata lives in memory for fast routing. Each entry includes the topic name, the subscriber list, configuration (retention, partition count, etc.), and offset tracking information. Multiply that by millions of topics and you've turned your broker into an in-memory metadata store that's growing without bound.

The underlying design mistake is treating pub/sub as a mailbox system. Pub/sub is designed for channels where messages of a particular type flow — all order events, all payment events. It's not designed for per-entity personal inboxes. That's what a push notification service or a dedicated inbox system is for.

# notification_publisher.py — WRONG: per-user topics class NotificationService: def send_notification(self, user_id: int, message: str): # BUG: creates a unique topic for EVERY user # With 2M users, the broker will have 2M topics in its metadata store topic_name = f"user.notifications.{user_id}" # Broker creates this topic if it doesn't exist yet # Each new user registration → +1 topic permanently in broker metadata broker.publish(topic_name, { "userId": user_id, "message": message, "timestamp": time.time() }) def subscribe_for_user(self, user_id: int, handler): # Each WebSocket connection creates a new topic subscription # Every subscription is metadata on the broker topic_name = f"user.notifications.{user_id}" broker.subscribe(topic_name, handler)

# notification_publisher.py — CORRECT: shared topic with routing in message class NotificationService: TOPIC = "user.notifications" # ONE topic, forever, no matter how many users def send_notification(self, user_id: int, message: str): # Publish to the shared topic with userId as a message attribute # Broker routes to the right subscriber group; workers filter by userId broker.publish(self.TOPIC, { "userId": user_id, "message": message, "timestamp": time.time() }, attributes={"userId": str(user_id)}) # Some brokers (GCP Pub/Sub, SNS) support server-side filtering # on message attributes — so workers only receive their subset def subscribe_worker(self, handler): # ONE subscription shared across all user IDs # Worker pool reads from this single topic and filters in application code broker.subscribe(self.TOPIC, handler) # notification_worker.py def handle_notification(msg): # If your broker doesn't support server-side attribute filtering, # filter here in the application layer — cheap, O(1) check target_user_id = int(msg["userId"]) # Look up WebSocket connection for this user (from connection registry) ws = connection_registry.get(target_user_id) if ws and ws.is_connected(): ws.send(msg["message"]) # If no active connection: drop or store in inbox DB for later retrieval

Lesson: Topics represent event types, not entity identities. A topic called user.notifications is correct — it describes what kind of event it carries. A topic called user.notifications.5001 is wrong — it conflates a channel name with a business entity ID. The routing logic (who gets which message) belongs in message attributes, subscriber filters, or application-level code — not in topic names. As a rule of thumb: if your topic name contains a user ID, order ID, or any runtime-generated key, you have a topic explosion waiting to happen.

How to Spot: Monitor total topic count on your broker over time. If the count is growing proportionally to your user base or record count, you have the per-entity topic anti-pattern. In Kafka, watch for ZooKeeper or KRaft metadata memory growth correlated with topic creation rate. In RabbitMQ, watch queue count in the management UI. Alert when total topic/queue count grows faster than a constant baseline — it should be roughly stable after your application is deployed, not growing with data volume.

Three bugs that show up repeatedly in production pub/sub systems. Bug 1: slow subscribers cause broker memory pressure that cascades to unrelated topics — fix with concurrent consumers, early ack, and consumer lag monitoring. Bug 2: at-least-once delivery meets non-idempotent handlers and creates duplicate side effects — fix by using the message ID as an idempotency key and recording processed events before (or atomically with) the side effect. Bug 3: per-entity topic naming causes broker metadata explosion — fix by using one topic per event type and routing by message attributes or application-level filtering.

Pitfalls & Anti-Patterns

Five mistakes that show up in real pub/sub incidents — what broke, why it broke, and how to fix it before it breaks yours.

Most pub/sub bugs aren't subtle. They're the same five mistakes, repeated across teams, discovered at 2am after a production incident. Each one has a clear cause, a clear consequence, and a clear fix. Learn them here so you don't have to learn them on-call.

The mistake: When building a notification system, it feels natural to give each user their own topic — notifications.user.4821, notifications.user.4822, and so on. Publishers address users directly. Nice and tidy, right?

Why it's bad: Brokers aren't free to host topics. Every topic carries metadata overhead — in Kafka, a topic with 3 partitions and 3 replicas means 9 log segments on disk, 9 entries in the metadata log, 9 file handles per broker. Multiply by a million users and you've given the controller millions of metadata entries to manage. Cluster startup, leader election, and rebalances all slow to a crawl. Brokers OOM. This is called topic proliferation, and it's the fastest way to kill a Kafka cluster.

Fix: Use a single shared topic (e.g., notifications) and embed the user ID as a field in the message payload. Subscribers filter by user ID on their side, or you shard by a small number of partitions keyed on user ID hash. One topic replaces a million. Metadata overhead drops from catastrophic to negligible.

// BAD — dynamic topic per user, millions of topics created at runtime async function sendNotification(userId, message) { const topicName = `notifications.user.${userId}`; // 💀 1 topic per user await broker.createTopicIfNotExists(topicName); // broker metadata grows unbounded await broker.publish(topicName, message); }

// GOOD — single topic, userId lives in the message payload async function sendNotification(userId, message) { await broker.publish("notifications", { // one stable topic userId, // recipient is a field, not a topic name ...message }); } // Subscriber filters on userId — no topic explosion broker.subscribe("notifications", (msg) => { if (msg.userId === currentUserId) handleNotification(msg); });

The mistake: Using a high-cardinality but heavily skewed key for partitioning — for example, routing all messages from your most popular creator to the same partition in a social media feed. Or routing by country when 80% of your users are in one country.

Why it's bad: Partitions are the unit of parallelism in log-based brokers. If one partition receives ten times the throughput of the others, the consumer assigned to that partition is overwhelmed while every other consumer sits mostly idle. You've wasted your horizontal scaling investment. Worse, a lagging hot partition backs up the entire topic if consumers checkpoint offsets linearly — downstream services see growing lag that no amount of consumer scaling can fix, because you can't add more consumers than partitions.

Fix: Add a random salt suffix to the partition key (e.g., creatorId + "-" + random(0,4)) to spread traffic across partitions. If ordering per entity matters, use a scatter-gather consumer that reads from all partitions and re-merges by sequence number. Alternatively, split the hot entity into its own dedicated topic with a higher partition count.

// BAD — key is highly skewed (big creator gets millions of events) broker.publish("feed-events", event, { partitionKey: event.creatorId // 💀 celebrity creatorId → always same partition });

// GOOD — salt the key to distribute load across partitions const salt = Math.floor(Math.random() * 4); // spread across 4 "virtual shards" broker.publish("feed-events", event, { partitionKey: `${event.creatorId}-${salt}` // hot key split across multiple partitions }); // Consumers fan-in and re-order by event.sequenceNumber if strict order needed

The mistake: Using pub/sub where you actually need a direct response — for example, using a message topic to send a "validate payment" request and a reply topic to get the answer back, building your own request-reply correlation over a broker that was never designed for it.

Why it's bad: Pub/sub is asynchronous by design. The publisher fires a message and moves on — it does not block waiting for a reply. If you bolt synchronous semantics on top (matching reply messages by a correlation ID, setting timeouts, handling "no reply received"), you've rebuilt the worst of both worlds: you have the latency overhead of a broker hop plus the blocking wait of a synchronous call, but with none of the simplicity of direct HTTP/gRPC. The correlation ID approach also creates subtle bugs: replies can arrive out of order, messages can be lost with no caller-side error, and every caller has to manage its own timeout/retry loop.

Fix: Use direct HTTP or gRPC for anything that needs a synchronous response. Pub/sub is right for events that trigger independent reactions — not for "ask a question and wait for the answer." If you genuinely need async request-reply at scale, look at RPC frameworks with async support (gRPC streaming, NATS request-reply which has built-in support) rather than cobbling it together on top of a pub/sub broker.

// BAD — using pub/sub for synchronous-style request-reply async function validatePayment(orderId) { const correlationId = uuid(); await broker.publish("payment.validate.request", { orderId, correlationId }); // now block-wait for a reply on "payment.validate.reply"... for up to 5s return waitForReply("payment.validate.reply", correlationId, 5000); // fragile, DIY }

// GOOD — direct HTTP/gRPC call for synchronous request-reply async function validatePayment(orderId) { // One hop. Timeout built into the HTTP client. Error surfaced immediately. const response = await fetch(`${PAYMENT_SERVICE_URL}/validate`, { method: "POST", body: JSON.stringify({ orderId }), signal: AbortSignal.timeout(3000) // 3s timeout, no DIY correlation needed }); return response.json(); }

The mistake: A subscriber processes messages one by one. A malformed message arrives — maybe a required field is null, maybe the schema changed, maybe an external API it calls is down. The subscriber throws an exception, the broker redelivers the message, the subscriber throws again, the broker redelivers again. Loop. The subscriber processes nothing else while it spins on this one poison message.

Why it's bad: This is called a poison message attack — usually accidental. The message is valid enough to be accepted by the broker but invalid enough to crash your processing logic every time. Without a dead-letter queue (DLQ), the only options are: wait forever (consumer stuck), manually delete the message (requires operator intervention at 3am), or silently drop it (data loss). The entire subscription is blocked. Healthy messages queue up behind the one bad one. Consumer lag grows. SLAs miss.

Fix: Configure a dead-letter queue (DLQ) on every subscription that processes user data. Set a maximum retry count (typically 3–5). After the last retry fails, the broker parks the message in the DLQ instead of redelivering it. Your main consumer unblocks immediately. The poison message sits in the DLQ for human inspection or automated alerting. Most managed brokers (AWS SQS, Google Pub/Sub, Azure Service Bus) support DLQs natively — it's a single configuration line, not a system redesign.

// BAD — no DLQ, no retry limit; one bad message stalls the subscriber forever broker.subscribe("order.placed", async (msg) => { await processOrder(msg); // if this throws, broker redelivers forever → infinite loop });

// GOOD — configure DLQ + max retries at the subscription level const subscription = broker.subscribe("order.placed", { maxRetries: 3, // give up after 3 failures deadLetterTopic: "order.placed.dlq", // park poison messages here retryBackoff: "exponential" // wait longer between each retry }, async (msg) => { await processOrder(msg); // on repeated failure → auto-moved to DLQ, consumer unblocks }); // Separately, monitor the DLQ and alert when messages arrive broker.subscribe("order.placed.dlq", (msg) => { alertOncall(`Poison message in order.placed.dlq: ${msg.id}`); });

The mistake: A subscriber catches a transient error (a downstream service is briefly slow, a database connection times out) and immediately retries — zero delay. Processing fails again. Retry again. And again. Dozens of subscriber instances doing this simultaneously flood the struggling downstream service with a wall of retries at the exact moment it most needs breathing room to recover.

Why it's bad: This is called a retry storm. The downstream service was about to recover — maybe it just needed 2 seconds to finish a GC pause or reconnect a pool — but the storm of retries prevents it from recovering at all, because every retry arrives before it can finish handling the last one. A transient 2-second blip turns into a cascading failure that lasts minutes. Every retry also consumes the broker's throughput budget, starving normal messages during the storm.

Fix: Always retry with exponential backoff plus jitter. Exponential backoff means each retry waits twice as long as the previous one (1s, 2s, 4s, 8s…). Jitter means adding a small random offset to each wait time so that all subscriber instances don't retry at the exact same millisecond. Together they transform a thundering herd into a gentle trickle, giving the downstream service the breathing room it needs to recover.

// BAD — zero-delay retry hammers a struggling service broker.subscribe("email.send", async (msg) => { try { await emailProvider.send(msg); } catch (err) { // Instant retry — if emailProvider is stressed, this makes it worse await emailProvider.send(msg); // 💀 thundering herd } });

// GOOD — exponential backoff + jitter function backoffMs(attempt) { const base = Math.min(1000 * Math.pow(2, attempt), 30000); // cap at 30s const jitter = Math.random() * 500; // ±500ms jitter return base + jitter; } broker.subscribe("email.send", async (msg) => { for (let attempt = 0; attempt < 4; attempt++) { try { await emailProvider.send(msg); return; // success — done } catch (err) { if (attempt < 3) { await sleep(backoffMs(attempt)); // wait before retry: 1s, 2s, 4s } else { throw err; // exhaust retries → DLQ handles it } } } });

Creating one topic per user (or per entity) causes broker metadata explosion — use a shared topic with user ID in the payload instead.
Hot partitions form when partition keys are skewed — salt the key to spread load, or dedicate a separate topic to high-traffic entities.
Pub/sub is fire-and-forget, not request-reply — use HTTP/gRPC directly when you need a synchronous response.
Without a dead-letter queue, one poison message blocks a subscriber forever — configure max retries and a DLQ on every subscription that touches production data.
Instant retries create retry storms that prevent recovery — always use exponential backoff with jitter.

Testing Pub/Sub Systems

How to verify publishers, subscribers, routing, and fault tolerance — without needing a live broker in every test.

Testing a pub/sub system is different from testing a plain HTTP service. There's a broker in the middle, delivery is asynchronous, and failures are probabilistic. The trick is to test each layer in isolation — mock the broker for unit tests, use a real (but local) broker for integration tests, and inject faults deliberately to verify that your resilience logic actually works.

The Four Testing Layers

Layer 1 — Unit-Testing Publishers (Mock the Broker)

A publisher's job is simple: call broker.publish(topicName, payload) with the right arguments. You don't need a real broker to test this. Inject a mock broker, trigger the publishing action, and assert that publish was called with the expected topic and a correctly shaped payload. Fast, deterministic, no infrastructure needed.

Layer 2 — Unit-Testing Subscribers (Replay a Fixture)

A subscriber's job is also simple: receive a message and do something with it (write to a DB, send an email, update state). You don't need a live broker for this either. Just call the handler function directly with a hand-crafted fixture message. Test the happy path, test malformed input (it should fail gracefully, not blow up the process), and test idempotency by calling the handler twice with the same message ID — the side effect should happen only once.

Layer 3 — Integration-Testing Topic Routing

Once individual units are covered, spin up a real (but ephemeral) broker in your CI pipeline. Testcontainers makes this easy — it starts a Docker container for Kafka, RabbitMQ, or whatever broker you use, runs your test, and tears it down. Publish a real message, subscribe, and verify end-to-end that routing, schema serialization, and subscription filters all work together correctly.

Layer 4 — Fault Injection

The most important tests — and the most skipped. Kill the broker mid-test and verify that your subscriber retries and the DLQ triggers after max retries. Replay the same message twice and verify your handler is idempotent. Publish a deliberately malformed message and verify it ends up in the DLQ without blocking healthy messages. These tests are the only way to prove your resilience config actually works before production proves it the hard way.

The idempotency test is non-negotiable. Pub/sub brokers guarantee at-least-once delivery — your subscriber will receive the same message more than once during broker restarts or consumer rebalances. If your handler isn't idempotent (safe to call multiple times with the same message), you will produce duplicate side effects in production. Test it explicitly: call the handler with the same messageId twice and assert the side effect happened exactly once.

Sample Test: Publisher + Subscriber Unit Tests

// Layer 1: publisher unit test — mock the broker, no infrastructure needed import { OrderService } from "./order-service.js"; import { MockBroker } from "./test-helpers/mock-broker.js"; describe("OrderService.placeOrder", () => { it("publishes order.placed with correct payload", async () => { const mockBroker = new MockBroker(); const service = new OrderService(mockBroker); await service.placeOrder({ orderId: "ord-99", amount: 42.00 }); // Assert broker.publish was called with the right topic and payload shape expect(mockBroker.published).toHaveLength(1); expect(mockBroker.published[0].topic).toBe("order.placed"); expect(mockBroker.published[0].payload.orderId).toBe("ord-99"); expect(mockBroker.published[0].payload.amount).toBe(42.00); }); });

// Layer 2: subscriber unit test — inject fixture message, no broker needed import { handleOrderPlaced } from "./order-handler.js"; import { db } from "./db.js"; // real or mock DB describe("handleOrderPlaced", () => { it("creates fulfillment record on valid message", async () => { const fixture = { messageId: "msg-1", orderId: "ord-99", amount: 42.00 }; await handleOrderPlaced(fixture); const fulfillment = await db.find("fulfillments", { orderId: "ord-99" }); expect(fulfillment).not.toBeNull(); }); it("is idempotent — reprocessing the same messageId has no double-effect", async () => { const fixture = { messageId: "msg-1", orderId: "ord-99", amount: 42.00 }; await handleOrderPlaced(fixture); // first delivery await handleOrderPlaced(fixture); // redelivery (at-least-once semantics) const count = await db.count("fulfillments", { orderId: "ord-99" }); expect(count).toBe(1); // must be exactly 1, not 2 }); });

// Layer 3: full integration test with a real Kafka (via Testcontainers) import { KafkaContainer } from "@testcontainers/kafka"; import { Kafka } from "kafkajs"; let kafkaContainer; let kafka; beforeAll(async () => { kafkaContainer = await new KafkaContainer("confluentinc/cp-kafka:7.5.0").start(); kafka = new Kafka({ brokers: [kafkaContainer.getBootstrapServers()] }); }, 60_000); // allow 60s for Docker pull afterAll(() => kafkaContainer.stop()); it("message published to order.placed reaches email subscriber", async () => { const producer = kafka.producer(); const consumer = kafka.consumer({ groupId: "test-email" }); await producer.connect(); await consumer.connect(); await consumer.subscribe({ topic: "order.placed", fromBeginning: true }); const received = []; consumer.run({ eachMessage: async ({ message }) => { received.push(JSON.parse(message.value.toString())); }}); await producer.send({ topic: "order.placed", messages: [{ value: JSON.stringify({ orderId: "ord-99", amount: 42.00 }) }] }); // Wait up to 5s for the consumer to receive the message await waitFor(() => received.length === 1, 5000); expect(received[0].orderId).toBe("ord-99"); });

Test publishers by injecting a mock broker and asserting the right topic and payload shape — no infrastructure needed.
Test subscribers by calling the handler directly with a fixture message — verify the happy path, malformed input, and idempotency explicitly.
Integration-test topic routing using a real ephemeral broker via Testcontainers in CI.
Fault injection tests (kill broker, replay message, send poison msg) are the only way to prove resilience config works before production does.
Idempotency is mandatory to test because brokers deliver at-least-once — expect duplicates.

Observability

What to monitor on a pub/sub system in production — the five signals that tell you everything, and the dashboard that shows them at a glance.

When a pub/sub system misbehaves in production, it rarely announces itself loudly. It's usually a slow creep: consumer lag grows by 1,000 messages a minute, fan-out latency drifts from 50ms to 800ms, and by the time users notice something is wrong, the queue depth is two hours behind. Good observability means you catch these drifts in minutes, not after user complaints arrive.

The Five Production Signals

What Each Signal Tells You

① Publish rate — messages published per second. A sudden drop means publishers are crashing or blocking on the broker. A sudden spike means a runaway publisher or a replay event. Baseline this metric so anomalies are obvious.

② Fan-out latency — time from when a publisher sends a message to when a subscriber first receives it. In a healthy system this is low (single-digit milliseconds for in-memory brokers, tens of milliseconds for durable log-based ones). A rising p99 means the broker is under load, the network is congested, or a subscriber is holding acknowledgements too long.

③ Consumer lag — how many messages are in the topic ahead of what the subscriber has processed. This is the most important signal. Zero lag means the subscriber is keeping up in real time. Rising lag means the subscriber is slower than the publisher — scale the consumer, fix a slow handler, or check for a poison message causing retries. If lag is growing faster than you can scale, it's time to investigate the handler logic, not just add more pods.

④ Broker queue depth — total messages retained in the broker across all topics. A slowly growing depth is normal (brokers retain messages for replay). A rapidly growing depth with stable publish rate but growing consumer lag confirms the consumer is the bottleneck. A depth that stays constant while lag grows means the broker is deleting (expiring) messages before the consumer can read them — a retention period problem.

⑤ Dead-letter rate — percentage of messages that exhaust retries and land in the DLQ. Near-zero is healthy. Any non-zero spike deserves investigation: what changed? A schema change? A downstream dependency going down? A new code deployment that introduced a regression? The DLQ rate is your early warning system for systematic subscriber failures.

The 4 Golden Signals for Pub/Sub (adapted from SRE)

Throughput — publish rate (messages/sec). Is the broker receiving traffic as expected?
Latency — fan-out latency p99 (ms). How long does it take a message to reach subscribers?
Consumer Lag — messages behind. Is the system processing events in near-real-time or falling behind?
Error Rate — DLQ rate (% of messages). Are subscribers failing systematically?

If these four are green, your pub/sub system is healthy. If any one of them degrades, you have a direction for diagnosis. Queue depth is a useful fifth signal for capacity planning, not for alerting.

Monitor five signals: publish rate (throughput), fan-out latency (delivery speed), consumer lag (backlog), broker queue depth (capacity), and dead-letter rate (error rate).
Consumer lag is the most critical signal — rising lag is always a symptom of something wrong and deserves an immediate alert.
DLQ rate is your early warning system for systematic handler failures — any non-zero spike warrants investigation.
The 4 golden signals for pub/sub are throughput, latency, consumer lag, and error rate. Keep these green and everything else follows.

Capacity Planning — How Much Work Is the Broker Actually Doing?

Fan-out factor, partition strategy, and the worked math behind sizing a pub/sub cluster for real traffic.

When you put a pub/sub broker in a system design interview — or in production — the interviewer's next question is usually "how does it scale?" The answer lives in one formula: broker work = publish rate × fan-out factor. Once you understand that formula, you can size topics, plan partitions, and predict where the bottleneck will be before it bites you.

The Fan-Out Formula

A broker doesn't just pass messages through — it multiplies them. When 1 publisher sends 1 message to a topic with 10 subscribers, the broker must deliver 10 copies. That amplification is called fan-out. The total work the broker does is:

Broker write throughput = publishers × publish rate (msg/s) × average message size
Broker read (fan-out) throughput = broker write throughput × subscriber count per topic
Total broker I/O = write throughput + read throughput

Why does this matter? Because in a classic pub/sub system (RabbitMQ, Google Pub/Sub) the broker carries all of this I/O on its own bus. In a log-based system (Kafka), the broker stores the log once and subscribers read independently — so write I/O stays flat while reads scale with subscriber count, but from local disk rather than multiplied RAM copies. Knowing this difference shapes your partitioning strategy completely.

Worked Example — The Full Math

Let's make it concrete. Suppose you're designing a pub/sub layer for a gaming leaderboard update system:

Publishers: 10,000 game servers, each publishing score updates at 100 msg/s
Topic: score.update
Subscribers: 50 (one leaderboard shard per region, plus analytics, plus fraud detection)
Message size: 512 bytes

Metric	Calculation	Result
Total publish rate	10,000 × 100 msg/s	1,000,000 msg/s
Inbound broker bandwidth	1,000,000 × 512 B	~512 MB/s write
Fan-out deliveries/s	1,000,000 × 50 subscribers	50,000,000 deliveries/s
Outbound broker bandwidth	50,000,000 × 512 B	~25 GB/s read
Minimum partitions (50 MB/s per partition)	512 MB/s ÷ 50 MB/s	11 partitions (round up to 16 or 32)

The 25 GB/s number is the alarm bell. No single broker handles 25 GB/s of outbound I/O. This is where your architecture changes: move from a traditional broker (which copies messages to each subscriber) to a log-based system like Kafka, where the broker writes once and subscribers read from a shared log independently. The outbound I/O problem becomes 50 independent sequential reads, not 50 copies.

Partition Strategy in Plain English

A partition is just a slice of a topic — a separate ordered log that a different broker (or different CPU thread) can handle independently. Splitting a topic into more partitions does two things: it spreads the write load across multiple brokers, and it lets multiple subscribers read in parallel (one subscriber thread per partition). This is why Kafka's throughput scales nearly linearly with partition count, up to the hardware limit of your cluster.

The catch: once you set partition count, you usually can't decrease it. More important — if you're routing messages by key (e.g., all events for user ID 42 go to the same partition to preserve order), changing partition count re-hashes keys to different partitions, breaking per-key ordering. Decide on partition count early and over-provision by 2× to leave room for growth without a reshuffle.

Quick sizing rule: Start with partitions = max(desired_parallelism × 2, throughput_MB_s ÷ 50). Most workloads are parallelism-bound, not throughput-bound. If you expect 20 consumer instances at peak, start with 40 partitions — not 20 — so you can add 20 more consumers later without restructuring the topic.

Broker work = publish rate × fan-out factor × message size. Fan-out is the multiplier that kills single-broker deployments — at high subscriber counts, move to log-based brokers (Kafka) where each subscriber reads independently rather than receiving N copies. Partition count sets the ceiling on consumer parallelism; over-provision by 2× at topic creation time because you cannot decrease partition count later without re-hashing keys and breaking per-key ordering.

Interview Q&A — Eight Questions That Actually Get Asked

Not just definitions — the WHY chains and trade-off reasoning interviewers are actually testing for.

These are the questions that come up in every system design interview once you mention pub/sub or an event-driven architecture. Knowing the textbook answer isn't enough — interviewers want to hear the reasoning chain that produced the answer. That's what each response below models.

Easy Q1 — What's the difference between pub/sub and a message queue?

Think first: What does each one guarantee about who receives a message and how many times?

A message queue is a pipe between exactly one producer and one consumer group — each message is consumed by one worker, then deleted. Think of it like a work queue: ten jobs go in, ten workers each grab one job. Once a job is taken, it's gone. The model exists to distribute load: you want each item processed exactly once by whichever worker is free.

Pub/sub is different in one fundamental way: every subscriber gets a copy. When you publish to a topic, all subscribers receive the same message independently. Nobody competes. It's not a work queue — it's a broadcast. The model exists to decouple producers from consumers: the publisher doesn't need to know who cares about the event.

When to use which: Use a queue when you need load distribution — many workers sharing the effort of processing a stream of tasks. Use pub/sub when you need event fan-out — one event triggering multiple independent reactions. Many systems use both: SNS (pub/sub) fans an event out to multiple SQS queues (work queues), where workers pick up tasks in parallel.

Medium Q2 — When would you NOT use pub/sub?

Think first: What does pub/sub fundamentally give up that a direct call provides?

Pub/sub gives up two things: synchronous response and simple observability. That immediately rules it out for several patterns:

Request/reply flows — If Service A needs an answer from Service B before it can proceed (e.g., "check if this user's credit limit allows this purchase"), pub/sub adds unnecessary complexity. A direct HTTP or gRPC call is the right tool. Layering a reply-to-topic pub/sub handshake on top just recreates synchronous communication with more moving parts.
Strict sequential processing with a single consumer — If you need guaranteed in-order processing with exactly-one consumer, a simple queue or a database-backed task table is often simpler and cheaper than a partitioned pub/sub system with careful key routing.
Very low volume, simple integrations — If two services communicate at 1 msg/minute and neither expects to add more consumers, the operational overhead of running a broker outweighs the decoupling benefit. A direct database row or a simple webhook works fine.
When you need exactly-once processing and can't tolerate deduplication logic — At-least-once delivery means duplicate handling is your problem. If your downstream consumer is not idempotent and the cost of implementing idempotency is high, consider whether a transactional outbox or a two-phase write model fits better.

The heuristic: If the caller needs a result before it can continue, use synchronous RPC. If the caller just needs to record that something happened and move on, pub/sub is usually right.

Medium Q3 — How does at-least-once delivery actually work?

Think first: What's the moment of danger — when can a message be lost, and how does the system recover?

The danger moment is the gap between the broker delivering a message and the subscriber acknowledging it. Here's the exact flow:

Broker delivers message M to subscriber S.
Broker starts a timer (the ack deadlineThe maximum time a subscriber has to acknowledge a message before the broker considers delivery failed and re-delivers. In Google Cloud Pub/Sub this is configurable, default 10 seconds. In Kafka it's controlled by max.poll.interval.ms for consumer groups.).
Subscriber processes M and sends ACK.
Broker marks M as delivered and moves on.

The "at-least-once" case: if step 3 never happens (subscriber crashes mid-processing, or the network drops the ACK), the timer expires and the broker re-delivers M. The subscriber might have processed M just fine — the ACK just never arrived. So M gets processed a second time. The message was delivered at least once, possibly more.

The implication: At-least-once delivery puts a contract on the consumer — your processing logic must be idempotent (processing the same message twice produces the same result as processing it once). Common patterns: check-and-skip by message ID before processing, use upserts instead of inserts in the database, or rely on natural idempotency (setting a flag is idempotent; incrementing a counter is not).

Medium Q4 — How do you handle a slow subscriber?

Think first: If one subscriber is slow, what happens to the broker's buffer and to the other subscribers?

A slow subscriber is one of the most common production pub/sub problems, and the right answer depends on the broker model.

In a traditional broker (RabbitMQ, GCP Pub/Sub): The broker buffers unacknowledged messages. If a subscriber is chronically slow, the buffer grows until it hits memory/storage limits. The broker may start dropping messages or exerting backpressure on publishers. The right fix is usually to scale out the subscriber — add more instances to increase processing throughput. If the slowness is spikey (not chronic), increasing the ack deadline buys time for processing without triggering re-delivery.

In Kafka: The broker doesn't care how fast subscribers are — each consumer group maintains its own offset independently. A slow consumer group just accumulates lag (distance between its current offset and the latest offset). The broker keeps the data until the retention period expires. Fix options: add consumer instances (up to the partition count), reduce message processing time, or increase max.poll.records to batch more work per poll. Monitor lag — a growing lag that never shrinks means you need more consumers or faster processing, not more time.

Hard Q5 — Pub/sub vs event sourcing — what's the difference?

Think first: One is a communication pattern; the other is a storage pattern. Do they overlap?

Pub/sub is about delivering messages to interested parties. The broker routes events from publishers to subscribers. The events are ephemeral — once delivered and acknowledged, they're gone (in most brokers). The model answers the question "how do services notify each other?"

Event sourcing is about storing state as a sequence of events. Instead of saving the current state of an entity ("Order #42 is SHIPPED"), you save every state change as an event ("OrderCreated", "OrderPaid", "OrderShipped"). The current state is always derivable by replaying those events from the beginning. The model answers the question "how should I persist state in my application?"

They overlap when you use Kafka: Kafka's log-based storage makes it usable as both an event sourcing store (events are retained indefinitely, replayable) and a pub/sub backbone (consumer groups subscribe to topics and receive events). But the concepts are independent — you can have pub/sub without event sourcing (SNS/SQS with fire-and-forget events) and event sourcing without pub/sub (a single-service event store with no broadcast).

The key difference in one sentence: Pub/sub is about routing; event sourcing is about persistence and history. Use both together when you want durable, replayable, broadly distributed events — which is why Kafka + event sourcing is a common pairing in CQRS architectures.

Hard Q6 — How do you scale a pub/sub broker when one topic is a hotspot?

Think first: What are the dimensions you can scale independently — and which ones are coupled?

The three levers work together but have different constraints. Partition count is the hard ceiling on consumer parallelism — you can't parallelize beyond the number of partitions regardless of how many consumer instances you add. Broker node count determines how many partition leaders can be distributed across the cluster — if all hot partitions live on one broker, adding more partitions doesn't help until you rebalance leaders. Consumer instances are the easiest to scale (just add more pods/workers) but are bounded by partition count.

The typical sequence for a hotspot: (1) Check if the topic is key-skewed — if 90% of messages share one key, all go to one partition. Fix by changing the partition key or adding a salt. (2) Increase partition count (accepting that key-to-partition mapping changes). (3) Add broker nodes and trigger a preferred replica election to redistribute leaders. (4) Scale consumers to match the new partition count.

Medium Q7 — What's a dead-letter queue and when do you need one?

Think first: What happens to a message that can never be successfully processed? Where should it go?

A dead-letter queue (DLQ) — sometimes called a dead-letter topic — is a special destination for messages that have failed delivery or processing too many times. Instead of retrying forever (burning CPU and blocking the topic) or silently dropping the message (losing data), the broker routes poison messages to the DLQ for human inspection or automated recovery.

When a message ends up in the DLQ: (1) The subscriber returned a negative acknowledgement (NACK) or failed to ack within the deadline N times. (2) The message exceeded a maximum retry count you configured. (3) The message is malformed and can't be deserialized at all.

When you need one: Any time your subscriber can plausibly receive a message it can't process — because of schema changes, downstream service outages, or data it doesn't expect. Without a DLQ, you have two bad choices: retry forever (rebalance storm, infinite loops, blocked consumers) or drop messages silently. The DLQ is the third option: park the broken message safely, alert an operator, and let the rest of the topic continue flowing. A DLQ is not optional in any production pub/sub system — it's hygiene.

What to do with DLQ messages: Monitor DLQ depth as an alert. Investigate the root cause, fix the subscriber or the message, and replay from the DLQ back to the original topic once the fix is deployed.

Hard Q8 — How do you guarantee ordered delivery in pub/sub?

Think first: Order within what scope? Globally? Per-entity? What breaks ordering and what preserves it?

Pub/sub systems don't guarantee global ordering across all messages — that would require a single serial bottleneck and kill throughput. What they do guarantee, with the right configuration, is per-key ordering: all messages with the same partition key arrive at the same partition in the order they were published. Within one partition, one consumer reads sequentially.

The three rules that preserve ordering: (1) use a meaningful partition key (e.g., entity ID), (2) assign exactly one consumer instance per partition, (3) process messages synchronously within each consumer — don't parallelize processing inside a single consumer using threads, because that re-introduces race conditions the partition eliminated. If you need global ordering across all messages, you're essentially serializing everything — use a single partition, accept the throughput limit, and consider whether you actually need global order or just per-entity order.

The eight questions cover the full interview spectrum: queue vs. pub/sub (fan-out vs. load distribution), when not to use pub/sub (synchronous reply, very low volume), at-least-once delivery mechanics (ack deadline + idempotency requirement), slow subscriber handling (lag monitoring, partition scaling, backpressure), pub/sub vs. event sourcing (routing vs. persistence), broker scaling (partitions → brokers → consumers, key skew detection), dead-letter queues (poison message parking, not optional in production), and ordered delivery (per-key via partition key, not global).

Practice Exercises — Build Your Pub/Sub Intuition

Four exercises that force you to apply the concepts under constraints — the same way interviews and on-call incidents do.

Reading about pub/sub builds vocabulary. Working through design problems builds intuition. Each exercise below has a constraint that forces a specific decision. Try to reason through it before expanding the hint or solution — the reasoning path matters more than the final answer.

You're designing a pub/sub system for a social feed. Users post status updates. Every post is published to a status.posted topic. The system has:

2,000 active publishers (users posting), each averaging 1 post per second at peak
Average post message size: 2 KB
Subscribers to the topic: 4 (notifications service, feed-builder service, analytics service, spam-detection service)
Your single broker node can sustain 200 MB/s of total I/O

Questions: (a) What is the broker's inbound bandwidth at peak? (b) What is the outbound bandwidth due to fan-out? (c) Is the single broker sufficient? (d) What's the most impactful change to reduce broker load if it's not?

Calculate inbound first (publishers × rate × size), then multiply by subscriber count for outbound. Total I/O = inbound + outbound.

Solution:

(a) Inbound: 2,000 publishers × 1 msg/s × 2 KB = 4,000 KB/s = ~4 MB/s inbound. That's tiny.

(b) Outbound (fan-out): 4 MB/s × 4 subscribers = 16 MB/s outbound.

(c) Total broker I/O: 4 + 16 = 20 MB/s. The 200 MB/s single broker is fine — you're at 10% utilization. No scaling needed at this load.

(d) If load grows 20× (40,000 publishers), total I/O becomes 400 MB/s — over the limit. The most impactful change is moving to a log-based broker (Kafka) where subscribers read from a shared log rather than receiving pushed copies. Inbound stays at 80 MB/s; outbound becomes 4 independent sequential reads (~80 MB/s each), but from local disk — not multiplied in-memory copies. Alternatively, reduce fan-out by routing some subscribers to read from a downstream database updated by a single consumer, rather than all four consuming from the topic directly.

Your company runs two pub/sub topics:

payment.completed — triggers refund eligibility checks, receipt emails, and fraud analysis. Losing a message means a customer is never notified, a refund is missed, or fraud is never caught.
ui.button.clicked — raw analytics events from the web UI. Used to build usage heatmaps. The team has explicitly said losing 0.5% of events is acceptable for cost reasons. Volume: 100,000 msg/s.

For each topic, choose: delivery guarantee (at-most-once / at-least-once / exactly-once), and justify the trade-off. Also — for payment.completed, what contract does your subscriber need to uphold regardless of which guarantee you choose?

At-most-once = fast, can lose messages. At-least-once = safe but may duplicate. Exactly-once = safest, highest cost. Match the business cost of loss vs. duplicate to the mechanism.

Solution:

payment.completed → At-least-once delivery. Losing any payment event has real financial and legal consequences. At-least-once guarantees the message will be delivered even if there's a crash or network blip, at the cost of possible duplicates. Exactly-once is theoretically ideal but adds significant broker and consumer complexity — most production systems achieve the practical equivalent by using at-least-once + idempotent consumers. The cost of building idempotency is far lower than the cost of lost payment events.

Subscriber contract: The subscriber MUST be idempotent — if it receives the same payment.completed event twice, it must not send two receipt emails or apply two refund eligibility checks. Common pattern: track processed event IDs in a database; on receive, check "have I seen this ID?" before processing.

ui.button.clicked → At-most-once delivery. The team explicitly accepts 0.5% loss. At-most-once is the cheapest option: the broker fires the message and doesn't retry on failure. This avoids ack overhead, retry queues, and deduplication logic — the savings compound at 100,000 msg/s. Using at-least-once here would add meaningful cost (ack round-trips, retry storage) for a benefit the team has decided isn't worth paying for.

You own a Kafka consumer group called inventory-updater that reads from the order.placed topic (32 partitions). The topic receives 50,000 msg/s at peak. Your consumer group has 32 instances running. You check the consumer group lag dashboard Monday morning and find:

Total lag: 2,400,000 messages and climbing
All 32 partitions have roughly equal lag (~75,000 each)
No consumer instances are idle — all 32 are assigned a partition
No errors in consumer logs — messages are processing without exceptions

What is most likely happening? List at least three hypotheses and the command or metric you'd check to confirm each. Then give the most likely fix.

If lag is growing uniformly across all partitions and there are no errors, the consumer is processing but not fast enough. Think about what determines processing throughput vs. what could be throttling it.

Solution — three hypotheses:

H1 (Most likely): Processing throughput is just too low. 32 consumers × throughput_per_consumer < 50,000 msg/s. If each consumer handles 1,200 msg/s and the topic receives 50,000 msg/s, you're 14,000 msg/s short. Check: kafka-consumer-groups.sh --describe and watch lag change over 60 seconds. If lag grows at 14k/s, your processing rate is 36k/s against 50k/s. Fix: profile the processing code — is there a slow downstream DB write? Can you batch-write to the inventory database instead of one row per message? Batching 100 messages per DB write is 100× faster than individual writes.

H2: A downstream dependency is slow or rate-limited. The inventory database has a connection pool limit or a slow query plan. Consumers are processing but waiting on DB acks. Check: DB query latency metrics, connection pool saturation. Fix: add indexes, tune pool size, switch to async batch writes with a local buffer.

H3: Offset commits are happening too frequently. If commits happen after every single message (commitSync() per message), each commit is a round-trip to the broker. At 50k msg/s with 32 consumers, that's ~1,560 round-trips/s per consumer — expensive. Check: look at commit frequency in logs. Fix: commit after processing a batch (every 500–1,000 messages), not after every single message.

Most likely fix: Profile the inventory DB write. Switch from per-message INSERT to batched UPSERT with a 500ms flush window. This alone typically 10–50× throughput on DB-bound consumers.

You're designing the pub/sub layer for a ride-sharing platform. Requirements:

Events: ride.requested, ride.accepted, ride.completed, ride.cancelled
Three regions: US-East, EU-West, AP-Southeast. Each region has riders and drivers.
Real-time systems (matching engine, driver app push) must react within 500ms and need local latency — cross-region round-trips are unacceptable.
Analytics and compliance audit trails must see ALL events from ALL regions.
A ride that starts in US-East must always be processed by US-East systems — do not route a US-East ride event to the EU broker for processing.

Design the topic layout, broker placement, replication strategy, and subscription topology. Identify the two hardest trade-offs in your design.

Think about: where should topics live? What is replicated vs. processed locally? What is the difference between "events flow to a global analytics topic" vs. "events are processed globally"?

Solution:

Broker placement: Deploy independent broker clusters in each region — US-East, EU-West, AP-Southeast. Each cluster hosts regional topics (us-east.ride.requested, eu-west.ride.requested, etc.). Real-time systems (matching engine, driver push) subscribe only to their local regional cluster. This guarantees sub-500ms local delivery — no cross-ocean hop.

Analytics aggregation: Deploy a global Kafka cluster (or use a managed service like Confluent Cloud or GCP Pub/Sub with global topics). Use Kafka MirrorMaker 2 (or equivalent) to replicate all regional topics into a unified global cluster: us-east.ride.requested, eu-west.ride.requested, and ap-se.ride.requested all mirror into global.ride.requested. Analytics and compliance consumers subscribe to the global cluster — they see all events from all regions, with some replication lag (typically 1–5 seconds). This satisfies the audit requirement without routing real-time traffic globally.

Topic layout per region: {region}.ride.{event_type} naming. Partition by ride ID — this ensures all events for one ride (requested → accepted → completed) land in the same partition, preserving ride-level ordering for the matching engine.

Trade-off 1 — Replication lag vs. consistency: The analytics cluster sees events 1–5 seconds behind real-time. If compliance queries the global cluster during that window, it may not yet see the most recent event for a ride. This is typically acceptable for audit use cases but must be documented and factored into any real-time dashboard SLAs.

Trade-off 2 — Operational complexity vs. simplicity: Three regional clusters plus a global cluster means four Kafka deployments to maintain, monitor, and upgrade. The alternative — a single global cluster — is simpler operationally but forces all real-time traffic to route to wherever the leader partition lives, adding cross-region latency that violates the 500ms requirement. The multi-cluster model is correct for this use case but requires investment in a cross-cluster replication pipeline and unified monitoring.

The four exercises build capacity planning instincts (fan-out math, I/O ceilings), delivery guarantee selection (matching business cost of loss vs. duplicate to the mechanism), consumer lag diagnosis (throughput vs. downstream bottleneck vs. commit overhead), and multi-region pub/sub topology design (local clusters for latency-sensitive real-time work, MirrorMaker replication for global analytics). The reasoning framework — not the numbers — transfers to any pub/sub system you encounter.

Cheat Sheet — Pub/Sub in 30 Seconds

Eight quick-reference cards covering every concept you need to recall fast — before an interview, during a design review, or when reviewing a broker config.

Bookmark this. When you need a fast refresh before an interview or while reviewing a pull request that touches event-driven code, this section is your 90-second re-entry point into the full page.

A service that emits events to a topic. It does not know — and does not care — who receives the event. Fire-and-forget is the contract. WHY: this ignorance is the source of all the decoupling benefits.

A service that listens on a topic and reacts to every message delivered to it. Multiple subscribers on the same topic each get their own copy. WHY: fan-out is the point — one event, many independent reactions.

A named, durable destination that the broker manages. Publishers write to it; subscribers read from it. Topics are the contract — name them as past-tense facts (order.placed, not createOrder). WHY: past-tense names signal that the event already happened and the publisher doesn't control what happens next.

The middleman. It receives messages from publishers, stores them (at least briefly), and delivers copies to all subscribers. It is the single component that makes pub/sub work — remove it and you're back to direct calls. WHY: the broker absorbs the fan-out cost so publishers and subscribers don't have to know about each other.

At-most-once: fast, can lose messages — use for low-stakes analytics. At-least-once: safe but may duplicate — requires idempotent consumers, use for payments and state changes. Exactly-once: no loss, no duplicates — expensive, use only when both loss AND duplication are unacceptable. WHY: each guarantee trades latency and complexity against data safety.

How many subscribers receive each published message. Broker outbound I/O = publish rate × message size × fan-out factor. This is WHY a log-based broker (Kafka) beats a push broker at high fan-out: subscribers read from a shared log rather than receiving N pushed copies.

Where messages go when a subscriber fails to process them after N retries. Without a DLQ you have two bad choices: retry forever (floods the topic) or drop silently (lose data). The DLQ parks broken messages safely for manual inspection and replay once the bug is fixed. Not optional in production.

Global ordering across all messages is impractical — it serializes everything and kills throughput. Per-key ordering is the practical standard: all messages with the same partition key land in the same partition, in publish order. One consumer per partition preserves that order end-to-end.

Eight cards: publisher (fire-and-forget emitter), subscriber (fan-out receiver), topic (named durable channel), broker (the decoupling middleman), ack semantics (at-most / at-least / exactly-once trade-offs), fan-out factor (the broker I/O multiplier), DLQ (poison message safety net, non-optional), and ordering (per-key via partition key, not global).

Glossary — Plain-English Definitions

Every pub/sub term defined in plain English — the vocabulary you need to read broker documentation and architecture discussions without getting lost.

These are the terms you'll encounter across all pub/sub systems — Kafka, RabbitMQ, Google Cloud Pub/Sub, AWS SNS/SQS, Redis Pub/Sub. The concepts are identical; only the brand names change.

Publisher (Producer): A service or process that creates and sends messages to a topic. The publisher doesn't know how many subscribers exist or what they do with the message. It just publishes and moves on.
Subscriber (Consumer): A service or process that listens to a topic and receives messages from it. In push-based systems, the broker delivers messages to the subscriber. In pull-based systems (like Kafka), the subscriber asks the broker for new messages on its own schedule.
Broker: The server (or cluster of servers) that sits between publishers and subscribers. It receives messages, stores them temporarily or permanently, and routes copies to every subscriber registered on the topic. Think of it as the post office for your events.
Topic: A named channel inside the broker. Publishers write to topics; subscribers read from topics. A topic is the only thing publishers and subscribers agree on — they don't need to know about each other, only about the topic name.
Message / Event: The unit of data sent through the system. Usually a small JSON or binary blob with a payload (the data) and optionally headers (metadata like timestamp, event type, or source service).
Subscription: A registered intent to receive messages from a topic. When a subscriber subscribes, the broker starts routing copies of new topic messages to it. In some systems (Google Cloud Pub/Sub), subscriptions are named resources with their own configuration (ack deadline, retention, DLQ settings).
Acknowledgement (ACK): A signal from the subscriber back to the broker saying "I received this message and processed it successfully — you can stop tracking it." Without an ACK, the broker assumes the message was not processed and will re-deliver it (in at-least-once systems).
Negative Acknowledgement (NACK): A signal from the subscriber saying "I got this message but I can't process it right now — please re-deliver it." The broker will re-queue the message for another delivery attempt.
Dead-Letter Queue (DLQ): A special topic or queue where the broker parks messages that have failed delivery or processing too many times. Prevents a single bad message from blocking the whole topic. Production requirement: every topic that matters needs a DLQ configured.
Fan-Out: The multiplication effect where one published message is delivered to N subscribers. If you publish 1 message to a topic with 10 subscribers, the broker does 10 deliveries. Fan-out factor is the multiplier that determines broker outbound I/O.
Consumer Group: A named set of subscriber instances that share the work of processing a topic's messages. In Kafka, each partition is assigned to exactly one consumer in the group — this distributes the processing load. Multiple consumer groups can independently consume the same topic, each getting a full copy of every message. WHY: consumer groups let you do load balancing (within the group) and fan-out (across groups) at the same time.
Partition: A slice of a topic — an independent ordered sequence of messages that can be stored on a different broker node. More partitions means more parallelism for both writing and reading. One consumer (per consumer group) is assigned to each partition, so partition count is the ceiling on consumer parallelism.
Offset: A number that identifies a specific message's position within a partition. Like a page number in a book. Subscribers track their current offset to know where they left off. If a subscriber restarts, it resumes from the last committed offset — this is how replay and crash recovery work.
Retention Period: How long the broker keeps messages before deleting them. In traditional brokers (RabbitMQ, SQS) messages are deleted after acknowledgement. In log-based brokers (Kafka) messages are retained for a configured duration (e.g., 7 days) regardless of consumption — enabling replay and late consumers.
Idempotency: A property of a processing operation: doing it twice produces the same result as doing it once. At-least-once delivery means duplicate messages are possible — your subscriber must be idempotent to handle them safely. Common pattern: deduplicate by message ID in the database before applying the change.

Fifteen terms: publisher, subscriber, broker, topic, message/event, subscription, ACK, NACK, DLQ, fan-out, consumer group, partition, offset, retention period, and idempotency. These are the vocabulary building blocks you need to read any pub/sub documentation or architecture discussion without reaching for a search engine.

Mini-Project — Build a Notification Fan-Out Service

Start with a single in-process broker, then grow it to partitioned topics, then multi-region — the same evolution path real systems follow.

The best way to truly understand pub/sub is to build a scaled-down version yourself. This mini-project takes you from a zero-dependency in-process broker to a design that would work at the scale of a real Slack-like notification system. Each stage adds exactly one new concept — and explains WHY that concept became necessary.

The Problem

You're building a notification service for a team chat app. When a user sends a message in a channel, several things need to happen: every member of the channel must receive a push notification, the "unread count" badge must update, the activity feed must log the event, and a search index must be updated. The user who sent the message shouldn't have to wait for any of this — they just need their message to appear in the chat.

Stage 1 — In-Process EventEmitter (Node.js)

Start here: zero infrastructure, zero configuration. The built-in EventEmitter in Node.js implements the pub/sub pattern natively. One line of code to publish, one line per subscriber. This is the right starting point for a new feature — prove the fan-out logic works before paying the operational cost of a real broker.

// Stage 1: a thin wrapper around Node's EventEmitter // Acts as the "broker" — all publishers and subscribers use this single instance const { EventEmitter } = require("events"); const broker = new EventEmitter(); // Increase max listener count so Node doesn't warn when you add many subscribers broker.setMaxListeners(50); module.exports = broker;

// Stage 2: swap in Google Cloud Pub/Sub — minimal code change const { PubSub } = require("@google-cloud/pubsub"); const pubsub = new PubSub({ projectId: "my-project" }); const topic = pubsub.topic("message.sent"); async function handleSendMessage(req, res) { const { channelId, userId, text } = req.body; const message = await db.messages.insert({ channelId, userId, text }); // Publish is now durable — broker persists the message // Even if subscribers are down, the message won't be lost const messageId = await topic.publishMessage({ data: Buffer.from(JSON.stringify({ messageId: message.id, channelId, userId, text })), attributes: { source: "chat-api", version: "1" }, }); console.log(`Published as GCP message ${messageId}`); res.json({ ok: true, messageId: message.id }); } // Subscriber — runs as a separate Cloud Run service const subscription = pubsub.subscription("push-notifications-sub"); subscription.on("message", async (msg) => { const event = JSON.parse(msg.data.toString()); const members = await db.channels.getMembers(event.channelId); await pushService.sendToMany(members.filter(m => m.id !== event.userId), { body: event.text }); msg.ack(); // Tell the broker: processed successfully, stop tracking }); // If processing fails, msg.nack() — broker will re-deliver subscription.on("error", (err) => console.error("Subscription error:", err));

Stage 3 Evolution Notes

Stage 3 adds multi-region. Each region (US-East, EU-West) gets its own Kafka cluster. The chat API in each region publishes only to its local cluster — no cross-region write latency. Notification subscribers (push, unread, search) also run locally, so the whole real-time loop stays within ~20ms round-trip.

A separate MirrorMaker 2 process replicates all regional topics into a global analytics cluster. Analytics and compliance consumers subscribe there, accepting 1–5 seconds of replication lag in exchange for a complete view of all events across every region. The real-time path never touches the global cluster.

Build order for a new project: Stage 1 (EventEmitter) until you have more than one service or need durability. Then Stage 2 (managed broker like Google Pub/Sub or Amazon SNS+SQS) when you have multiple services or need messages to survive a restart. Only go to Stage 3 when you have users in multiple geographies and cross-region latency is measurable. Skipping stages is premature complexity.

Three-stage evolution: Stage 1 (Node.js EventEmitter, zero infra, prove the fan-out design) → Stage 2 (managed broker like GCP Pub/Sub, durable delivery, independent scaling, DLQ) → Stage 3 (regional clusters for <20ms local delivery, MirrorMaker replication for global analytics). Code change from Stage 1 to Stage 2 is minimal — you're swapping the broker, not the design. Build only the stage you actually need right now.

Migration Path — From HTTP Webhooks to a Pub/Sub Backbone

A four-step guide for teams moving from a tightly-coupled webhook system to an event-driven pub/sub architecture — with honest risk assessment at each step.

Many systems start with webhooks: Service A calls Service B's endpoint directly when something happens. This works. Then a third service needs the same events. Then a fourth. Then Service B goes down and Service A starts returning 500s to users. The moment you recognize this smell, it's time to migrate to pub/sub — but migrating a live system takes care.

Golden rule for migrations: Never do a big-bang cutover. Each step below runs the old system and new system in parallel for a period. Only cut over after verifying the new path works correctly.

Step 1 — Introduce the Broker Alongside Existing Webhooks (Risk: Low)

What you do: Deploy a broker (start with a managed service — Google Pub/Sub, AWS SNS, or RabbitMQ). When Service A fires its existing webhook call to Service B, also publish the same event to a new topic on the broker. Do NOT remove the webhook yet. Both paths are live simultaneously.

Why this is safe: You haven't changed the existing code paths at all. The webhook still fires. The broker publish is additive — if it fails, the webhook already succeeded. You're giving yourself a window to verify the broker is receiving all events correctly before anyone depends on it.

Risk: Double-publishing means both paths must be maintained while you migrate. Avoid letting this parallel state last more than 2–4 weeks — teams forget and the webhook never gets removed.

Verification: Compare event counts on the broker topic vs. webhook delivery count over 48 hours. They should match. Set up a DLQ on the broker topic and monitor it for zero depth during this period.

Step 2 — Migrate Subscribers One by One (Risk: Medium)

What you do: Pick the lowest-stakes downstream consumer (not the most critical — start with analytics or a non-critical notification). Update it to subscribe to the broker topic instead of listening for the webhook. Keep the webhook delivery to this consumer disabled but not deleted. Observe the consumer for one week under real production traffic.

Why one by one: If something breaks, you know exactly which consumer caused it. A big-bang migration where all consumers switch simultaneously makes debugging impossible. The incremental approach turns a high-risk migration into a series of low-risk steps.

Risk: The subscriber now depends on the broker. If the broker has a delivery delay or outage, this consumer stops receiving events. This is acceptable for a low-stakes consumer but not yet for the most critical ones. Test your DLQ and retry policy before moving critical consumers.

Idempotency check: Before migrating any subscriber that modifies state (database writes, financial operations), confirm it is idempotent — at-least-once delivery means it will receive duplicates eventually. Add deduplication logic now if it isn't already present.

Step 3 — Cut Over Critical Consumers and Remove Webhooks (Risk: High — do this carefully)

What you do: After all non-critical consumers are successfully running on the broker, migrate the critical consumers (payments, core state changes). Once a consumer has been running on the broker for at least one week in production with zero issues, remove the webhook delivery path to that consumer. Removing the webhook is what actually simplifies the architecture.

Why the wait: One week of production traffic exercises edge cases that staging never sees. A shadow period that's too short misses monthly batch jobs, weekend traffic patterns, and retry storms from unrelated outages.

Risk: Removing the webhook to a critical consumer is irreversible in the short term. If the broker has an undiscovered bug (schema mismatch, delivery ordering issue), you have no fallback. Mitigate by keeping the webhook code path commented-out (not deleted) for 30 days post-cutover — you can re-enable it in minutes if you need to.

Rollback plan: Maintain the ability to re-enable direct webhook delivery within 10 minutes for any critical consumer. Document this rollback procedure before you start Step 3.

Step 4 — Remove Publisher-Side Webhook Code and Harden (Risk: Low)

What you do: Once all consumers have been on the broker for 30+ days, remove the webhook publishing code from Service A. Update monitoring to track broker topic depth, consumer group lag, and DLQ depth as your primary health signals (replacing the old HTTP webhook success rate). Set up alerts: page on DLQ depth > 0 for critical topics; alert (not page) on consumer lag growing for more than 5 minutes.

Why monitoring changes matter: Your old observability was HTTP response codes on webhook calls — easy to monitor. Pub/sub failure modes are different: the publisher succeeds (200 OK from the broker), but the subscriber might be slow, down, or processing incorrectly. Consumer lag and DLQ depth are now your primary health signals. If you don't update your dashboards and alerts, you'll miss subscriber failures entirely.

Hardening checklist: (1) DLQ configured on every topic. (2) Retry policy tested (NACK → exponential backoff → DLQ after N attempts). (3) Consumer lag alert set. (4) Runbook written for "DLQ has messages — what do I do?" (5) Load test at 3× expected peak before decommissioning any parallel systems.

Four steps: (1) Add broker alongside existing webhooks — additive, no risk, verify event parity. (2) Migrate consumers one-by-one starting with low-stakes ones — incremental, isolates failures, confirm idempotency. (3) Cut over critical consumers and remove webhooks — highest risk step, keep rollback plan ready, 30-day shadow before deleting code. (4) Remove publisher webhook code and update monitoring — consumer lag + DLQ depth replace HTTP success rate as your health signals.

Source	Author / Org	Why It's Worth Reading
Designing Data-Intensive Applications — Chapter 11: Stream Processing	Martin Kleppmann	The single best explanation of how logs, streams, and pub/sub systems relate to each other — covers replication, ordering, and the difference between databases and message logs with unusual clarity. If you read one book chapter on this topic, make it this one.
"You Cannot Have Exactly-Once Delivery"	Tyler Treat (bravenewgeek.com)	A short, precise essay that explains why true exactly-once delivery is fundamentally impossible in a distributed system — and what "exactly-once" actually means in practice (exactly-once processing, not delivery). Corrects one of the most persistent myths in pub/sub. Free on the web.
The Log: What every software engineer should know about real-time data's unifying abstraction	Jay Kreps (LinkedIn Engineering Blog)	The foundational blog post that explains why a distributed append-only log is the right abstraction for pub/sub at scale. Written by one of Kafka's creators. Explains the unifying idea behind Kafka, database replication, and event sourcing in one readable essay. Free online.
Google Cloud Pub/Sub Documentation — "Choosing a subscription type"	Google	A practical, well-organized reference for push vs. pull subscriptions, ordering keys, exactly-once delivery configuration, and DLQ setup in a real managed pub/sub service. Useful as a reference when you're designing or implementing against GCP Pub/Sub specifically. Free at cloud.google.com/pubsub/docs.
RabbitMQ Tutorials — "Publish/Subscribe"	RabbitMQ (rabbitmq.com)	The official hands-on tutorial series that walks through exchanges, queues, and binding keys with working code in multiple languages. The best first-principles introduction if you want to understand how a traditional broker (not log-based) implements pub/sub routing. All tutorials are free and runnable in under 30 minutes.
"Building a Scalable Pub/Sub System" — Confluent Blog Series	Confluent Engineering	A multi-part series on production pub/sub patterns using Kafka: schema evolution, consumer group management, DLQ patterns, and multi-region replication. Goes beyond Kafka-specific details into general principles that apply to any broker. Free at confluent.io/blog.
AWS SNS + SQS Fan-Out Pattern Documentation	Amazon Web Services	The AWS architecture reference for the SNS-to-SQS fan-out pattern — where one SNS topic fans out to multiple SQS queues, each with its own consumers. The most commonly deployed pub/sub architecture in AWS environments. Includes worked configuration examples. Free at docs.aws.amazon.com.

Publish/Subscribe (Pub/Sub)

TL;DR

Why You Should Care — The Problem It Solves

Before Pub/Sub: The Direct-Call Tangle

After Pub/Sub: One Publisher, Many Listeners

Real-World Analogies

The Mental Model — How Pub/Sub Actually Works

Component 1: The Publisher

Component 2: The Broker and Topics

Component 3: The Subscriber

How the Message Flows — Step by Step

Minimal Working Example — Pub/Sub in Code

Code Walkthrough — What Each Piece Does

Junior vs Senior — How You Think About Pub/Sub

How a Junior Thinks

Problems

How a Senior Thinks

What Changed — and Why

Bottom Line — Synchronous vs Pub/Sub

Senior Solution: Isolate failures with pub/sub

Senior Solution: Open/Closed via subscriptions

Senior Solution: Return after the broker acknowledges

The Topic Registry — The Broker's Routing Table

Push vs Pull — Two Ways to Get Messages Out

Acknowledgements and Redelivery — The At-Least-Once Problem

Ordering — When Does It Matter and When Can You Get It?

Partitioning — The Throughput Multiplier

What Went Wrong

What Went Wrong

What Went Wrong

The Four Testing Layers

Layer 1 — Unit-Testing Publishers (Mock the Broker)

Layer 2 — Unit-Testing Subscribers (Replay a Fixture)

Layer 3 — Integration-Testing Topic Routing

Layer 4 — Fault Injection

Sample Test: Publisher + Subscriber Unit Tests

The Five Production Signals

What Each Signal Tells You

The Fan-Out Formula

Worked Example — The Full Math

Partition Strategy in Plain English

The Problem

Stage 1 — In-Process EventEmitter (Node.js)

Stage 3 Evolution Notes

Step 1 — Introduce the Broker Alongside Existing Webhooks (Risk: Low)

Step 2 — Migrate Subscribers One by One (Risk: Medium)

Step 3 — Cut Over Critical Consumers and Remove Webhooks (Risk: High — do this carefully)

Step 4 — Remove Publisher-Side Webhook Code and Harden (Risk: Low)