The pattern that lets services talk to each other without ever knowing who's listening β and why it's the backbone of every scalable event-driven system you'll ever build.
What "publish/subscribe" actually means in plain English β and why it exists in the first place
The three roles every pub/sub system has: publishers, a broker, and subscribers β and why the broker is the magic ingredient
The fundamental difference between pub/sub and a direct API call (and when each one is the right choice)
How topics work as named channels that decouple producers from consumers completely
Where pub/sub shows up in real systems: Kafka, Google Pub/Sub, AWS SNS, RabbitMQ fanout, Redis Pub/Sub, and browser EventEmitter all use this exact pattern
The three things you gain (decoupling, scalability, resilience) and the two things you trade away (guaranteed order, instant feedback)
Pub/Sub is the art of shouting into a named channel and letting whoever cares listen β without the shouter ever knowing who's on the other end. It's why your order service can fire "order placed!" and have the email team, the inventory team, and the analytics team all react independently, without the order service ever needing to know those teams exist.
The one-liner: Pub/Sub is a messaging pattern where senders (publishers) post messages to a named channel (topic), and receivers (subscribers) listen on that channel β with a broker in the middle doing all the routing. Nobody talks to anyone directly. The broker is the post office.
What: Pub/Sub (short for Publish/SubscribeA messaging pattern where message senders (publishers) and receivers (subscribers) are completely decoupled. Publishers don't know who reads their messages; subscribers don't know who writes them. A broker in the middle routes everything.) is a communication pattern for distributed systems. Instead of Service A calling Service B directly, A publishes a message to a topic. The broker delivers that message to every service that has subscribed to that topic. A and B never talk to each other β only to the broker.
When: Use pub/sub when one event should trigger multiple independent reactions β like "user signed up" kicking off a welcome email, a Slack notification, and an analytics event simultaneously. Also when you want to decouple services so they can be deployed, scaled, and changed independently. Avoid it when you need an immediate response (synchronous request/reply is better) or when strict message ordering is non-negotiable.
In real systems: You'll see pub/sub under many brand names β Apache KafkaA distributed append-only log that implements pub/sub at massive scale. Topics are split into partitions for parallelism, and messages are retained for a configurable period so consumers can replay history. (topics + consumer groups), Google Cloud Pub/SubA fully managed pub/sub service from Google. Publishers write to topics; subscribers pull or push from subscriptions. Guarantees at-least-once delivery., AWS SNS (fan-out to queues/lambdas/email/SMS), RabbitMQ fanout exchanges, Redis Pub/Sub for lightweight in-memory messaging, and even the browser's EventEmitter / addEventListener. The underlying idea is identical in all of them.
The core trade-off: You gain decoupling (services don't know about each other), scalability (the broker scales fan-out, not your services), and resilience (the broker buffers messages if a subscriber is down). You trade away synchronous responses (you can't wait for a reply) and simple debugging (tracing a message through a broker is harder than tracing a direct function call).
Quick Example (Node-style):
// Publisher β doesn't know who's listening
broker.publish("order.placed", { orderId: "42", amount: 99.00 });
// Subscriber A β email service
broker.subscribe("order.placed", (msg) => sendConfirmationEmail(msg));
// Subscriber B β inventory service
broker.subscribe("order.placed", (msg) => reserveStock(msg));
// Both get called. Publisher wrote once, two services reacted.
// Neither knows the other exists.Pub/Sub decouples message senders from receivers by routing everything through a named channel (topic) managed by a broker. Publishers fire-and-forget; subscribers listen and react. The broker handles fan-out, buffering, and delivery. You gain decoupling and scalability at the cost of synchronous feedback and simpler debugging.
Section 2
Why You Should Care β The Problem It Solves
Let's start with a story about a team that didn't use pub/sub β and paid for it.
Before Pub/Sub: The Direct-Call Tangle
Imagine a small startup building an e-commerce platform. When a customer places an order, the order service needs to do four things: send a confirmation email, update the inventory count, log the event for analytics, and notify the shipping service to prepare a label.
The obvious approach? Have the order service call each of those services directly β one HTTP request to the email service, another to inventory, another to analytics, another to shipping. It works. The code ships. Life is good.
Then things start going wrong.
Problem 1: The shipping service goes down. Now your entire order endpoint returns a 500 error β even though the email sent fine and inventory updated correctly. One unrelated service failure broke everything, because the order service was blocked waiting for each reply.
Problem 2: The analytics team wants the same events. They ask the order service team to add one more HTTP call β to their new data warehouse. The order service now has five dependencies. Then a sixth. Every new team that needs order data goes straight to the order service and asks for a direct integration. The order service becomes the most tightly coupled, most brittle thing in the whole system.
Problem 3: Latency compounds. Six HTTP calls in sequence adds up fast. If each downstream service takes 50ms to reply, the customer waits 300ms extra just for side-effects that don't need to happen before they see "order confirmed." The user experience suffers because of coupling that serves nobody.
The diagram above shows the mess that grows naturally from direct calls. Every arrow is a hard dependency. One service goes down and the whole chain breaks. A new team arrives and the order service has to change again.
After Pub/Sub: One Publisher, Many Listeners
Now the team refactors. The order service does exactly one thing: publish a message to a topic called order.placed. It doesn't call anyone directly. It doesn't wait for anyone to reply. It shouts the news into the channel and returns immediately to the customer with "Order confirmed!"
Every other service β email, inventory, analytics, shipping β listens to the order.placed topic independently. When the message arrives, they each do their thing at their own pace, in their own process, without affecting each other.
The difference is dramatic. The order service now has exactly one dependency: the broker. Not four, not ten β one. New teams can subscribe to the same topic without touching the order service at all. The shipping service can go down for an hour; the broker buffers its messages, and when shipping comes back up, it catches up at its own pace. The customer got their confirmation immediately because the order service didn't wait for anything after publishing.
The core insight: Pub/Sub doesn't just make things faster β it changes the shape of your dependencies. Direct calls create a web: every sender knows about every receiver. Pub/Sub creates a star: everyone knows about the broker, and nobody knows about anyone else. When you change that shape, you change what can break, what can scale, and what you can modify without fear.
Direct service-to-service calls create fragile webs where one failure cascades, every new consumer requires code changes in the producer, and latency compounds. Pub/Sub replaces that web with a star: the order service publishes once to a topic, and each downstream service subscribes independently. New consumers are free; downstream failures are isolated; the producer returns to the user immediately without waiting.
Section 3
Real-World Analogies
Abstract patterns click when you can picture them in real life first. Here's the analogy that makes pub/sub stick β and then two more for different angles on the same idea.
Think about how a newspaper works. The publisher (the newspaper company) writes one edition of the paper. Thousands of subscribers each get their own copy delivered to their door. The publisher has no idea who any individual subscriber is β they just print the paper and let the distribution system handle delivery. And crucially, subscribers don't call the newspaper office to request their copy β they signed up once, and the copies just arrive whenever a new edition is published.
Notice what's missing: no direct relationship between publisher and subscriber. The publisher doesn't call each subscriber. The subscribers don't poll the publisher asking "is there news yet?" There's a distribution system β an intermediary β that handles the routing. In pub/sub systems, that intermediary is called the broker.
Real World
What it represents
In pub/sub code
Newspaper company
The thing that creates and sends events
publisher / producer
The edition / story
The data being communicated
message / event payload
"Business News" section
The category of events
topic / channel
Distribution system (post office)
The intermediary that handles routing
broker
Individual subscriber
A service that cares about the events
subscriber / consumer
Subscription sign-up
Registering interest in a topic
subscribe("topic.name")
The SVG above captures the key relationships. The newspaper company (publisher) on the left sends one message to the broker. The broker manages a topic called morning-edition. Three subscribers β home, office, library β each get their own independent copy. The publisher never spoke to the subscribers directly. The subscribers never polled the publisher. The broker handled all the routing, and each subscriber received the message at its own pace.
The "aha" moment: The publisher wrote the message once. Three subscribers consumed it independently. This is called fan-out β one message becomes N deliveries β and it's the reason pub/sub is so economical. Without it, the publisher would have to make three separate HTTP calls, know all three addresses, and handle all three failure modes. Fan-out moves that complexity to the broker where it belongs.
FM Radio Broadcast β A radio station transmits on a fixed frequency (the "topic"). Every car, home stereo, and phone tuned to that frequency receives the signal simultaneously β without the station knowing how many listeners there are. You can add a million new listeners and the broadcaster does zero extra work. That's pub/sub's fan-out in the physical world. The key difference from the newspaper analogy: radio has no queue β if you miss the broadcast, it's gone. Many pub/sub systems work this way too (Redis Pub/Sub, for example), while others (Kafka, SQS) store messages so late-arriving subscribers can catch up.
Push Notification Feed β When a Twitter/X account you follow posts a tweet, you get a notification β and so do your 50,000 fellow followers. The tweet author didn't manually DM each of you. There's an infrastructure layer that tracks "who follows @someaccount" (the subscriber list) and fans out the notification to everyone on that list. You can unfollow at any time (unsubscribe), and new followers opt in without the author doing anything (subscribe at will). This maps directly to how subscribe() and unsubscribe() work in code: consumers register interest dynamically, and the broker maintains the routing table.
Emergency Alert System β When an emergency is declared, one central authority broadcasts an alert on a specific channel (weather alerts, evacuation orders). Every TV station, radio station, phone, and siren system subscribed to that emergency channel activates at once. The authorities don't call each TV station β they publish to one channel and let each receiver handle its response independently. This maps to a crucial real-world use case: large-scale fan-out where the publisher can't possibly enumerate all consumers in advance. You subscribe ahead of time; the publisher never needs an updated list.
The newspaper analogy captures pub/sub perfectly: one publisher, one broker (distribution system), many independent subscribers β none of whom the publisher knows about. The publisher sends once; the broker fans out to everyone who subscribed. Radio adds the "no memory" angle (miss the broadcast = miss the message), while push notifications add the "dynamic subscription" angle. All three maps to the same underlying pub/sub mechanics.
Section 4
The Mental Model β How Pub/Sub Actually Works
Time to go from analogy to architecture. Pub/Sub has three moving parts in every implementation, from a tiny in-memory EventEmitter to a globe-spanning Kafka cluster. You need these three concepts locked in your head before any of the details make sense.
Three components, every time: A publisher that emits messages without knowing who reads them, a broker that manages topics and routes messages, and subscribers that receive messages without knowing who sent them. The broker is the piece that makes the other two independent.
Component 1: The Publisher
The publisherAlso called a "producer" in many systems (e.g. Kafka uses "producer"). It's the service that creates and sends messages. It has no knowledge of which services will receive those messages β only the broker knows that. is any piece of code that creates a message and sends it to a topic. The publisher's job is simple: produce the message and hand it to the broker. After that, the publisher's work is done β it doesn't wait for consumers to finish, and it doesn't track whether anyone received the message.
Why is this "fire and forget"? Because requiring the publisher to wait for all subscribers would re-introduce the same coupling problem we started with. If the publisher had to wait for three subscribers to acknowledge, the publisher is now blocked on all three. The whole point of pub/sub is that the publisher's throughput is independent of what subscribers do with the message.
Component 2: The Broker and Topics
The brokerThe intermediary server (or cluster of servers) that receives messages from publishers, stores them temporarily, and routes them to subscribers. Examples: Kafka brokers, RabbitMQ exchanges + queues, Google Cloud Pub/Sub service, Redis server. is the heart of the system. It does three things: it accepts messages from publishers, it stores them temporarily (or durably, depending on configuration), and it routes them to the right subscribers.
The broker organises messages by topicA named channel within the broker. Publishers send to a specific topic; subscribers listen on a specific topic. Think of it as the subject line of the message β e.g., "order.placed", "user.signed-up", "payment.failed". Each topic is independent from others. β a named channel like order.placed or user.signed-up. Topics are the addressing scheme: a publisher says "here's a message for the order.placed topic," and the broker delivers it to every subscriber registered for that topic. Topics let you have dozens of completely independent message streams flowing through the same broker without them interfering with each other.
Component 3: The Subscriber
The subscriberAlso called a "consumer" in many systems. It's the service that registers interest in a topic and receives messages when they arrive. Subscribers don't know who published the messages β they only interact with the broker. registers its interest in a topic (a subscribe() call), and from that point on, the broker delivers matching messages to it. The subscriber processes each message independently β it can take as long as it needs, retry failed processing, or even restart and catch up from where it left off (in systems like Kafka that retain messages).
Why is it important that the subscriber doesn't know who published? Because it means you can add, remove, or change publishers without touching subscriber code. The subscriber's contract is with the message format and the topic name β not with any specific service. This is called loose couplingA design principle where components interact through well-defined interfaces and have minimal knowledge of each other's internals. Loose coupling makes systems easier to change, scale, and test independently., and it's one of the most valuable properties in large distributed systems.
PubSub-Architecture
The diagram captures the full picture. Three publishers on the left (Order, User, Payment services) each send to a specific topic β they have no idea who's listening. The broker in the centre manages three topics and buffers messages in each. Five subscribers on the right each register interest in one or more topics and receive messages automatically. Notice that the Email Service subscribes to bothorder.placed and user.signed-up β a subscriber can listen to multiple topics simultaneously. And the Analytics Service listens to all three, aggregating events from every corner of the system without any of the publishers knowing.
How the Message Flows β Step by Step
The sequence diagram above shows the timing. In step β , the Order Service publishes the message and immediately gets back an acknowledgment from the broker (step β‘) β the message is safely stored. At this point the publisher is done. It goes back to the customer with "order confirmed." Steps β’ and β£ happen independently and concurrently after that: the broker delivers to both Email and Inventory simultaneously, and each processes at its own pace. The publisher is completely disconnected from those steps. It doesn't know they happened, and it certainly didn't wait for them.
Key Insight β Why the broker's ACK is what matters: The publisher gets a "message accepted" acknowledgment from the broker β not from the subscribers. This is what makes fire-and-forget possible. The broker takes responsibility for delivery from that point on. If a subscriber is temporarily down, the broker will keep the message and retry delivery when the subscriber comes back (in durable systems like Kafka or SQS). The publisher never needs to know.
Pub/Sub has three roles: publishers emit messages to topics (fire-and-forget), the broker manages topics and routes messages to registered subscribers, and subscribers process messages independently. The key insight is that publishers and subscribers are fully decoupled through the broker β each only knows about the broker, not about each other. Messages are delivered concurrently to all subscribers of a topic, and the publisher returns to the caller immediately after the broker acknowledges receipt.
Section 5
Minimal Working Example β Pub/Sub in Code
Before you see Kafka with its brokers, partitions, and consumer groups, let's build pub/sub from scratch β the smallest possible version that still captures all three core ideas. Once you understand this 30-line version, every real system (Kafka, Google Pub/Sub, SNS) will feel like a feature-rich version of the same concept.
What we're building: A tiny in-memory event broker. Publishers call publish(topic, message). Subscribers call subscribe(topic, handler). The broker routes messages to the right handlers. That's it. Three moving parts, ~30 lines of code.
The flow is simple: during setup, subscribers register their handler functions with the broker for specific topics. At runtime, when a publisher calls publish(), the broker looks up the topic in its subscriber map and calls every registered handler with the message. No direct coupling between publisher and subscribers β they communicate only through the broker's topic map.
This version builds the broker from scratch so you can see every moving part. It's not production code β it's a learning tool. The entire pub/sub mechanism fits in about 25 lines.
// The Broker β the piece that makes pub/sub work
class EventBroker {
constructor() {
// Core data structure: a map from topic name β array of handler functions
// When you "subscribe", you push a handler into the array for that topic.
// When you "publish", you call every handler in that array.
this.subscribers = new Map(); // { "order.placed": [fn1, fn2, ...] }
}
// Subscribe: register a callback for a topic
// Any function can subscribe β the broker doesn't care what it does
subscribe(topic, handler) {
if (!this.subscribers.has(topic)) {
this.subscribers.set(topic, []); // first subscriber for this topic
}
this.subscribers.get(topic).push(handler);
console.log(`[Broker] handler registered for "${topic}"`);
}
// Unsubscribe: remove a handler (e.g. when a service shuts down)
unsubscribe(topic, handler) {
const handlers = this.subscribers.get(topic) || [];
this.subscribers.set(topic, handlers.filter(h => h !== handler));
}
// Publish: send a message to a topic β fan-out to ALL subscribers
// Returns immediately β doesn't wait for handlers to finish
publish(topic, message) {
const handlers = this.subscribers.get(topic) || [];
console.log(`[Broker] publishing to "${topic}" β ${handlers.length} subscriber(s)`);
// Each handler is called independently
// In a real system, these would run in separate processes/services
handlers.forEach(handler => {
try {
handler(message);
} catch (err) {
// Real systems would put this in a dead-letter queue
console.error(`[Broker] handler error on "${topic}":`, err.message);
}
});
}
}
module.exports = { EventBroker };const { EventBroker } = require('./EventBroker');
const broker = new EventBroker();
// βββ SUBSCRIBERS register BEFORE any publisher runs ββββββββββββββββββββββββββ
// Email Service β cares about "order.placed"
broker.subscribe("order.placed", (msg) => {
console.log(`[Email Service] Sending confirmation to order #${msg.orderId}`);
// sendConfirmationEmail(msg.customerEmail, msg.orderId) ...
});
// Inventory Service β also cares about "order.placed"
broker.subscribe("order.placed", (msg) => {
console.log(`[Inventory] Reserving ${msg.quantity}x "${msg.item}"`);
// reserveStock(msg.item, msg.quantity) ...
});
// Analytics β subscribes to ALL event types
broker.subscribe("order.placed", (msg) => {
console.log(`[Analytics] Recording order event: $${msg.amount}`);
});
broker.subscribe("user.signed-up", (msg) => {
console.log(`[Analytics] New user: ${msg.userId}`);
});
// βββ PUBLISHER fires and forgets βββββββββββββββββββββββββββββββββββββββββββββ
// Order Service β publishes and returns to the caller immediately.
// It has zero knowledge of Email, Inventory, or Analytics services.
function handleOrderPlaced(orderData) {
broker.publish("order.placed", {
orderId: orderData.id,
item: orderData.item,
quantity: orderData.qty,
amount: orderData.price,
});
return { status: 201, message: "Order accepted" }; // returns immediately!
}
// βββ Trigger an order ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
const result = handleOrderPlaced({ id: "ORD-42", item: "Widget", qty: 3, price: 79.99 });
console.log(`[Order Service] Responded: ${result.status} ${result.message}`);
Once you understand the in-memory broker, it's easy to see how Redis Pub/Sub maps to the same idea. Redis is a fast in-memory data store that includes a built-in pub/sub system. Publishers call PUBLISH channel message, subscribers call SUBSCRIBE channel. The concepts are identical β only the transport changes.
// Redis Pub/Sub β Publisher (runs in the Order Service process)
const redis = require('ioredis');
const pub = new redis(); // connects to Redis broker
async function handleOrderPlaced(order) {
const message = JSON.stringify({
orderId: order.id,
item: order.item,
quantity: order.qty,
amount: order.price,
timestamp: new Date().toISOString(),
});
// PUBLISH channel message β fire and forget
// Redis routes this to every subscriber on "order.placed"
await pub.publish("order.placed", message);
// Publisher returns immediately β does NOT wait for subscribers
return { status: 201, message: "Order accepted" };
}
module.exports = { handleOrderPlaced };// Redis Pub/Sub β Subscriber (runs in the Email Service process)
const redis = require('ioredis');
const sub = new redis(); // separate connection for subscribing
// Register interest in the "order.placed" channel
sub.subscribe("order.placed", (err) => {
if (err) throw err;
console.log("[Email Service] Listening on order.placed");
});
// Handler fires every time Redis delivers a message to this subscriber
sub.on("message", (channel, rawMessage) => {
if (channel === "order.placed") {
const order = JSON.parse(rawMessage);
console.log(`[Email Service] Sending confirmation for #${order.orderId}`);
// sendEmail(order.customerEmail, order.orderId) ...
}
});
// βββ Important Redis Pub/Sub caveat βββββββββββββββββββββββββββββββββββββββββ
// Redis Pub/Sub has NO message persistence. If this subscriber process
// is down when the publisher fires, that message is lost forever.
// For durable delivery (retry-on-failure), use Redis Streams, Kafka, or SQS instead.
// For volatile notifications (e.g. live dashboard updates), Redis Pub/Sub is perfect.// Redis Pub/Sub β Subscriber (runs in the Inventory Service process)
const redis = require('ioredis');
const sub = new redis();
sub.subscribe("order.placed", (err) => {
if (err) throw err;
console.log("[Inventory] Listening on order.placed");
});
sub.on("message", (channel, rawMessage) => {
if (channel === "order.placed") {
const order = JSON.parse(rawMessage);
console.log(`[Inventory] Reserving ${order.quantity}x "${order.item}"`);
// reserveStock(order.item, order.quantity) ...
}
});
The structure is identical to the in-memory version: publisher calls publish(), broker (Redis server) routes to subscribers, subscribers receive and handle independently. The only real difference is that this works across multiple processes and machines β each file runs in its own service, connected to the same Redis server over the network.
Β·Β·Β·
Code Walkthrough β What Each Piece Does
The broker's subscriber map (Map<topic, handler[]>) is the heart of the whole thing. Every pub/sub system β from a 30-line in-memory broker to a multi-datacenter Kafka cluster β has an equivalent of this map somewhere. It's the answer to the question: "when a message arrives on topic X, who should get it?" The broker's only job is to maintain this mapping and execute it correctly.
The subscribe(topic, handler) method does two things: ensures a list exists for the topic (creating it on first use), then appends the handler function to that list. That's it. Subscription is O(1) β just a map lookup and a push. Notice that the handler is stored as a reference to a function β the broker doesn't know or care what the function does. It could send an email, update a database, or trigger a machine learning job. The broker is blissfully ignorant.
The publish(topic, message) method is where the fan-out happens. It looks up the topic in the subscriber map, then calls each handler in sequence (or in parallel in async systems). The publisher calls publish() and gets back immediately β it doesn't await the handlers. In the Redis version, the publisher awaits only the acknowledgment from the Redis server, not from the downstream services.
The error handling in publish() is subtle but important. If one handler throws, the broker catches it and continues to the next handler β one subscriber failing doesn't prevent other subscribers from receiving the message. In production systems, a failed handler typically sends the message to a dead-letter queueA special fallback queue that holds messages that couldn't be processed successfully. Instead of losing failed messages, the broker moves them here for manual inspection or retry. Common in Kafka (using a separate DLT topic), SQS, and RabbitMQ. for later inspection and retry.
Redis Pub/Sub vs in-memory broker β the concepts are identical, but Redis crosses process boundaries. Each subscriber runs in its own service on its own server. The Redis server is the shared broker. This means messages survive process restarts on the publisher side β but Redis Pub/Sub has no message persistence, so if a subscriber is down, it misses the message entirely. For durable delivery, you need Kafka, SQS, or Redis Streams (which we cover in later sections).
The in-memory broker shows pub/sub in its purest form: a map from topic to handler list, a subscribe() that registers handlers, and a publish() that fan-outs to all of them. Redis Pub/Sub applies the same idea across the network β publisher and subscribers run in separate processes, connected through the Redis server as broker. The core pattern never changes: publish fire-and-forgets, broker routes, subscribers handle independently. The differences between systems are mostly about durability, ordering, and scale.
Section 6
Junior vs Senior β How You Think About Pub/Sub
The Scenario
You're building the backend for a ride-hailing app. When a ride is completed, the system needs to: (1) charge the customer's payment card, (2) calculate and add the driver's earnings, (3) send a receipt email, (4) update the trip in the analytics database, and (5) prompt both driver and rider to leave a review. Your manager asks: "How do you wire this up?"
How a Junior Thinks
The natural first instinct is to put everything in one place. The ride service knows all the things that need to happen, so why not just call them in order? It keeps the logic visible and easy to reason about.
If the analytics database is having a slow night and throws a timeout on step 4, the customer gets a 500 error β even though payment was charged, the driver got their earnings, and the email was sent. All that work is wasted. The customer will call support confused about a failed ride that they were definitely charged for.
Problem 2: The ride service knows too much
The ride service now imports and depends on five other services. When the review prompt feature changes (maybe the team moves to in-app push instead of SMS), the ride service code has to be updated. When a new "loyalty points" feature is added, someone goes into the ride service and adds a sixth await. The ride service becomes the place where everyone's requirements collide.
Problem 3: Latency is the sum of all steps
Steps 2β5 are completely independent β there's no reason the analytics write needs to happen before the review prompt. But because they're sequential awaits, the total latency is the sum of all five operations. Even if you parallelise them with Promise.all, you're still blocked waiting for the slowest one before you can return. And you still break if any of them fail.
How a Senior Thinks
A senior engineer looks at this scenario and immediately asks: "Which of these five things must happen before the customer sees their response?" Usually, only one: payment charging. Everything else β earnings, email, analytics, review prompt β can happen asynchronously without the user waiting for them. Once you identify that distinction, the architecture becomes obvious: publish one event, let each concern subscribe and handle itself.
// Senior approach: publish one event and return immediately
// The ride service has exactly ONE responsibility: record the ride outcome.
// It delegates everything else by emitting an event.
async function onRideCompleted(rideId, customerId, driverId, fare) {
// The ONLY synchronous call: write the final ride record to our own DB
await rideDb.markCompleted(rideId, { customerId, driverId, fare });
// Publish one event β all downstream reactions happen asynchronously
// The ride service doesn't know (or care) who's listening
await broker.publish("ride.completed", {
rideId,
customerId,
driverId,
fare,
completedAt: new Date().toISOString(),
});
// Return to the customer immediately β no waiting for email, analytics, etc.
return { status: "completed", rideId };
}
// The ride service has ZERO imports of paymentService, emailService, analyticsDb, etc.
// Adding a new downstream action = add a new subscriber, zero changes here.// Payment service subscribes to ride.completed
// Completely independent β can be deployed and scaled on its own
broker.subscribe("ride.completed", async (event) => {
const { rideId, customerId, fare } = event;
// Charge the customer
const charge = await stripe.charges.create({
amount: Math.round(fare * 100), // Stripe works in cents
currency: "usd",
customer: customerId,
metadata: { rideId },
});
console.log(`[Payment] Charged $${fare} for ride ${rideId}: ${charge.id}`);
// Once charge succeeds, emit ANOTHER event β for things that depend on payment
// (e.g. driver earnings and receipts need the charge ID)
await broker.publish("payment.captured", {
rideId,
customerId,
chargeId: charge.id,
amount: fare,
});
});// Analytics service subscribes to ride.completed
// Completely independent β if analytics is slow or down, the ride still completes
broker.subscribe("ride.completed", async (event) => {
const { rideId, customerId, driverId, fare, completedAt } = event;
await analyticsDb.insert("trips", {
trip_id: rideId,
rider_id: customerId,
driver_id: driverId,
fare_usd: fare,
completed_at: completedAt,
recorded_at: new Date().toISOString(),
});
console.log(`[Analytics] Trip ${rideId} recorded`);
// If this fails, the ride.completed message goes to a dead-letter topic
// for retry β it doesn't affect payment or email at all
});// Review prompt service β completely decoupled
// When the team decides to change from SMS to push notifications,
// ONLY this file needs to change. RideService.js is untouched.
broker.subscribe("ride.completed", async (event) => {
const { rideId, customerId, driverId } = event;
// Prompt rider and driver simultaneously (independent of each other)
await Promise.all([
notificationService.send(customerId, {
type: "review_prompt",
rideId,
message: "How was your ride? Leave a quick rating.",
}),
notificationService.send(driverId, {
type: "review_prompt",
rideId,
message: "How was your passenger? Leave a quick rating.",
}),
]);
console.log(`[Reviews] Prompts sent for ride ${rideId}`);
});
What Changed β and Why
Failures are isolated
If the analytics database goes down, the analytics subscriber fails β and its message goes to the dead-letter queue for retry. The customer's ride completes, payment processes, and the email arrives on time. Nothing else breaks because nothing else depends on analytics completing first. Each subscriber owns its own failure mode.
Adding features is additive, not modifying
The product team wants to add loyalty points for completed rides. In the junior approach, someone adds a sixth await to the ride service β touching a critical, well-tested function. In the senior approach, someone creates a new LoyaltySubscriber.js file that subscribes to ride.completed. The ride service is never touched. This is the Open/Closed Principle in action: open for extension (new subscribers), closed for modification (existing publishers don't change).
The ride service is simple, tested, and stable
The ride service went from knowing about five external services to knowing about one (the broker). Its unit tests went from needing five mocks to needing one. When a new team is onboarded, they don't need to understand analytics database schemas or Stripe API calls to understand what happens when a ride completes. The core business logic is clean and legible.
Bottom Line β Synchronous vs Pub/Sub
The graduation moment: You've crossed from junior to senior thinking on pub/sub the day you stop asking "how do I call all the things that need to happen?" and start asking "what event just occurred β and who should decide independently what to do about it?" That mental shift β from procedure to event β is what event-driven architecture is built on.
Junior instinct is to chain all the side-effects sequentially in the triggering service β which works until any downstream service is slow or down. Senior thinking separates the event (ride completed) from the reactions (charge payment, update analytics, send email), publishing the event once and letting each concern subscribe independently. The producer gains simplicity, failures isolate cleanly, and new features are added by creating new subscribers rather than modifying the producer.
Senior Solution: Isolate failures with pub/sub
The root problem is synchronous coupling: each step in the chain must succeed for the next one to run. The senior fix is to decouple the ride completion from the side-effects. Publish one ride.completed event; let analytics subscribe and handle itself. If analytics fails, the broker sends the message to a dead-letter topic for later retry. The customer's core experience β ride recorded, payment charged, receipt sent β is unaffected.
// If THIS subscriber fails, only THIS subscriber retries.
// RideService, email, and review prompts are unaffected.
broker.subscribe("ride.completed", async (event) => {
try {
await analyticsDb.recordTrip(event.rideId, event.fare);
} catch (err) {
// In production: broker routes to dead-letter topic for retry
throw err; // tell broker this message was not processed
}
});
Senior Solution: Open/Closed via subscriptions
Every time a new team needs ride events, they add their logic to the ride service β breaking the Open/Closed Principle. The senior fix: the ride service publishes to a topic and never changes again. New teams add a new subscriber file. The ride service codebase never sees a pull request from the loyalty team, the insurance team, or the data science team. Their code is entirely in their own repositories.
Senior Solution: Return after the broker acknowledges
In the pub/sub approach, the latency the user feels is only the time to write the ride record to your own database plus the time for the broker to accept the message β typically a few milliseconds. All the downstream work (email, analytics, review prompts) happens completely in parallel, in separate processes, after the user has already received their confirmation. The user's experience is fast regardless of how slow or backlogged any downstream service is.
Evolution β How Messaging Got Here
Five eras from hand-rolled glue scripts to the event-driven substrate powering modern platforms.
Pub/sub didn't arrive fully formed. It evolved through five distinct eras, each one solving problems the previous era created. Understanding this history isn't trivia β it explains why every design decision in Kafka, RabbitMQ, and Google Pub/Sub looks the way it does. The quirks are answers to specific pain points, and once you know the pain, the answers make sense.
Each dot on that timeline is a step forward β and each step was forced by something that broke. Let's walk through all five.
Before standardised messaging libraries existed, teams who needed async communication built it themselves. The classic approach: one service writes a row to a database table called something like pending_jobs. Another service polls that table every few seconds, picks up rows, processes them, and deletes them. Job done. This is a queue β just a terrible one.
The why behind the polling: services running on different machines couldn't share memory or call each other reliably over the network. A database table was the only shared state both processes could access safely. It worked well enough for low-volume batch jobs in the 1990s. The problem arrived with scale.
The pain points that defined this era: Database polling hammers your DB with SELECT queries every few seconds even when there's nothing to process β wasted CPU and I/O. Rows held for long processing block other readers. When two consumers try to claim the same row, you need careful locking. And there's no fan-out: each row goes to exactly one consumer, so broadcasting an event to multiple services means writing multiple rows manually. By the late 90s, larger systems were groaning under this weight.
What changed and why: The core insight was that a database is not a message broker. A message broker is optimised for temporal delivery β get a message from A to B as fast as possible, then forget it. A relational database is optimised for persistent queries, transactions, and long-lived storage. Forcing one to do the other's job creates friction everywhere. This realisation drove the next era.
The Java Message Service specification (JMS 1.0, released in 1998) was the first serious attempt to standardise messaging in enterprise software. The idea was appealing: define a common API, and any compliant messaging provider β IBM MQ, TIBCO, BEA, ActiveMQ β can plug in underneath. Write your code once against the JMS API, swap providers if you want.
In practice, this worked only partially. JMS standardised the API β Session.createProducer(), MessageConsumer.receive() β but left the underlying wire protocol completely up to each vendor. IBM MQ used a proprietary binary protocol. TIBCO used a different one. ActiveMQ used OpenWire. If Service A ran on an IBM MQ broker and Service B ran on ActiveMQ, they couldn't talk without a bridge component. You were essentially vendor-locked to a single broker vendor per deployment.
JMS did standardise two important abstractions that survive today: queues (point-to-point β one producer, one consumer, each message consumed exactly once) and topics (pub/sub β one producer, many consumers, each subscriber gets their own copy). Every modern messaging system uses some form of this split, even if the names differ.
What changed and why: The vendor lock-in was a strategic problem. If you bet on TIBCO and they raised prices or went under, migrating was a multi-year project. What the industry needed wasn't just a common API β it needed a common wire protocol so different brokers could interoperate. That drove the next era.
The Advanced Message Queuing Protocol (AMQP)An open standard wire-level protocol for message-oriented middleware. Unlike JMS which only standardises the API, AMQP standardises the actual bytes on the wire β meaning any AMQP client can talk to any AMQP broker regardless of language or vendor. was born from frustration with vendor lock-in. JPMorgan Chase, Red Hat, Cisco, and others collaborated to define a common binary protocol that specified not just the API calls but the exact bytes on the wire. Any client library that speaks AMQP can talk to any AMQP broker. The era of proprietary silos was over.
RabbitMQ, first released in 2007, became the landmark implementation. It introduced a routing model that was genuinely more flexible than anything before it: the exchange + binding model. Publishers don't send messages directly to queues. They send to an exchange β a named routing component β which applies a binding rule to decide which queues receive the message.
Exchange Type
Routing Rule
Maps to What
direct
Route by exact routing key match
Point-to-point queue
fanout
Copy to ALL bound queues, ignore routing key
Pure pub/sub broadcast
topic
Match routing key patterns (wildcards * and #)
Topic-filtered pub/sub
headers
Match on message header attributes
Content-based routing
A fanout exchange is pure pub/sub: every queue bound to it gets a copy of every message, regardless of routing key. This is the model that directly inspired Google Pub/Sub's subscription model and many others that followed.
What changed and why: AMQP solved the interoperability problem but introduced a new one β the broker became the centre of the universe. Every message had to flow through the broker's exchange routing logic. At enormous scale (millions of messages per second), that single routing layer became a bottleneck. The broker also had limited memory: messages lived in RAM and maybe on disk, but once consumed they were gone. There was no concept of a subscriber "replaying" past messages. That gap became the motivation for the next era.
As cloud infrastructure matured, the big providers built managed pub/sub services that took the operational burden off engineering teams entirely. You didn't have to provision, patch, or scale your own broker β the cloud provider did it.
Google Cloud Pub/Sub (public beta March 2015, GA later that year) introduced a durable, at-least-once delivery model with configurable retention and a clean separation of topics (where publishers write) and subscriptions (where consumers read). Multiple subscriptions can attach to one topic, each maintaining its own independent cursor β so adding a new consumer doesn't affect existing ones at all. This addressed one of RabbitMQ's pain points: binding queues to exchanges was manual plumbing. In GCP Pub/Sub, you just create a new subscription and it starts receiving all future messages automatically.
AWS SNS (Simple Notification Service) took a different angle: fan-out to heterogeneous endpoints. One SNS topic can fan out to SQS queues, Lambda functions, HTTP endpoints, email addresses, and SMS simultaneously. The same order.placed event can trigger a Lambda that processes it in real time, land in an SQS queue for a worker that processes it asynchronously, and send an SMS alert to an on-call engineer β all from a single publish. This pattern became one of the most widely deployed event-driven architectures on AWS.
Redis Pub/Sub filled a different niche: ultra-low-latency, in-process fan-out where durability doesn't matter. Redis Pub/Sub has no message storage β if a subscriber isn't connected at the moment a message is published, that message is lost. But for use cases like real-time presence updates, live chat fan-out within a single data centre, or cache invalidation signals, that trade-off is fine. The gain is sub-millisecond delivery with zero broker overhead beyond what Redis already provides.
What changed and why: Cloud-native pub/sub made the pattern accessible to teams that couldn't or wouldn't operate their own broker infrastructure. But all these systems shared one limitation: they were fundamentally "fire and forget" from the broker's perspective β once a message was delivered and acknowledged, it was gone. You couldn't rewind to last Tuesday's events to replay them through a new service. The 2008 financial crisis and subsequent regulatory requirements (audit trails, reproducible computation) made replay a hard requirement in financial and analytics workloads. That gap drove Kafka.
Apache Kafka was created at LinkedIn and open-sourced in 2011. The key insight behind it was simple but profound: instead of treating messages as ephemeral items to be consumed and deleted, treat them as entries in an immutable, append-only log. A log keeps everything. New consumers can subscribe to the beginning of the log, not just the tail. Old consumers can replay from any point. The data is the log; the log is the truth.
This one design decision β log as the substrate β changed what was possible:
Replay: Deploy a new analytics service today and back-fill from 90 days of events. No special export needed β the data is already there.
Multiple consumer groups: Each consumer group maintains its own offset pointer in the log. They're completely independent. You can have a real-time processing group at offset 10,000,000 and a batch analytics group at offset 5,000,000 simultaneously, neither affecting the other.
Partitioned parallelism: Topics are split into partitions, each an independent log shard. Multiple consumers in a group can read different partitions in parallel, scaling throughput linearly with partition count.
Durability as a side effect: The log is just files on disk. Replication across brokers makes it fault-tolerant. Retention policies (time-based or size-based) control how far back you can go.
The convergence: Kafka didn't replace pub/sub β it expanded what pub/sub could do. Classic pub/sub (RabbitMQ, SNS) is great when you want fire-and-forget fan-out with low latency and no need to replay. Log-based pub/sub (Kafka) is great when the event history itself is valuable, when you need multiple independent consumers with different paces, or when you want to build stream processing pipelines. Modern architectures often use both: Redis Pub/Sub for sub-second in-memory fan-out, Kafka as the durable event backbone, SNS for cross-system notifications.
Era
Representative Tool
Key Innovation
Key Limitation
1 β Late 90s
DB polling tables
Async decoupling without a separate broker
Hammers DB, no fan-out, terrible throughput
2 β Early 2000s
JMS / ActiveMQ
Standardised API, queues + topics
Proprietary wire protocols, vendor lock-in
3 β ~2007
AMQP / RabbitMQ
Open wire protocol, flexible exchange routing
No message replay, broker memory limits
4 β 2010β2015
GCP Pub/Sub, SNS, Redis
Managed, no ops, heterogeneous fan-out
Fire-and-forget only, no replay
5 β 2011+
Apache Kafka
Immutable log, replay, partitioned parallelism
Higher operational complexity, not "instant forget"
Messaging evolved through five eras: (1) hand-rolled DB polling tables β simple but slow; (2) JMS standardised the API but not the wire, causing vendor lock-in; (3) AMQP/RabbitMQ standardised the wire protocol and introduced flexible exchange routing; (4) cloud-native services (GCP Pub/Sub, SNS, Redis) eliminated ops burden and enabled heterogeneous fan-out; (5) Kafka's log-based model added replay, independent consumer groups, and durable event history. Each step fixed the previous era's deepest pain point.
Internals β How a Broker Actually Routes a Message
Open the broker's hood: topic registry, subscriber lists, push vs pull, ack/redelivery, ordering, and partitioning β all in plain English first.
From the outside, a broker looks like a simple relay: messages go in on one end, messages come out the other. But inside, a broker is making a surprising number of decisions per message β which subscribers should receive it, how to deliver it, what to do if delivery fails, how to maintain ordering, and how to handle a subscriber that's falling behind. Understanding these internals is what separates engineers who use pub/sub from engineers who can debug it when it breaks at 3am.
The Topic Registry β The Broker's Routing Table
At its core, a broker maintains a topic registry β a map that says "topic X has subscribers A, B, and C." When a publisher sends a message tagged with a topic name, the broker looks up that topic in the registry and gets back the list of subscribers who should receive it. That lookup is the entire routing algorithm for basic pub/sub.
Why is this stored in a registry rather than derived on the fly? Because subscribers register interest ahead of time. When a service starts up, it makes a subscribe("order.placed") call to the broker. The broker records that subscription. Later, when a message arrives for order.placed, the broker already has the full subscriber list ready β no need to discover subscribers at delivery time. This pre-registration is what makes fan-out fast and what enables the broker to buffer messages for offline subscribers.
The diagram shows the five-step path a message takes through the broker. Step 1: the publisher sends the message. Step 2: the broker persists it and sends an acknowledgement back to the publisher β from this point on, the broker owns the message. Step 3: the broker looks up the topic in its registry to find the subscriber list. Step 4: the fan-out engine creates one copy of the message per subscriber and places each copy in that subscriber's delivery queue. Step 5: the broker delivers each copy and waits for an acknowledgement from each subscriber. Only when a subscriber acks does the broker remove the message from that subscriber's queue. No ack within the timeout window means redelivery.
Push vs Pull β Two Ways to Get Messages Out
Once a message is sitting in a subscriber's delivery queue, the broker needs to get it to the subscriber's code. There are two fundamentally different approaches to this, and they trade off throughput, latency, and back-pressure control.
Think of push vs pull like a waiter vs a cafeteria line. Push is the waiter bringing food to your table as soon as it's ready β you didn't ask, it just arrives. Pull is the cafeteria: you go get food when you're ready for it. Push feels faster but can overwhelm you if the kitchen is faster than you can eat. Pull gives you control but requires you to keep asking "is there more?"
Push Delivery
Pull Delivery
How it works
Broker calls the subscriber's endpoint (HTTP callback, WebSocket, long-poll callback) when a message arrives
Subscriber sends a request to the broker asking "give me up to N messages"
Latency
Lowest β message arrives at subscriber as soon as it hits the broker
Slightly higher β subscriber must poll; messages sit until the next poll cycle
Back-pressure
Hard β if broker pushes faster than subscriber can process, subscriber buffers overflow
Natural β subscriber controls the rate by how often and how much it asks for
Low-volume, latency-sensitive workloads where the subscriber can keep up
High-throughput workloads where the consumer controls its own pace
Kafka uses pure pull. There's a deep reason for this: Kafka serves millions of messages per second across thousands of partitions. If the broker tried to push to every consumer simultaneously, it would need to track each consumer's processing capacity and throttle accordingly β enormous coordination overhead. With pull, each consumer group polls at its own pace. The broker just waits. A slow consumer has no impact on fast consumers, because each holds its own independent offset pointer.
Acknowledgements and Redelivery β The At-Least-Once Problem
Here's a scenario that causes subtle production bugs: a subscriber receives a message, starts processing it, crashes halfway through, and never sends an ack. What should the broker do?
The answer in virtually every production pub/sub system: redeliver. The broker has a timeout (e.g., 30 seconds). If no ack arrives within that window, the broker assumes the message was lost and puts it back in the subscriber's queue. The subscriber restarts, receives the message again, and processes it successfully. This is called at-least-once deliveryA delivery guarantee where every message is delivered to every subscriber at least once, but possibly more than once if a failure happens between processing and acknowledgement. Contrast with at-most-once (may be lost, never duplicated) and exactly-once (never lost, never duplicated β requires transactions)..
At-least-once means your handler WILL run more than once for the same message. This isn't a bug in the broker β it's an intentional safety trade-off. The alternative (at-most-once) would mean some messages could be silently dropped if a subscriber crashes. For most production systems, a duplicate is better than a missing event. But your subscriber code must be written to handle duplicates safely β a property called idempotency. An idempotent handler produces the same result whether it runs once or ten times for the same message. This is one of the most important design rules in pub/sub systems.
Ordering β When Does It Matter and When Can You Get It?
Ordering is one of the trickiest guarantees in pub/sub. The short answer: most pub/sub systems give you ordering within a partition or queue, but not across them. Here's why.
If a topic is served by a single broker thread in a single queue, messages arrive and are delivered in the order they were published β FIFO. Simple. But real systems fan out across multiple broker nodes and multiple subscriber instances for throughput. The moment you introduce parallelism, strict global ordering breaks β message A might be processed by subscriber instance 1 and message B by instance 2, and instance 2 might finish first even though B was published after A.
The standard solution is partition-key ordering: messages that must be ordered relative to each other share the same partition key. For example, all events for user #5001 carry the key user:5001. They all hash to the same partition, which is assigned to exactly one consumer in the consumer group. Within that partition, they're FIFO. Events for user #9999 are in a different partition and can process in parallel without interfering. You get per-entity ordering without sacrificing cross-entity throughput.
Partitioning β The Throughput Multiplier
A single broker thread handling a single topic queue will eventually hit a ceiling β the throughput of one CPU core reading from one disk. PartitioningSplitting a topic into multiple independent shards (partitions), each handled by a different broker node and consumed by a different subscriber instance. Partitions are the mechanism by which pub/sub systems scale throughput horizontally. solves this by splitting a topic into N independent sub-streams. Each partition is a completely independent queue on a different broker node. Publishers hash messages to partitions by key. Subscriber instances in a consumer group are each assigned a subset of partitions β so with 12 partitions and 4 subscriber instances, each instance handles 3 partitions in parallel.
The rule that governs scaling: maximum parallelism = number of partitions. You can add subscriber instances beyond partition count, but the extra instances sit idle with no partitions to read from. They serve only as hot standbys β ready to take over if an active instance dies and triggers a rebalance.
Inside a broker, messages flow through five stages: receive and persist, topic registry lookup, fan-out copy creation, per-subscriber queue delivery, and ack/redelivery management. Push delivery minimises latency; pull delivery gives subscribers natural back-pressure control β Kafka uses pull exclusively. At-least-once delivery is the default safety guarantee: brokers redeliver any message that isn't acked within the timeout, so subscriber handlers must be idempotent. Ordering is guaranteed only within a partition; use consistent partition keys to get per-entity ordering. Partitioning is the horizontal scaling mechanism β throughput scales with partition count, not subscriber count.
When To Use Pub/Sub
Decision framework, use-cases, and when pub/sub is the wrong tool β with a branching flowchart to wire it into your instincts.
Pub/sub solves a specific class of problems very well and handles others poorly. The mistake most engineers make is reaching for it by default because it sounds "scalable." Before picking up a broker, you need to be able to answer two questions: does this situation actually have multiple independent consumers? And does the publisher genuinely not need a response? If both answers are yes, pub/sub belongs in your design. If either is no, you're likely better served by something simpler.
Walk through those three questions for any design decision and you'll catch the majority of bad pub/sub uses before they're built. The pattern below shows both sides of the coin β the situations where pub/sub shines and where it creates more problems than it solves.
The three-question heuristic: If you can't name at least two independent services that need this event, skip the broker. If the caller needs the result before returning a response, use a direct call. If you're writing to a single consumer for ordering reasons, a queue (not pub/sub) is the right shape.
Use pub/sub when one event genuinely needs to trigger multiple independent downstream reactions, when you want producer-consumer decoupling, or when you need a durable event log. Avoid it when you need a synchronous reply, when there's only one consumer, when you need transactional rollback semantics, or when your handler cannot be made idempotent. The three-question flowchart β multiple consumers? needs reply? needs global ordering? β catches most design mistakes before they're built.
Pub/Sub vs Alternatives
Direct API calls, message queues, event streaming, webhooks, and the Observer pattern β how pub/sub compares and when to pick each one.
Pub/sub is one tool in a family of communication patterns. Choosing the wrong one doesn't cause immediate explosions β it causes slow architectural drift toward a system that's hard to change and harder to debug. The comparison below focuses on the shape of each pattern β who knows about whom, what happens on failure, and what you can do that you can't do with the others.
The visual above shows the five patterns at a glance β their topology tells you most of what you need to know. Now let's go deeper on each comparison.
The core distinction: pub/sub is for events that happened (past tense, notification, no response needed). Direct API calls are for commands that need a result (future tense, query, response required). If your operation is "send order placed notification," pub/sub is right. If it's "calculate the tax for this cart and return the result," a direct call is right.
This is the most commonly confused pairing. A message queue is a work distribution tool β you have a pool of workers and you want each job processed by exactly one of them. Pub/sub is a broadcast tool β you have an event and you want every interested party to know about it. They solve different problems. Kafka is interesting because it can do both: within a consumer group, it acts like a queue (each partition assigned to one consumer). Across consumer groups, it acts like pub/sub (each group gets all messages independently).
Think of classic pub/sub (Redis Pub/Sub, SNS) as a phone call: you need to be listening when the message comes in. Event streaming (Kafka) is a voicemail inbox: messages are stored, you listen whenever you're ready, and you can re-listen as many times as you want. If your use case requires replaying past events or catching up after downtime, you need event streaming. If you need sub-millisecond delivery and don't care about replay, classic pub/sub wins.
Webhooks are pub/sub's HTTP-native cousin, commonly used at API boundaries between different companies (Stripe calls your webhook when a payment succeeds; GitHub calls your webhook when a commit is pushed). Inside a single system, webhooks are the wrong choice because you don't get broker-managed retry, delivery guarantees, or fan-out without significant plumbing. At API boundaries where the subscriber is an external service you don't control, webhooks are standard and appropriate.
The Observer pattern (a Gang of Four behavioural pattern) is pub/sub's conceptual ancestor, but they live in different worlds. Observer is within a single process β a button click notifies listeners in the same browser tab. Pub/sub is across processes β an order service on Machine A notifies subscribers on Machines B, C, and D. The idea is identical; the implementation, durability, and scale are completely different.
Pub/sub is one tool in a family. Use a direct API call when you need a synchronous reply. Use a point-to-point queue when distributing work across competing workers. Use event streaming (Kafka) when you need replay, event history, or consumer groups at different offsets. Use webhooks at API boundaries with external systems. Use the Observer pattern for in-process event notification within a single codebase. Pub/sub sits squarely in the "one event, many independent async reactions" space β that's the shape it's designed for.
Real Companies β Pub/Sub at Scale
Five real systems that use pub/sub as a load-bearing architectural component β and the specific reasons each one reached for it.
Abstract patterns become concrete when you see them carrying real production load. The five examples below aren't just "they use Kafka" β each one illustrates a specific pub/sub property (fan-out, decoupling, replay, heterogeneous consumers, or low-latency delivery) that was the deciding factor in the architecture.
Numbers note: Where specific throughput figures are not verifiable from primary public sources, descriptions use approximate language. The architectural decisions are drawn from published engineering blog posts and conference talks.
When a user sends a message in a Slack channel with 500 members, Slack needs to push that message to every active member's WebSocket connection in near real-time. The naive approach β have the message handler look up all 500 member sessions and push to each one directly β doesn't scale when you have millions of concurrent users spread across thousands of servers.
Slack's approach (described in their engineering blog) uses an internal pub/sub system layered on top of stateful Channel Servers and Gateway Servers. When a user sends a message, the Gateway Server holding their connection routes it to the appropriate Channel Server (chosen by consistent hashing on channel ID). The Channel Server then fans the message out to every Gateway Server worldwide that has subscribed clients in that channel. Each Gateway Server pushes over its own open WebSocket connections. Internally, Slack uses Kafka for durable buffering and Redis for in-flight, fast-access job data β but the fan-out primitive itself is their own pub/sub abstraction over these stores.
Why pub/sub was the right shape here: The publisher (the server handling the sending user) has no knowledge of which other servers have active connections for channel members β that's constantly changing as users connect and disconnect. Pub/sub lets the publisher broadcast once; each server self-selects based on what it has subscribed to. An in-memory, low-durability fan-out primitive works well here because the cross-server hop is within Slack's data centres where latency is sub-millisecond, and if a server briefly misses a message the client fetches history from a separate persistent store on reconnect.
The trade-off acknowledged: An in-memory pub/sub layer typically has no message persistence β if a server is temporarily disconnected and misses a published message, that message is gone from the in-memory broker. Slack handles this with a separate history storage layer (backed by Kafka and a chat-history datastore) that clients can query to catch up. The pub/sub layer handles the "right now" fan-out; durable storage handles the "what did I miss" case. These are separate concerns with separate tools.
An Uber trip involves dozens of state transitions: driver goes online, driver goes nearby, driver accepts trip, driver arrives, trip starts, waypoints update, trip ends, payment processes. Multiple services need to react to each of these transitions: the mobile app needs to update the map, the pricing service needs to update the fare estimate, the safety system needs to log the event, the analytics pipeline needs the data.
This is a textbook pub/sub use case. Each state transition is published as an event to a topic (e.g., trip.driver.arrived). Services that care β map rendering, fare calculation, safety logging, analytics β subscribe to the events they need. The dispatch system that generates these events doesn't call each downstream service directly. It publishes and walks away.
Why this matters at Uber's scale: Uber processes millions of trips simultaneously, with driver location updates arriving multiple times per second per trip. At that rate, any direct-call architecture between services would create a cascading failure risk β one slow downstream service would back up the dispatch system. With pub/sub, each subscriber operates at its own pace. The analytics pipeline can fall behind during a traffic spike and catch up when load drops, with no impact on the real-time map updates that riders see. The concerns are genuinely independent, and pub/sub enforces that independence architecturally.
Uber's engineering team has published extensively on their use of Apache Kafka as the backbone for these event pipelines, with Kafka's log retention enabling the analytics and data science teams to replay event history for model training and back-testing.
LinkedIn created Apache Kafka specifically to solve their own messaging problem at scale. Before Kafka, LinkedIn had a fragmented landscape: ad hoc point-to-point integrations between systems, each with its own data pipeline. Every time a new data consumer needed activity data (for the news feed, for recommendations, for analytics), a new pipeline was built from scratch to connect to the source. The complexity grew combinatorially.
Kafka changed the shape of that problem. Instead of connecting each consumer to each producer directly, every producer publishes to Kafka topics. Every consumer reads from Kafka topics. The number of integrations drops from O(producers Γ consumers) to O(producers + consumers). LinkedIn uses Kafka for activity stream data (page views, searches, profile visits), infrastructure metrics, audit logs, and as the backbone for their data warehouse ingestion pipeline.
Why the log model specifically: LinkedIn's data science teams needed to train recommendation models on historical activity data. With a traditional pub/sub system (fire-and-forget), historical events would be gone once consumed. Kafka's log retention means the machine learning team can spin up a new model training job and replay months of activity data from the beginning of the log β no special export, no batch dump, just a new consumer group starting at offset 0. The event history is a permanent asset, not a transient stream.
This is the pub/sub property LinkedIn needed most: not just fan-out, but replayable fan-out where new consumers can join the conversation retroactively.
Google Cloud Spanner β Google's globally distributed relational database β exposes a change streams feature that lets you subscribe to all inserts, updates, and deletes on a table in near real time. The typical production pattern (and the one Google provides a managed Dataflow template for) is to run a Dataflow pipeline that uses the Apache Beam SpannerIO connector to read change records from Spanner and forward them to a GCP Pub/Sub topic. Your applications then subscribe to that Pub/Sub topic and receive the stream of changes as they happen.
This is a compelling demonstration of pub/sub's "producer doesn't know its consumers" property. The Spanner-to-Pub/Sub pipeline doesn't know whether you're using the change feed to sync a downstream read replica, to trigger a Cloud Function, to feed an analytics pipeline, or to implement event sourcing. It just publishes changes. The GCP Pub/Sub layer fans those changes out to however many subscribers you've set up, each processing at their own pace with their own independent subscription cursor.
Why GCP Pub/Sub specifically: GCP Pub/Sub's subscription model means each consumer gets its own independent delivery queue and ack state. If your analytics pipeline falls behind, that doesn't affect your cache-warming subscriber at all. And because GCP Pub/Sub retains unacked messages for a default of seven days (configurable up to 31 days), a subscriber that goes offline for a few hours can catch up when it recovers β without any special handling upstream. The durability and independent-cursor model is precisely what makes this pattern reliable.
One of the most widely deployed pub/sub patterns in AWS architectures is the SNS β SQS fan-out. The pattern works like this: you create one SNS topic for an event type (e.g., order-placed). You create multiple SQS queues β one per consumer service (email service queue, inventory service queue, analytics queue). You subscribe each SQS queue to the SNS topic. When a publisher sends a message to the SNS topic, SNS fans it out to all subscribed SQS queues simultaneously. Each consumer service reads from its own queue independently.
This is the canonical pattern for decoupling services in AWS. Why does SNS sit in front of SQS instead of publishing directly to SQS? Because SNS handles the fan-out. If you published directly to SQS, you'd have to make one publish call per consumer queue, and you'd need to know about all consumer queues. With SNS in front, you publish once and SNS does the fan-out. New consumer teams just subscribe their queue to the SNS topic β the publisher never changes.
SNS can also fan out to Lambda functions, HTTP endpoints, email addresses, and SMS simultaneously from the same topic. A payment failure event can simultaneously trigger a Lambda for automated retry logic, drop a message in an SQS queue for the fraud analysis worker, and send an SMS to the on-call engineer β all from one SNS publish call.
The pattern in one sentence: SNS is the broadcaster; SQS is the durable buffer. SNS gives you fan-out without storage; SQS gives you storage without fan-out. Together they give you both.
Five real systems, five different pub/sub properties. Slack uses an internal pub/sub layer (Channel Servers fanning out to Gateway Servers) for sub-millisecond in-region delivery, backed by Kafka for durability. Uber uses Kafka-backed pub/sub to decouple dispatch events from the many services that react to them, enabling independent scaling and failure isolation. LinkedIn built Kafka specifically to replace O(NΒ²) direct integrations with an O(N) log-based architecture, with replay as a first-class capability. GCP Pub/Sub is the standard delivery layer for Spanner change streams (typically via a Dataflow pipeline), separating change producers completely from downstream consumers (pipelines, functions, replicas). AWS SNS+SQS is the standard fan-out pattern for AWS architectures, where SNS handles broadcast and SQS provides per-consumer durability.
Production Bug Case Studies
Three bugs that show up repeatedly in real pub/sub systems β and the thinking you need to prevent and diagnose them.
Pub/sub systems fail in predictable ways. The same three failure modes appear in almost every large-scale deployment. Understanding each one β what caused it, what the symptoms looked like, and how to fix it β is worth more than any theoretical knowledge about brokers. These are the bugs that page on-call engineers at 3am.
Incident Summary: A downstream analytics service begins processing messages 10Γ slower than usual (a database query it depends on acquired a slow lock). The broker's delivery queue for that subscriber grows. Memory usage on the broker climbs. Other topics on the same broker begin experiencing delivery latency. Eventually the broker starts refusing new publishes and the entire event pipeline stalls β even for topics the slow subscriber has nothing to do with.
What Went Wrong
The root cause is a mismatch between publish rate and consume rate. When a subscriber processes messages slower than they arrive, the broker must buffer the undelivered messages somewhere. In most brokers, that buffer is bounded β either by memory limits or disk quotas. When the buffer fills, the broker has two bad choices: drop new messages (losing data) or apply back-pressure to publishers (slowing down the entire system).
What makes this particularly nasty is the blast radius. The analytics service subscribed to order.placed is slow. But because all topics live on the same broker cluster and share the same memory pool, the broker's slowdown affects every other publisher and subscriber on the cluster β including the email service and inventory service that are completely healthy and working fine. One slow subscriber degrades the entire messaging infrastructure.
# analytics_consumer.py β the slow subscriber
# No timeout, no concurrency, processes one message at a time
# This is the code that causes the cascade
import broker_client
def handle_order_placed(msg):
# This function occasionally takes 10+ seconds when the
# data warehouse is under load β nobody noticed in dev
result = slow_data_warehouse_query(msg["orderId"])
write_to_analytics_store(result)
# ACK sent only AFTER all the work is done.
# During a DW slowdown: 1 msg takes 10s = 6 msg/min
# But publisher sends 600 msg/min β queue grows 594/min
broker_client.ack(msg)
# Single-threaded subscription β one slow message blocks all others
broker_client.subscribe("order.placed", handle_order_placed)
broker_client.start() # blocking loop, single thread# analytics_consumer.py β fixed version
# Three changes: concurrency, timeout guard, separate slow path
import broker_client
import asyncio
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=20) # bounded concurrency
def handle_order_placed(msg):
try:
# 1. ACK immediately (or set a short visibility timeout)
# The message is safe in the DLQ if processing fails
broker_client.ack(msg)
# 2. Do slow work AFTER acking, non-blocking
# If DW is slow, this backs up in our local executor β
# NOT in the broker's queue
executor.submit(process_analytics_async, msg)
except Exception as e:
# 3. On failure, message goes to dead-letter queue
# for manual inspection, not silently dropped
broker_client.nack(msg, dead_letter=True)
log.error(f"analytics processing failed: {e}")
def process_analytics_async(msg):
# Runs in thread pool β DW slowness only fills our local pool
# not the broker queue. Bounded by max_workers=20.
result = slow_data_warehouse_query(msg["orderId"])
write_to_analytics_store(result)
# ALSO: set a max inflight / consumer quota on the broker side
# so this subscriber can never accumulate >10k unacked messages
# broker_client.set_max_inflight(10_000)
broker_client.subscribe("order.placed", handle_order_placed)
broker_client.start()Lesson: A slow subscriber is a broker resource leak. The fix has two layers: (1) decouple the ack from the slow processing so the broker's queue drains at message-receive speed, not processing speed; (2) bound your local concurrency so the slow work backs up in your own thread pool, not the broker's memory. Add a max_inflight or consumer quota at the broker level as a safety net.
How to Spot: Watch broker queue depth per subscriber. If one subscriber's queue grows monotonically while others are flat, that subscriber is the bottleneck. Set alerts on "consumer lag" (Kafka) or "approximate number of messages not visible" (SQS) per subscription. A steadily growing lag that doesn't recover is the early warning signal β intervene before it fills broker memory.
Incident Summary: A payment service subscribes to checkout.completed events and charges the customer's card. One evening, a network blip between the broker and the payment service causes an ack to be lost. The broker redelivers the message 30 seconds later. The payment handler runs again for the same checkout event. The customer is charged twice. The money is real; the duplicate charge triggers a wave of support tickets.
What Went Wrong
This is the classic at-least-once delivery collision. The broker guarantees every message will be delivered at least once β meaning it may deliver the same message multiple times if an ack is lost or the subscriber crashes between processing and acknowledging. This is the correct, safe default: the broker would rather send twice than not at all.
The bug is not in the broker. The bug is in the handler: it performs a side effect (charging a card) without checking whether that side effect has already been performed for this specific message. A handler that produces the same outcome whether it runs once or ten times is called idempotentFrom mathematics: a function is idempotent if applying it multiple times produces the same result as applying it once. In distributed systems, an idempotent operation is one that can be safely retried without changing the outcome beyond the first application.. The payment handler was not idempotent, so redelivery became a double charge.
// payment-handler.js β non-idempotent, dangerous with at-least-once
async function handleCheckoutCompleted(msg) {
const { checkoutId, customerId, amount } = msg;
// BUG: no deduplication check.
// If this message is delivered twice (at-least-once), we charge twice.
await chargeCard(customerId, amount);
// BUG: ack AFTER the charge, but if network drops the ack,
// broker redelivers and we charge again.
await broker.ack(msg);
console.log(`Charged customer ${customerId} for checkout ${checkoutId}`);
}
broker.subscribe("checkout.completed", handleCheckoutCompleted);// payment-handler.js β idempotent via deduplication table
const processedCheckouts = new Map(); // In production: Redis or DB table
async function handleCheckoutCompleted(msg) {
const { checkoutId, customerId, amount } = msg;
// FIX 1: Idempotency check β have we processed this checkout before?
// Use checkoutId as the idempotency key (stable across redeliveries)
const alreadyProcessed = await db.query(
`SELECT 1 FROM processed_events WHERE event_id = ?`,
[checkoutId]
);
if (alreadyProcessed) {
// Duplicate delivery β safe to ack and skip
console.log(`Skipping duplicate for checkout ${checkoutId}`);
await broker.ack(msg);
return;
}
// FIX 2: Mark as processed AND charge in the same DB transaction
// This is atomic: either both happen or neither does
await db.transaction(async (tx) => {
await tx.query(
`INSERT INTO processed_events (event_id, processed_at) VALUES (?, NOW())`,
[checkoutId]
);
await chargeCard(customerId, amount, { idempotencyKey: checkoutId });
// Most payment APIs (Stripe, etc.) accept an idempotency key directly
});
await broker.ack(msg);
console.log(`Charged customer ${customerId} for checkout ${checkoutId}`);
}
broker.subscribe("checkout.completed", handleCheckoutCompleted);Lesson: At-least-once delivery is a feature, not a bug. The broker is doing the right thing by redelivering unacked messages β it's protecting you from silent data loss. The contract is: you must make your handler idempotent. Use a stable message ID (the broker provides one, or embed one in your event schema) as an idempotency key. Check before acting, or use a transaction that atomically records "processed" and performs the side effect. Most payment APIs (Stripe, Adyen) support idempotency keys natively for exactly this reason.
How to Spot: Duplicate record creation, double charges, duplicate emails, or duplicate database rows that share the same business key are all symptoms of a non-idempotent handler meeting at-least-once delivery. Monitor for event processing counts per message ID. If you see any message ID with a processing count greater than 1, your handler is not idempotent. Add an idempotency table to your schema as standard practice in any system that consumes from a pub/sub broker.
Incident Summary: A team builds a notification system where each user gets their own personal topic: user.notifications.{userId}. With 2 million users, the broker is managing 2 million topics. Each topic has metadata (name, subscriber list, configuration) stored in memory on the broker. The broker's metadata memory climbs past its heap limit. It starts failing to create new topics. Then existing topic lookups begin timing out. Then the broker crashes under the metadata load.
What Went Wrong
Topics are not free. Every topic is a named entry in the broker's metadata store, and that metadata lives in memory for fast routing. Each entry includes the topic name, the subscriber list, configuration (retention, partition count, etc.), and offset tracking information. Multiply that by millions of topics and you've turned your broker into an in-memory metadata store that's growing without bound.
The underlying design mistake is treating pub/sub as a mailbox system. Pub/sub is designed for channels where messages of a particular type flow β all order events, all payment events. It's not designed for per-entity personal inboxes. That's what a push notification service or a dedicated inbox system is for.
# notification_publisher.py β WRONG: per-user topics
class NotificationService:
def send_notification(self, user_id: int, message: str):
# BUG: creates a unique topic for EVERY user
# With 2M users, the broker will have 2M topics in its metadata store
topic_name = f"user.notifications.{user_id}"
# Broker creates this topic if it doesn't exist yet
# Each new user registration β +1 topic permanently in broker metadata
broker.publish(topic_name, {
"userId": user_id,
"message": message,
"timestamp": time.time()
})
def subscribe_for_user(self, user_id: int, handler):
# Each WebSocket connection creates a new topic subscription
# Every subscription is metadata on the broker
topic_name = f"user.notifications.{user_id}"
broker.subscribe(topic_name, handler)# notification_publisher.py β CORRECT: shared topic with routing in message
class NotificationService:
TOPIC = "user.notifications" # ONE topic, forever, no matter how many users
def send_notification(self, user_id: int, message: str):
# Publish to the shared topic with userId as a message attribute
# Broker routes to the right subscriber group; workers filter by userId
broker.publish(self.TOPIC, {
"userId": user_id,
"message": message,
"timestamp": time.time()
}, attributes={"userId": str(user_id)})
# Some brokers (GCP Pub/Sub, SNS) support server-side filtering
# on message attributes β so workers only receive their subset
def subscribe_worker(self, handler):
# ONE subscription shared across all user IDs
# Worker pool reads from this single topic and filters in application code
broker.subscribe(self.TOPIC, handler)
# notification_worker.py
def handle_notification(msg):
# If your broker doesn't support server-side attribute filtering,
# filter here in the application layer β cheap, O(1) check
target_user_id = int(msg["userId"])
# Look up WebSocket connection for this user (from connection registry)
ws = connection_registry.get(target_user_id)
if ws and ws.is_connected():
ws.send(msg["message"])
# If no active connection: drop or store in inbox DB for later retrievalLesson: Topics represent event types, not entity identities. A topic called user.notifications is correct β it describes what kind of event it carries. A topic called user.notifications.5001 is wrong β it conflates a channel name with a business entity ID. The routing logic (who gets which message) belongs in message attributes, subscriber filters, or application-level code β not in topic names. As a rule of thumb: if your topic name contains a user ID, order ID, or any runtime-generated key, you have a topic explosion waiting to happen.
How to Spot: Monitor total topic count on your broker over time. If the count is growing proportionally to your user base or record count, you have the per-entity topic anti-pattern. In Kafka, watch for ZooKeeper or KRaft metadata memory growth correlated with topic creation rate. In RabbitMQ, watch queue count in the management UI. Alert when total topic/queue count grows faster than a constant baseline β it should be roughly stable after your application is deployed, not growing with data volume.
Three bugs that show up repeatedly in production pub/sub systems. Bug 1: slow subscribers cause broker memory pressure that cascades to unrelated topics β fix with concurrent consumers, early ack, and consumer lag monitoring. Bug 2: at-least-once delivery meets non-idempotent handlers and creates duplicate side effects β fix by using the message ID as an idempotency key and recording processed events before (or atomically with) the side effect. Bug 3: per-entity topic naming causes broker metadata explosion β fix by using one topic per event type and routing by message attributes or application-level filtering.
Pitfalls & Anti-Patterns
Five mistakes that show up in real pub/sub incidents β what broke, why it broke, and how to fix it before it breaks yours.
Most pub/sub bugs aren't subtle. They're the same five mistakes, repeated across teams, discovered at 2am after a production incident. Each one has a clear cause, a clear consequence, and a clear fix. Learn them here so you don't have to learn them on-call.
The mistake: When building a notification system, it feels natural to give each user their own topic β notifications.user.4821, notifications.user.4822, and so on. Publishers address users directly. Nice and tidy, right?
Why it's bad: Brokers aren't free to host topics. Every topic carries metadata overhead β in Kafka, a topic with 3 partitions and 3 replicas means 9 log segments on disk, 9 entries in the metadata log, 9 file handles per broker. Multiply by a million users and you've given the controller millions of metadata entries to manage. Cluster startup, leader election, and rebalances all slow to a crawl. Brokers OOM. This is called topic proliferation, and it's the fastest way to kill a Kafka cluster.
Fix: Use a single shared topic (e.g., notifications) and embed the user ID as a field in the message payload. Subscribers filter by user ID on their side, or you shard by a small number of partitions keyed on user ID hash. One topic replaces a million. Metadata overhead drops from catastrophic to negligible.
// BAD β dynamic topic per user, millions of topics created at runtime
async function sendNotification(userId, message) {
const topicName = `notifications.user.${userId}`; // π 1 topic per user
await broker.createTopicIfNotExists(topicName); // broker metadata grows unbounded
await broker.publish(topicName, message);
}
// GOOD β single topic, userId lives in the message payload
async function sendNotification(userId, message) {
await broker.publish("notifications", { // one stable topic
userId, // recipient is a field, not a topic name
...message
});
}
// Subscriber filters on userId β no topic explosion
broker.subscribe("notifications", (msg) => {
if (msg.userId === currentUserId) handleNotification(msg);
});
The mistake: Using a high-cardinality but heavily skewed key for partitioning β for example, routing all messages from your most popular creator to the same partition in a social media feed. Or routing by country when 80% of your users are in one country.
Why it's bad: Partitions are the unit of parallelism in log-based brokers. If one partition receives ten times the throughput of the others, the consumer assigned to that partition is overwhelmed while every other consumer sits mostly idle. You've wasted your horizontal scaling investment. Worse, a lagging hot partition backs up the entire topic if consumers checkpoint offsets linearly β downstream services see growing lag that no amount of consumer scaling can fix, because you can't add more consumers than partitions.
Fix: Add a random salt suffix to the partition key (e.g., creatorId + "-" + random(0,4)) to spread traffic across partitions. If ordering per entity matters, use a scatter-gather consumer that reads from all partitions and re-merges by sequence number. Alternatively, split the hot entity into its own dedicated topic with a higher partition count.
// BAD β key is highly skewed (big creator gets millions of events)
broker.publish("feed-events", event, {
partitionKey: event.creatorId // π celebrity creatorId β always same partition
});
// GOOD β salt the key to distribute load across partitions
const salt = Math.floor(Math.random() * 4); // spread across 4 "virtual shards"
broker.publish("feed-events", event, {
partitionKey: `${event.creatorId}-${salt}` // hot key split across multiple partitions
});
// Consumers fan-in and re-order by event.sequenceNumber if strict order needed
The mistake: Using pub/sub where you actually need a direct response β for example, using a message topic to send a "validate payment" request and a reply topic to get the answer back, building your own request-reply correlation over a broker that was never designed for it.
Why it's bad: Pub/sub is asynchronous by design. The publisher fires a message and moves on β it does not block waiting for a reply. If you bolt synchronous semantics on top (matching reply messages by a correlation ID, setting timeouts, handling "no reply received"), you've rebuilt the worst of both worlds: you have the latency overhead of a broker hop plus the blocking wait of a synchronous call, but with none of the simplicity of direct HTTP/gRPC. The correlation ID approach also creates subtle bugs: replies can arrive out of order, messages can be lost with no caller-side error, and every caller has to manage its own timeout/retry loop.
Fix: Use direct HTTP or gRPC for anything that needs a synchronous response. Pub/sub is right for events that trigger independent reactions β not for "ask a question and wait for the answer." If you genuinely need async request-reply at scale, look at RPC frameworks with async support (gRPC streaming, NATS request-reply which has built-in support) rather than cobbling it together on top of a pub/sub broker.
// BAD β using pub/sub for synchronous-style request-reply
async function validatePayment(orderId) {
const correlationId = uuid();
await broker.publish("payment.validate.request", { orderId, correlationId });
// now block-wait for a reply on "payment.validate.reply"... for up to 5s
return waitForReply("payment.validate.reply", correlationId, 5000); // fragile, DIY
}
// GOOD β direct HTTP/gRPC call for synchronous request-reply
async function validatePayment(orderId) {
// One hop. Timeout built into the HTTP client. Error surfaced immediately.
const response = await fetch(`${PAYMENT_SERVICE_URL}/validate`, {
method: "POST",
body: JSON.stringify({ orderId }),
signal: AbortSignal.timeout(3000) // 3s timeout, no DIY correlation needed
});
return response.json();
}
The mistake: A subscriber processes messages one by one. A malformed message arrives β maybe a required field is null, maybe the schema changed, maybe an external API it calls is down. The subscriber throws an exception, the broker redelivers the message, the subscriber throws again, the broker redelivers again. Loop. The subscriber processes nothing else while it spins on this one poison message.
Why it's bad: This is called a poison message attack β usually accidental. The message is valid enough to be accepted by the broker but invalid enough to crash your processing logic every time. Without a dead-letter queue (DLQ), the only options are: wait forever (consumer stuck), manually delete the message (requires operator intervention at 3am), or silently drop it (data loss). The entire subscription is blocked. Healthy messages queue up behind the one bad one. Consumer lag grows. SLAs miss.
Fix: Configure a dead-letter queue (DLQ) on every subscription that processes user data. Set a maximum retry count (typically 3β5). After the last retry fails, the broker parks the message in the DLQ instead of redelivering it. Your main consumer unblocks immediately. The poison message sits in the DLQ for human inspection or automated alerting. Most managed brokers (AWS SQS, Google Pub/Sub, Azure Service Bus) support DLQs natively β it's a single configuration line, not a system redesign.
// BAD β no DLQ, no retry limit; one bad message stalls the subscriber forever
broker.subscribe("order.placed", async (msg) => {
await processOrder(msg); // if this throws, broker redelivers forever β infinite loop
});
// GOOD β configure DLQ + max retries at the subscription level
const subscription = broker.subscribe("order.placed", {
maxRetries: 3, // give up after 3 failures
deadLetterTopic: "order.placed.dlq", // park poison messages here
retryBackoff: "exponential" // wait longer between each retry
}, async (msg) => {
await processOrder(msg);
// on repeated failure β auto-moved to DLQ, consumer unblocks
});
// Separately, monitor the DLQ and alert when messages arrive
broker.subscribe("order.placed.dlq", (msg) => {
alertOncall(`Poison message in order.placed.dlq: ${msg.id}`);
});
The mistake: A subscriber catches a transient error (a downstream service is briefly slow, a database connection times out) and immediately retries β zero delay. Processing fails again. Retry again. And again. Dozens of subscriber instances doing this simultaneously flood the struggling downstream service with a wall of retries at the exact moment it most needs breathing room to recover.
Why it's bad: This is called a retry storm. The downstream service was about to recover β maybe it just needed 2 seconds to finish a GC pause or reconnect a pool β but the storm of retries prevents it from recovering at all, because every retry arrives before it can finish handling the last one. A transient 2-second blip turns into a cascading failure that lasts minutes. Every retry also consumes the broker's throughput budget, starving normal messages during the storm.
Fix: Always retry with exponential backoff plus jitter. Exponential backoff means each retry waits twice as long as the previous one (1s, 2s, 4s, 8sβ¦). Jitter means adding a small random offset to each wait time so that all subscriber instances don't retry at the exact same millisecond. Together they transform a thundering herd into a gentle trickle, giving the downstream service the breathing room it needs to recover.
// BAD β zero-delay retry hammers a struggling service
broker.subscribe("email.send", async (msg) => {
try {
await emailProvider.send(msg);
} catch (err) {
// Instant retry β if emailProvider is stressed, this makes it worse
await emailProvider.send(msg); // π thundering herd
}
});
// GOOD β exponential backoff + jitter
function backoffMs(attempt) {
const base = Math.min(1000 * Math.pow(2, attempt), 30000); // cap at 30s
const jitter = Math.random() * 500; // Β±500ms jitter
return base + jitter;
}
broker.subscribe("email.send", async (msg) => {
for (let attempt = 0; attempt < 4; attempt++) {
try {
await emailProvider.send(msg);
return; // success β done
} catch (err) {
if (attempt < 3) {
await sleep(backoffMs(attempt)); // wait before retry: 1s, 2s, 4s
} else {
throw err; // exhaust retries β DLQ handles it
}
}
}
});
Creating one topic per user (or per entity) causes broker metadata explosion β use a shared topic with user ID in the payload instead.
Hot partitions form when partition keys are skewed β salt the key to spread load, or dedicate a separate topic to high-traffic entities.
Pub/sub is fire-and-forget, not request-reply β use HTTP/gRPC directly when you need a synchronous response.
Without a dead-letter queue, one poison message blocks a subscriber forever β configure max retries and a DLQ on every subscription that touches production data.
Instant retries create retry storms that prevent recovery β always use exponential backoff with jitter.
Testing Pub/Sub Systems
How to verify publishers, subscribers, routing, and fault tolerance β without needing a live broker in every test.
Testing a pub/sub system is different from testing a plain HTTP service. There's a broker in the middle, delivery is asynchronous, and failures are probabilistic. The trick is to test each layer in isolation β mock the broker for unit tests, use a real (but local) broker for integration tests, and inject faults deliberately to verify that your resilience logic actually works.
The Four Testing Layers
Layer 1 β Unit-Testing Publishers (Mock the Broker)
A publisher's job is simple: call broker.publish(topicName, payload) with the right arguments. You don't need a real broker to test this. Inject a mock broker, trigger the publishing action, and assert that publish was called with the expected topic and a correctly shaped payload. Fast, deterministic, no infrastructure needed.
Layer 2 β Unit-Testing Subscribers (Replay a Fixture)
A subscriber's job is also simple: receive a message and do something with it (write to a DB, send an email, update state). You don't need a live broker for this either. Just call the handler function directly with a hand-crafted fixture message. Test the happy path, test malformed input (it should fail gracefully, not blow up the process), and test idempotency by calling the handler twice with the same message ID β the side effect should happen only once.
Layer 3 β Integration-Testing Topic Routing
Once individual units are covered, spin up a real (but ephemeral) broker in your CI pipeline. Testcontainers makes this easy β it starts a Docker container for Kafka, RabbitMQ, or whatever broker you use, runs your test, and tears it down. Publish a real message, subscribe, and verify end-to-end that routing, schema serialization, and subscription filters all work together correctly.
Layer 4 β Fault Injection
The most important tests β and the most skipped. Kill the broker mid-test and verify that your subscriber retries and the DLQ triggers after max retries. Replay the same message twice and verify your handler is idempotent. Publish a deliberately malformed message and verify it ends up in the DLQ without blocking healthy messages. These tests are the only way to prove your resilience config actually works before production proves it the hard way.
The idempotency test is non-negotiable. Pub/sub brokers guarantee at-least-once delivery β your subscriber will receive the same message more than once during broker restarts or consumer rebalances. If your handler isn't idempotent (safe to call multiple times with the same message), you will produce duplicate side effects in production. Test it explicitly: call the handler with the same messageId twice and assert the side effect happened exactly once.
Sample Test: Publisher + Subscriber Unit Tests
// Layer 1: publisher unit test β mock the broker, no infrastructure needed
import { OrderService } from "./order-service.js";
import { MockBroker } from "./test-helpers/mock-broker.js";
describe("OrderService.placeOrder", () => {
it("publishes order.placed with correct payload", async () => {
const mockBroker = new MockBroker();
const service = new OrderService(mockBroker);
await service.placeOrder({ orderId: "ord-99", amount: 42.00 });
// Assert broker.publish was called with the right topic and payload shape
expect(mockBroker.published).toHaveLength(1);
expect(mockBroker.published[0].topic).toBe("order.placed");
expect(mockBroker.published[0].payload.orderId).toBe("ord-99");
expect(mockBroker.published[0].payload.amount).toBe(42.00);
});
});
// Layer 2: subscriber unit test β inject fixture message, no broker needed
import { handleOrderPlaced } from "./order-handler.js";
import { db } from "./db.js"; // real or mock DB
describe("handleOrderPlaced", () => {
it("creates fulfillment record on valid message", async () => {
const fixture = { messageId: "msg-1", orderId: "ord-99", amount: 42.00 };
await handleOrderPlaced(fixture);
const fulfillment = await db.find("fulfillments", { orderId: "ord-99" });
expect(fulfillment).not.toBeNull();
});
it("is idempotent β reprocessing the same messageId has no double-effect", async () => {
const fixture = { messageId: "msg-1", orderId: "ord-99", amount: 42.00 };
await handleOrderPlaced(fixture); // first delivery
await handleOrderPlaced(fixture); // redelivery (at-least-once semantics)
const count = await db.count("fulfillments", { orderId: "ord-99" });
expect(count).toBe(1); // must be exactly 1, not 2
});
});
// Layer 3: full integration test with a real Kafka (via Testcontainers)
import { KafkaContainer } from "@testcontainers/kafka";
import { Kafka } from "kafkajs";
let kafkaContainer;
let kafka;
beforeAll(async () => {
kafkaContainer = await new KafkaContainer("confluentinc/cp-kafka:7.5.0").start();
kafka = new Kafka({ brokers: [kafkaContainer.getBootstrapServers()] });
}, 60_000); // allow 60s for Docker pull
afterAll(() => kafkaContainer.stop());
it("message published to order.placed reaches email subscriber", async () => {
const producer = kafka.producer();
const consumer = kafka.consumer({ groupId: "test-email" });
await producer.connect();
await consumer.connect();
await consumer.subscribe({ topic: "order.placed", fromBeginning: true });
const received = [];
consumer.run({ eachMessage: async ({ message }) => {
received.push(JSON.parse(message.value.toString()));
}});
await producer.send({
topic: "order.placed",
messages: [{ value: JSON.stringify({ orderId: "ord-99", amount: 42.00 }) }]
});
// Wait up to 5s for the consumer to receive the message
await waitFor(() => received.length === 1, 5000);
expect(received[0].orderId).toBe("ord-99");
});
Test publishers by injecting a mock broker and asserting the right topic and payload shape β no infrastructure needed.
Test subscribers by calling the handler directly with a fixture message β verify the happy path, malformed input, and idempotency explicitly.
Integration-test topic routing using a real ephemeral broker via Testcontainers in CI.
Fault injection tests (kill broker, replay message, send poison msg) are the only way to prove resilience config works before production does.
Idempotency is mandatory to test because brokers deliver at-least-once β expect duplicates.
Observability
What to monitor on a pub/sub system in production β the five signals that tell you everything, and the dashboard that shows them at a glance.
When a pub/sub system misbehaves in production, it rarely announces itself loudly. It's usually a slow creep: consumer lag grows by 1,000 messages a minute, fan-out latency drifts from 50ms to 800ms, and by the time users notice something is wrong, the queue depth is two hours behind. Good observability means you catch these drifts in minutes, not after user complaints arrive.
The Five Production Signals
What Each Signal Tells You
β Publish rate β messages published per second. A sudden drop means publishers are crashing or blocking on the broker. A sudden spike means a runaway publisher or a replay event. Baseline this metric so anomalies are obvious.
β‘ Fan-out latency β time from when a publisher sends a message to when a subscriber first receives it. In a healthy system this is low (single-digit milliseconds for in-memory brokers, tens of milliseconds for durable log-based ones). A rising p99 means the broker is under load, the network is congested, or a subscriber is holding acknowledgements too long.
β’ Consumer lag β how many messages are in the topic ahead of what the subscriber has processed. This is the most important signal. Zero lag means the subscriber is keeping up in real time. Rising lag means the subscriber is slower than the publisher β scale the consumer, fix a slow handler, or check for a poison message causing retries. If lag is growing faster than you can scale, it's time to investigate the handler logic, not just add more pods.
β£ Broker queue depth β total messages retained in the broker across all topics. A slowly growing depth is normal (brokers retain messages for replay). A rapidly growing depth with stable publish rate but growing consumer lag confirms the consumer is the bottleneck. A depth that stays constant while lag grows means the broker is deleting (expiring) messages before the consumer can read them β a retention period problem.
β€ Dead-letter rate β percentage of messages that exhaust retries and land in the DLQ. Near-zero is healthy. Any non-zero spike deserves investigation: what changed? A schema change? A downstream dependency going down? A new code deployment that introduced a regression? The DLQ rate is your early warning system for systematic subscriber failures.
The 4 Golden Signals for Pub/Sub (adapted from SRE)
Throughput β publish rate (messages/sec). Is the broker receiving traffic as expected?
Latency β fan-out latency p99 (ms). How long does it take a message to reach subscribers?
Consumer Lag β messages behind. Is the system processing events in near-real-time or falling behind?
Error Rate β DLQ rate (% of messages). Are subscribers failing systematically?
If these four are green, your pub/sub system is healthy. If any one of them degrades, you have a direction for diagnosis. Queue depth is a useful fifth signal for capacity planning, not for alerting.
Monitor five signals: publish rate (throughput), fan-out latency (delivery speed), consumer lag (backlog), broker queue depth (capacity), and dead-letter rate (error rate).
Consumer lag is the most critical signal β rising lag is always a symptom of something wrong and deserves an immediate alert.
DLQ rate is your early warning system for systematic handler failures β any non-zero spike warrants investigation.
The 4 golden signals for pub/sub are throughput, latency, consumer lag, and error rate. Keep these green and everything else follows.
Capacity Planning β How Much Work Is the Broker Actually Doing?
Fan-out factor, partition strategy, and the worked math behind sizing a pub/sub cluster for real traffic.
When you put a pub/sub broker in a system design interview β or in production β the interviewer's next question is usually "how does it scale?" The answer lives in one formula: broker work = publish rate Γ fan-out factor. Once you understand that formula, you can size topics, plan partitions, and predict where the bottleneck will be before it bites you.
The Fan-Out Formula
A broker doesn't just pass messages through β it multiplies them. When 1 publisher sends 1 message to a topic with 10 subscribers, the broker must deliver 10 copies. That amplification is called fan-out. The total work the broker does is:
Why does this matter? Because in a classic pub/sub system (RabbitMQ, Google Pub/Sub) the broker carries all of this I/O on its own bus. In a log-based system (Kafka), the broker stores the log once and subscribers read independently β so write I/O stays flat while reads scale with subscriber count, but from local disk rather than multiplied RAM copies. Knowing this difference shapes your partitioning strategy completely.
Worked Example β The Full Math
Let's make it concrete. Suppose you're designing a pub/sub layer for a gaming leaderboard update system:
Publishers: 10,000 game servers, each publishing score updates at 100 msg/s
Topic:score.update
Subscribers: 50 (one leaderboard shard per region, plus analytics, plus fraud detection)
Message size: 512 bytes
Metric
Calculation
Result
Total publish rate
10,000 Γ 100 msg/s
1,000,000 msg/s
Inbound broker bandwidth
1,000,000 Γ 512 B
~512 MB/s write
Fan-out deliveries/s
1,000,000 Γ 50 subscribers
50,000,000 deliveries/s
Outbound broker bandwidth
50,000,000 Γ 512 B
~25 GB/s read
Minimum partitions (50 MB/s per partition)
512 MB/s Γ· 50 MB/s
11 partitions (round up to 16 or 32)
The 25 GB/s number is the alarm bell. No single broker handles 25 GB/s of outbound I/O. This is where your architecture changes: move from a traditional broker (which copies messages to each subscriber) to a log-based system like KafkaIn Kafka, the broker stores the log once on disk. Each subscriber (consumer group) maintains its own offset and reads independently from the log. The broker's I/O is dominated by write throughput + sequential disk reads β not N copies of each message in memory. This makes fan-out much cheaper at scale., where the broker writes once and subscribers read from a shared log independently. The outbound I/O problem becomes 50 independent sequential reads, not 50 copies.
Partition Strategy in Plain English
A partition is just a slice of a topic β a separate ordered log that a different broker (or different CPU thread) can handle independently. Splitting a topic into more partitions does two things: it spreads the write load across multiple brokers, and it lets multiple subscribers read in parallel (one subscriber thread per partition). This is why Kafka's throughput scales nearly linearly with partition count, up to the hardware limit of your cluster.
The catch: once you set partition count, you usually can't decrease it. More important β if you're routing messages by key (e.g., all events for user ID 42 go to the same partition to preserve order), changing partition count re-hashes keys to different partitions, breaking per-key ordering. Decide on partition count early and over-provision by 2Γ to leave room for growth without a reshuffle.
Quick sizing rule: Start with partitions = max(desired_parallelism Γ 2, throughput_MB_s Γ· 50). Most workloads are parallelism-bound, not throughput-bound. If you expect 20 consumer instances at peak, start with 40 partitions β not 20 β so you can add 20 more consumers later without restructuring the topic.
Broker work = publish rate Γ fan-out factor Γ message size. Fan-out is the multiplier that kills single-broker deployments β at high subscriber counts, move to log-based brokers (Kafka) where each subscriber reads independently rather than receiving N copies. Partition count sets the ceiling on consumer parallelism; over-provision by 2Γ at topic creation time because you cannot decrease partition count later without re-hashing keys and breaking per-key ordering.
Interview Q&A β Eight Questions That Actually Get Asked
Not just definitions β the WHY chains and trade-off reasoning interviewers are actually testing for.
These are the questions that come up in every system design interview once you mention pub/sub or an event-driven architecture. Knowing the textbook answer isn't enough β interviewers want to hear the reasoning chain that produced the answer. That's what each response below models.
EasyQ1 β What's the difference between pub/sub and a message queue?
Think first: What does each one guarantee about who receives a message and how many times?
A message queue is a pipe between exactly one producer and one consumer group β each message is consumed by one worker, then deleted. Think of it like a work queue: ten jobs go in, ten workers each grab one job. Once a job is taken, it's gone. The model exists to distribute load: you want each item processed exactly once by whichever worker is free.
Pub/sub is different in one fundamental way: every subscriber gets a copy. When you publish to a topic, all subscribers receive the same message independently. Nobody competes. It's not a work queue β it's a broadcast. The model exists to decouple producers from consumers: the publisher doesn't need to know who cares about the event.
When to use which: Use a queue when you need load distribution β many workers sharing the effort of processing a stream of tasks. Use pub/sub when you need event fan-out β one event triggering multiple independent reactions. Many systems use both: SNS (pub/sub) fans an event out to multiple SQS queues (work queues), where workers pick up tasks in parallel.
MediumQ2 β When would you NOT use pub/sub?
Think first: What does pub/sub fundamentally give up that a direct call provides?
Pub/sub gives up two things: synchronous response and simple observability. That immediately rules it out for several patterns:
Request/reply flows β If Service A needs an answer from Service B before it can proceed (e.g., "check if this user's credit limit allows this purchase"), pub/sub adds unnecessary complexity. A direct HTTP or gRPC call is the right tool. Layering a reply-to-topic pub/sub handshake on top just recreates synchronous communication with more moving parts.
Strict sequential processing with a single consumer β If you need guaranteed in-order processing with exactly-one consumer, a simple queue or a database-backed task table is often simpler and cheaper than a partitioned pub/sub system with careful key routing.
Very low volume, simple integrations β If two services communicate at 1 msg/minute and neither expects to add more consumers, the operational overhead of running a broker outweighs the decoupling benefit. A direct database row or a simple webhook works fine.
When you need exactly-once processing and can't tolerate deduplication logic β At-least-once delivery means duplicate handling is your problem. If your downstream consumer is not idempotent and the cost of implementing idempotency is high, consider whether a transactional outbox or a two-phase write model fits better.
The heuristic: If the caller needs a result before it can continue, use synchronous RPC. If the caller just needs to record that something happened and move on, pub/sub is usually right.
MediumQ3 β How does at-least-once delivery actually work?
Think first: What's the moment of danger β when can a message be lost, and how does the system recover?
The danger moment is the gap between the broker delivering a message and the subscriber acknowledging it. Here's the exact flow:
Broker delivers message M to subscriber S.
Broker starts a timer (the ack deadlineThe maximum time a subscriber has to acknowledge a message before the broker considers delivery failed and re-delivers. In Google Cloud Pub/Sub this is configurable, default 10 seconds. In Kafka it's controlled by max.poll.interval.ms for consumer groups.).
Subscriber processes M and sends ACK.
Broker marks M as delivered and moves on.
The "at-least-once" case: if step 3 never happens (subscriber crashes mid-processing, or the network drops the ACK), the timer expires and the broker re-delivers M. The subscriber might have processed M just fine β the ACK just never arrived. So M gets processed a second time. The message was delivered at least once, possibly more.
The implication: At-least-once delivery puts a contract on the consumer β your processing logic must be idempotent (processing the same message twice produces the same result as processing it once). Common patterns: check-and-skip by message ID before processing, use upserts instead of inserts in the database, or rely on natural idempotency (setting a flag is idempotent; incrementing a counter is not).
MediumQ4 β How do you handle a slow subscriber?
Think first: If one subscriber is slow, what happens to the broker's buffer and to the other subscribers?
A slow subscriber is one of the most common production pub/sub problems, and the right answer depends on the broker model.
In a traditional broker (RabbitMQ, GCP Pub/Sub): The broker buffers unacknowledged messages. If a subscriber is chronically slow, the buffer grows until it hits memory/storage limits. The broker may start dropping messages or exerting backpressure on publishers. The right fix is usually to scale out the subscriber β add more instances to increase processing throughput. If the slowness is spikey (not chronic), increasing the ack deadline buys time for processing without triggering re-delivery.
In Kafka: The broker doesn't care how fast subscribers are β each consumer group maintains its own offset independently. A slow consumer group just accumulates lag (distance between its current offset and the latest offset). The broker keeps the data until the retention period expires. Fix options: add consumer instances (up to the partition count), reduce message processing time, or increase max.poll.records to batch more work per poll. Monitor lag β a growing lag that never shrinks means you need more consumers or faster processing, not more time.
HardQ5 β Pub/sub vs event sourcing β what's the difference?
Think first: One is a communication pattern; the other is a storage pattern. Do they overlap?
Pub/sub is about delivering messages to interested parties. The broker routes events from publishers to subscribers. The events are ephemeral β once delivered and acknowledged, they're gone (in most brokers). The model answers the question "how do services notify each other?"
Event sourcing is about storing state as a sequence of events. Instead of saving the current state of an entity ("Order #42 is SHIPPED"), you save every state change as an event ("OrderCreated", "OrderPaid", "OrderShipped"). The current state is always derivable by replaying those events from the beginning. The model answers the question "how should I persist state in my application?"
They overlap when you use Kafka: Kafka's log-based storage makes it usable as both an event sourcing store (events are retained indefinitely, replayable) and a pub/sub backbone (consumer groups subscribe to topics and receive events). But the concepts are independent β you can have pub/sub without event sourcing (SNS/SQS with fire-and-forget events) and event sourcing without pub/sub (a single-service event store with no broadcast).
The key difference in one sentence: Pub/sub is about routing; event sourcing is about persistence and history. Use both together when you want durable, replayable, broadly distributed events β which is why Kafka + event sourcing is a common pairing in CQRS architectures.
HardQ6 β How do you scale a pub/sub broker when one topic is a hotspot?
Think first: What are the dimensions you can scale independently β and which ones are coupled?
The three levers work together but have different constraints. Partition count is the hard ceiling on consumer parallelism β you can't parallelize beyond the number of partitions regardless of how many consumer instances you add. Broker node count determines how many partition leaders can be distributed across the cluster β if all hot partitions live on one broker, adding more partitions doesn't help until you rebalance leaders. Consumer instances are the easiest to scale (just add more pods/workers) but are bounded by partition count.
The typical sequence for a hotspot: (1) Check if the topic is key-skewed β if 90% of messages share one key, all go to one partition. Fix by changing the partition key or adding a salt. (2) Increase partition count (accepting that key-to-partition mapping changes). (3) Add broker nodes and trigger a preferred replica election to redistribute leaders. (4) Scale consumers to match the new partition count.
MediumQ7 β What's a dead-letter queue and when do you need one?
Think first: What happens to a message that can never be successfully processed? Where should it go?
A dead-letter queue (DLQ) β sometimes called a dead-letter topic β is a special destination for messages that have failed delivery or processing too many times. Instead of retrying forever (burning CPU and blocking the topic) or silently dropping the message (losing data), the broker routes poison messages to the DLQ for human inspection or automated recovery.
When a message ends up in the DLQ: (1) The subscriber returned a negative acknowledgement (NACK) or failed to ack within the deadline N times. (2) The message exceeded a maximum retry count you configured. (3) The message is malformed and can't be deserialized at all.
When you need one: Any time your subscriber can plausibly receive a message it can't process β because of schema changes, downstream service outages, or data it doesn't expect. Without a DLQ, you have two bad choices: retry forever (rebalance storm, infinite loops, blocked consumers) or drop messages silently. The DLQ is the third option: park the broken message safely, alert an operator, and let the rest of the topic continue flowing. A DLQ is not optional in any production pub/sub system β it's hygiene.
What to do with DLQ messages: Monitor DLQ depth as an alert. Investigate the root cause, fix the subscriber or the message, and replay from the DLQ back to the original topic once the fix is deployed.
HardQ8 β How do you guarantee ordered delivery in pub/sub?
Think first: Order within what scope? Globally? Per-entity? What breaks ordering and what preserves it?
Pub/sub systems don't guarantee global ordering across all messages β that would require a single serial bottleneck and kill throughput. What they do guarantee, with the right configuration, is per-key ordering: all messages with the same partition key arrive at the same partition in the order they were published. Within one partition, one consumer reads sequentially.
The three rules that preserve ordering: (1) use a meaningful partition key (e.g., entity ID), (2) assign exactly one consumer instance per partition, (3) process messages synchronously within each consumer β don't parallelize processing inside a single consumer using threads, because that re-introduces race conditions the partition eliminated. If you need global ordering across all messages, you're essentially serializing everything β use a single partition, accept the throughput limit, and consider whether you actually need global order or just per-entity order.
The eight questions cover the full interview spectrum: queue vs. pub/sub (fan-out vs. load distribution), when not to use pub/sub (synchronous reply, very low volume), at-least-once delivery mechanics (ack deadline + idempotency requirement), slow subscriber handling (lag monitoring, partition scaling, backpressure), pub/sub vs. event sourcing (routing vs. persistence), broker scaling (partitions β brokers β consumers, key skew detection), dead-letter queues (poison message parking, not optional in production), and ordered delivery (per-key via partition key, not global).
Practice Exercises β Build Your Pub/Sub Intuition
Four exercises that force you to apply the concepts under constraints β the same way interviews and on-call incidents do.
Reading about pub/sub builds vocabulary. Working through design problems builds intuition. Each exercise below has a constraint that forces a specific decision. Try to reason through it before expanding the hint or solution β the reasoning path matters more than the final answer.
You're designing a pub/sub system for a social feed. Users post status updates. Every post is published to a status.posted topic. The system has:
2,000 active publishers (users posting), each averaging 1 post per second at peak
Average post message size: 2 KB
Subscribers to the topic: 4 (notifications service, feed-builder service, analytics service, spam-detection service)
Your single broker node can sustain 200 MB/s of total I/O
Questions: (a) What is the broker's inbound bandwidth at peak? (b) What is the outbound bandwidth due to fan-out? (c) Is the single broker sufficient? (d) What's the most impactful change to reduce broker load if it's not?
Calculate inbound first (publishers Γ rate Γ size), then multiply by subscriber count for outbound. Total I/O = inbound + outbound.Solution:
(c) Total broker I/O: 4 + 16 = 20 MB/s. The 200 MB/s single broker is fine β you're at 10% utilization. No scaling needed at this load.
(d) If load grows 20Γ (40,000 publishers), total I/O becomes 400 MB/s β over the limit. The most impactful change is moving to a log-based broker (Kafka) where subscribers read from a shared log rather than receiving pushed copies. Inbound stays at 80 MB/s; outbound becomes 4 independent sequential reads (~80 MB/s each), but from local disk β not multiplied in-memory copies. Alternatively, reduce fan-out by routing some subscribers to read from a downstream database updated by a single consumer, rather than all four consuming from the topic directly.
Your company runs two pub/sub topics:
payment.completed β triggers refund eligibility checks, receipt emails, and fraud analysis. Losing a message means a customer is never notified, a refund is missed, or fraud is never caught.
ui.button.clicked β raw analytics events from the web UI. Used to build usage heatmaps. The team has explicitly said losing 0.5% of events is acceptable for cost reasons. Volume: 100,000 msg/s.
For each topic, choose: delivery guarantee (at-most-once / at-least-once / exactly-once), and justify the trade-off. Also β for payment.completed, what contract does your subscriber need to uphold regardless of which guarantee you choose?
At-most-once = fast, can lose messages. At-least-once = safe but may duplicate. Exactly-once = safest, highest cost. Match the business cost of loss vs. duplicate to the mechanism.Solution:
payment.completed β At-least-once delivery. Losing any payment event has real financial and legal consequences. At-least-once guarantees the message will be delivered even if there's a crash or network blip, at the cost of possible duplicates. Exactly-once is theoretically ideal but adds significant broker and consumer complexity β most production systems achieve the practical equivalent by using at-least-once + idempotent consumers. The cost of building idempotency is far lower than the cost of lost payment events.
Subscriber contract: The subscriber MUST be idempotent β if it receives the same payment.completed event twice, it must not send two receipt emails or apply two refund eligibility checks. Common pattern: track processed event IDs in a database; on receive, check "have I seen this ID?" before processing.
ui.button.clicked β At-most-once delivery. The team explicitly accepts 0.5% loss. At-most-once is the cheapest option: the broker fires the message and doesn't retry on failure. This avoids ack overhead, retry queues, and deduplication logic β the savings compound at 100,000 msg/s. Using at-least-once here would add meaningful cost (ack round-trips, retry storage) for a benefit the team has decided isn't worth paying for.
You own a Kafka consumer group called inventory-updater that reads from the order.placed topic (32 partitions). The topic receives 50,000 msg/s at peak. Your consumer group has 32 instances running. You check the consumer group lag dashboard Monday morning and find:
Total lag: 2,400,000 messages and climbing
All 32 partitions have roughly equal lag (~75,000 each)
No consumer instances are idle β all 32 are assigned a partition
No errors in consumer logs β messages are processing without exceptions
What is most likely happening? List at least three hypotheses and the command or metric you'd check to confirm each. Then give the most likely fix.
If lag is growing uniformly across all partitions and there are no errors, the consumer is processing but not fast enough. Think about what determines processing throughput vs. what could be throttling it.Solution β three hypotheses:
H1 (Most likely): Processing throughput is just too low. 32 consumers Γ throughput_per_consumer < 50,000 msg/s. If each consumer handles 1,200 msg/s and the topic receives 50,000 msg/s, you're 14,000 msg/s short. Check: kafka-consumer-groups.sh --describe and watch lag change over 60 seconds. If lag grows at 14k/s, your processing rate is 36k/s against 50k/s. Fix: profile the processing code β is there a slow downstream DB write? Can you batch-write to the inventory database instead of one row per message? Batching 100 messages per DB write is 100Γ faster than individual writes.
H2: A downstream dependency is slow or rate-limited. The inventory database has a connection pool limit or a slow query plan. Consumers are processing but waiting on DB acks. Check: DB query latency metrics, connection pool saturation. Fix: add indexes, tune pool size, switch to async batch writes with a local buffer.
H3: Offset commits are happening too frequently. If commits happen after every single message (commitSync() per message), each commit is a round-trip to the broker. At 50k msg/s with 32 consumers, that's ~1,560 round-trips/s per consumer β expensive. Check: look at commit frequency in logs. Fix: commit after processing a batch (every 500β1,000 messages), not after every single message.
Most likely fix: Profile the inventory DB write. Switch from per-message INSERT to batched UPSERT with a 500ms flush window. This alone typically 10β50Γ throughput on DB-bound consumers.
You're designing the pub/sub layer for a ride-sharing platform. Requirements:
Three regions: US-East, EU-West, AP-Southeast. Each region has riders and drivers.
Real-time systems (matching engine, driver app push) must react within 500ms and need local latency β cross-region round-trips are unacceptable.
Analytics and compliance audit trails must see ALL events from ALL regions.
A ride that starts in US-East must always be processed by US-East systems β do not route a US-East ride event to the EU broker for processing.
Design the topic layout, broker placement, replication strategy, and subscription topology. Identify the two hardest trade-offs in your design.
Think about: where should topics live? What is replicated vs. processed locally? What is the difference between "events flow to a global analytics topic" vs. "events are processed globally"?Solution:
Broker placement: Deploy independent broker clusters in each region β US-East, EU-West, AP-Southeast. Each cluster hosts regional topics (us-east.ride.requested, eu-west.ride.requested, etc.). Real-time systems (matching engine, driver push) subscribe only to their local regional cluster. This guarantees sub-500ms local delivery β no cross-ocean hop.
Analytics aggregation: Deploy a global Kafka cluster (or use a managed service like Confluent Cloud or GCP Pub/Sub with global topics). Use Kafka MirrorMaker 2 (or equivalent) to replicate all regional topics into a unified global cluster: us-east.ride.requested, eu-west.ride.requested, and ap-se.ride.requested all mirror into global.ride.requested. Analytics and compliance consumers subscribe to the global cluster β they see all events from all regions, with some replication lag (typically 1β5 seconds). This satisfies the audit requirement without routing real-time traffic globally.
Topic layout per region:{region}.ride.{event_type} naming. Partition by ride ID β this ensures all events for one ride (requested β accepted β completed) land in the same partition, preserving ride-level ordering for the matching engine.
Trade-off 1 β Replication lag vs. consistency: The analytics cluster sees events 1β5 seconds behind real-time. If compliance queries the global cluster during that window, it may not yet see the most recent event for a ride. This is typically acceptable for audit use cases but must be documented and factored into any real-time dashboard SLAs.
Trade-off 2 β Operational complexity vs. simplicity: Three regional clusters plus a global cluster means four Kafka deployments to maintain, monitor, and upgrade. The alternative β a single global cluster β is simpler operationally but forces all real-time traffic to route to wherever the leader partition lives, adding cross-region latency that violates the 500ms requirement. The multi-cluster model is correct for this use case but requires investment in a cross-cluster replication pipeline and unified monitoring.
The four exercises build capacity planning instincts (fan-out math, I/O ceilings), delivery guarantee selection (matching business cost of loss vs. duplicate to the mechanism), consumer lag diagnosis (throughput vs. downstream bottleneck vs. commit overhead), and multi-region pub/sub topology design (local clusters for latency-sensitive real-time work, MirrorMaker replication for global analytics). The reasoning framework β not the numbers β transfers to any pub/sub system you encounter.
Cheat Sheet β Pub/Sub in 30 Seconds
Eight quick-reference cards covering every concept you need to recall fast β before an interview, during a design review, or when reviewing a broker config.
Bookmark this. When you need a fast refresh before an interview or while reviewing a pull request that touches event-driven code, this section is your 90-second re-entry point into the full page.
A service that emits events to a topic. It does not know β and does not care β who receives the event. Fire-and-forget is the contract. WHY: this ignorance is the source of all the decoupling benefits.A service that listens on a topic and reacts to every message delivered to it. Multiple subscribers on the same topic each get their own copy. WHY: fan-out is the point β one event, many independent reactions.A named, durable destination that the broker manages. Publishers write to it; subscribers read from it. Topics are the contract β name them as past-tense facts (order.placed, not createOrder). WHY: past-tense names signal that the event already happened and the publisher doesn't control what happens next.The middleman. It receives messages from publishers, stores them (at least briefly), and delivers copies to all subscribers. It is the single component that makes pub/sub work β remove it and you're back to direct calls. WHY: the broker absorbs the fan-out cost so publishers and subscribers don't have to know about each other.At-most-once: fast, can lose messages β use for low-stakes analytics. At-least-once: safe but may duplicate β requires idempotent consumers, use for payments and state changes. Exactly-once: no loss, no duplicates β expensive, use only when both loss AND duplication are unacceptable. WHY: each guarantee trades latency and complexity against data safety.How many subscribers receive each published message. Broker outbound I/O = publish rate Γ message size Γ fan-out factor. This is WHY a log-based broker (Kafka) beats a push broker at high fan-out: subscribers read from a shared log rather than receiving N pushed copies.Where messages go when a subscriber fails to process them after N retries. Without a DLQ you have two bad choices: retry forever (floods the topic) or drop silently (lose data). The DLQ parks broken messages safely for manual inspection and replay once the bug is fixed. Not optional in production.Global ordering across all messages is impractical β it serializes everything and kills throughput. Per-key ordering is the practical standard: all messages with the same partition key land in the same partition, in publish order. One consumer per partition preserves that order end-to-end.
Eight cards: publisher (fire-and-forget emitter), subscriber (fan-out receiver), topic (named durable channel), broker (the decoupling middleman), ack semantics (at-most / at-least / exactly-once trade-offs), fan-out factor (the broker I/O multiplier), DLQ (poison message safety net, non-optional), and ordering (per-key via partition key, not global).
Glossary β Plain-English Definitions
Every pub/sub term defined in plain English β the vocabulary you need to read broker documentation and architecture discussions without getting lost.
These are the terms you'll encounter across all pub/sub systems β Kafka, RabbitMQ, Google Cloud Pub/Sub, AWS SNS/SQS, Redis Pub/Sub. The concepts are identical; only the brand names change.
Publisher (Producer)
A service or process that creates and sends messages to a topic. The publisher doesn't know how many subscribers exist or what they do with the message. It just publishes and moves on.
Subscriber (Consumer)
A service or process that listens to a topic and receives messages from it. In push-based systems, the broker delivers messages to the subscriber. In pull-based systems (like Kafka), the subscriber asks the broker for new messages on its own schedule.
Broker
The server (or cluster of servers) that sits between publishers and subscribers. It receives messages, stores them temporarily or permanently, and routes copies to every subscriber registered on the topic. Think of it as the post office for your events.
Topic
A named channel inside the broker. Publishers write to topics; subscribers read from topics. A topic is the only thing publishers and subscribers agree on β they don't need to know about each other, only about the topic name.
Message / Event
The unit of data sent through the system. Usually a small JSON or binary blob with a payload (the data) and optionally headers (metadata like timestamp, event type, or source service).
Subscription
A registered intent to receive messages from a topic. When a subscriber subscribes, the broker starts routing copies of new topic messages to it. In some systems (Google Cloud Pub/Sub), subscriptions are named resources with their own configuration (ack deadline, retention, DLQ settings).
Acknowledgement (ACK)
A signal from the subscriber back to the broker saying "I received this message and processed it successfully β you can stop tracking it." Without an ACK, the broker assumes the message was not processed and will re-deliver it (in at-least-once systems).
Negative Acknowledgement (NACK)
A signal from the subscriber saying "I got this message but I can't process it right now β please re-deliver it." The broker will re-queue the message for another delivery attempt.
Dead-Letter Queue (DLQ)
A special topic or queue where the broker parks messages that have failed delivery or processing too many times. Prevents a single bad message from blocking the whole topic. Production requirement: every topic that matters needs a DLQ configured.
Fan-Out
The multiplication effect where one published message is delivered to N subscribers. If you publish 1 message to a topic with 10 subscribers, the broker does 10 deliveries. Fan-out factor is the multiplier that determines broker outbound I/O.
Consumer Group
A named set of subscriber instances that share the work of processing a topic's messages. In Kafka, each partition is assigned to exactly one consumer in the group β this distributes the processing load. Multiple consumer groups can independently consume the same topic, each getting a full copy of every message. WHY: consumer groups let you do load balancing (within the group) and fan-out (across groups) at the same time.
Partition
A slice of a topic β an independent ordered sequence of messages that can be stored on a different broker node. More partitions means more parallelism for both writing and reading. One consumer (per consumer group) is assigned to each partition, so partition count is the ceiling on consumer parallelism.
Offset
A number that identifies a specific message's position within a partition. Like a page number in a book. Subscribers track their current offset to know where they left off. If a subscriber restarts, it resumes from the last committed offset β this is how replay and crash recovery work.
Retention Period
How long the broker keeps messages before deleting them. In traditional brokers (RabbitMQ, SQS) messages are deleted after acknowledgement. In log-based brokers (Kafka) messages are retained for a configured duration (e.g., 7 days) regardless of consumption β enabling replay and late consumers.
Idempotency
A property of a processing operation: doing it twice produces the same result as doing it once. At-least-once delivery means duplicate messages are possible β your subscriber must be idempotent to handle them safely. Common pattern: deduplicate by message ID in the database before applying the change.
Fifteen terms: publisher, subscriber, broker, topic, message/event, subscription, ACK, NACK, DLQ, fan-out, consumer group, partition, offset, retention period, and idempotency. These are the vocabulary building blocks you need to read any pub/sub documentation or architecture discussion without reaching for a search engine.
Mini-Project β Build a Notification Fan-Out Service
Start with a single in-process broker, then grow it to partitioned topics, then multi-region β the same evolution path real systems follow.
The best way to truly understand pub/sub is to build a scaled-down version yourself. This mini-project takes you from a zero-dependency in-process broker to a design that would work at the scale of a real Slack-like notification system. Each stage adds exactly one new concept β and explains WHY that concept became necessary.
The Problem
You're building a notification service for a team chat app. When a user sends a message in a channel, several things need to happen: every member of the channel must receive a push notification, the "unread count" badge must update, the activity feed must log the event, and a search index must be updated. The user who sent the message shouldn't have to wait for any of this β they just need their message to appear in the chat.
Stage 1 β In-Process EventEmitter (Node.js)
Start here: zero infrastructure, zero configuration. The built-in EventEmitter in Node.js implements the pub/sub pattern natively. One line of code to publish, one line per subscriber. This is the right starting point for a new feature β prove the fan-out logic works before paying the operational cost of a real broker.
// Stage 1: a thin wrapper around Node's EventEmitter
// Acts as the "broker" β all publishers and subscribers use this single instance
const { EventEmitter } = require("events");
const broker = new EventEmitter();
// Increase max listener count so Node doesn't warn when you add many subscribers
broker.setMaxListeners(50);
module.exports = broker;const broker = require("./broker");
// Called when a user sends a message in a channel
async function handleSendMessage(req, res) {
const { channelId, userId, text } = req.body;
// 1. Persist the message to the database (sync β user waits for this)
const message = await db.messages.insert({ channelId, userId, text });
// 2. Publish the event β fire-and-forget; subscribers run async
broker.emit("message.sent", { messageId: message.id, channelId, userId, text });
// 3. Return immediately β user doesn't wait for notifications
res.json({ ok: true, messageId: message.id });
}
module.exports = { handleSendMessage };const broker = require("./broker");
// Subscriber A β push notifications
broker.on("message.sent", async ({ channelId, userId, text }) => {
const members = await db.channels.getMembers(channelId);
const recipients = members.filter(m => m.id !== userId);
await pushService.sendToMany(recipients, { body: text });
});
// Subscriber B β unread count
broker.on("message.sent", async ({ channelId, userId }) => {
// Increment unread count for every member except the sender
await db.unreads.incrementForChannel(channelId, excludeUserId: userId);
});
// Subscriber C β search index
broker.on("message.sent", async ({ messageId, channelId, text }) => {
await searchClient.index({ id: messageId, channelId, text });
});
// Subscriber D β activity feed
broker.on("message.sent", async ({ channelId, userId }) => {
await activityFeed.record({ type: "message", channelId, userId });
});
// All four fire concurrently β no one waits for anyone else.// Stage 2: swap in Google Cloud Pub/Sub β minimal code change
const { PubSub } = require("@google-cloud/pubsub");
const pubsub = new PubSub({ projectId: "my-project" });
const topic = pubsub.topic("message.sent");
async function handleSendMessage(req, res) {
const { channelId, userId, text } = req.body;
const message = await db.messages.insert({ channelId, userId, text });
// Publish is now durable β broker persists the message
// Even if subscribers are down, the message won't be lost
const messageId = await topic.publishMessage({
data: Buffer.from(JSON.stringify({ messageId: message.id, channelId, userId, text })),
attributes: { source: "chat-api", version: "1" },
});
console.log(`Published as GCP message ${messageId}`);
res.json({ ok: true, messageId: message.id });
}
// Subscriber β runs as a separate Cloud Run service
const subscription = pubsub.subscription("push-notifications-sub");
subscription.on("message", async (msg) => {
const event = JSON.parse(msg.data.toString());
const members = await db.channels.getMembers(event.channelId);
await pushService.sendToMany(members.filter(m => m.id !== event.userId), { body: event.text });
msg.ack(); // Tell the broker: processed successfully, stop tracking
});
// If processing fails, msg.nack() β broker will re-deliver
subscription.on("error", (err) => console.error("Subscription error:", err));
Stage 3 Evolution Notes
Stage 3 adds multi-region. Each region (US-East, EU-West) gets its own Kafka cluster. The chat API in each region publishes only to its local cluster β no cross-region write latency. Notification subscribers (push, unread, search) also run locally, so the whole real-time loop stays within ~20ms round-trip.
A separate MirrorMaker 2 process replicates all regional topics into a global analytics cluster. Analytics and compliance consumers subscribe there, accepting 1β5 seconds of replication lag in exchange for a complete view of all events across every region. The real-time path never touches the global cluster.
Build order for a new project: Stage 1 (EventEmitter) until you have more than one service or need durability. Then Stage 2 (managed broker like Google Pub/Sub or Amazon SNS+SQS) when you have multiple services or need messages to survive a restart. Only go to Stage 3 when you have users in multiple geographies and cross-region latency is measurable. Skipping stages is premature complexity.
Three-stage evolution: Stage 1 (Node.js EventEmitter, zero infra, prove the fan-out design) β Stage 2 (managed broker like GCP Pub/Sub, durable delivery, independent scaling, DLQ) β Stage 3 (regional clusters for <20ms local delivery, MirrorMaker replication for global analytics). Code change from Stage 1 to Stage 2 is minimal β you're swapping the broker, not the design. Build only the stage you actually need right now.
Migration Path β From HTTP Webhooks to a Pub/Sub Backbone
A four-step guide for teams moving from a tightly-coupled webhook system to an event-driven pub/sub architecture β with honest risk assessment at each step.
Many systems start with webhooks: Service A calls Service B's endpoint directly when something happens. This works. Then a third service needs the same events. Then a fourth. Then Service B goes down and Service A starts returning 500s to users. The moment you recognize this smell, it's time to migrate to pub/sub β but migrating a live system takes care.
Golden rule for migrations: Never do a big-bang cutover. Each step below runs the old system and new system in parallel for a period. Only cut over after verifying the new path works correctly.
What you do: Deploy a broker (start with a managed service β Google Pub/Sub, AWS SNS, or RabbitMQ). When Service A fires its existing webhook call to Service B, also publish the same event to a new topic on the broker. Do NOT remove the webhook yet. Both paths are live simultaneously.
Why this is safe: You haven't changed the existing code paths at all. The webhook still fires. The broker publish is additive β if it fails, the webhook already succeeded. You're giving yourself a window to verify the broker is receiving all events correctly before anyone depends on it.
Risk: Double-publishing means both paths must be maintained while you migrate. Avoid letting this parallel state last more than 2β4 weeks β teams forget and the webhook never gets removed.
Verification: Compare event counts on the broker topic vs. webhook delivery count over 48 hours. They should match. Set up a DLQ on the broker topic and monitor it for zero depth during this period.
Step 2 β Migrate Subscribers One by One (Risk: Medium)
What you do: Pick the lowest-stakes downstream consumer (not the most critical β start with analytics or a non-critical notification). Update it to subscribe to the broker topic instead of listening for the webhook. Keep the webhook delivery to this consumer disabled but not deleted. Observe the consumer for one week under real production traffic.
Why one by one: If something breaks, you know exactly which consumer caused it. A big-bang migration where all consumers switch simultaneously makes debugging impossible. The incremental approach turns a high-risk migration into a series of low-risk steps.
Risk: The subscriber now depends on the broker. If the broker has a delivery delay or outage, this consumer stops receiving events. This is acceptable for a low-stakes consumer but not yet for the most critical ones. Test your DLQ and retry policy before moving critical consumers.
Idempotency check: Before migrating any subscriber that modifies state (database writes, financial operations), confirm it is idempotent β at-least-once delivery means it will receive duplicates eventually. Add deduplication logic now if it isn't already present.
Step 3 β Cut Over Critical Consumers and Remove Webhooks (Risk: High β do this carefully)
What you do: After all non-critical consumers are successfully running on the broker, migrate the critical consumers (payments, core state changes). Once a consumer has been running on the broker for at least one week in production with zero issues, remove the webhook delivery path to that consumer. Removing the webhook is what actually simplifies the architecture.
Why the wait: One week of production traffic exercises edge cases that staging never sees. A shadow period that's too short misses monthly batch jobs, weekend traffic patterns, and retry storms from unrelated outages.
Risk: Removing the webhook to a critical consumer is irreversible in the short term. If the broker has an undiscovered bug (schema mismatch, delivery ordering issue), you have no fallback. Mitigate by keeping the webhook code path commented-out (not deleted) for 30 days post-cutover β you can re-enable it in minutes if you need to.
Rollback plan: Maintain the ability to re-enable direct webhook delivery within 10 minutes for any critical consumer. Document this rollback procedure before you start Step 3.
What you do: Once all consumers have been on the broker for 30+ days, remove the webhook publishing code from Service A. Update monitoring to track broker topic depth, consumer group lag, and DLQ depth as your primary health signals (replacing the old HTTP webhook success rate). Set up alerts: page on DLQ depth > 0 for critical topics; alert (not page) on consumer lag growing for more than 5 minutes.
Why monitoring changes matter: Your old observability was HTTP response codes on webhook calls β easy to monitor. Pub/sub failure modes are different: the publisher succeeds (200 OK from the broker), but the subscriber might be slow, down, or processing incorrectly. Consumer lag and DLQ depth are now your primary health signals. If you don't update your dashboards and alerts, you'll miss subscriber failures entirely.
Hardening checklist: (1) DLQ configured on every topic. (2) Retry policy tested (NACK β exponential backoff β DLQ after N attempts). (3) Consumer lag alert set. (4) Runbook written for "DLQ has messages β what do I do?" (5) Load test at 3Γ expected peak before decommissioning any parallel systems.
Four steps: (1) Add broker alongside existing webhooks β additive, no risk, verify event parity. (2) Migrate consumers one-by-one starting with low-stakes ones β incremental, isolates failures, confirm idempotency. (3) Cut over critical consumers and remove webhooks β highest risk step, keep rollback plan ready, 30-day shadow before deleting code. (4) Remove publisher webhook code and update monitoring β consumer lag + DLQ depth replace HTTP success rate as your health signals.
Further Reading β Seven Sources Worth Your Time
Carefully selected references β each one teaches something this page doesn't have room to cover in depth.
These references go deeper on specific sub-topics. They're listed in order of approachability β start from the top if you're still building mental models, from the bottom if you want to go deep on production systems.
The single best explanation of how logs, streams, and pub/sub systems relate to each other β covers replication, ordering, and the difference between databases and message logs with unusual clarity. If you read one book chapter on this topic, make it this one.
"You Cannot Have Exactly-Once Delivery"
Tyler Treat (bravenewgeek.com)
A short, precise essay that explains why true exactly-once delivery is fundamentally impossible in a distributed system β and what "exactly-once" actually means in practice (exactly-once processing, not delivery). Corrects one of the most persistent myths in pub/sub. Free on the web.
The Log: What every software engineer should know about real-time data's unifying abstraction
Jay Kreps (LinkedIn Engineering Blog)
The foundational blog post that explains why a distributed append-only log is the right abstraction for pub/sub at scale. Written by one of Kafka's creators. Explains the unifying idea behind Kafka, database replication, and event sourcing in one readable essay. Free online.
Google Cloud Pub/Sub Documentation β "Choosing a subscription type"
Google
A practical, well-organized reference for push vs. pull subscriptions, ordering keys, exactly-once delivery configuration, and DLQ setup in a real managed pub/sub service. Useful as a reference when you're designing or implementing against GCP Pub/Sub specifically. Free at cloud.google.com/pubsub/docs.
RabbitMQ Tutorials β "Publish/Subscribe"
RabbitMQ (rabbitmq.com)
The official hands-on tutorial series that walks through exchanges, queues, and binding keys with working code in multiple languages. The best first-principles introduction if you want to understand how a traditional broker (not log-based) implements pub/sub routing. All tutorials are free and runnable in under 30 minutes.
"Building a Scalable Pub/Sub System" β Confluent Blog Series
Confluent Engineering
A multi-part series on production pub/sub patterns using Kafka: schema evolution, consumer group management, DLQ patterns, and multi-region replication. Goes beyond Kafka-specific details into general principles that apply to any broker. Free at confluent.io/blog.
AWS SNS + SQS Fan-Out Pattern Documentation
Amazon Web Services
The AWS architecture reference for the SNS-to-SQS fan-out pattern β where one SNS topic fans out to multiple SQS queues, each with its own consumers. The most commonly deployed pub/sub architecture in AWS environments. Includes worked configuration examples. Free at docs.aws.amazon.com.
Suggested reading order: Tyler Treat's essay (15 min, fixes a common misconception) β Jay Kreps' "The Log" (30 min, builds the big picture) β Kleppmann Chapter 11 (2 hours, goes deep). Then pick the platform-specific docs (Google, RabbitMQ, or AWS) for whichever broker you're actually building on.
Seven references: Kleppmann Ch. 11 (the mental model), Tyler Treat's exactly-once essay (fixes a myth), Jay Kreps' "The Log" (the foundational unifying idea), Google Cloud Pub/Sub docs (practical managed service reference), RabbitMQ tutorials (hands-on traditional broker), Confluent blog series (production patterns), and AWS SNS+SQS docs (the most common cloud fan-out pattern). Together they take you from first principles to production-ready design decisions.
Related Topics β What to Study Next
Six natural next steps in the HLD learning path β each one builds directly on what you just learned about pub/sub.
Pub/sub doesn't exist in isolation. Each of these six topics connects directly to something covered on this page β they're ordered from "most directly related" to "broadens your horizons."
You've covered the full pub/sub mental model. You now understand the core pattern (publisher β broker β subscriber), the delivery guarantees and their trade-offs, how fan-out scales (and where it breaks), the six most common failure modes, eight interview questions with full reasoning chains, capacity planning math, and a concrete mini-project plus migration playbook. The next natural step in the learning path is Apache Kafka β the broker that implements these ideas at the scale of real high-throughput systems, and the one most likely to appear in system design interviews.
Six related topics in reading order: Message Queues (the complement to pub/sub fan-out), Apache Kafka (pub/sub at high throughput), Webhooks (the direct-call pattern pub/sub replaces), Real-Time Systems (browser delivery layer that pairs with a pub/sub backend), Caching Strategies (pub/sub as a cache invalidation mechanism), and CDN (pub/sub-driven cache purge at the edge). The logical next stop is Apache Kafka.