TL;DR — The Hospital That Never Closes
- Why building systems that survive failure is more important than building systems that never fail
- The three pillars of reliability: redundancy, failover, and monitoring
- How "the nines" of availability translate to real downtime minutes per year
- Why human error — not hardware — causes the majority of outages
Reliability is the art of keeping your system running even when pieces of it break.
Think about a hospital. A hospital never closes. Not on holidays, not during a power outage, not when a doctor calls in sick. Why? Because hospitals are designed around a simple truth: things will go wrong. The power will fail. Doctors will get sick. Equipment will malfunction. The hospital doesn't try to prevent every possible failure (that's impossible). Instead, it makes sure that when something fails, there's always a backup ready to take over.
The main power goes out? A backup generatorA secondary power source that kicks in automatically when the main power fails. In hospitals, generators must start within 10 seconds. In software, this is like a standby database that takes over when the primary crashes. kicks in within 10 seconds. The lead surgeon is in an accident? An on-call surgeon arrives within 30 minutes. The MRI machine breaks? There are two more down the hall. Every single critical system has a Plan B. Many have a Plan C. Some have a Plan D.
That's reliability in software. You don't build systems that never fail — that's a fantasy. You build systems where failure doesn't matter because something else picks up the slack before anyone even notices. Netflix engineers intentionally crash their own servers during business hours using a tool called Chaos MonkeyA tool Netflix built that randomly kills production servers during working hours. Why? Because if a random server dying causes problems, they want to find out at 2 PM on a Tuesday — not 3 AM on a Saturday. It forces every team to build their services to survive failures.. If a random server dies and users see an error, that's a bug in the architecture — not bad luck.
The core insight is deceptively simple: things will fail. Disks die. Networks split. Developers push bad code at 5 PM on a Friday. The question isn't "will something go wrong?" — it always will. The question is: will your users notice?
A reliable system isn't one where nothing breaks. It's one where things break all the time and nobody cares — because the system handles it automatically. The generator starts. The backup takes over. The traffic reroutes. By the time a human even looks at the dashboard, the system has already healed itself.
What: Reliability means your system keeps working correctly even when things go wrong — hardware failures, software bugs, human mistakes, network issues. Not "nothing ever fails," but "failure doesn't reach the user."
When: Every production system needs reliability thinking from day one. The cost of an outage grows exponentially with your user count. A 5-minute outage at 100 users is nothing. At 10 million users, it's front-page news.
Key Principle: Design for failure, not against it. Assume every component will fail, and build the system so it doesn't matter when they do. The three pillars: redundancy (have backups), failover (switch automatically), monitoring (detect instantly).
The Scenario — Your Server Just Died at 3 AM
It's 3:17 AM on a Saturday. Your phone vibrates on the nightstand. Then again. Then again. You grab it, squinting at the screen:
You stumble to your laptop and SSH into the production server. The e-commerce site is completely down. Checkout is broken. The shopping cart returns 500 errors. Order confirmations aren't sending. Every minute the site is down, the company loses roughly $4,000 in revenue. It's Black Friday weekend.
You check the database server. Disk is at 100%. The application logs have been writing to /var/log/app/ without any log rotationA system that automatically archives and deletes old log files so they don't fill up the disk. Without it, logs grow forever until the disk is full. Most Linux systems use logrotate — but someone has to actually configure it. configured. Three months of debug logs have eaten the entire 50 GB disk. PostgreSQL can't write its WAL filesWrite-Ahead Log — PostgreSQL's crash recovery mechanism. Before changing any data, PostgreSQL writes the change to the WAL first. If the server crashes, it replays the WAL to recover. But if the disk is full, it can't write WAL entries, so it refuses ALL writes., so it's refusing every write operation. One tiny oversight — forgetting to set up logrotate — brought down the entire platform.
You frantically delete old logs (find /var/log/app/ -mtime +30 -delete), restart PostgreSQL, and watch the site slowly come back to life. Total downtime: 47 minutes. Revenue lost: ~$188,000. Sleep lost: all of it. Reputation damage: immeasurable.
Now contrast that with Netflix. Netflix runs on thousands of servers across multiple AWS regionsAmazon's data centers in different parts of the world (US East, EU West, Asia Pacific, etc.). Each region is completely independent — its own power, network, and cooling. If one region has an outage, others keep running.. Every day, their Chaos Monkey tool randomly kills server instances during business hours. Not in staging. Not in a test environment. In production. Serving real customers watching real shows.
And nobody notices. Not a single stream buffers. Not a single recommendation fails to load. The system detects the dead server within seconds, reroutes traffic to healthy servers, and spins up a replacement. By the time a Netflix engineer glances at the dashboard, the incident is already resolved — automatically.
The difference between the 3 AM horror story and the Netflix non-event isn't luck. It isn't budget (though Netflix spends more, the core patterns are free). It's reliability engineering — designing every piece of the system to fail gracefully, detect failures instantly, and recover automatically.
Look at the 3 AM incident again. The root cause was a full disk — no log rotation. But think deeper: what reliability mechanisms were missing? List at least three things that should have caught this problem before it became an outage.
Think about monitoring (what should have alerted before 100%?), redundancy (what if the database had a replica?), and automation (what if log rotation was already configured?).What Fails — The Four Horsemen of System Failure
Before you can build reliable systems, you need to understand what actually goes wrong. Failures don't come from one place — they come from four distinct categories, each with its own frequency, severity, and fix. Think of them as the four horsemen of your system's apocalypse. Let's meet them.
Horseman #1: Hardware Failures
Physical things break. Hard drives have moving parts that wear out. RAM chips get bit flipsA single bit in memory spontaneously changes from 0 to 1 (or vice versa), usually caused by cosmic rays or electrical noise. Google's study found ~1 bit flip per GB of RAM per year. ECC (Error-Correcting Code) memory detects and fixes single-bit flips automatically. from cosmic rays (yes, really). Power supplies overheat. Fans stop spinning. Server motherboards just... die one day.
How often does this happen? More than you'd think. Google published a famous study on disk failures: across a fleet of 100,000+ drives, the Annual Failure Rate (AFR)The percentage of drives that die in a given year. An AFR of 2% means if you have 100 drives, expect about 2 to fail this year. After 3 years, AFR typically doubles or triples as drives age. was about 2-4% per year, increasing sharply after 3 years. That means if you have 100 servers with one drive each, 2-4 drives will die this year. At Google's scale of millions of drives, that's thousands of disk failures per day.
The good news? Hardware failures are the most predictable type. Drives give warnings (SMART data shows increasing error counts before failure). Servers often degrade slowly rather than dying instantly. And the fix is straightforward: have spares. Use RAIDRedundant Array of Independent Disks — spreading data across multiple drives so that if one dies, your data is still intact on the others. RAID-1 mirrors everything to two drives. RAID-5/6 uses math (parity) to recover from 1-2 drive failures. for disks, ECC memoryError-Correcting Code memory — special RAM that can detect and automatically fix single-bit errors. Standard in servers, rare in consumer PCs. Costs about 10-20% more but prevents silent data corruption. for RAM, and redundant power supplies for servers.
Horseman #2: Software Bugs
Software fails in sneakier ways than hardware. A memory leakWhen a program keeps allocating memory but never frees it. The memory usage slowly climbs over hours or days until the server runs out and crashes. Notorious in long-running servers — the code works fine for a day, then dies on day 3. doesn't crash your server immediately — it slowly eats RAM over days until the OOM killerOut-Of-Memory killer — a Linux kernel mechanism that forcibly kills processes when the system runs out of RAM. It picks the process using the most memory and terminates it. Your app works for 72 hours, then Linux murders it. strikes. A race conditionA bug that only happens when two operations happen at exactly the same time and interfere with each other. Like two people trying to book the last seat on a flight simultaneously — both see "1 seat available," both click "buy," and you've sold the same seat twice. doesn't show up in testing — it shows up at 2x traffic when two requests hit the same database row at the exact same millisecond. A bad configuration file doesn't cause errors until the server restarts and reads the broken config.
Software bugs are particularly dangerous because they're often latent. The bug was introduced last Tuesday. It didn't cause any problems until Saturday night when traffic hit a specific pattern. Now you're debugging a week-old change in the middle of the night, and nobody remembers what changed.
Horseman #3: Human Error (The Big One)
Here's the uncomfortable truth that every reliability report confirms: humans cause more outages than hardware, software, and network issues combined. Study after study puts human error at 60-80% of all outages.
A developer pushes to production instead of staging. A DBA runs DELETE FROM users without a WHERE clause. Someone fat-fingers a firewall ruleRules that control which network traffic is allowed in and out of a server. One wrong rule — like accidentally blocking port 443 (HTTPS) — and your entire website becomes unreachable to every user, even though the server is running perfectly fine. and blocks all incoming traffic. An ops engineer rolls back to the wrong version. A config change meant for one server gets applied to all 200 servers.
This is why modern reliability engineering focuses so heavily on guard rails around human actions: code reviews for config changes, canary deployments that test on 1% of traffic first, automated rollbacks when error rates spike, and "dry run" modes that show what a command would do before actually doing it.
Horseman #4: Network Issues
Networks fail in the most confusing ways. A network partitionWhen some servers can talk to each other but not to others. Server A can reach Server B, and Server C can reach Server D, but A can't reach C. It's like having two groups of friends who can hear each other within each group but not across groups. This is the "P" in the CAP theorem. means Server A can talk to Server B, but not Server C — even though B and C can still talk to each other. Packet lossWhen data packets get dropped during transmission. At 0.1% packet loss, you barely notice. At 1%, SSH sessions feel laggy. At 5%, things start timing out. At 10%, the system is basically unusable. Caused by congested routers, bad cables, or overloaded switches. makes things slow and unreliable without fully breaking them. A DNS failureWhen the Domain Name System — the internet's phone book that converts names like "google.com" to IP addresses — stops working. If DNS is down, browsers can't find ANY website. It's one of the few single points of failure that can take down the entire internet for a region. means names can't be resolved to addresses, so nothing can find anything else.
Network issues are the rarest of the four horsemen (roughly 5-10% of outages), but they're often the most devastating in scope. A disk failure kills one server. A bad deployment kills one service. A network partition can split your entire cluster in half and cause data inconsistency that takes days to clean up.
The scariest network failure is a BGP hijackBGP (Border Gateway Protocol) is how internet routers decide where to send traffic. A BGP hijack happens when someone (accidentally or maliciously) announces bad routing information, causing traffic meant for your servers to go somewhere else entirely. It's like someone changing the road signs to reroute all your customers to a competitor's store. — when someone (intentionally or accidentally) announces bad routing information and traffic meant for your servers goes to the wrong place entirely. This has happened to major companies including Google, Facebook, and Amazon.
Your team had 5 production incidents last month. If the industry average holds (60-80% human error), how many of those 5 were likely caused by a person, not a machine? What does that tell you about where to invest your reliability budget?
3-4 out of 5. This means investing in better deployment tooling, automated testing, and "dry run" commands will prevent more outages than buying better hardware.The Foundation — Redundancy (Having a Backup for Your Backup)
Now that you know what fails, let's talk about the most fundamental defense: have more than one of everything that matters. This is called redundancyHaving duplicate components so that if one fails, the other takes over. Like carrying a spare tire in your car. You hope you never need it, but when you do, it saves you from being stranded on the highway., and it's the oldest trick in the reliability playbook.
Think about airplanes. A Boeing 777 has two engines — it can fly perfectly well on just one. It has three independent hydraulic systemsThe high-pressure fluid systems that move an airplane's control surfaces (flaps, rudder, ailerons). Three independent systems means two can fail completely and the pilot can still control the aircraft. Each system has its own fluid reservoir, pumps, and tubing. — if two fail completely, the pilot can still control the aircraft with the third. It has four independent flight computers cross-checking each other. The landing gear has a manual backup release. Even the cockpit windshield has a heated spare layer in case the outer layer cracks. Every critical system has a backup. Some have three backups.
Why? Because the consequence of failure is catastrophic. You can't pull over at 35,000 feet. Software systems use the exact same thinking: the more critical the component, the more copies you need.
Levels of Redundancy
Engineers use a simple naming system for how much redundancy a system has:
N+1 redundancy means you have one more than you need. If you need 3 servers to handle your traffic, you run 4. If one dies, you still have exactly enough. This is the minimum level of redundancy for any production system. It protects against one failure, but not two simultaneous failures.
N+2 redundancy means two spares. Needed when failures can be correlated — like when a power outage kills two servers on the same rack. Or when you need to take one server offline for maintenance while still being protected against a random failure on another.
2N redundancy means you have double everything. If you need 4 servers, you run 8. This is expensive, but it's what hospitals use for life-critical systems and what financial trading platforms use. When every second of downtime costs millions, the cost of extra hardware is a rounding error.
Active-Active vs. Active-Passive
Having backups is only half the story. The other half is: what are those backups doing while the primary is healthy? There are two approaches.
In active-passiveOne server handles all the work (active), while the other sits idle, just waiting (passive). When the active server fails, the passive one takes over. Simple and safe, but you're paying for a server that does nothing 99% of the time. redundancy, the backup just sits there waiting. It's powered on, it's synchronized with the primary, but it doesn't serve any traffic. It's like that on-call surgeon — at home, awake, ready, but not operating. When the primary fails, the passive takes over. The upside: simple, safe, no complicated coordination. The downside: you're paying full price for a server that does nothing most of the time.
In active-activeALL servers handle traffic simultaneously. No idle backup — everyone is working. If one dies, the others absorb the extra load. More efficient (no wasted capacity), but harder to coordinate. You need to make sure all servers have the same data. redundancy, all copies serve traffic simultaneously. There's no idle backup — everyone's working all the time. If one fails, the others absorb the extra load. Like a hospital with 3 MRI machines all running scans: if one breaks, patients shift to the other two with slightly longer wait times, but nobody goes unscanned. The upside: no wasted capacity, better performance normally. The downside: more complex to coordinate, and you need to ensure all servers have consistent data.
Which should you use? It depends on the component. For stateless things (web servers, API servers that don't store data locally), active-active is almost always better — you get more throughput and instant failover. For stateful things (databases, message queues, anything with data you can't afford to lose), active-passive is safer because there's only one source of truth for writes.
You're designing an e-commerce system with these components: (1) Product catalog API, (2) Payment processing, (3) User review service, (4) Email notification sender. What redundancy level would you choose for each, and why?
Think about what happens if each one goes down. Can users still buy things? Losing reviews is annoying; losing payments is catastrophic. Emails can be retried later.Failover — The Backup Takes Over
Having a backup is step one. Making that backup actually take over smoothly — without losing data, without confusing users, without breaking things worse — that's failoverThe process of switching from a failed primary system to its backup. Sounds simple ("just switch!"), but the devil is in the details: how fast do you detect the failure? How do you avoid data loss during the switch? How do you prevent both servers from thinking they're the primary?. And it's where most of the complexity lives.
Think of it like this: you're in a meeting with a client. Your sales lead suddenly gets sick and has to leave. Having a second sales person in the building is redundancy. Getting them up to speed, into the meeting, and smoothly continuing the conversation without the client noticing — that's failover. The transition is the hard part, not the spare capacity.
Active-Passive Failover
The most common pattern. You have a primary server handling all traffic, and a standby replica that mirrors everything the primary does. The standby watches the primary like a hawk, constantly asking "are you still alive?" When the primary stops responding, the standby says "okay, I'm in charge now" and starts accepting traffic.
This sounds simple, but three big questions make it complicated:
1. How fast can you detect the failure? The standby pings the primary every few seconds (a heartbeatA periodic signal sent between servers meaning "I'm still alive." If the heartbeat stops, the other server assumes failure. The interval is critical — too frequent wastes bandwidth, too infrequent means slow detection. Typical: every 1-5 seconds, declare dead after 3-5 missed beats.). If it misses 3-5 heartbeats, it assumes the primary is dead. With a 5-second interval and 3 missed beats, that's 15 seconds just to detect the failure.
2. How long does promotion take? The standby needs to finish replaying any pending data, open network ports, register itself with the load balancer, and start accepting connections. For a database, this might mean replaying the WALWrite-Ahead Log — a sequential record of every database change. The standby continuously receives WAL records from the primary and replays them to stay in sync. During failover, it must finish replaying all pending WAL records before it can accept writes. — anywhere from 5 to 60 seconds depending on how much data is in the pipeline.
3. How long to redirect traffic? The load balancer or DNSDomain Name System — translates domain names (like "api.myapp.com") to IP addresses. If failover changes the IP address, DNS must update. DNS changes can take 30 seconds to 48 hours to propagate because of caching. This is why most failover uses load balancers instead of DNS. needs to stop sending traffic to the dead server and start sending it to the new primary. If you're using a load balancer with health checks, this can be near-instant. If you're relying on DNS changes, it could take minutes due to TTL cachingTime-To-Live — how long DNS resolvers cache an IP address before asking again. A TTL of 60 seconds means after a DNS change, some clients won't see the new IP for up to 60 seconds. Lower TTLs mean faster failover but more DNS queries (and more cost)..
Add it all up: detect (15s) + promote (30s) + redirect (5s) = ~50 seconds of downtime in a typical active-passive database failover. That's the best case. In practice, 1-5 minutes is common.
The Split-Brain Problem
Here's the scariest failure mode in distributed systems. Imagine your primary database is running in Data Center A. Your standby is in Data Center B. The network link between them goes down — but both servers are still running fine.
The standby can't reach the primary. After 15 seconds of missed heartbeats, it concludes: "The primary must be dead. I'm promoting myself to primary." But the primary isn't dead. It's still running, still accepting writes from users who can reach Data Center A. Now you have two servers that both think they're the primary. Both are accepting writes. Both are making changes to the data. And those changes conflict with each other.
This is called split-brainWhen two servers in a cluster both believe they are the primary and start making conflicting changes. Like two pilots both grabbing the controls and steering in different directions. It's one of the most dangerous failure modes because it can corrupt your data permanently., and it can cause permanent data corruption. Customer A updates their address on Server 1. Customer A's order ships from Server 2 using the old address. A payment is processed on Server 1 but the inventory is decremented on Server 2. The data becomes inconsistent in ways that are extremely difficult to untangle.
Active-Active Failover: Simpler (With Caveats)
Active-active failover is mechanically simpler: both servers already handle traffic, so there's no "promotion" step. If one dies, the load balancer simply stops sending it requests. The remaining server absorbs the extra load. Total failover time: however long it takes the load balancer to detect the failure — usually 5-10 seconds.
The catch is that active-active only works cleanly for stateless services or read-only workloads. For anything that writes data, you need a strategy for handling writes that arrived at both servers simultaneously. This is why most database failover is active-passive — databases are inherently stateful, and split-brain data corruption is worse than a few seconds of downtime.
Your PostgreSQL primary database is in US-East. Your standby replica is in US-West. The network link between them goes down for 30 seconds. What happens with active-passive failover? What if the standby promotes itself? What happens when the link comes back?
If the standby promotes during a temporary network blip, you get split-brain for 30 seconds. Both accept writes. When the link comes back, those conflicting writes need manual reconciliation. This is why heartbeat timeouts should be long enough to ride out brief network issues — but short enough to detect real failures.Health Checks & Heartbeats — Detecting Failures Before Users Do
You can have the best backups in the world, but they're useless if you don't know something is broken. You can't fix what you can't see. That's where health checks and heartbeats come in — they're the monitoring systems that detect failures before your users call you about it.
Think of it this way: a health check is like a doctor asking "are you okay?" — someone else checks on you. A heartbeatA periodic signal a server sends to say "I'm still alive." Like a pulse — if it stops, something is wrong. The monitor doesn't need to ask; the server proactively sends beats. If the beats stop, the monitor knows the server is either dead or unreachable. is like wearing a heart monitor — you broadcast your status continuously. Both accomplish the same goal (knowing if something is alive), but they work in opposite directions.
Three Types of Health Checks
Not all "are you alive?" questions are equal. There are three levels, and using the wrong one can actually make things worse.
Liveness checks answer the simplest question: "Is the process running?" Can I reach the server? Does it respond at all? This is like checking if a patient has a pulse — it doesn't tell you if they're healthy, just that they're not dead. A server might be running but completely broken (stuck in an infinite loop, for example). A liveness check would still say "yes, it's alive."
Readiness checks go deeper: "Can this server actually handle traffic right now?" It's running, sure, but is it still starting up? Is it overloaded? Did it lose its database connection? A readiness check is like asking a surgeon "are you available for surgery?" — they might be alive and well but in the middle of another operation. A server that fails readiness checks stays in the pool but stops receiving new traffic until it recovers.
Deep health checks (sometimes called dependency checks) test everything: "Can this server do its actual job?" It checks the database connection, the cache, external APIs, disk space, memory — all the things the server needs to function. This is like a full physical exam. If the database is down, the health check reports unhealthy even though the server itself is fine.
The Health Check Cascade Trap
Deep health checks are the most useful but also the most dangerous. Here's why: imagine your health check endpoint queries the database to verify the connection is alive. Normally this takes 5ms. But one day the database is overloaded and responding slowly — 2 seconds per query instead of 5ms.
Now your health check takes 2 seconds. Your load balancer has a 1-second timeout for health checks. The health check times out. The load balancer concludes: "This server is unhealthy." It removes the server from the pool. But the server is fine — it's the database that's slow. Now the remaining servers get more traffic, which means more database queries, which makes the database even slower, which makes more health checks fail, which removes more servers... and suddenly your entire fleet is marked unhealthy because of a single slow database.
This is called a cascading failureWhen one failure causes another, which causes another, in a chain reaction. Like dominos falling — the first one tips over and takes down all the others. In software, a slow database makes health checks fail, which removes servers, which overloads remaining servers, which makes them fail too., and it turns a minor database slowdown into a complete outage of every server.
Heartbeat Patterns
Health checks are "pull" — someone asks "are you alive?" Heartbeats are "push" — the server announces "I'm alive!" without being asked. There are three main patterns for how servers communicate their status to each other.
Push heartbeats are the simplest: each server sends a periodic "I'm alive" signal to a central monitor. If the monitor stops receiving beats from a server, it marks it as dead. This works well for small clusters but has a weakness — the monitor itself is a single point of failureA component whose failure takes down the entire system. If your monitoring server dies, nobody detects failures in the actual servers. It's like having only one smoke detector for an entire building — if the detector breaks, no fire gets detected.. If the monitor goes down, nobody detects failures.
Pull health checks work the other way: a monitor (often the load balancer) periodically asks each server "are you alive?" This is how most load balancers work — they send HTTP requests to a /health endpoint every 5-10 seconds. If a server doesn't respond within the timeout (usually 2-5 seconds), the load balancer removes it from the rotation after 2-3 failures.
Gossip protocols are the most sophisticated. There's no central monitor. Instead, every server randomly picks a few peers and shares what it knows about the cluster. "Hey, Server A sent me a heartbeat 2 seconds ago, so it's alive. Server D hasn't been heard from in 30 seconds — I think it's dead." Over time, this gossip propagates to every node. The big advantage: no single point of failure. If any node goes down, the rest of the cluster still knows about it. The trade-off: slower detection (information takes time to gossip through the cluster) and more complex implementation.
Practical Design: What Should Your Health Check Do?
Here's a concrete example. Imagine you have a web API that reads from a database and caches results in Redis. Your health check strategy should look like this:
Liveness endpoint (/health/live): Return 200 immediately. Don't check anything. The only thing this tells the orchestrator is "the process hasn't crashed." If this fails, the process needs to be restarted — not just removed from the load balancer pool, but actually killed and restarted.
Readiness endpoint (/health/ready): Check that the app has finished startup, database connection pool has at least one available connection, and the server isn't overloaded (e.g., pending request count below a threshold). This should respond in under 50ms. The load balancer uses this to decide whether to send traffic.
Deep health endpoint (/health/deep): Run a test query against the database (SELECT 1), ping Redis, check disk space, verify external API connectivity. This might take 200-500ms. Use it for monitoring dashboards and alerting, but never for load balancer health checks (because of the cascade problem).
Your health check endpoint queries the database with SELECT 1 and your load balancer has a 3-second timeout. The database starts responding slowly (2.5 seconds per query). What happens? Now imagine the database gets even slower (4 seconds). What changes?
Retries & Exponential Backoff — Handling Temporary Failures
Not every failure is permanent. A server might restart in 3 seconds. A network switch might flap for half a second. A queue might be temporarily full but drain in moments. These are called transient failuresA temporary failure that resolves itself without any intervention. Think of it like a busy phone line — call back in a minute and it'll probably work fine., and they're the most common kind of failure in distributed systems. The right response isn't to give up — it's to try again.
But how you try again matters enormously. Get it wrong and you'll turn a small hiccup into a full-blown outage.
The Naive Approach: Retry Immediately in a Loop
Imagine a popular API goes down for 2 seconds. It serves 10,000 clients. Every client notices the failure at roughly the same time and immediately retries. So instead of handling 10,000 normal requests per second, the API comes back to life and is instantly hit with 10,000 retry requests on top of the normal 10,000 new requests. That's 20,000 requests — double the normal load — slamming into a server that was already struggling. It goes down again. Everyone retries again. And again. This self-reinforcing storm is called the thundering herdWhen a large number of clients all retry at the same instant, overwhelming the recovering server with a burst of traffic that's far worse than the original load. problem, and it can keep a system down for minutes or hours even though the original failure lasted only seconds.
The Smart Approach: Exponential Backoff with Jitter
Instead of retrying immediately, you wait a little while. And if that retry also fails, you wait longer. Each failed attempt doubles the wait time: 1 second, then 2 seconds, then 4, then 8, then 16... This is exponential backoffA retry strategy where the delay between attempts grows exponentially — typically doubling each time. This gives the failing system progressively more breathing room to recover.. It gives the struggling server progressively more breathing room to recover.
But there's still a problem. If all 10,000 clients use the same backoff schedule, they'll all retry at 1 second, then all at 3 seconds, then all at 7 seconds — synchronized waves of traffic. The fix is jitterA random variation added to the retry delay. Instead of retrying at exactly 4 seconds, one client retries at 3.7s, another at 4.3s, another at 4.1s. This spreads the retries out over time. — add some randomness to each delay. Instead of everyone retrying at exactly 4 seconds, one client retries at 3.2s, another at 4.8s, another at 5.1s. The retries spread out like a gentle wave instead of crashing like a tsunami.
The formula:
base = initial delay (often 1 second)
attempt = which retry this is (0, 1, 2, 3...)
random_jitter = a random value between 0 and the current delay (spreads retries out)
max_delay = a cap so you don't wait forever (often 30-60 seconds)
When to Retry — and When NOT To
Not every error deserves a retry. The key question is: could this request succeed if I try again without changing anything? A 503 (Service Unavailable) or a connection timeout? Probably yes — the server might be back in a moment. A 400 (Bad Request) or 404 (Not Found)? Absolutely not — if your request was malformed the first time, it's still malformed the second time. Retrying would just waste resources.
Idempotency: The Safety Net for Retries
Here's a critical concept: retrying is only safe if the operation is idempotentAn operation where doing it once has the exact same result as doing it twice (or ten times). GET a web page? Always safe to retry. Transfer $100? Dangerous — you might transfer $200 if you retry. — meaning doing it twice produces the same result as doing it once. If you ask "what's the current price of this item?" ten times, you get the same answer ten times. Safe. If you ask "charge this credit card $50" ten times, you charge $500. Very not safe.
GET requests are naturally idempotent — reading data doesn't change anything. PUT and DELETE are idempotent by design — "set the price to $20" always results in the price being $20 regardless of how many times you say it. POST is the dangerous one — "create a new order" run twice means two orders. For POST operations, you need an idempotency keyA unique identifier (often a UUID) that the client sends with the request. The server checks: "Have I already processed a request with this key?" If yes, it returns the previous result instead of processing it again. — a unique ID sent with the request so the server can detect and ignore duplicates.
Circuit Breakers — Stop the Bleeding
Your house has circuit breakers in the electrical panel. If too much current flows through a wire — maybe a short circuit, maybe too many appliances on one outlet — the breaker trips and cuts the power to that circuit. It's not a fix. The broken toaster is still broken. But the breaker stops the broken toaster from causing an electrical fire that burns down the whole house.
Software circuit breakers work the exact same way. When a service you depend on is failing, the circuit breaker stops sending requests to it. Not to fix the broken service, but to protect everything else from being dragged down with it.
Why You Need This: The Cascade Problem
Picture this: your app (Service A) calls a payment service (Service B) on every checkout. Service B gets slow — maybe a database issue — and starts taking 30 seconds per request instead of 200ms. What happens?
Every thread in Service A that calls Service B is now stuck waiting for 30 seconds. You have 100 threads. Within minutes, all 100 threads are blocked, waiting for a service that isn't responding. Now Service A can't handle any requests — not even the ones that don't need Service B. Your entire checkout flow, your homepage, your search — everything is frozen.
And it gets worse. If Service C depends on Service A? Now Service C is also stuck waiting. One slow database in one service has taken out three services. This is called a cascading failureWhen one service's failure causes another service to fail, which causes another to fail, and so on — like dominoes toppling. A single point of failure ripples through the entire system., and it's one of the most common causes of total system outages.
The Three States of a Circuit Breaker
A circuit breaker sits between your service and the service you're calling. It watches every request and tracks how many succeed and how many fail. It has three states:
CLOSED (normal operation): Requests flow through normally. The breaker is monitoring success/failure rates in the background. Think of it like a closed electrical circuit — current flows freely.
OPEN (broken — fast-fail mode): Too many requests have failed (say, 50% failure rate in the last 30 seconds). The breaker "trips." Now, instead of sending requests to the broken service and waiting 30 seconds for a timeout, it immediately returns an error. No waiting, no blocked threads. Your service stays healthy and can serve a fallback response (like "payments temporarily unavailable") instead of going completely unresponsive.
HALF-OPEN (testing — is it fixed yet?): After a cooldown period (say, 60 seconds), the breaker lets one test request through. If it succeeds, the breaker closes and normal traffic resumes. If it fails, the breaker opens again and waits another cooldown period. This is the "try flipping the breaker back on" moment.
Circuit Breakers vs Retries
Retries and circuit breakers are complementary, not competing. Think of it this way: retries handle hiccups (a single request failed, try again). Circuit breakers handle meltdowns (the whole service is down, stop trying). In practice, you use both: retry a few times for transient errors, but if the circuit breaker detects sustained failure, it stops all retries and fails fast.
Graceful Degradation — Keep Working, Just Not Perfectly
When something breaks, you have two choices. Option one: the whole system crashes and users see an error page. Option two: the system keeps running, but with reduced functionality — maybe recommendations are missing, maybe search results are slightly stale, maybe images don't load. Which would you rather have?
That's graceful degradationA design strategy where a system continues to provide core functionality (even if reduced) when some component fails, rather than failing completely. — the art of breaking less badly. It means planning ahead so that when (not if) something fails, the system drops non-essential features while keeping the critical path alive.
Real-World Examples
Netflix's recommendation engine crashes? Instead of showing a blank screen, they show the "Top 10 in your country" — a static, pre-computed list that doesn't need the recommendation service at all. Users might not get personalized picks, but they can still browse and watch.
Amazon's search gets overloaded during a Prime Day sale? They serve cached search results from 5 minutes ago instead of real-time results. The prices might be slightly out of date, but customers can still find products and shop. Much better than showing "Search Unavailable."
Twitter can't load images? They show text-only tweets. The experience is worse, but the core functionality — reading and posting tweets — still works.
Degradation Tiers
Smart teams plan their degradation as a series of tiers — like dialing down a dimmer switch instead of flipping the lights off all at once. Each tier sacrifices a little more functionality to keep the core experience alive under increasing pressure.
Feature Flags for Degradation
You don't want to figure out what to turn off during an outage at 2 AM. That's panic-mode engineering. Instead, pre-wire feature flagsConfiguration switches (often in a central config service like LaunchDarkly or AWS AppConfig) that let you enable or disable features instantly without deploying new code. Think of them as light switches for features. — "kill switches" that let you disable specific features with a single toggle. Before an incident happens, you've already decided: "If the recommendation service goes down, flip this flag and show trending content instead."
Load Shedding: Triaging Under Pressure
When your system is overwhelmed, you can't serve everyone equally — trying to do so means everyone gets a terrible experience. Load sheddingThe practice of intentionally dropping some requests during overload so that the remaining requests can be served properly. Like an emergency room prioritizing critical patients over minor injuries. is the practice of intentionally rejecting some requests so that the rest can be served properly. It's like an emergency room during a mass-casualty event — you triage. Critical patients (checkout, payments) get served first. Less urgent cases (browsing, recommendations) get told to come back later.
Blast Radius — Limiting the Damage
When a bomb goes off, the damage depends on how close you are. A firecracker breaks a window. A stick of dynamite collapses a room. A missile levels a building. The area of destruction is called the blast radiusIn infrastructure, the blast radius is the scope of impact when something fails. A single-server failure has a small blast radius. A regional failure has a massive one. The goal is to architect systems so that any single failure has the smallest possible blast radius. — and in software, it's the single most important question you can ask about any failure: how much of my system is affected?
If one server crashes behind a load balancer with 10 servers, the blast radius is 10% — 9 out of 10 servers still handle traffic. If an entire availability zoneAn availability zone (AZ) is an isolated data center (or cluster of data centers) within a cloud region. AWS, for example, has regions like us-east-1 with 3-6 AZs each. AZs within a region have independent power and networking but are connected by low-latency links. goes down and you're running in 3 AZs, the blast radius is 33%. If an entire region goes down and you're only in one region? The blast radius is 100%. You're completely offline.
The goal of blast radius engineering is simple: make sure no single failure can take down everything.
Strategies to Shrink the Blast Radius
Old wooden ships had one big hull. A single hole sank the whole ship. Modern ships are divided into watertight compartmentsSealed sections of a ship's hull. If one compartment floods, the watertight doors contain the water to that section, keeping the rest of the ship afloat. The Titanic had 16 compartments — but 5 flooded simultaneously, which exceeded its design limit. (bulkheads). A hole in one compartment floods only that section — the rest of the ship stays afloat.
In software, bulkheads mean isolating resources for different functions. Give your checkout service its own thread pool, database connection pool, and circuit breaker — separate from your search service. If search goes haywire and exhausts its thread pool, checkout is completely unaffected because it has its own isolated resources.
Instead of one big system serving all users, you divide users into independent cellsA self-contained unit of infrastructure that serves a subset of users. Each cell has its own servers, databases, and caches — completely independent from other cells. If one cell fails, only the users assigned to that cell are affected.. Each cell has its own servers, its own database, its own cache — completely independent. User IDs 1-100,000 go to Cell A. User IDs 100,001-200,000 go to Cell B. If Cell A's database crashes, only 100,000 users are affected. The other 900,000 never notice.
AWS uses this architecture internally. Their control plane is divided into cells, so a bug that crashes one cell doesn't cascade to all cells. It limits the blast radius by design.
Regular sharding assigns Customer A to Shard 1 and Customer B to Shard 1. If Shard 1 goes down, both lose service. Shuffle shardingA technique where each customer is assigned to a random subset of resources (e.g., 2 out of 8 shards). The probability of two customers sharing ALL the same resources becomes very small — so one customer's bad behavior is unlikely to affect another. assigns each customer to a random subset of resources. Customer A gets Shards 1 and 4. Customer B gets Shards 2 and 7. The chance of two customers sharing ALL the same shards is tiny. Even if one customer sends a massive traffic spike that takes out their shards, most other customers are unaffected because they're on different shard combinations.
With 8 shards and each customer using 2, there are 28 possible shard combinations. The odds of two random customers sharing the same pair? About 3.6%. That's powerful isolation without the cost of giving each customer dedicated infrastructure.
SLA, SLO, SLI — Measuring Reliability with Numbers
You can't improve what you can't measure. Saying "our system should be reliable" is like saying "I want to be healthier" — it sounds nice but means nothing concrete. Reliability needs numbers. Specific, measurable, time-bound numbers. That's where three related but distinct terms come in: SLI, SLO, and SLA.
Think of it as a pyramid. At the bottom, you have raw measurements. In the middle, you have targets. At the top, you have a legally binding contract with financial consequences.
SLI — Service Level Indicator (The Measurement)
An SLIService Level Indicator — a concrete, quantitative metric that measures one aspect of your service's reliability. Examples: "percentage of requests that returned a 2xx status code" or "percentage of requests completed in under 200ms." is the raw metric — the number you actually measure. It answers: "How is this specific thing performing right now?"
Examples: "What percentage of HTTP requests got a successful response (2xx) in the last 5 minutes?" Or: "What percentage of page loads completed in under 2 seconds?" Or: "What percentage of database queries finished within 50ms?" These are all SLIs. They're objective, measurable facts about your system's behavior.
SLO — Service Level Objective (The Target)
An SLOService Level Objective — a target value (or range) for an SLI. It's an internal goal your team sets. Example: "99.9% of requests should return a successful response" or "p99 latency should be under 500ms." is the target you set for an SLI. It's your internal goal — the line in the sand that says "below this, we're not meeting our own standards." You pick the SLI (request success rate), and you set a target (99.9%).
SLOs are internal — they're for your engineering team, not your customers. They drive decisions: "Our SLO is 99.9% success rate. We're currently at 99.7%. We should probably hold off on deploying that risky new feature until we fix the existing reliability issues."
SLA — Service Level Agreement (The Contract)
An SLAService Level Agreement — a formal contract between a service provider and its customers that specifies what level of service is guaranteed, and what compensation (usually credits) the customer receives if the provider fails to meet it. is the contract — usually between you and your customers — that says: "We promise at least this level of service. If we fail, here's what we'll give you in return." It's a business and legal commitment with financial consequences.
AWS S3, for example, promises 99.9% availability. If they drop below that in a billing month, customers get a 10% service credit. Below 99.0%? 25% credit. Below 95%? 100% credit. That's real money on the line, which is why SLAs are always set lower than SLOs — your internal target (SLO) should be stricter than your customer promise (SLA) so you have a safety margin.
The Error Budget — Your License to Break Things
Here's where it gets clever. If your SLO is 99.9% availability, that means you're "allowed" to be unavailable 0.1% of the time. That's your error budgetThe amount of allowable unreliability built into your SLO. If your SLO is 99.9%, your error budget is 0.1%. You "spend" this budget on deployments, experiments, maintenance, and unexpected failures. When the budget is exhausted, you freeze changes. — and it's not a bad thing. It's a feature. That 0.1% budget is what allows you to deploy new features (which might briefly cause errors), run experiments, and perform maintenance.
The math is straightforward:
Total minutes in 30 days: 30 × 24 × 60 = 43,200 minutes
Error budget: 0.1% × 43,200 = 43.2 minutes of downtime allowed per month
For 1 million requests per day:
Error budget: 0.1% × 1,000,000 = 1,000 failed requests allowed per day
When your error budget is consumed: freeze deployments. Focus entirely on reliability until the budget resets.
Common SLOs for Different Service Types
| Service Type | Typical SLO | Allowed Downtime/month |
|---|---|---|
| Internal tools | 99.0% | ~7 hours 18 min |
| Standard web app | 99.9% | ~43 min |
| E-commerce checkout | 99.95% | ~22 min |
| Payment processing | 99.99% | ~4.3 min |
| DNS / Auth | 99.999% | ~26 sec |
| Service Type | p50 Target | p99 Target |
|---|---|---|
| API endpoint | < 100ms | < 500ms |
| Web page load | < 1s | < 3s |
| Search results | < 200ms | < 1s |
| Database query | < 10ms | < 100ms |
| Real-time messaging | < 50ms | < 200ms |
p50 = the median (50th percentile). p99 = the worst-case for 99% of requests. The p99 matters more than the average — 1% of your users having a terrible experience at scale is still thousands of angry people.
Disaster Recovery — When Everything Goes Wrong
Everything we've talked about so far — retries, circuit breakers, graceful degradation — handles normal failures. A server crashes, a network blips, a service slows down. These happen daily and your system should shrug them off automatically.
But sometimes the failure isn't normal. An entire data center loses power. A region goes offline because an undersea cable gets cut. Ransomware encrypts your production database and all your replicas (because the replicas were on the same network). A well-meaning engineer runs a migration script against production instead of staging. These are disastersCatastrophic failures that exceed normal fault-tolerance mechanisms — events like full data center outages, regional failures, data corruption, ransomware attacks, or large-scale human errors that destroy critical data. — events that exceed the normal "one server died" fault tolerance built into your daily operations.
Disaster recovery (DR) is your plan for surviving these events. And it comes down to two numbers that every engineer should know.
RPO and RTO — The Two Numbers That Define Your DR
RPO — Recovery Point ObjectiveHow much data can you afford to lose when disaster strikes? RPO = 0 means zero data loss (requires synchronous replication). RPO = 1 hour means you can tolerate losing up to 1 hour of data (hourly backups are sufficient).: How much data can you afford to lose? If your RPO is 1 hour, you need backups (or replicas) that are at most 1 hour old. If disaster strikes, you'll lose up to 1 hour of recent data — and that's acceptable for your business. RPO = 0 means zero data loss, which requires synchronous replication (every write is confirmed on a remote copy before the system acknowledges it).
RTO — Recovery Time ObjectiveHow quickly do you need to be back online after a disaster? RTO = 4 hours means you can afford to be completely down for up to 4 hours. RTO near 0 means instant failover to another site.: How quickly do you need to be back online? If your RTO is 4 hours, you can afford to be completely down for up to 4 hours while you restore from backups and spin up new infrastructure. If your RTO is near zero, you need a hot standby site that can take over instantly.
The relationship is simple: lower RPO and RTO = more expensive. Zero data loss + instant failover requires fully synchronized infrastructure in multiple locations, running 24/7. That costs 2-3x more than a single-site setup. Higher RPO and RTO = cheaper but riskier.
The Four DR Strategies
There are four standard approaches to disaster recovery, ranked from cheapest (and slowest to recover) to most expensive (and fastest). Your choice depends on your RPO, RTO, and budget.
You take regular backups (database dumps, filesystem snapshots) and store them somewhere safe — typically a different region. When disaster strikes, you spin up new infrastructure and restore from the latest backup. It's like keeping a copy of your house's blueprints and furniture photos in a safe deposit box. If the house burns down, you can rebuild it — but it takes a while.
RTO: Hours to days (depending on data size and infrastructure complexity)
RPO: Last backup time (if you back up daily, you could lose up to 24 hours of data)
Cost: Very low — you only pay for backup storage
Best for: Non-critical systems, development/staging environments, cost-constrained startups
Your database is continuously replicated to a DR site (another region), but the application servers and other infrastructure exist only as launch templates — they're not running. Think of it like a gas pilot light on a water heater: the flame is always on (the database replica), so when you need heat (full recovery), it ignites quickly instead of starting from scratch.
RTO: 10-30 minutes (time to spin up app servers and update DNS)
RPO: Near-zero for database (continuous replication), but application state may have gaps
Cost: Moderate — you pay for the replicated database 24/7, but compute is only provisioned during recovery
Best for: Business-critical apps that can tolerate 15-30 minutes of downtime
A scaled-down but fully functional copy of your production environment runs in the DR site at all times. Everything works — it's just running at maybe 10-20% of production capacity. When disaster strikes, you scale it up to full capacity and redirect traffic. Since everything is already running, the switchover takes minutes, not tens of minutes.
RTO: Minutes (just scale up and switch DNS)
RPO: Near-zero (continuous replication)
Cost: High — you're running a second environment 24/7, even if it's smaller
Best for: Revenue-critical services where every minute of downtime costs significant money
Both sites are fully active, serving real production traffic simultaneously. Users in the US hit the US site, users in Europe hit the EU site. If one goes down, the other absorbs all traffic. There's no "failover" because both sites are already active — you just stop routing to the dead one.
RTO: Near-zero (seconds — just a DNS/routing change)
RPO: Near-zero (bidirectional replication, though conflict resolution adds complexity)
Cost: Very high — 2x or more the infrastructure cost of a single site, plus the engineering complexity of multi-region data consistency
Best for: Global services where any downtime is unacceptable (payments, messaging, critical infrastructure)
The Backup Testing Rule
Here's a truth that bites companies every year: a backup you haven't tested is not a backup. It's a hope. Backups can silently fail — corrupted files, missing tables, wrong permissions, incompatible formats. You only discover these problems when you actually try to restore, which is exactly the worst possible time to discover them.
The fix: schedule regular restore drills. Once a quarter, take your latest backup, restore it to a fresh environment, and verify that the data is complete and the application works. If you can't restore successfully in a calm test environment, you certainly won't succeed during a real disaster at 3 AM with your hands shaking.
Real-World Incidents — Famous Outages and What We Learned
Theory is great, but nothing drives the lessons home like real disasters that cost real companies real money. Every incident below follows the same depressing pattern: the failure was preventable, the safeguard existed but wasn't tested, or the blast radius wasn't limited. Let's walk through five of the most famous outages in modern tech history and see exactly what went wrong — and what you should do differently.
The cascade was brutal. S3 is the backbone of the internet's static content. When S3 went down, so did Slack, Trello, Quora, IFTTT, parts of Docker Hub, and thousands of websites that serve images and files from S3. Even Amazon's own status dashboard was hosted on S3, so the page that was supposed to tell everyone "we're having problems" was itself down. The irony was not lost on anyone.
The outage lasted nearly 5 hours. Amazon later estimated that S&P 500 companies alone lost roughly $150 million. And all because one person mistyped a parameter in a maintenance command with no guardrails.
rm -rf on what they thought was the staging database directory. It was the production database. 300 GB of data — gone in seconds.
Then the real horror began. GitLab had five different backup methods configured. Five! And not a single one was working properly. The daily database dumps hadn't been running because of a configuration error. The replication to a secondary had been turned off. The automated snapshots were failing silently. The point-in-time recovery via WAL archiving had a misconfigured path. The final safety net — a nightly backup to a remote server — was the only one still partially functional, but it was 6 hours stale.
GitLab ended up losing 6 hours of production data affecting around 5,000 projects, 5,000 comments, and 700 merge requests. They live-streamed their recovery on YouTube — one of the most transparent post-mortems in tech history.
Here's the part that made this outage legendary: Facebook's internal tools for diagnosing and fixing network problems also ran on Facebook's network. The engineers who needed to fix the BGP routes couldn't reach the systems that manage BGP routes. They couldn't even badge into the data centers at first because the door access system relied on Facebook's network. Engineers literally had to be dispatched to data centers to gain physical access to the routers and manually reconfigure them.
Cloudflare operates in over 200 cities across 100+ countries. Every edge server runs the same WAF rules. So when that regex was deployed globally, it hit every server at the same time. Millions of websites behind Cloudflare — including major services — returned 502 errors for about 27 minutes. The fix was conceptually simple (revert the bad rule), but because the servers were pegged at 100% CPU, even deploying the revert was slow.
In 45 minutes, the old code bought and sold stocks worth billions of dollars, racking up $440 million in losses. The firm had no automated kill switch. By the time humans realized what was happening and manually stopped the system, the damage was irreversible. Knight Capital went from a healthy company to bankrupt practically overnight. They were acquired for pennies on the dollar.
Chaos Engineering — Break It On Purpose
Here's a question that sounds crazy until you think about it: what if you broke your own system on purpose?
Think about fire drills. Nobody wants a fire in their building. But every school, every office, every hospital runs fire drills regularly. Why? Because discovering that the emergency exit is locked during an actual fire is the worst possible time to learn that. You want to find that problem on a calm Tuesday afternoon when everyone's awake and the fire department is on standby.
That's exactly the philosophy behind chaos engineeringThe practice of deliberately injecting failures into a system — killing servers, slowing networks, filling disks — to verify that the system handles them gracefully. Pioneered by Netflix in 2011 with their famous "Chaos Monkey" tool.. You will have failures. The hard drives in your servers will die. The network between your data centers will hiccup. A dependency you rely on will go down. The question is: do you discover your weaknesses at 3 AM during a real outage, or at 2 PM on Tuesday when your entire team is ready, the coffee is fresh, and the incident response channel is already open?
Netflix pioneered this approach with a tool called Chaos MonkeyA tool built by Netflix that randomly terminates virtual machine instances in production during business hours. The philosophy: if a random server dying can break your service, you need to know about it now — not during a real outage. Netflix later expanded this into the "Simian Army" with tools like Chaos Gorilla (kills an entire availability zone) and Latency Monkey (adds artificial network delays).. During business hours, Chaos Monkey randomly kills production servers. Not staging. Not test environments. Production. If your service can't handle a random server dying at 2 PM when everyone's watching, it definitely can't handle it at 3 AM when nobody is. Netflix's reasoning was simple: they wanted to make individual server failures so routine that their systems handled them automatically without any human intervention.
The Chaos Engineering Process
Chaos engineering isn't just randomly pulling cables and hoping for the best. It's a disciplined, scientific process with five steps:
Step 1: Define steady state. Before you break anything, measure what "normal" looks like. What's your p99 latency? What's your error rate? How much CPU are your servers using? You need a baseline so you can tell whether the experiment caused a problem.
Step 2: Hypothesize. Make a specific prediction: "If we kill 2 of our 6 API servers, the load balancer will redistribute traffic to the remaining 4, latency will increase by no more than 50ms, and zero requests will fail." Write this down. A vague "it should be fine" is not a hypothesis.
Step 3: Inject the failure. Actually do it. Kill the servers, introduce network latency, fill a disk to 95%, whatever your experiment calls for.
Step 4: Observe. Did reality match your hypothesis? Did latency stay within bounds? Did the load balancer actually reroute traffic? Did the alerts fire? Look at your dashboards and compare actual behavior to predicted behavior.
Step 5: Fix the gaps. If your hypothesis was wrong — if latency spiked to 2 seconds instead of 50ms — that's a success. You found a weakness at 2 PM on Tuesday instead of 3 AM during Black Friday. Now fix it.
Types of Chaos Experiments
Game Days
A Game Day is a scheduled event where your entire team practices responding to failures. Think of it as a fire drill for your infrastructure. You pick a scenario — "the primary database goes down" or "we lose an entire availability zone" — and actually simulate it, usually in production or a production-like environment. The team follows their runbooks, uses their dashboards, and communicates through their incident channels. Every stumble — a missing runbook, a broken alert, a confusing dashboard — gets documented and fixed.
Companies like Google, Amazon, and Netflix run Game Days regularly. At some companies they're even surprise events — the SRE team injects a failure without warning the on-call team, to test real-world response times and procedures.
Idempotency — Safe to Retry
You tap "Pay Now" on your phone. The screen spins for 10 seconds... then shows a timeout error. Did the payment go through, or didn't it? You have no idea. If you tap "Pay Now" again and it did already go through, you just got charged twice. If it didn't go through and you don't retry, your order never processes. This is the fundamental problem that idempotencyA fancy word from mathematics that means "doing something multiple times has the same effect as doing it once." In APIs, an idempotent operation can be safely retried without causing duplicate side effects — like double charges or duplicate orders. solves.
The word comes from math: an operation is idempotent if doing it once produces the same result as doing it twice, or ten times, or a hundred times. Setting your thermostat to 72 degrees is idempotent — whether you press the button once or mash it 50 times, the temperature is still 72. Adding 1 to a counter is not idempotent — pressing it 50 times gives you 50 instead of 1.
Which Operations Are Naturally Idempotent?
Some operations are safe to repeat by their very nature. Others need special engineering to make them safe:
Reading data (GET) — asking "what's my balance?" ten times always returns the same balance. No side effects.
Absolute set (PUT with full replacement) — "set the user's name to Alice" is the same whether you do it once or ten times. The name is still Alice.
Delete by ID — "delete order #4521" is a no-op if order #4521 is already deleted. The end state is the same: order #4521 doesn't exist.
Creating a resource (POST) — "create a new order" twice creates two orders. Now you have a duplicate.
Increment operations — "add $10 to the balance" twice adds $20. The counter went up by double.
Append operations — "add item to cart" twice gives you two of the same item.
Send notifications — "send confirmation email" twice sends two emails. Your user gets annoyed.
The Problem: Network Uncertainty
You're stuck. Without idempotency, both options are bad — retry and risk a double charge, or don't retry and risk a lost order. The client cannot know which scenario happened because the response was lost in transit.
The Solution: Idempotency Keys
The fix is elegant. Before the client sends the request, it generates a unique ID — called an idempotency keyA unique identifier (usually a UUID like 550e8400-e29b-41d4-a716-446655440000) that the client attaches to every mutating request. The server uses this key to detect duplicate requests: "I've already processed this key, so I'll return the cached result instead of processing it again." — and attaches it to the request. The server checks: "Have I seen this key before?" If yes, it returns the cached result from the first time without processing the operation again. If no, it processes the operation and stores the result keyed by that ID.
Now the client can safely retry as many times as it wants. First attempt goes through? The retries are no-ops that return the same result. First attempt didn't go through? The retry processes it for the first time. Either way, the operation happens exactly once.
Implementation: The Database Approach
The most common approach is dead simple. You add a column to your database with a unique constraint on the idempotency key. When a request comes in, you try to insert a row with that key. If the insert succeeds (the key is new), you process the operation and store the result. If the insert fails because of a unique constraint violation, you know this is a duplicate — just look up the stored result and return it.
Distributed Consensus — Agreeing When Networks Split
When all your data lives on one database server, "truth" is easy — whatever that server says is the truth. But the moment you replicate that data to multiple servers (for redundancy, speed, or both), you create a hard problem: what happens when those servers disagree?
Imagine you have three database servers holding account balances. A user transfers $20 from checking to savings. Server A processes the transfer. But before Server A can tell Servers B and C about it, the network cable between them gets unplugged (or more realistically, a router has a firmware bug). Now Server A says the checking balance is $80, but Servers B and C still say $100. Which is correct? If a user checks their balance on Server B, they see $100 — and might try to spend money that's already been transferred. This is the split-brain problemWhen servers in a cluster can't communicate with each other and each "half" starts acting independently, potentially making conflicting decisions. Like two managers both thinking they're in charge because they can't reach each other. Also called "network partition.", and it's one of the most fundamental challenges in distributed systems.
The CAP Theorem — You Can't Have It All
In 2000, computer scientist Eric Brewer proposed something called the CAP theoremBrewer's theorem states that a distributed system can provide at most two of three guarantees simultaneously: Consistency (every read returns the latest write), Availability (every request gets a response), and Partition tolerance (the system works despite network failures). Since network partitions are inevitable, the real choice is between consistency and availability.. It says that when your network splits (and it will — this is not optional), you have to choose between two guarantees:
Consistency (C) means every read returns the most recent write. If Alice just transferred $20, anyone who checks the balance — on any server — sees the updated amount. No stale data, ever.
Availability (A) means every request gets a response. Even if some servers are unreachable, the system keeps answering queries. It might return slightly stale data, but it never says "sorry, I can't help you right now."
Partition Tolerance (P) means the system keeps working even when the network between nodes is broken. Since real networks do partition (routers fail, cables get cut, clouds have outages), P is not optional in any distributed system. That leaves you choosing between C and A.
Banks and financial systems typically choose CP — they'd rather temporarily refuse requests than give you a wrong balance that lets you overdraw. Social media and shopping carts typically choose AP — they'd rather show you a slightly stale news feed than show nothing at all.
Raft — A Consensus Algorithm You Can Actually Understand
Okay, so servers need to agree on truth. But how do they agree, especially when some of them can't communicate? That's the job of consensus algorithmsProtocols that allow a group of servers to agree on a value (like "the balance is $80") even when some servers are slow, unreachable, or have crashed. The most famous ones are Paxos (notoriously hard to understand), Raft (designed to be understandable), and ZAB (used by ZooKeeper)..
The first famous algorithm, PaxosInvented by Leslie Lamport in 1989. It's mathematically proven correct and extremely powerful, but so hard to understand that Lamport himself said most people find the description "greek to them" (he wrote the paper as a story about a Greek island). Most production implementations use Raft instead because it's far easier to reason about., is correct but notoriously difficult to understand — even experts struggle with it. So in 2014, Diego Ongaro and John Ousterhout created RaftA consensus algorithm specifically designed to be understandable. It breaks the problem into three sub-problems: leader election, log replication, and safety. Used in etcd (Kubernetes), CockroachDB, TiKV, and Consul., an algorithm that solves the same problem but was designed from the ground up to be understandable. Here's how it works:
The basic idea: one server is elected the leader. All writes go through the leader. The leader copies each write to the other servers (called followers). Once a majority of servers confirm they have the write, it's considered committed and safe. If the leader dies, the remaining servers hold an election and pick a new leader.
Why Odd Numbers and Quorums Matter
You might wonder: why do clusters always use odd numbers of nodes — 3, 5, 7 — never 2, 4, 6? It's about the quorumThe minimum number of nodes that must agree before a decision is considered valid. For a cluster of N nodes, the quorum is (N/2) + 1. For 3 nodes, the quorum is 2. For 5 nodes, it's 3. This ensures that any two quorums always overlap by at least one node — preventing conflicting decisions., or the minimum number of servers that must agree. A quorum is a simple majority: more than half. With 3 nodes, you need 2 to agree. With 5 nodes, you need 3. This guarantees that any two groups who think they have a quorum overlap by at least one server, preventing two groups from making conflicting decisions.
With an even number — say 4 nodes — a network partition could split them evenly (2 vs 2). Neither side has a majority, so neither can make progress. With 5 nodes split 3 vs 2, the group of 3 has a quorum and can keep operating. You get the same fault tolerance with 4 nodes as with 3 (both tolerate 1 failure), so the 4th node is just wasted money.
Monitoring & Alerting — Your Early Warning System
You can build the most resilient system in the world — redundant servers, automatic failover, circuit breakers, retries — and it's all worthless if you don't know something is broken until a user tweets about it. MonitoringThe practice of collecting and analyzing metrics, logs, and traces from your system in real-time. Good monitoring lets you spot problems before users do — and gives you the data to diagnose them quickly when they happen. is the nervous system of your infrastructure. Without it, you're flying blind.
The Four Golden Signals
Google's Site Reliability EngineeringSRE is Google's approach to operations — treating infrastructure as a software problem. Google's SRE book (free online) is considered the bible of modern operations. SRE teams write code to automate away operational toil, and they defined the "four golden signals" that every service should monitor. (SRE) team distilled decades of experience into four metrics that every service should track. If you monitor nothing else, monitor these:
Why Averages Lie — Use Percentiles
This is one of the most important lessons in monitoring, and most beginners get it wrong. Suppose 99% of your requests take 10 milliseconds (fast!) and 1% take 10,000 milliseconds (10 full seconds — terrible). What's the average? About 109ms. That sounds fine. But it masks a horrible reality: 1 out of every 100 users is waiting 10 seconds for a page to load. If you have 10,000 requests per minute, that's 100 miserable users every minute.
This is why experienced engineers track percentilesIf you sort all request times from fastest to slowest, the p50 is the median (half are faster, half slower), p95 is the value where 95% of requests are faster, and p99 is where 99% are faster. The p99 captures the worst 1% of user experience — the "tail latency" that averages hide completely. instead of averages. The p50 (median) tells you what a typical user experiences. The p95 tells you what 1 in 20 users experiences. The p99 tells you what 1 in 100 users experiences. Your SLA should be based on p99, not on the average.
Alert Fatigue — The Boy Who Cried Wolf
There's a trap that every team falls into at some point: they set up alerts for everything. CPU above 50%? Alert. A single 500 error? Alert. Disk at 60%? Alert. Within a week, the on-call engineer is getting 200 alerts per day — and they start ignoring all of them. That's alert fatigueWhen engineers receive so many alerts that they start ignoring them — including the real ones. Studies show that after a few weeks of high alert volume, response times to critical alerts increase dramatically. The solution: only alert on conditions that require immediate human action., and it's more dangerous than having no alerts at all. With no alerts, at least you know you're blind. With alert fatigue, you think you're watching but you're actually ignoring the signals.
The rule is simple: every alert must be actionable. If an alert fires and the correct response is "do nothing and wait," that alert shouldn't exist. If the on-call engineer's first reaction is "I can ignore this," the alert is broken — fix it or delete it.
The Observability Stack
Modern systems need three types of telemetry working together. Engineers call this the "three pillars of observability":
Metrics tell you what is happening — numbers like CPU usage, request counts, and error rates. They're cheap to store and fast to query. When you see a spike on your dashboard, metrics tell you "error rate jumped to 5% at 2:47 PM."
Logs tell you why it happened — detailed event records with context. When metrics say "errors spiked," you dig into logs to find "ERROR: database connection pool exhausted, max_connections=50, active=50, waiting=347." Now you know the root cause.
Traces tell you where the problem is — they follow a single request as it travels through your microservices. A trace might show that an API call took 900ms total, and 890ms of that was one slow database query. Without tracing, in a system with 20 microservices, finding which service is slow is like finding a needle in a haystack.
What to Monitor — A Practical Checklist
- Latency — p50, p95, p99 response times (per endpoint if possible)
- Error rate — percentage of 5xx responses; alert if > 1%
- Requests per second — watch for sudden drops (outage) or spikes (DDoS, viral moment)
- Active connections — are you approaching your server's connection limit?
- Thread pool saturation — how many request-handling threads are in use vs available
- Query latency — p95 of SELECT and INSERT/UPDATE separately
- Connection pool usage — active vs max connections; alert at 80%
- Replication lag — how far behind are read replicas? Alert if > 5 seconds
- Disk I/O — read/write IOPS, disk queue depth
- Slow queries — queries taking > 1 second (log and review weekly)
- Table/index size — prevent "disk full" surprises
- Hit rate — percentage of requests served from cache; below 80% means your caching strategy is off
- Memory usage — how close to max memory; evictions start hurting when cache is full
- Eviction rate — keys being evicted per second; spikes mean your cache is too small
- Connection count — are clients exhausting the max connections?
- Latency — Redis should respond in < 1ms; anything higher suggests a problem
- Queue depth — messages waiting to be processed; growing = consumers can't keep up
- Consumer lag — how far behind are consumers from the latest message?
- Processing time — how long each message takes to process
- Dead letter queue size — messages that failed processing; should be near zero
- Throughput — messages produced vs consumed per second
Common Mistakes — Reliability Traps Everyone Falls Into
You've now got a solid understanding of how reliable systems work. But knowing the theory and avoiding the traps in practice are two very different things. These seven mistakes look obvious in hindsight, but they trip up experienced teams every year — in real incidents, post-mortems, and system design interviews. Learn them here so you don't learn them at 3 AM during an outage.
What goes wrong: The team sets up automated daily backups and checks the box. Months later, a catastrophic failure happens. They try to restore — and discover the backups have been silently failing for weeks, or the restore process has never been documented, or the backup format is incompatible with the current schema.
Why it's dangerous: This is exactly what happened to GitLab in January 2017A GitLab engineer accidentally deleted a production database directory. When they tried to restore, they discovered that 5 out of 5 backup methods had issues — some hadn't run in days, others produced empty files, and the most recent backup was 6 hours old. They lost data for 5,000 projects.. They had five different backup systems. None of them worked when it mattered. Backups that have never been restored are not backups — they are comforting lies.
How to avoid it: Schedule monthly restore drills. Automate a "backup validation" job that restores to a scratch database and runs a row-count check. If the restore fails or counts don't match, page someone immediately — not in a week, immediately. The rule is simple: an untested backup is no backup at all.
What goes wrong: Your integration tests verify that the checkout flow works when the payment API responds in 200ms, the inventory database is up, and the email service sends confirmations. All green. Then in production, Redis goes down for 3 seconds. The payment API returns a 500. The network drops packets for 2 seconds. Your "fully tested" service crashes spectacularly.
Why it's dangerous: Happy-path tests prove your system works when nothing fails. But things always fail. If you haven't tested what happens when Redis times out, when an API returns garbage, when the network is slow — you have no idea how your system behaves under the conditions that actually cause outages.
How to avoid it: For every dependency, write failure tests: what happens when it's slow (inject 5-second latency)? When it returns errors? When it's completely unreachable? Tools like ToxiproxyAn open-source tool by Shopify that lets you simulate network conditions like latency, timeouts, and connection resets in your test environment. let you inject network failures in tests. If you can't answer "what happens when X is down?" for every dependency — you haven't tested enough.
What goes wrong: The team adds redundancy to the obvious things — two app servers, a database replica, multi-zone deployment. They feel safe. Then their single DNS provider has an outage, and the entire site is unreachable. Or the one config server that all services depend on crashes, and nothing can start. Or the single load balancer that fronts everything goes down.
Why it's dangerous: People add redundancy to the things they think about — servers, databases, application code. But they forget the "glue" infrastructure: DNS, load balancers, config servers, certificate authorities, the CI/CD pipeline, and even the one engineer who knows how the deployment works. These hidden SPOFsSingle Point of Failure — any component whose failure brings down the entire system. SPOFs are dangerous because they negate all the redundancy you've built around them. are the ones that actually cause outages.
How to avoid it: Run a "what if this dies?" exercise. Walk through every component in your architecture — including DNS, load balancers, config stores, monitoring, and the deployment pipeline. For each one, ask: "If this single thing disappears right now, what breaks?" Draw the dependency graph and look for any node that, if removed, disconnects everything.
What goes wrong: The team adds alerts for everything — CPU above 60%, any 500 error, disk above 70%, response time above 200ms. The Slack channel gets 500 alerts per day. The team starts ignoring them. Then a real outage happens, and the critical alert is buried under noise. Nobody notices for 45 minutes.
Why it's dangerous: Alert fatigue is one of the most common causes of delayed incident response. When every alert is "critical," none of them are. The human brain can't sustain vigilance across hundreds of notifications — it starts filtering them out as background noise. This is the boy who cried wolfA classic parable: if you raise the alarm too often for non-emergencies, people stop responding — and when a real emergency comes, nobody pays attention. problem applied to infrastructure.
How to avoid it: Use a tiered alert system. Page-worthy (wake someone up): only for customer-facing impact — error rate above 5%, complete service down, data loss risk. Ticket-worthy (fix during business hours): elevated latency, disk above 85%, a single node down but redundancy is covering it. Dashboard-only (just watch it): CPU trends, cache hit ratios, queue depths. If your on-call engineer gets more than 2-3 pages per week, your thresholds are wrong.
What goes wrong: The recommendation engine goes down. Instead of showing generic popular items, the entire product page crashes with a 500 error. Or the search service is slow, so the whole website hangs waiting for it. The team never defined what should happen when a non-critical dependency fails.
Why it's dangerous: In a system with 10 dependencies, the probability that all of them are healthy at any given moment is surprisingly low. If your site requires 100% of its dependencies to be up, your effective availability is the product of all their individual availabilities. Ten services at 99.9% each gives you 99.9%10 = 99.0% — that's almost 4 days of downtime per year.
How to avoid it: For every dependency, define the fallback. Recommendations down? Show popular items. Search slow? Return cached results. Email service offline? Queue the email and send later. Payment API unreachable? Show "try again in a few minutes" instead of a cryptic 500. The key insight: decide these fallbacks before the outage, not during it.
What goes wrong: A downstream service gets slow under load. Your service retries immediately. So do the other 50 services that depend on it. The downstream service was handling 1,000 requests per second, now it's getting 3,000 (original load plus retries). It collapses completely. Now nothing works.
Why it's dangerous: Immediate retries turn a partial failure into a total failure. The downstream service was struggling but alive — the flood of retries killed it. This is called the thundering herdWhen many clients simultaneously retry or reconnect to a recovering service, overwhelming it before it can recover. Like a herd of buffalo all running at a narrow gate at once. problem, and it's one of the most common ways that small incidents become big outages.
How to avoid it: Always use exponential backoff with jitterExponential backoff means waiting longer between each retry (1s, 2s, 4s, 8s...). Jitter adds a random delay so all clients don't retry at the same instant. Together, they spread the retry load over time.. First retry after 1 second (plus random 0-500ms). Second retry after 2 seconds (plus random). Third after 4 seconds. Cap at 30 seconds. And set a maximum retry count — after 3-5 attempts, stop and fail gracefully. A circuit breaker (from Section 6) automates this even better.
What goes wrong: All your services run in a single availability zoneA physically distinct data center or section of a data center within a cloud region. Availability zones have independent power, cooling, and networking — so a failure in one zone shouldn't affect others.. Or all your microservices share the same Kubernetes cluster. Or all your databases are on the same physical host. When that one zone, cluster, or host goes down — everything goes down together.
Why it's dangerous: Shared fate means your redundancy is an illusion. Having 5 replicas sounds great — until you realize they're all on the same physical rack, sharing the same power supply and network switch. One hardware failure, and all 5 replicas die simultaneously. This happened in several major cloud outages where "redundant" services all lived in the same blast radius.
How to avoid it: Spread your infrastructure across independent failure domainsA failure domain is a group of resources that share a common point of failure. Examples: a single rack (shared power), a single availability zone (shared data center), a single region (shared geography).. Run across at least 2 availability zones (3 is better). Place database replicas in different zones. Use anti-affinity rules to prevent Kubernetes from scheduling critical pods on the same node. The rule: if two components can fail from the same cause, they're not truly redundant.
Interview Playbook — Nail Reliability Questions
Reliability questions show up in almost every system design interview — either as the main topic ("design a highly available system") or as a follow-up ("what happens when this component fails?"). The best candidates don't wait for the interviewer to ask about failures. They bring it up proactively, showing they think about real-world production from the start.
Here's a five-step framework that works for any reliability question. Memorize the steps, not the answers — because the specific system changes, but the thinking process is always the same.
Let's see this framework in action. Here's how a strong candidate might think through a classic interview question:
Step 1 — Failures: "99.99% means only 52 minutes of downtime per year. So I need to think about every failure mode: server crashes, database failover, network partitions, bad deployments, and even cloud provider outages. Any one of these could eat my entire error budget in a single incident."
Step 2 — Redundancy: "I'll run at least 3 app servers across 2 availability zones behind an active-active load balancer. The database needs a hot standby with streaming replication — automated failover in under 30 seconds. I'll use a multi-zone Redis cluster for caching so one zone going down doesn't kill cache."
Step 3 — Detection: "Health checks every 10 seconds from the load balancer. An external uptime monitor pinging from 3 regions. Error rate and latency dashboards with alerts at 1% error rate — not 5%, because at 99.99% I can't afford to wait."
Step 4 — Recovery: "Automated failover for the database — no human in the loop for the first response. Blue-green deployments so I can roll back a bad deploy in under a minute. Canary releases that test new code on 5% of traffic first. And a runbook for the on-call engineer covering the top 10 failure scenarios."
Step 5 — SLOs: "I'll define SLIs: request success rate and P99 latency. The SLO is 99.99% success rate and P99 under 500ms. That gives me an error budget of about 4,300 failed requests per month on a service handling 1 million daily. If I'm burning budget too fast, I freeze deployments and investigate."
Now let's walk through the most common reliability interview questions and how to approach each one.
What the interviewer wants: They want to see you think about redundancy at every layer — not just "add more servers." A strong answer covers: load balancing, database replication and failover, multi-zone deployment, health checks, and graceful degradation.
Framework to use: Walk through the architecture layer by layer — DNS (use multiple providers or a managed service), load balancer (active-passive pair), app tier (stateless servers, at least N+1 capacity), data tier (primary + hot standby, read replicas), and caching tier (clustered Redis). For each layer, state the failure mode and the mitigation.
Key phrase to use: "I want to eliminate single points of failure at every layer, and make sure each layer can lose one component without user impact."
What the interviewer wants: They're testing whether you've thought about failure propagation. Don't just say "the backup takes over." Explain the full chain: detection (how do you know it failed?), impact (what do users see during the gap?), recovery (automatic or manual?), and data implications (any data loss?).
Framework to use: "When X fails, here's the timeline: detection takes Y seconds via health checks. During the detection window, requests to X will timeout — but our circuit breaker trips after 3 failures and returns a fallback response. Failover to the standby takes Z seconds. Total user-visible impact is... And we lose at most N seconds of uncommitted data due to replication lag."
Key phrase to use: "Let me walk through the blast radius of that failure."
What the interviewer wants: This tests your understanding of replication, data consistency, and the practical mechanics of failover. A weak answer says "the replica takes over." A strong answer discusses replication lag, in-flight transactions, split-brain risk, and client reconnection.
Framework to use: "The primary fails. The monitoring system detects it within 10 seconds (missed heartbeats). The orchestrator (like Patroni or RDS Multi-AZ) promotes the replica to primary. There's a window of replication lag — say 200ms of committed transactions that haven't replicated yet. Those might be lost. In-flight transactions get connection errors and need to retry. The application's connection pool detects the DNS change and reconnects. Total switchover: 15-30 seconds."
Key phrase to use: "The key trade-off is between recovery time and data loss — I need to define our RPO and RTO based on business requirements."
What the interviewer wants: They want to see you think big — beyond single-component failures to entire region outages. This is about RPO, RTO, data replication strategies, and how you test the plan.
Framework to use: "I'd classify our data into tiers. Tier 1 (user data, transactions): synchronous replication to another region, RPO near zero, RTO under 5 minutes. Tier 2 (analytics, logs): asynchronous replication, RPO of 1 hour is acceptable, RTO of 30 minutes. Tier 3 (derived data, caches): no replication needed, can be rebuilt. For the actual failover, I'd use DNS-based routing with health checks — if the primary region fails, traffic routes to the secondary within 60 seconds. And critically — we test this quarterly with a planned regional failover drill."
Key phrase to use: "A disaster recovery plan that hasn't been tested is just a disaster recovery hope."
Practice Exercises — Build Your Reliability Intuition
Reading about reliability is step one. But real understanding comes from working through problems yourself — figuring out where failures hide, calculating error budgets, and designing fallback strategies. Try each exercise before peeking at the hints. The struggle is where the learning happens.
Your e-commerce checkout flow has three dependencies: a payment API (processes credit cards), an inventory database (checks stock levels), and an email service (sends order confirmations). The payment API averages 500ms response time. The inventory DB responds in 20ms. The email service takes 2 seconds.
Questions: (a) Which dependencies need circuit breakers? (b) Which ones can fail gracefully without blocking checkout? (c) Design the degradation strategy — what does the user see when each dependency is down?
(a) Circuit breakers: The payment API and email service both need circuit breakers. The payment API is slow (500ms) and external — if it starts timing out at 10 seconds, it'll block your checkout threads. The email service is even slower (2 seconds) and is a fire-and-forget operation. The inventory DB is fast (20ms) and internal — a simple timeout + retry is sufficient, but a circuit breaker doesn't hurt.
(b) Graceful failure: The email service is the easy one — queue the confirmation email and send it later. The user doesn't need the email to complete checkout. The inventory DB can degrade to an optimistic strategy — accept the order and reconcile stock later (most e-commerce sites do this during flash sales). The payment API is the one that cannot degrade — you need payment confirmation to complete a purchase. If it's down, show "Payment processing is temporarily unavailable, please try again in a few minutes."
(c) Degradation plan: Payment API down → show friendly retry message, offer to save the cart. Inventory DB down → accept orders optimistically, flag for manual review. Email service down → queue emails, show "confirmation email coming soon" in the UI. The checkout still completes for 2 out of 3 failure modes.
Your API service has a SLOService Level Objective — the target reliability you promise internally. For example, "99.95% of requests will succeed." It's stricter than your SLA (external promise to customers) to give you a safety margin. of 99.95% success rate. It handles 2 million requests per day.
Questions: (a) How many failed requests per day does your error budget allow? (b) How many per month (30 days)? (c) A bad deployment causes 1,500 errors in 10 minutes before you roll back. What percentage of your monthly error budget did that single incident consume?
(a) Daily budget: 100% - 99.95% = 0.05% allowed failures. 2,000,000 × 0.0005 = 1,000 errors per day.
(b) Monthly budget: 1,000 × 30 = 30,000 errors per month.
(c) Incident impact: 1,500 errors out of a 30,000 monthly budget = 1,500 / 30,000 = 5% of your monthly error budget burned in 10 minutes. That's significant but survivable. If this happened 6 more times in the same month, you'd blow your entire budget and need to freeze deployments. This is exactly why error budgets matter — they turn abstract reliability targets into concrete deployment decisions.
Your PostgreSQL primary fails unexpectedly. You have a streaming replicaA PostgreSQL replica that receives a continuous stream of WAL (Write-Ahead Log) records from the primary. It's usually a few hundred milliseconds behind — close to real-time but not exact. with approximately 200ms of replication lag. At the moment of failure, the primary was processing 500 transactions per second.
Questions: (a) Walk through the failover process step by step. (b) How many transactions might be lost? (c) What happens to in-flight transactions that were mid-commit? (d) How do application servers discover the new primary?
(a) Failover steps: (1) The monitoring system (e.g., Patroni) detects missed heartbeats from the primary — typically 3 missed checks at 5-second intervals = 15 seconds to detect. (2) Patroni confirms the primary is truly dead (not just a network blip). (3) The replica is promoted to primary — it replays any remaining WAL and starts accepting writes. (4) DNS or connection pooler (PgBouncer) is updated to point to the new primary. (5) Application connections are reset and reconnect. Total time: 15-30 seconds.
(b) Data loss: With 200ms replication lag and 500 TPS, roughly 500 × 0.2 = ~100 transactions that were committed on the primary but not yet replicated. These are lost. This is the RPORecovery Point Objective — the maximum acceptable amount of data loss measured in time. An RPO of 200ms means you accept losing up to 200ms of data. trade-off with asynchronous replication.
(c) In-flight transactions: Any transaction that was mid-commit (sent to primary but not yet acknowledged) will get a connection error. The application must retry these. If the application is idempotent (which it should be!), retrying against the new primary is safe. If not, you risk duplicate operations — this is why idempotency matters.
(d) Discovery: Option 1: DNS-based — update the DNS record for the DB hostname; apps reconnect on next attempt (TTL matters — keep it low, 5-10 seconds). Option 2: Connection pooler (PgBouncer/HAProxy) handles routing transparently — the app doesn't even know the primary changed. Option 3: Client-side library with cluster awareness (like Patroni's REST API).
You run a microservice called order-service that depends on: a PostgreSQL database (stores orders), a Redis cache (caches product prices), and two external APIs — payment-api (processes payments) and shipping-api (calculates shipping rates).
Questions: (a) Define a liveness checkA health check that answers "Is this process alive and not stuck?" If liveness fails, the orchestrator kills and restarts the container. and a readiness checkA health check that answers "Is this service ready to handle traffic?" If readiness fails, the load balancer stops sending requests to it — but doesn't kill it.. (b) If Redis is down but everything else is healthy, should the service be marked "not ready"? Why or why not? (c) What timeout should each health check use?
(a) Liveness: A simple check that the process is running and not deadlocked. Ping an internal endpoint like /health/live that returns 200 if the event loop is responding. Do NOT check dependencies here — if the database is down, the process is still alive. Liveness failures trigger a restart, and restarting won't fix a database outage.
Readiness: Check that the service can actually handle requests. /health/ready should verify: (1) database connection pool has available connections, (2) Redis is reachable, (3) the service has completed startup initialization. External APIs (payment, shipping) should NOT be part of readiness — they're checked per-request with circuit breakers.
(b) Redis down: It depends on your degradation strategy. If the service can fall back to reading prices directly from the database (slower but functional), then Redis being down should NOT make the service "not ready" — it should still serve traffic, just slower. If Redis is absolutely required (e.g., it holds session data with no fallback), then yes, mark it not ready. The answer reveals whether you've thought about graceful degradation.
(c) Timeouts: Liveness: 1-2 seconds max (it's just checking the process). Readiness: 3-5 seconds (includes a DB connection check). Make these significantly shorter than your Kubernetes probe intervals — if the check itself times out, it counts as a failure. A common mistake is setting health check timeouts equal to the probe interval, causing false positives.
You run a global SaaS platform across three regions: US-East (primary), EU-West, and AP-Southeast. US-East hosts the primary database and the control plane. Each region handles local traffic. You have 50,000 active users across all regions. Design the disaster recovery plan.
Questions: (a) Define RPORecovery Point Objective — how much data you can afford to lose, measured in time. RPO of 0 means zero data loss (synchronous replication). RPO of 1 hour means you accept losing up to 1 hour of data. and RTORecovery Time Objective — how quickly the service must be restored after a failure. RTO of 5 minutes means users should be back online within 5 minutes of the outage. for three tiers of data. (b) Explain what happens when US-East goes completely offline — minute by minute. (c) How do EU-West and AP-Southeast users continue working? (d) How do you test this plan without causing an actual outage?
(a) Data tiers:
Tier 1 (user accounts, transactions, orders): RPO < 1 second (synchronous or near-synchronous replication to EU-West), RTO < 5 minutes. This data is irreplaceable.
Tier 2 (analytics, audit logs, activity history): RPO < 1 hour (async replication is fine), RTO < 30 minutes. Important but not urgent.
Tier 3 (caches, search indexes, derived data): RPO = N/A (can be rebuilt), RTO < 2 hours. Rebuild from Tier 1 data after failover.
(b) US-East fails — timeline: Minute 0: US-East goes dark. External monitors (Pingdom, Route53 health checks) detect failure within 30-60 seconds. Minute 1: DNS failover triggers — US traffic routes to EU-West. EU-West's replica is promoted to primary. Minute 2-3: Connection pools reset, applications reconnect to the new primary. Some US users see 30-60 seconds of errors during the switchover. Minute 5: Service is restored for all regions. Minute 10-30: Tier 2 data catches up via async replication replay. Hour 1-2: Tier 3 data (search indexes, caches) is rebuilt.
(c) EU/AP users: EU-West users experience a brief blip (2-3 seconds) as their local region absorbs the extra load from US traffic and the database promotion happens. AP-Southeast users route to EU-West for write operations (higher latency — 200-300ms instead of 50ms) but reads are served from their local read replica. Both regions remain operational throughout.
(d) Testing: Run quarterly DR drills: (1) Announce a maintenance window. (2) Simulate US-East failure by updating DNS to stop routing traffic there. (3) Verify EU-West handles the promotion and all traffic. (4) Measure actual RTO and RPO. (5) Fail back to US-East and verify data consistency. Netflix and Google do this routinely. If you can't test the plan, you can't trust the plan.
Cheat Sheet — Reliability at a Glance
Quick-reference cards for every major reliability concept. Pin this section for your next system design interview or post-mortem review.
Connected Topics — Where to Go Next
Reliability doesn't exist in a vacuum. Every concept you've learned here connects to a deeper topic. Pick the ones that matter most for your next interview or your current system, and go deeper.