Core Principles

Reliability

Netflix serves 250 million subscribers because when a server dies, nobody notices. Your app serves 500 users and one database hiccup brings it all down. This page explains how real systems stay alive — with real incident stories, real numbers, and the exact patterns that prevent 3 AM wake-up calls.

12 Think Firsts 30+ SVG Diagrams 22 Sections 5 Exercises 50+ Tooltips
Section 1

TL;DR — The Hospital That Never Closes

  • Why building systems that survive failure is more important than building systems that never fail
  • The three pillars of reliability: redundancy, failover, and monitoring
  • How "the nines" of availability translate to real downtime minutes per year
  • Why human error — not hardware — causes the majority of outages

Reliability is the art of keeping your system running even when pieces of it break.

Think about a hospital. A hospital never closes. Not on holidays, not during a power outage, not when a doctor calls in sick. Why? Because hospitals are designed around a simple truth: things will go wrong. The power will fail. Doctors will get sick. Equipment will malfunction. The hospital doesn't try to prevent every possible failure (that's impossible). Instead, it makes sure that when something fails, there's always a backup ready to take over.

The main power goes out? A backup generatorA secondary power source that kicks in automatically when the main power fails. In hospitals, generators must start within 10 seconds. In software, this is like a standby database that takes over when the primary crashes. kicks in within 10 seconds. The lead surgeon is in an accident? An on-call surgeon arrives within 30 minutes. The MRI machine breaks? There are two more down the hall. Every single critical system has a Plan B. Many have a Plan C. Some have a Plan D.

That's reliability in software. You don't build systems that never fail — that's a fantasy. You build systems where failure doesn't matter because something else picks up the slack before anyone even notices. Netflix engineers intentionally crash their own servers during business hours using a tool called Chaos MonkeyA tool Netflix built that randomly kills production servers during working hours. Why? Because if a random server dying causes problems, they want to find out at 2 PM on a Tuesday — not 3 AM on a Saturday. It forces every team to build their services to survive failures.. If a random server dies and users see an error, that's a bug in the architecture — not bad luck.

The Hospital Model = The Reliability Model Every critical system has a backup. Every backup has a trigger. PRIMARY SYSTEMS Main Power Grid (Primary Database) Serves all floors Lead Surgeon (Primary Server) Handles all operations MRI Machine #1 (Service Instance) Runs scans all day Power outage! Called in sick! Hardware fault! BACKUP SYSTEMS Diesel Generator (Standby Replica) Starts in 10 seconds On-Call Surgeon (Failover Server) Arrives in 30 minutes MRI Machine #2 & #3 (Redundant Instances) Already running THE PATTERN 1. Assume it WILL fail 2. Have a backup ready 3. Detect failure FAST 4. Switch automatically 5. Users never notice This is the entire page in 5 steps. Everything else is HOW to implement each. Hospital = reliable system. Patients = users. They never see the generator switch. Your users should never see the failover.

The core insight is deceptively simple: things will fail. Disks die. Networks split. Developers push bad code at 5 PM on a Friday. The question isn't "will something go wrong?" — it always will. The question is: will your users notice?

A reliable system isn't one where nothing breaks. It's one where things break all the time and nobody cares — because the system handles it automatically. The generator starts. The backup takes over. The traffic reroutes. By the time a human even looks at the dashboard, the system has already healed itself.

What: Reliability means your system keeps working correctly even when things go wrong — hardware failures, software bugs, human mistakes, network issues. Not "nothing ever fails," but "failure doesn't reach the user."

When: Every production system needs reliability thinking from day one. The cost of an outage grows exponentially with your user count. A 5-minute outage at 100 users is nothing. At 10 million users, it's front-page news.

Key Principle: Design for failure, not against it. Assume every component will fail, and build the system so it doesn't matter when they do. The three pillars: redundancy (have backups), failover (switch automatically), monitoring (detect instantly).

Reliability is about building systems that survive failure, not systems that never fail. Like a hospital with backup generators, on-call staff, and redundant equipment, a reliable software system has backups for every critical component and switches to them automatically before users notice. The three pillars: redundancy, failover, and monitoring.
Section 2

The Scenario — Your Server Just Died at 3 AM

It's 3:17 AM on a Saturday. Your phone vibrates on the nightstand. Then again. Then again. You grab it, squinting at the screen:

[PagerDuty] Production database unreachable. Service: orders-api. Status: DOWN. Duration: 4 minutes and counting. Escalation: on-call engineer (you).

You stumble to your laptop and SSH into the production server. The e-commerce site is completely down. Checkout is broken. The shopping cart returns 500 errors. Order confirmations aren't sending. Every minute the site is down, the company loses roughly $4,000 in revenue. It's Black Friday weekend.

You check the database server. Disk is at 100%. The application logs have been writing to /var/log/app/ without any log rotationA system that automatically archives and deletes old log files so they don't fill up the disk. Without it, logs grow forever until the disk is full. Most Linux systems use logrotate — but someone has to actually configure it. configured. Three months of debug logs have eaten the entire 50 GB disk. PostgreSQL can't write its WAL filesWrite-Ahead Log — PostgreSQL's crash recovery mechanism. Before changing any data, PostgreSQL writes the change to the WAL first. If the server crashes, it replays the WAL to recover. But if the disk is full, it can't write WAL entries, so it refuses ALL writes., so it's refusing every write operation. One tiny oversight — forgetting to set up logrotate — brought down the entire platform.

You frantically delete old logs (find /var/log/app/ -mtime +30 -delete), restart PostgreSQL, and watch the site slowly come back to life. Total downtime: 47 minutes. Revenue lost: ~$188,000. Sleep lost: all of it. Reputation damage: immeasurable.

Anatomy of a 3 AM Outage 3:17 AM ALERT FIRES PagerDuty wakes you Find laptop, SSH in VPN connects slowly 8 minutes wasted 3:25 AM DIAGNOSE Check app logs: blank Check DB: won't connect df -h: disk 100% full 15 minutes diagnosing 3:40 AM FIX Delete old logs Restart PostgreSQL Verify site is back 24 minutes to recover 4:04 AM RECOVERY Site is live again Set up logrotate Add disk space alert 47 min total downtime THE DAMAGE 47 min downtime ~$188,000 lost revenue 1 angry executive email What would have prevented this? A $0 cron job: logrotate + a $0 disk space alert at 80%. Total cost: 10 minutes of setup.

Now contrast that with Netflix. Netflix runs on thousands of servers across multiple AWS regionsAmazon's data centers in different parts of the world (US East, EU West, Asia Pacific, etc.). Each region is completely independent — its own power, network, and cooling. If one region has an outage, others keep running.. Every day, their Chaos Monkey tool randomly kills server instances during business hours. Not in staging. Not in a test environment. In production. Serving real customers watching real shows.

And nobody notices. Not a single stream buffers. Not a single recommendation fails to load. The system detects the dead server within seconds, reroutes traffic to healthy servers, and spins up a replacement. By the time a Netflix engineer glances at the dashboard, the incident is already resolved — automatically.

The difference between the 3 AM horror story and the Netflix non-event isn't luck. It isn't budget (though Netflix spends more, the core patterns are free). It's reliability engineering — designing every piece of the system to fail gracefully, detect failures instantly, and recover automatically.

Industry research consistently shows that unplanned downtime costs Fortune 500 companies an average of $100,000 to $500,000+ per hour. AWS itself reports roughly 7 hours of unplanned downtime per year across its services. But the real cost isn't just revenue — it's trust. Users who experience one outage are 3x more likely to switch to a competitor. Reliability isn't a feature. It's THE feature.
Think First

Look at the 3 AM incident again. The root cause was a full disk — no log rotation. But think deeper: what reliability mechanisms were missing? List at least three things that should have caught this problem before it became an outage.

Think about monitoring (what should have alerted before 100%?), redundancy (what if the database had a replica?), and automation (what if log rotation was already configured?).
A single overlooked configuration — missing log rotation — can bring down an entire production system at 3 AM. Meanwhile, Netflix intentionally kills servers and nobody notices. The difference is reliability engineering: monitoring that catches problems early, redundancy that provides backups, and automation that recovers without human intervention.
Section 3

What Fails — The Four Horsemen of System Failure

Before you can build reliable systems, you need to understand what actually goes wrong. Failures don't come from one place — they come from four distinct categories, each with its own frequency, severity, and fix. Think of them as the four horsemen of your system's apocalypse. Let's meet them.

The Four Horsemen of System Failure YOUR SYSTEM 1. HARDWARE Disks die (AFR 2-4%/year) RAM corrupts (bit flips) Servers crash, fans fail, PSUs pop Frequency: ~10-20% of outages Predictable with monitoring 2. SOFTWARE Memory leaks, race conditions Config errors, null pointers Dependency failures, version drift Frequency: ~15-25% of outages Often latent — triggers later 3. HUMAN ERROR Wrong deploy, bad config push Accidental deletion, typos Skipped steps, wrong server Frequency: 60-80% of outages THE #1 CAUSE BY FAR 4. NETWORK Partitions, packet loss DNS failures, BGP hijacks Cable cuts, switch failures Frequency: ~5-10% of outages Rare but devastating scope

Horseman #1: Hardware Failures

Physical things break. Hard drives have moving parts that wear out. RAM chips get bit flipsA single bit in memory spontaneously changes from 0 to 1 (or vice versa), usually caused by cosmic rays or electrical noise. Google's study found ~1 bit flip per GB of RAM per year. ECC (Error-Correcting Code) memory detects and fixes single-bit flips automatically. from cosmic rays (yes, really). Power supplies overheat. Fans stop spinning. Server motherboards just... die one day.

How often does this happen? More than you'd think. Google published a famous study on disk failures: across a fleet of 100,000+ drives, the Annual Failure Rate (AFR)The percentage of drives that die in a given year. An AFR of 2% means if you have 100 drives, expect about 2 to fail this year. After 3 years, AFR typically doubles or triples as drives age. was about 2-4% per year, increasing sharply after 3 years. That means if you have 100 servers with one drive each, 2-4 drives will die this year. At Google's scale of millions of drives, that's thousands of disk failures per day.

The good news? Hardware failures are the most predictable type. Drives give warnings (SMART data shows increasing error counts before failure). Servers often degrade slowly rather than dying instantly. And the fix is straightforward: have spares. Use RAIDRedundant Array of Independent Disks — spreading data across multiple drives so that if one dies, your data is still intact on the others. RAID-1 mirrors everything to two drives. RAID-5/6 uses math (parity) to recover from 1-2 drive failures. for disks, ECC memoryError-Correcting Code memory — special RAM that can detect and automatically fix single-bit errors. Standard in servers, rare in consumer PCs. Costs about 10-20% more but prevents silent data corruption. for RAM, and redundant power supplies for servers.

Horseman #2: Software Bugs

Software fails in sneakier ways than hardware. A memory leakWhen a program keeps allocating memory but never frees it. The memory usage slowly climbs over hours or days until the server runs out and crashes. Notorious in long-running servers — the code works fine for a day, then dies on day 3. doesn't crash your server immediately — it slowly eats RAM over days until the OOM killerOut-Of-Memory killer — a Linux kernel mechanism that forcibly kills processes when the system runs out of RAM. It picks the process using the most memory and terminates it. Your app works for 72 hours, then Linux murders it. strikes. A race conditionA bug that only happens when two operations happen at exactly the same time and interfere with each other. Like two people trying to book the last seat on a flight simultaneously — both see "1 seat available," both click "buy," and you've sold the same seat twice. doesn't show up in testing — it shows up at 2x traffic when two requests hit the same database row at the exact same millisecond. A bad configuration file doesn't cause errors until the server restarts and reads the broken config.

Software bugs are particularly dangerous because they're often latent. The bug was introduced last Tuesday. It didn't cause any problems until Saturday night when traffic hit a specific pattern. Now you're debugging a week-old change in the middle of the night, and nobody remembers what changed.

Horseman #3: Human Error (The Big One)

Here's the uncomfortable truth that every reliability report confirms: humans cause more outages than hardware, software, and network issues combined. Study after study puts human error at 60-80% of all outages.

A developer pushes to production instead of staging. A DBA runs DELETE FROM users without a WHERE clause. Someone fat-fingers a firewall ruleRules that control which network traffic is allowed in and out of a server. One wrong rule — like accidentally blocking port 443 (HTTPS) — and your entire website becomes unreachable to every user, even though the server is running perfectly fine. and blocks all incoming traffic. An ops engineer rolls back to the wrong version. A config change meant for one server gets applied to all 200 servers.

Google, Amazon, and Microsoft have all publicly stated that the majority of their outages are caused by human error, not technical failure. Amazon's famous S3 outage in 2017 — which took down half the internet — was caused by a typo in a command. One engineer typed the wrong parameter, removed too many servers from an index subsystem, and the cascading failure took hours to recover from. The fix wasn't better hardware. It was better tooling that prevents humans from making dangerous mistakes.

This is why modern reliability engineering focuses so heavily on guard rails around human actions: code reviews for config changes, canary deployments that test on 1% of traffic first, automated rollbacks when error rates spike, and "dry run" modes that show what a command would do before actually doing it.

Horseman #4: Network Issues

Networks fail in the most confusing ways. A network partitionWhen some servers can talk to each other but not to others. Server A can reach Server B, and Server C can reach Server D, but A can't reach C. It's like having two groups of friends who can hear each other within each group but not across groups. This is the "P" in the CAP theorem. means Server A can talk to Server B, but not Server C — even though B and C can still talk to each other. Packet lossWhen data packets get dropped during transmission. At 0.1% packet loss, you barely notice. At 1%, SSH sessions feel laggy. At 5%, things start timing out. At 10%, the system is basically unusable. Caused by congested routers, bad cables, or overloaded switches. makes things slow and unreliable without fully breaking them. A DNS failureWhen the Domain Name System — the internet's phone book that converts names like "google.com" to IP addresses — stops working. If DNS is down, browsers can't find ANY website. It's one of the few single points of failure that can take down the entire internet for a region. means names can't be resolved to addresses, so nothing can find anything else.

Network issues are the rarest of the four horsemen (roughly 5-10% of outages), but they're often the most devastating in scope. A disk failure kills one server. A bad deployment kills one service. A network partition can split your entire cluster in half and cause data inconsistency that takes days to clean up.

The scariest network failure is a BGP hijackBGP (Border Gateway Protocol) is how internet routers decide where to send traffic. A BGP hijack happens when someone (accidentally or maliciously) announces bad routing information, causing traffic meant for your servers to go somewhere else entirely. It's like someone changing the road signs to reroute all your customers to a competitor's store. — when someone (intentionally or accidentally) announces bad routing information and traffic meant for your servers goes to the wrong place entirely. This has happened to major companies including Google, Facebook, and Amazon.

Think First

Your team had 5 production incidents last month. If the industry average holds (60-80% human error), how many of those 5 were likely caused by a person, not a machine? What does that tell you about where to invest your reliability budget?

3-4 out of 5. This means investing in better deployment tooling, automated testing, and "dry run" commands will prevent more outages than buying better hardware.
System failures come from four sources: hardware (predictable, 10-20%), software bugs (sneaky and latent, 15-25%), human error (the #1 cause at 60-80% of all outages), and network issues (rare but devastating, 5-10%). The biggest bang for your reliability buck comes from protecting against human mistakes — better tooling, code reviews for config changes, canary deployments, and automated rollbacks.
Section 4

The Foundation — Redundancy (Having a Backup for Your Backup)

Now that you know what fails, let's talk about the most fundamental defense: have more than one of everything that matters. This is called redundancyHaving duplicate components so that if one fails, the other takes over. Like carrying a spare tire in your car. You hope you never need it, but when you do, it saves you from being stranded on the highway., and it's the oldest trick in the reliability playbook.

Think about airplanes. A Boeing 777 has two engines — it can fly perfectly well on just one. It has three independent hydraulic systemsThe high-pressure fluid systems that move an airplane's control surfaces (flaps, rudder, ailerons). Three independent systems means two can fail completely and the pilot can still control the aircraft. Each system has its own fluid reservoir, pumps, and tubing. — if two fail completely, the pilot can still control the aircraft with the third. It has four independent flight computers cross-checking each other. The landing gear has a manual backup release. Even the cockpit windshield has a heated spare layer in case the outer layer cracks. Every critical system has a backup. Some have three backups.

Why? Because the consequence of failure is catastrophic. You can't pull over at 35,000 feet. Software systems use the exact same thinking: the more critical the component, the more copies you need.

Levels of Redundancy

Engineers use a simple naming system for how much redundancy a system has:

N+1 redundancy means you have one more than you need. If you need 3 servers to handle your traffic, you run 4. If one dies, you still have exactly enough. This is the minimum level of redundancy for any production system. It protects against one failure, but not two simultaneous failures.

N+2 redundancy means two spares. Needed when failures can be correlated — like when a power outage kills two servers on the same rack. Or when you need to take one server offline for maintenance while still being protected against a random failure on another.

2N redundancy means you have double everything. If you need 4 servers, you run 8. This is expensive, but it's what hospitals use for life-critical systems and what financial trading platforms use. When every second of downtime costs millions, the cost of extra hardware is a rounding error.

Redundancy Levels — More Copies = More Reliability NO REDUNDANCY Server 1 1 failure = total outage Cost: $100/mo Availability: ~99% N+1 (ONE SPARE) S1 Spare Survives 1 failure Cost: $200/mo Availability: ~99.9% N+2 (TWO SPARES) S1 Sp1 Sp2 Survives 2 failures Cost: $300/mo Availability: ~99.99% 2N (DOUBLE ALL) S1 S2 S3 S4 Survives half failing Cost: $400/mo The Nines of Availability — Each Extra 9 Costs 10x More 99% 87.6 hours/year 3.65 days downtime Hobby projects 99.9% (three nines) 8.76 hours/year 43.8 min/month Most SaaS products 99.99% (four nines) 52.6 min/year 4.38 min/month E-commerce, banking 99.999% (five nines) 5.26 min/year 26.3 sec/month Telecom, 911 systems

Active-Active vs. Active-Passive

Having backups is only half the story. The other half is: what are those backups doing while the primary is healthy? There are two approaches.

In active-passiveOne server handles all the work (active), while the other sits idle, just waiting (passive). When the active server fails, the passive one takes over. Simple and safe, but you're paying for a server that does nothing 99% of the time. redundancy, the backup just sits there waiting. It's powered on, it's synchronized with the primary, but it doesn't serve any traffic. It's like that on-call surgeon — at home, awake, ready, but not operating. When the primary fails, the passive takes over. The upside: simple, safe, no complicated coordination. The downside: you're paying full price for a server that does nothing most of the time.

In active-activeALL servers handle traffic simultaneously. No idle backup — everyone is working. If one dies, the others absorb the extra load. More efficient (no wasted capacity), but harder to coordinate. You need to make sure all servers have the same data. redundancy, all copies serve traffic simultaneously. There's no idle backup — everyone's working all the time. If one fails, the others absorb the extra load. Like a hospital with 3 MRI machines all running scans: if one breaks, patients shift to the other two with slightly longer wait times, but nobody goes unscanned. The upside: no wasted capacity, better performance normally. The downside: more complex to coordinate, and you need to ensure all servers have consistent data.

ACTIVE-PASSIVE Traffic PRIMARY Handling 100% traffic STANDBY Idle, synced, waiting sync + Simple & safe + No data conflicts - Wasted capacity - Switchover delay Use for: databases, critical stateful services where data consistency matters most ACTIVE-ACTIVE Traffic (split) SERVER A Handling 50% traffic SERVER B Handling 50% traffic sync + No wasted capacity + Better normal perf - Data sync complexity - Conflict resolution Use for: stateless web servers, CDN nodes, read-heavy APIs, DNS servers

Which should you use? It depends on the component. For stateless things (web servers, API servers that don't store data locally), active-active is almost always better — you get more throughput and instant failover. For stateful things (databases, message queues, anything with data you can't afford to lose), active-passive is safer because there's only one source of truth for writes.

Not everything needs the same level of redundancy. Your login service (blocks all users if down)? N+2, active-active. Your recommendation engine (nice to have, not critical)? N+1 is probably fine. Your analytics pipeline (can process data later)? Maybe no redundancy at all — just retry when it comes back. Match the redundancy level to the business impact of failure, not some blanket "everything must be highly available" rule. Over-engineering reliability wastes money. Under-engineering it wastes trust.
Think First

You're designing an e-commerce system with these components: (1) Product catalog API, (2) Payment processing, (3) User review service, (4) Email notification sender. What redundancy level would you choose for each, and why?

Think about what happens if each one goes down. Can users still buy things? Losing reviews is annoying; losing payments is catastrophic. Emails can be retried later.
Redundancy means having backups for every critical component. N+1 gives you one spare (survives one failure), N+2 gives two spares (survives two), and 2N doubles everything. Active-passive keeps a standby idle for simple failover; active-active splits traffic across all copies for better efficiency. Each extra "nine" of availability costs roughly 10x more — match your redundancy level to the business impact of each component's failure.
Section 5

Failover — The Backup Takes Over

Having a backup is step one. Making that backup actually take over smoothly — without losing data, without confusing users, without breaking things worse — that's failoverThe process of switching from a failed primary system to its backup. Sounds simple ("just switch!"), but the devil is in the details: how fast do you detect the failure? How do you avoid data loss during the switch? How do you prevent both servers from thinking they're the primary?. And it's where most of the complexity lives.

Think of it like this: you're in a meeting with a client. Your sales lead suddenly gets sick and has to leave. Having a second sales person in the building is redundancy. Getting them up to speed, into the meeting, and smoothly continuing the conversation without the client noticing — that's failover. The transition is the hard part, not the spare capacity.

Active-Passive Failover

The most common pattern. You have a primary server handling all traffic, and a standby replica that mirrors everything the primary does. The standby watches the primary like a hawk, constantly asking "are you still alive?" When the primary stops responding, the standby says "okay, I'm in charge now" and starts accepting traffic.

This sounds simple, but three big questions make it complicated:

1. How fast can you detect the failure? The standby pings the primary every few seconds (a heartbeatA periodic signal sent between servers meaning "I'm still alive." If the heartbeat stops, the other server assumes failure. The interval is critical — too frequent wastes bandwidth, too infrequent means slow detection. Typical: every 1-5 seconds, declare dead after 3-5 missed beats.). If it misses 3-5 heartbeats, it assumes the primary is dead. With a 5-second interval and 3 missed beats, that's 15 seconds just to detect the failure.

2. How long does promotion take? The standby needs to finish replaying any pending data, open network ports, register itself with the load balancer, and start accepting connections. For a database, this might mean replaying the WALWrite-Ahead Log — a sequential record of every database change. The standby continuously receives WAL records from the primary and replays them to stay in sync. During failover, it must finish replaying all pending WAL records before it can accept writes. — anywhere from 5 to 60 seconds depending on how much data is in the pipeline.

3. How long to redirect traffic? The load balancer or DNSDomain Name System — translates domain names (like "api.myapp.com") to IP addresses. If failover changes the IP address, DNS must update. DNS changes can take 30 seconds to 48 hours to propagate because of caching. This is why most failover uses load balancers instead of DNS. needs to stop sending traffic to the dead server and start sending it to the new primary. If you're using a load balancer with health checks, this can be near-instant. If you're relying on DNS changes, it could take minutes due to TTL cachingTime-To-Live — how long DNS resolvers cache an IP address before asking again. A TTL of 60 seconds means after a DNS change, some clients won't see the new IP for up to 60 seconds. Lower TTLs mean faster failover but more DNS queries (and more cost)..

Add it all up: detect (15s) + promote (30s) + redirect (5s) = ~50 seconds of downtime in a typical active-passive database failover. That's the best case. In practice, 1-5 minutes is common.

Active-Passive Failover — Step by Step NORMAL Primary serves traffic Standby syncs data Heartbeats: OK OK OK Users: happy DETECT (5-30s) Primary goes silent Heartbeat: miss, miss, miss Standby: "Is it dead?" Users: errors starting PROMOTE (5-60s) Standby replays WAL Opens write port Becomes new primary Users: still seeing errors REDIRECT (1-30s) LB updates backend list DNS updates (if needed) Traffic flows to new primary Users: back to normal TOTAL DOWNTIME = detect + promote + redirect Typical: 30 seconds to 5 minutes. Goal: under 30 seconds. ACTIVE-ACTIVE FAILOVER: Much Faster Both servers already serve traffic. If one dies, the LB just stops sending it requests. No promote step. Detection + redirect only = typically under 10 seconds total.

The Split-Brain Problem

Here's the scariest failure mode in distributed systems. Imagine your primary database is running in Data Center A. Your standby is in Data Center B. The network link between them goes down — but both servers are still running fine.

The standby can't reach the primary. After 15 seconds of missed heartbeats, it concludes: "The primary must be dead. I'm promoting myself to primary." But the primary isn't dead. It's still running, still accepting writes from users who can reach Data Center A. Now you have two servers that both think they're the primary. Both are accepting writes. Both are making changes to the data. And those changes conflict with each other.

This is called split-brainWhen two servers in a cluster both believe they are the primary and start making conflicting changes. Like two pilots both grabbing the controls and steering in different directions. It's one of the most dangerous failure modes because it can corrupt your data permanently., and it can cause permanent data corruption. Customer A updates their address on Server 1. Customer A's order ships from Server 2 using the old address. A payment is processed on Server 1 but the inventory is decremented on Server 2. The data becomes inconsistent in ways that are extremely difficult to untangle.

The Split-Brain Problem When the network splits, both sides think the other is dead Data Center A PRIMARY "I'm still primary!" User writes: address = "123 New St" balance = $450 Accepting writes from users in DC-A NETWORK PARTITION Data Center B NEW PRIMARY "Primary is dead!" User writes: address = "456 Old Ave" balance = $380 Accepting writes from users in DC-B DATA CONFLICT: Which version is correct? address = "123 New St" or "456 Old Ave"? balance = $450 or $380? Both were valid writes. This can corrupt data permanently. Recovering requires manual reconciliation. Split-brain is one of the most feared failure modes in distributed systems. The standard prevention is called STONITH — "Shoot The Other Node In The Head." When the standby decides to promote itself, it first sends a hard kill signal (power off, not shutdown) to the old primary to make absolutely sure it's not still accepting writes. Another approach is quorum-based fencing: a server can only be primary if it can reach a majority of nodes in the cluster. If neither side has a majority after a network split, both stop accepting writes. Better to be unavailable than to corrupt data.

Active-Active Failover: Simpler (With Caveats)

Active-active failover is mechanically simpler: both servers already handle traffic, so there's no "promotion" step. If one dies, the load balancer simply stops sending it requests. The remaining server absorbs the extra load. Total failover time: however long it takes the load balancer to detect the failure — usually 5-10 seconds.

The catch is that active-active only works cleanly for stateless services or read-only workloads. For anything that writes data, you need a strategy for handling writes that arrived at both servers simultaneously. This is why most database failover is active-passive — databases are inherently stateful, and split-brain data corruption is worse than a few seconds of downtime.

Think First

Your PostgreSQL primary database is in US-East. Your standby replica is in US-West. The network link between them goes down for 30 seconds. What happens with active-passive failover? What if the standby promotes itself? What happens when the link comes back?

If the standby promotes during a temporary network blip, you get split-brain for 30 seconds. Both accept writes. When the link comes back, those conflicting writes need manual reconciliation. This is why heartbeat timeouts should be long enough to ride out brief network issues — but short enough to detect real failures.
Failover is the process of switching from a failed primary to its backup. Active-passive failover takes 30 seconds to 5 minutes (detect + promote + redirect). Active-active is faster (~10 seconds) but only works cleanly for stateless services. The biggest danger is split-brain — when both servers think they're primary and accept conflicting writes. Prevention uses STONITH (kill the old primary) or quorum-based fencing (need a majority to be primary).
Section 6

Health Checks & Heartbeats — Detecting Failures Before Users Do

You can have the best backups in the world, but they're useless if you don't know something is broken. You can't fix what you can't see. That's where health checks and heartbeats come in — they're the monitoring systems that detect failures before your users call you about it.

Think of it this way: a health check is like a doctor asking "are you okay?" — someone else checks on you. A heartbeatA periodic signal a server sends to say "I'm still alive." Like a pulse — if it stops, something is wrong. The monitor doesn't need to ask; the server proactively sends beats. If the beats stop, the monitor knows the server is either dead or unreachable. is like wearing a heart monitor — you broadcast your status continuously. Both accomplish the same goal (knowing if something is alive), but they work in opposite directions.

Three Types of Health Checks

Not all "are you alive?" questions are equal. There are three levels, and using the wrong one can actually make things worse.

Liveness checks answer the simplest question: "Is the process running?" Can I reach the server? Does it respond at all? This is like checking if a patient has a pulse — it doesn't tell you if they're healthy, just that they're not dead. A server might be running but completely broken (stuck in an infinite loop, for example). A liveness check would still say "yes, it's alive."

Readiness checks go deeper: "Can this server actually handle traffic right now?" It's running, sure, but is it still starting up? Is it overloaded? Did it lose its database connection? A readiness check is like asking a surgeon "are you available for surgery?" — they might be alive and well but in the middle of another operation. A server that fails readiness checks stays in the pool but stops receiving new traffic until it recovers.

Deep health checks (sometimes called dependency checks) test everything: "Can this server do its actual job?" It checks the database connection, the cache, external APIs, disk space, memory — all the things the server needs to function. This is like a full physical exam. If the database is down, the health check reports unhealthy even though the server itself is fine.

Three Levels of Health Checks Each level checks deeper — but takes longer and costs more LIVENESS "Is it alive?" GET /health 200 OK Checks: process running, port open, responds to ping Speed: < 1ms Use: restart if dead Risk: can say "alive" when broken READINESS "Can it serve traffic?" GET /health/ready 200 OK or 503 Not Ready Checks: startup complete, not overloaded, warmed up Speed: 1-10ms Use: stop sending NEW traffic Risk: may oscillate under load DEEP HEALTH "Can it do its actual job?" GET /health/deep Checks DB, cache, disk, APIs Checks: DB query works, cache reachable, disk > 10% Speed: 50-500ms Use: full dependency check Risk: can cascade failures!

The Health Check Cascade Trap

Deep health checks are the most useful but also the most dangerous. Here's why: imagine your health check endpoint queries the database to verify the connection is alive. Normally this takes 5ms. But one day the database is overloaded and responding slowly — 2 seconds per query instead of 5ms.

Now your health check takes 2 seconds. Your load balancer has a 1-second timeout for health checks. The health check times out. The load balancer concludes: "This server is unhealthy." It removes the server from the pool. But the server is fine — it's the database that's slow. Now the remaining servers get more traffic, which means more database queries, which makes the database even slower, which makes more health checks fail, which removes more servers... and suddenly your entire fleet is marked unhealthy because of a single slow database.

This is called a cascading failureWhen one failure causes another, which causes another, in a chain reaction. Like dominos falling — the first one tips over and takes down all the others. In software, a slow database makes health checks fail, which removes servers, which overloads remaining servers, which makes them fail too., and it turns a minor database slowdown into a complete outage of every server.

Your health check should be faster than your timeout. If your load balancer's health check timeout is 3 seconds, your health check endpoint must respond in under 3 seconds — always, even when dependencies are slow. The solution: use liveness checks for the load balancer (fast, no dependencies), and use deep checks for monitoring dashboards (slower, more thorough, but doesn't trigger removal). Never let a slow dependency take down healthy servers through health check cascading.

Heartbeat Patterns

Health checks are "pull" — someone asks "are you alive?" Heartbeats are "push" — the server announces "I'm alive!" without being asked. There are three main patterns for how servers communicate their status to each other.

Three Heartbeat Patterns PUSH ("I'm alive!") Server Monitor beat beat Server periodically sends "I'm alive" to a central monitor + Low overhead on monitor + Server controls frequency - Monitor is a SPOF - Network issue = false death Used by: Consul, etcd watch PULL ("Are you alive?") Monitor Server ping? pong! Monitor polls each server: "Are you there?" "Yes!" + Monitor controls timing + Catches unresponsive servers - Monitor is a SPOF - Scales poorly (N polls) Used by: load balancers, Nagios GOSSIP ("Did you hear?") S1 S2 S3 Every server periodically tells a random peer what it knows about the cluster health + No single point of failure! + Scales to huge clusters - Slower detection (gossip lag) - More complex to implement Used by: Cassandra, Consul, SWIM

Push heartbeats are the simplest: each server sends a periodic "I'm alive" signal to a central monitor. If the monitor stops receiving beats from a server, it marks it as dead. This works well for small clusters but has a weakness — the monitor itself is a single point of failureA component whose failure takes down the entire system. If your monitoring server dies, nobody detects failures in the actual servers. It's like having only one smoke detector for an entire building — if the detector breaks, no fire gets detected.. If the monitor goes down, nobody detects failures.

Pull health checks work the other way: a monitor (often the load balancer) periodically asks each server "are you alive?" This is how most load balancers work — they send HTTP requests to a /health endpoint every 5-10 seconds. If a server doesn't respond within the timeout (usually 2-5 seconds), the load balancer removes it from the rotation after 2-3 failures.

Gossip protocols are the most sophisticated. There's no central monitor. Instead, every server randomly picks a few peers and shares what it knows about the cluster. "Hey, Server A sent me a heartbeat 2 seconds ago, so it's alive. Server D hasn't been heard from in 30 seconds — I think it's dead." Over time, this gossip propagates to every node. The big advantage: no single point of failure. If any node goes down, the rest of the cluster still knows about it. The trade-off: slower detection (information takes time to gossip through the cluster) and more complex implementation.

Practical Design: What Should Your Health Check Do?

Here's a concrete example. Imagine you have a web API that reads from a database and caches results in Redis. Your health check strategy should look like this:

Liveness endpoint (/health/live): Return 200 immediately. Don't check anything. The only thing this tells the orchestrator is "the process hasn't crashed." If this fails, the process needs to be restarted — not just removed from the load balancer pool, but actually killed and restarted.

Readiness endpoint (/health/ready): Check that the app has finished startup, database connection pool has at least one available connection, and the server isn't overloaded (e.g., pending request count below a threshold). This should respond in under 50ms. The load balancer uses this to decide whether to send traffic.

Deep health endpoint (/health/deep): Run a test query against the database (SELECT 1), ping Redis, check disk space, verify external API connectivity. This might take 200-500ms. Use it for monitoring dashboards and alerting, but never for load balancer health checks (because of the cascade problem).

Kubernetes popularized the three-probe model: livenessProbe (restart if failing), readinessProbe (stop sending traffic if failing), and startupProbe (give slow-starting apps time before checking liveness). This separation is brilliant because it distinguishes between "the process is broken and needs a restart" versus "the process is temporarily unable to serve traffic." You don't want to restart a server just because the database is slow — that would make things worse. You want to stop sending it traffic until the database recovers.
Think First

Your health check endpoint queries the database with SELECT 1 and your load balancer has a 3-second timeout. The database starts responding slowly (2.5 seconds per query). What happens? Now imagine the database gets even slower (4 seconds). What changes?

At 2.5s, health checks pass (just barely under the 3s timeout), but they're consuming a database connection for 2.5 seconds each. At 4s, health checks time out, the load balancer removes the server — even though the server itself is healthy. If all servers get removed, you have a full outage caused by a slow (not dead) database. This is the cascade trap.
Health checks detect failures before users do. There are three levels: liveness (is the process running?), readiness (can it serve traffic?), and deep checks (can it do its job?). Heartbeats work in three patterns: push (server announces "I'm alive"), pull (monitor asks "are you alive?"), and gossip (peers share knowledge about the cluster). The biggest pitfall is the cascade trap — deep health checks that depend on a slow database can cause the load balancer to remove perfectly healthy servers, turning a minor slowdown into a complete outage.
Section 7

Retries & Exponential Backoff — Handling Temporary Failures

Not every failure is permanent. A server might restart in 3 seconds. A network switch might flap for half a second. A queue might be temporarily full but drain in moments. These are called transient failuresA temporary failure that resolves itself without any intervention. Think of it like a busy phone line — call back in a minute and it'll probably work fine., and they're the most common kind of failure in distributed systems. The right response isn't to give up — it's to try again.

But how you try again matters enormously. Get it wrong and you'll turn a small hiccup into a full-blown outage.

The Naive Approach: Retry Immediately in a Loop

Imagine a popular API goes down for 2 seconds. It serves 10,000 clients. Every client notices the failure at roughly the same time and immediately retries. So instead of handling 10,000 normal requests per second, the API comes back to life and is instantly hit with 10,000 retry requests on top of the normal 10,000 new requests. That's 20,000 requests — double the normal load — slamming into a server that was already struggling. It goes down again. Everyone retries again. And again. This self-reinforcing storm is called the thundering herdWhen a large number of clients all retry at the same instant, overwhelming the recovering server with a burst of traffic that's far worse than the original load. problem, and it can keep a system down for minutes or hours even though the original failure lasted only seconds.

The Smart Approach: Exponential Backoff with Jitter

Instead of retrying immediately, you wait a little while. And if that retry also fails, you wait longer. Each failed attempt doubles the wait time: 1 second, then 2 seconds, then 4, then 8, then 16... This is exponential backoffA retry strategy where the delay between attempts grows exponentially — typically doubling each time. This gives the failing system progressively more breathing room to recover.. It gives the struggling server progressively more breathing room to recover.

But there's still a problem. If all 10,000 clients use the same backoff schedule, they'll all retry at 1 second, then all at 3 seconds, then all at 7 seconds — synchronized waves of traffic. The fix is jitterA random variation added to the retry delay. Instead of retrying at exactly 4 seconds, one client retries at 3.7s, another at 4.3s, another at 4.1s. This spreads the retries out over time. — add some randomness to each delay. Instead of everyone retrying at exactly 4 seconds, one client retries at 3.2s, another at 4.8s, another at 5.1s. The retries spread out like a gentle wave instead of crashing like a tsunami.

The formula:

delay = min(base × 2attempt + random_jitter, max_delay)

base = initial delay (often 1 second)
attempt = which retry this is (0, 1, 2, 3...)
random_jitter = a random value between 0 and the current delay (spreads retries out)
max_delay = a cap so you don't wait forever (often 30-60 seconds)
Naive Retry vs Exponential Backoff + Jitter NAIVE: Retry Immediately time Failure ALL retry at once ALL retry again Server never recovers — thundering herd SMART: Exponential Backoff + Jitter time Failure ~1s ~2-4s ~4-12s Retries spread out — server has room to recover

When to Retry — and When NOT To

Not every error deserves a retry. The key question is: could this request succeed if I try again without changing anything? A 503 (Service Unavailable) or a connection timeout? Probably yes — the server might be back in a moment. A 400 (Bad Request) or 404 (Not Found)? Absolutely not — if your request was malformed the first time, it's still malformed the second time. Retrying would just waste resources.

Should I Retry This Error? Request Failed Error type? 4xx / logic DON'T Retry Fix the request 5xx / timeout Idempotent? Yes RETRY with backoff + jitter No RETRY CAREFULLY Need idempotency key or deduplication Examples: 400 Bad Request 401 Unauthorized 404 Not Found 422 Validation Error Examples: GET, PUT, DELETE = idempotent Examples: POST (create), charge payment

Idempotency: The Safety Net for Retries

Here's a critical concept: retrying is only safe if the operation is idempotentAn operation where doing it once has the exact same result as doing it twice (or ten times). GET a web page? Always safe to retry. Transfer $100? Dangerous — you might transfer $200 if you retry. — meaning doing it twice produces the same result as doing it once. If you ask "what's the current price of this item?" ten times, you get the same answer ten times. Safe. If you ask "charge this credit card $50" ten times, you charge $500. Very not safe.

GET requests are naturally idempotent — reading data doesn't change anything. PUT and DELETE are idempotent by design — "set the price to $20" always results in the price being $20 regardless of how many times you say it. POST is the dangerous one — "create a new order" run twice means two orders. For POST operations, you need an idempotency keyA unique identifier (often a UUID) that the client sends with the request. The server checks: "Have I already processed a request with this key?" If yes, it returns the previous result instead of processing it again. — a unique ID sent with the request so the server can detect and ignore duplicates.

def call_with_retry(request, max_retries=5): base_delay = 1 # seconds max_delay = 30 # seconds for attempt in range(max_retries): response = send(request) if response.is_success(): return response # It worked! if response.status in [400, 401, 403, 404, 422]: raise PermanentError(response) # Don't retry client errors # It's a 5xx or timeout — worth retrying delay = min(base_delay * (2 ** attempt), max_delay) jitter = random(0, delay) # Random 0 to delay sleep(delay + jitter) raise MaxRetriesExceeded() # Give up after all attempts Retrying a payment charge, order creation, or message send without idempotency protection can cause double charges, duplicate orders, or spam messages. Always attach a unique idempotency key to non-idempotent requests so the server can detect and discard duplicates. Retries handle transient failures, but naive immediate retries cause thundering herds that keep failing servers down. Exponential backoff (doubling wait times) plus jitter (random variation) spreads retries over time, giving the server room to recover. Only retry 5xx/timeout errors — never 4xx client errors. And always ensure operations are idempotent (or use idempotency keys) before retrying, or you risk duplicate charges and corrupted data.
Section 8

Circuit Breakers — Stop the Bleeding

Your house has circuit breakers in the electrical panel. If too much current flows through a wire — maybe a short circuit, maybe too many appliances on one outlet — the breaker trips and cuts the power to that circuit. It's not a fix. The broken toaster is still broken. But the breaker stops the broken toaster from causing an electrical fire that burns down the whole house.

Software circuit breakers work the exact same way. When a service you depend on is failing, the circuit breaker stops sending requests to it. Not to fix the broken service, but to protect everything else from being dragged down with it.

Why You Need This: The Cascade Problem

Picture this: your app (Service A) calls a payment service (Service B) on every checkout. Service B gets slow — maybe a database issue — and starts taking 30 seconds per request instead of 200ms. What happens?

Every thread in Service A that calls Service B is now stuck waiting for 30 seconds. You have 100 threads. Within minutes, all 100 threads are blocked, waiting for a service that isn't responding. Now Service A can't handle any requests — not even the ones that don't need Service B. Your entire checkout flow, your homepage, your search — everything is frozen.

And it gets worse. If Service C depends on Service A? Now Service C is also stuck waiting. One slow database in one service has taken out three services. This is called a cascading failureWhen one service's failure causes another service to fail, which causes another to fail, and so on — like dominoes toppling. A single point of failure ripples through the entire system., and it's one of the most common causes of total system outages.

Cascading Failure — One Slow Service Takes Down Everything Service C 100/100 threads blocked DOWN waiting... Service A 100/100 threads blocked DOWN waiting... Service B (Payment) Database overloaded Response: 30s per request ROOT CAUSE Users see errors One slow database in Service B took out THREE services This is exactly what circuit breakers prevent

The Three States of a Circuit Breaker

A circuit breaker sits between your service and the service you're calling. It watches every request and tracks how many succeed and how many fail. It has three states:

CLOSED (normal operation): Requests flow through normally. The breaker is monitoring success/failure rates in the background. Think of it like a closed electrical circuit — current flows freely.

OPEN (broken — fast-fail mode): Too many requests have failed (say, 50% failure rate in the last 30 seconds). The breaker "trips." Now, instead of sending requests to the broken service and waiting 30 seconds for a timeout, it immediately returns an error. No waiting, no blocked threads. Your service stays healthy and can serve a fallback response (like "payments temporarily unavailable") instead of going completely unresponsive.

HALF-OPEN (testing — is it fixed yet?): After a cooldown period (say, 60 seconds), the breaker lets one test request through. If it succeeds, the breaker closes and normal traffic resumes. If it fails, the breaker opens again and waits another cooldown period. This is the "try flipping the breaker back on" moment.

Circuit Breaker State Machine CLOSED Requests flow normally Tracking success/failure OPEN Requests fail immediately No calls to broken service HALF-OPEN Testing with 1 request... 50% failures in 30s → TRIP! After 60s cooldown Test request OK Test failed Success

Circuit Breakers vs Retries

Retries and circuit breakers are complementary, not competing. Think of it this way: retries handle hiccups (a single request failed, try again). Circuit breakers handle meltdowns (the whole service is down, stop trying). In practice, you use both: retry a few times for transient errors, but if the circuit breaker detects sustained failure, it stops all retries and fails fast.

A circuit breaker doesn't fix the broken service. It protects EVERYTHING ELSE from being dragged down with it. The broken service still needs someone to fix it — the circuit breaker just buys time by preventing cascading failures while the team investigates. Circuit breakers prevent cascading failures — when one service goes down, they stop it from taking out every other service that depends on it. Three states: CLOSED (normal), OPEN (fail fast, protect the system), HALF-OPEN (test if the problem is fixed). Retries handle transient hiccups; circuit breakers handle sustained outages. Use both together.
Section 9

Graceful Degradation — Keep Working, Just Not Perfectly

When something breaks, you have two choices. Option one: the whole system crashes and users see an error page. Option two: the system keeps running, but with reduced functionality — maybe recommendations are missing, maybe search results are slightly stale, maybe images don't load. Which would you rather have?

That's graceful degradationA design strategy where a system continues to provide core functionality (even if reduced) when some component fails, rather than failing completely. — the art of breaking less badly. It means planning ahead so that when (not if) something fails, the system drops non-essential features while keeping the critical path alive.

Real-World Examples

Netflix's recommendation engine crashes? Instead of showing a blank screen, they show the "Top 10 in your country" — a static, pre-computed list that doesn't need the recommendation service at all. Users might not get personalized picks, but they can still browse and watch.

Amazon's search gets overloaded during a Prime Day sale? They serve cached search results from 5 minutes ago instead of real-time results. The prices might be slightly out of date, but customers can still find products and shop. Much better than showing "Search Unavailable."

Twitter can't load images? They show text-only tweets. The experience is worse, but the core functionality — reading and posting tweets — still works.

Degradation Tiers

Smart teams plan their degradation as a series of tiers — like dialing down a dimmer switch instead of flipping the lights off all at once. Each tier sacrifices a little more functionality to keep the core experience alive under increasing pressure.

Degradation Tiers — The Dimmer Switch Tier 1: Full Service Everything works — recommendations, search, images, real-time data, analytics Tier 2: Reduced Features Disable recommendations, analytics, non-critical APIs. Core CRUD still works. Tier 3: Read-Only Mode Users can view content but not post, buy, or modify. Serves cached data. Tier 4: Static Fallback Pre-built static HTML pages served from CDN. No dynamic content at all. Tier 5: Maintenance Page "We're working on it. Check back soon." — last resort. worse experience

Feature Flags for Degradation

You don't want to figure out what to turn off during an outage at 2 AM. That's panic-mode engineering. Instead, pre-wire feature flagsConfiguration switches (often in a central config service like LaunchDarkly or AWS AppConfig) that let you enable or disable features instantly without deploying new code. Think of them as light switches for features. — "kill switches" that let you disable specific features with a single toggle. Before an incident happens, you've already decided: "If the recommendation service goes down, flip this flag and show trending content instead."

Load Shedding: Triaging Under Pressure

When your system is overwhelmed, you can't serve everyone equally — trying to do so means everyone gets a terrible experience. Load sheddingThe practice of intentionally dropping some requests during overload so that the remaining requests can be served properly. Like an emergency room prioritizing critical patients over minor injuries. is the practice of intentionally rejecting some requests so that the rest can be served properly. It's like an emergency room during a mass-casualty event — you triage. Critical patients (checkout, payments) get served first. Less urgent cases (browsing, recommendations) get told to come back later.

Load Shedding — Priority-Based Request Triage Incoming Flood 10,000 req/s Checkouts (2K) Searches (3K) Browse (3K) Recommendations (2K) Priority Gate Capacity: 5,000 req/s PASS: Checkouts (2K) PASS: Searches (3K) DROP: Browse (3K) DROP: Recommendations (2K) Served (5K) Checkouts + Searches Users can still buy things! Dropped (5K) HTTP 503 — "Try again later" During an outage is the worst time to figure out which features to disable. Create a runbook that lists every feature flag, what it controls, and when to flip it. Run "game day" exercises where you intentionally trigger each degradation tier to make sure it actually works. Graceful degradation means breaking less badly — keeping core features alive while shedding non-critical ones. Plan degradation in tiers (full service → reduced features → read-only → static fallback → maintenance page), pre-wire feature flags as kill switches, and use load shedding to prioritize critical requests (checkout over browsing) when overwhelmed. The key: plan your degradation strategy BEFORE the incident, not during it.
Section 10

Blast Radius — Limiting the Damage

When a bomb goes off, the damage depends on how close you are. A firecracker breaks a window. A stick of dynamite collapses a room. A missile levels a building. The area of destruction is called the blast radiusIn infrastructure, the blast radius is the scope of impact when something fails. A single-server failure has a small blast radius. A regional failure has a massive one. The goal is to architect systems so that any single failure has the smallest possible blast radius. — and in software, it's the single most important question you can ask about any failure: how much of my system is affected?

If one server crashes behind a load balancer with 10 servers, the blast radius is 10% — 9 out of 10 servers still handle traffic. If an entire availability zoneAn availability zone (AZ) is an isolated data center (or cluster of data centers) within a cloud region. AWS, for example, has regions like us-east-1 with 3-6 AZs each. AZs within a region have independent power and networking but are connected by low-latency links. goes down and you're running in 3 AZs, the blast radius is 33%. If an entire region goes down and you're only in one region? The blast radius is 100%. You're completely offline.

The goal of blast radius engineering is simple: make sure no single failure can take down everything.

Blast Radius — How Much Goes Down? FAIL 1 Server Dies 10% traffic lost 1 AZ Dies 33% traffic lost 1 Region Dies 100% traffic lost (if single-region) Minimum: 2+ AZs Standard for any production system. Survives AZ failure. Standard: 3 AZs Industry default. Lose 1 AZ, still have 67% capacity. Critical: Multi-Region For global apps. Survives entire region failure (2-10x more costly).

Strategies to Shrink the Blast Radius

Old wooden ships had one big hull. A single hole sank the whole ship. Modern ships are divided into watertight compartmentsSealed sections of a ship's hull. If one compartment floods, the watertight doors contain the water to that section, keeping the rest of the ship afloat. The Titanic had 16 compartments — but 5 flooded simultaneously, which exceeded its design limit. (bulkheads). A hole in one compartment floods only that section — the rest of the ship stays afloat.

In software, bulkheads mean isolating resources for different functions. Give your checkout service its own thread pool, database connection pool, and circuit breaker — separate from your search service. If search goes haywire and exhausts its thread pool, checkout is completely unaffected because it has its own isolated resources.

Instead of one big system serving all users, you divide users into independent cellsA self-contained unit of infrastructure that serves a subset of users. Each cell has its own servers, databases, and caches — completely independent from other cells. If one cell fails, only the users assigned to that cell are affected.. Each cell has its own servers, its own database, its own cache — completely independent. User IDs 1-100,000 go to Cell A. User IDs 100,001-200,000 go to Cell B. If Cell A's database crashes, only 100,000 users are affected. The other 900,000 never notice.

AWS uses this architecture internally. Their control plane is divided into cells, so a bug that crashes one cell doesn't cascade to all cells. It limits the blast radius by design.

Regular sharding assigns Customer A to Shard 1 and Customer B to Shard 1. If Shard 1 goes down, both lose service. Shuffle shardingA technique where each customer is assigned to a random subset of resources (e.g., 2 out of 8 shards). The probability of two customers sharing ALL the same resources becomes very small — so one customer's bad behavior is unlikely to affect another. assigns each customer to a random subset of resources. Customer A gets Shards 1 and 4. Customer B gets Shards 2 and 7. The chance of two customers sharing ALL the same shards is tiny. Even if one customer sends a massive traffic spike that takes out their shards, most other customers are unaffected because they're on different shard combinations.

With 8 shards and each customer using 2, there are 28 possible shard combinations. The odds of two random customers sharing the same pair? About 3.6%. That's powerful isolation without the cost of giving each customer dedicated infrastructure.

Never put all your eggs in one availability zone. Minimum: 2 AZs. Standard: 3 AZs. Critical workloads: multi-region. Every major cloud provider (AWS, Azure, GCP) gives you at least 3 AZs per region — use them. Blast radius is how much of your system goes down when something fails. A single server failure should affect 10% of traffic (if behind 10 servers), not 100%. Strategies to limit blast radius: bulkhead pattern (isolate resources per service), cell-based architecture (independent infrastructure per user group), and shuffle sharding (random resource assignment so customers rarely share all the same failure points). Rule of thumb: minimum 2 AZs, standard 3 AZs, critical workloads multi-region.
Section 11

SLA, SLO, SLI — Measuring Reliability with Numbers

You can't improve what you can't measure. Saying "our system should be reliable" is like saying "I want to be healthier" — it sounds nice but means nothing concrete. Reliability needs numbers. Specific, measurable, time-bound numbers. That's where three related but distinct terms come in: SLI, SLO, and SLA.

Think of it as a pyramid. At the bottom, you have raw measurements. In the middle, you have targets. At the top, you have a legally binding contract with financial consequences.

SLI — Service Level Indicator (The Measurement)

An SLIService Level Indicator — a concrete, quantitative metric that measures one aspect of your service's reliability. Examples: "percentage of requests that returned a 2xx status code" or "percentage of requests completed in under 200ms." is the raw metric — the number you actually measure. It answers: "How is this specific thing performing right now?"

Examples: "What percentage of HTTP requests got a successful response (2xx) in the last 5 minutes?" Or: "What percentage of page loads completed in under 2 seconds?" Or: "What percentage of database queries finished within 50ms?" These are all SLIs. They're objective, measurable facts about your system's behavior.

SLO — Service Level Objective (The Target)

An SLOService Level Objective — a target value (or range) for an SLI. It's an internal goal your team sets. Example: "99.9% of requests should return a successful response" or "p99 latency should be under 500ms." is the target you set for an SLI. It's your internal goal — the line in the sand that says "below this, we're not meeting our own standards." You pick the SLI (request success rate), and you set a target (99.9%).

SLOs are internal — they're for your engineering team, not your customers. They drive decisions: "Our SLO is 99.9% success rate. We're currently at 99.7%. We should probably hold off on deploying that risky new feature until we fix the existing reliability issues."

SLA — Service Level Agreement (The Contract)

An SLAService Level Agreement — a formal contract between a service provider and its customers that specifies what level of service is guaranteed, and what compensation (usually credits) the customer receives if the provider fails to meet it. is the contract — usually between you and your customers — that says: "We promise at least this level of service. If we fail, here's what we'll give you in return." It's a business and legal commitment with financial consequences.

AWS S3, for example, promises 99.9% availability. If they drop below that in a billing month, customers get a 10% service credit. Below 99.0%? 25% credit. Below 95%? 100% credit. That's real money on the line, which is why SLAs are always set lower than SLOs — your internal target (SLO) should be stricter than your customer promise (SLA) so you have a safety margin.

SLI → SLO → SLA — The Reliability Pyramid SLA Contract SLO Internal Target SLI The Actual Measurement "If below 99.9%, we credit 10%" Legal contract with money on the line "Target: 99.95% success rate" Internal engineering goal "99.97% of requests succeeded" Raw metric from monitoring Higher stakes Foundation

The Error Budget — Your License to Break Things

Here's where it gets clever. If your SLO is 99.9% availability, that means you're "allowed" to be unavailable 0.1% of the time. That's your error budgetThe amount of allowable unreliability built into your SLO. If your SLO is 99.9%, your error budget is 0.1%. You "spend" this budget on deployments, experiments, maintenance, and unexpected failures. When the budget is exhausted, you freeze changes. — and it's not a bad thing. It's a feature. That 0.1% budget is what allows you to deploy new features (which might briefly cause errors), run experiments, and perform maintenance.

The math is straightforward:

SLO: 99.9% over 30 days
Total minutes in 30 days: 30 × 24 × 60 = 43,200 minutes
Error budget: 0.1% × 43,200 = 43.2 minutes of downtime allowed per month

For 1 million requests per day:
Error budget: 0.1% × 1,000,000 = 1,000 failed requests allowed per day

When your error budget is consumed: freeze deployments. Focus entirely on reliability until the budget resets.
Error Budget Over a Month (SLO: 99.9%) Budget: 43.2 min Week 1: OK Deploy incident: 12 min Week 3: OK Outage: 28 min used 3.2 min left! Day 1 Day 15 Day 30 BUDGET NEARLY EXHAUSTED Freeze all deployments. Focus on reliability. Budget resets next month.

Common SLOs for Different Service Types

Service TypeTypical SLOAllowed Downtime/month
Internal tools99.0%~7 hours 18 min
Standard web app99.9%~43 min
E-commerce checkout99.95%~22 min
Payment processing99.99%~4.3 min
DNS / Auth99.999%~26 sec
Service Typep50 Targetp99 Target
API endpoint< 100ms< 500ms
Web page load< 1s< 3s
Search results< 200ms< 1s
Database query< 10ms< 100ms
Real-time messaging< 50ms< 200ms

p50 = the median (50th percentile). p99 = the worst-case for 99% of requests. The p99 matters more than the average — 1% of your users having a terrible experience at scale is still thousands of angry people.

Going from 99.9% to 99.99% availability costs roughly 10x more in engineering effort, infrastructure redundancy, and operational complexity. Going from 99.99% to 99.999% costs another 10x. At some point, the cost of that extra nine exceeds the value of the service itself. Pick the SLO that matches the actual business need — not the highest number you can dream up. SLI is the raw metric (what you measure), SLO is the target (what you aim for), SLA is the contract (what you promise customers with money on the line). The error budget concept turns reliability into a spending decision: your SLO of 99.9% gives you 43 minutes of allowed downtime per month — use it for deployments and experiments, but freeze when it's spent. 100% reliability is impossible and infinitely expensive; pick the right number for your business.
Section 12

Disaster Recovery — When Everything Goes Wrong

Everything we've talked about so far — retries, circuit breakers, graceful degradation — handles normal failures. A server crashes, a network blips, a service slows down. These happen daily and your system should shrug them off automatically.

But sometimes the failure isn't normal. An entire data center loses power. A region goes offline because an undersea cable gets cut. Ransomware encrypts your production database and all your replicas (because the replicas were on the same network). A well-meaning engineer runs a migration script against production instead of staging. These are disastersCatastrophic failures that exceed normal fault-tolerance mechanisms — events like full data center outages, regional failures, data corruption, ransomware attacks, or large-scale human errors that destroy critical data. — events that exceed the normal "one server died" fault tolerance built into your daily operations.

Disaster recovery (DR) is your plan for surviving these events. And it comes down to two numbers that every engineer should know.

RPO and RTO — The Two Numbers That Define Your DR

RPORecovery Point ObjectiveHow much data can you afford to lose when disaster strikes? RPO = 0 means zero data loss (requires synchronous replication). RPO = 1 hour means you can tolerate losing up to 1 hour of data (hourly backups are sufficient).: How much data can you afford to lose? If your RPO is 1 hour, you need backups (or replicas) that are at most 1 hour old. If disaster strikes, you'll lose up to 1 hour of recent data — and that's acceptable for your business. RPO = 0 means zero data loss, which requires synchronous replication (every write is confirmed on a remote copy before the system acknowledges it).

RTORecovery Time ObjectiveHow quickly do you need to be back online after a disaster? RTO = 4 hours means you can afford to be completely down for up to 4 hours. RTO near 0 means instant failover to another site.: How quickly do you need to be back online? If your RTO is 4 hours, you can afford to be completely down for up to 4 hours while you restore from backups and spin up new infrastructure. If your RTO is near zero, you need a hot standby site that can take over instantly.

The relationship is simple: lower RPO and RTO = more expensive. Zero data loss + instant failover requires fully synchronized infrastructure in multiple locations, running 24/7. That costs 2-3x more than a single-site setup. Higher RPO and RTO = cheaper but riskier.

RPO & RTO — The Two DR Numbers time → Last Backup DISASTER Data center goes down Service Restored RPO = Data Loss Window How much data did we lose? RTO = Recovery Time How long were we offline?

The Four DR Strategies

There are four standard approaches to disaster recovery, ranked from cheapest (and slowest to recover) to most expensive (and fastest). Your choice depends on your RPO, RTO, and budget.

DR Strategies — Cost vs Recovery Speed Recovery Time (RTO) → Cost → Backup & Restore RTO: hours-days $ Pilot Light RTO: 10-30 min $$ Warm Standby RTO: minutes $$$ Active Active RTO: ~0 $$$$

You take regular backups (database dumps, filesystem snapshots) and store them somewhere safe — typically a different region. When disaster strikes, you spin up new infrastructure and restore from the latest backup. It's like keeping a copy of your house's blueprints and furniture photos in a safe deposit box. If the house burns down, you can rebuild it — but it takes a while.

RTO: Hours to days (depending on data size and infrastructure complexity)

RPO: Last backup time (if you back up daily, you could lose up to 24 hours of data)

Cost: Very low — you only pay for backup storage

Best for: Non-critical systems, development/staging environments, cost-constrained startups

Your database is continuously replicated to a DR site (another region), but the application servers and other infrastructure exist only as launch templates — they're not running. Think of it like a gas pilot light on a water heater: the flame is always on (the database replica), so when you need heat (full recovery), it ignites quickly instead of starting from scratch.

RTO: 10-30 minutes (time to spin up app servers and update DNS)

RPO: Near-zero for database (continuous replication), but application state may have gaps

Cost: Moderate — you pay for the replicated database 24/7, but compute is only provisioned during recovery

Best for: Business-critical apps that can tolerate 15-30 minutes of downtime

A scaled-down but fully functional copy of your production environment runs in the DR site at all times. Everything works — it's just running at maybe 10-20% of production capacity. When disaster strikes, you scale it up to full capacity and redirect traffic. Since everything is already running, the switchover takes minutes, not tens of minutes.

RTO: Minutes (just scale up and switch DNS)

RPO: Near-zero (continuous replication)

Cost: High — you're running a second environment 24/7, even if it's smaller

Best for: Revenue-critical services where every minute of downtime costs significant money

Both sites are fully active, serving real production traffic simultaneously. Users in the US hit the US site, users in Europe hit the EU site. If one goes down, the other absorbs all traffic. There's no "failover" because both sites are already active — you just stop routing to the dead one.

RTO: Near-zero (seconds — just a DNS/routing change)

RPO: Near-zero (bidirectional replication, though conflict resolution adds complexity)

Cost: Very high — 2x or more the infrastructure cost of a single site, plus the engineering complexity of multi-region data consistency

Best for: Global services where any downtime is unacceptable (payments, messaging, critical infrastructure)

The Backup Testing Rule

Here's a truth that bites companies every year: a backup you haven't tested is not a backup. It's a hope. Backups can silently fail — corrupted files, missing tables, wrong permissions, incompatible formats. You only discover these problems when you actually try to restore, which is exactly the worst possible time to discover them.

The fix: schedule regular restore drills. Once a quarter, take your latest backup, restore it to a fresh environment, and verify that the data is complete and the application works. If you can't restore successfully in a calm test environment, you certainly won't succeed during a real disaster at 3 AM with your hands shaking.

In 2017, GitLab lost 6 hours of production data for over 5,000 projects. They had FIVE backup methods configured: regular database dumps, automated snapshots, replication, LVM snapshots, and Azure disk snapshots. When disaster struck, they discovered that none of them were working correctly. The database dumps were empty, replication was lagging, and snapshots hadn't been tested. The only thing that saved them was a lucky manual snapshot taken 6 hours earlier by an engineer doing unrelated work. Test your backups. Regularly. Disaster recovery handles catastrophic failures beyond normal fault tolerance — full data center outages, ransomware, or accidental data destruction. Two key numbers: RPO (how much data you can lose) and RTO (how fast you need to recover). Four DR strategies, cheapest to most expensive: (1) Backup & Restore (RTO hours, cheapest), (2) Pilot Light (RTO 10-30 min, database always replicated), (3) Warm Standby (RTO minutes, scaled-down clone always running), (4) Active-Active (RTO near zero, full duplicate, 2x+ cost). And the golden rule: a backup you haven't tested is not a backup.
Section 13

Real-World Incidents — Famous Outages and What We Learned

Theory is great, but nothing drives the lessons home like real disasters that cost real companies real money. Every incident below follows the same depressing pattern: the failure was preventable, the safeguard existed but wasn't tested, or the blast radius wasn't limited. Let's walk through five of the most famous outages in modern tech history and see exactly what went wrong — and what you should do differently.

An engineer on the S3 billing team was debugging a slow process. They ran a command to remove a small number of servers from the S3 subsystem. But they typed the wrong number — and the command removed way more servers than intended. This took down the index subsystem (the part that knows where your files are stored) and the placement subsystem (the part that handles new storage requests). Without those two pieces, S3 couldn't read or write anything.

The cascade was brutal. S3 is the backbone of the internet's static content. When S3 went down, so did Slack, Trello, Quora, IFTTT, parts of Docker Hub, and thousands of websites that serve images and files from S3. Even Amazon's own status dashboard was hosted on S3, so the page that was supposed to tell everyone "we're having problems" was itself down. The irony was not lost on anyone.

The outage lasted nearly 5 hours. Amazon later estimated that S&P 500 companies alone lost roughly $150 million. And all because one person mistyped a parameter in a maintenance command with no guardrails.

AWS S3 Outage — The Cascade Engineer runs command Intended: remove a few servers | Actual: removed too many Index Subsystem DOWN "Where are the files?" — unknown Placement Subsystem DOWN Can't store new data S3 Completely Unavailable Slack Trello / Quora IFTTT Docker Hub AWS Status Page (also on S3!) Three lessons: (1) Dangerous commands need safeguards — rate limits on how many servers can be removed at once, confirmation prompts for large-scale changes. (2) Limit the blast radius — never let one operation touch the entire fleet. Roll changes out progressively. (3) Don't host your status page on the same infrastructure it's supposed to report on.
A GitLab engineer was troubleshooting a database replication issue late at night. Tired and frustrated, they ran rm -rf on what they thought was the staging database directory. It was the production database. 300 GB of data — gone in seconds.

Then the real horror began. GitLab had five different backup methods configured. Five! And not a single one was working properly. The daily database dumps hadn't been running because of a configuration error. The replication to a secondary had been turned off. The automated snapshots were failing silently. The point-in-time recovery via WAL archiving had a misconfigured path. The final safety net — a nightly backup to a remote server — was the only one still partially functional, but it was 6 hours stale.

GitLab ended up losing 6 hours of production data affecting around 5,000 projects, 5,000 comments, and 700 merge requests. They live-streamed their recovery on YouTube — one of the most transparent post-mortems in tech history.

Having backups is not the same as having working backups. If you don't regularly test your backups by actually restoring from them, you don't have backups — you have wishful thinking. Schedule monthly restore drills. Verify every backup method independently. And never, ever rely on a single recovery path.
A routine maintenance job meant to assess the capacity of Facebook's backbone network accidentally withdrew all the BGPBorder Gateway Protocol — the system that routers use to figure out how to send traffic across the internet. When Facebook's BGP routes disappeared, the rest of the internet simply couldn't find Facebook's servers anymore. It's like removing your address from every GPS system on the planet. routes. In plain English: Facebook told the entire internet "we don't exist anymore." DNS servers couldn't find Facebook. Every single Facebook property — Instagram, WhatsApp, Messenger, Oculus — vanished from the internet.

Here's the part that made this outage legendary: Facebook's internal tools for diagnosing and fixing network problems also ran on Facebook's network. The engineers who needed to fix the BGP routes couldn't reach the systems that manage BGP routes. They couldn't even badge into the data centers at first because the door access system relied on Facebook's network. Engineers literally had to be dispatched to data centers to gain physical access to the routers and manually reconfigure them.

Facebook Outage — The Recovery Tool Trap BGP Routes Withdrawn Facebook disappears from the internet 3.5 billion users Can't reach Facebook, Instagram, WhatsApp Engineering team Can't reach internal fix tools (also on FB network!) Must physically go to data centers Badge system also down — need physical escort to routers ~6 hours Total downtime Never let your recovery tools depend on the same infrastructure they're supposed to fix. This is called a circular dependencyWhen System A needs System B to work, but System B also needs System A. If either fails, you can't fix the other. It's the tech equivalent of locking your keys inside your car — and the locksmith's phone number is inside the car too. in your recovery path. Your out-of-band management tools — the ones you use to fix everything else — must run on completely separate infrastructure.
An engineer deployed a new rule to Cloudflare's Web Application FirewallA WAF inspects incoming HTTP traffic and blocks malicious requests (SQL injection, XSS attacks, etc.) before they reach your application. Cloudflare's WAF protects millions of websites simultaneously — which means a bad rule affects millions of sites simultaneously. (WAF). The rule contained a regular expression that caused catastrophic backtrackingWhen a regex engine tries to match a pattern, it can sometimes get stuck exploring an exponentially growing number of possible matches. A poorly-written regex on a specific input can peg a CPU at 100% for minutes — a pattern called "ReDoS" (Regular Expression Denial of Service). — the CPU on every single Cloudflare edge server worldwide spiked to 100% and stayed there.

Cloudflare operates in over 200 cities across 100+ countries. Every edge server runs the same WAF rules. So when that regex was deployed globally, it hit every server at the same time. Millions of websites behind Cloudflare — including major services — returned 502 errors for about 27 minutes. The fix was conceptually simple (revert the bad rule), but because the servers were pegged at 100% CPU, even deploying the revert was slow.

Three takeaways: (1) Test configuration changes in a staging environment that mirrors production. (2) Use canary deployments — roll changes to 1% of servers first, watch the metrics, then expand. If the canary catches fire, you've only burned 1% instead of 100%. (3) Have a kill switch — a one-click "revert everything" button that doesn't require the system to be healthy to work.
Knight Capital Group, a major trading firm, was deploying new software to 8 servers. An engineer forgot to deploy to one of them — server #8 still had old code from a feature that had been decommissioned years earlier. When the market opened, that old code activated and started executing millions of unintended stock trades at a speed only a computer can achieve.

In 45 minutes, the old code bought and sold stocks worth billions of dollars, racking up $440 million in losses. The firm had no automated kill switch. By the time humans realized what was happening and manually stopped the system, the damage was irreversible. Knight Capital went from a healthy company to bankrupt practically overnight. They were acquired for pennies on the dollar.

Knight Capital — 45 Minutes to Bankruptcy 9:30 AM Market opens 7 of 8 servers: new code Server #8: OLD code activates 9:31 AM Bad trades begin Millions of orders per minute No automated stop ~9:45 AM Engineers notice "Something is very wrong" Stock prices swinging wildly 10:15 AM Manually stopped 45 minutes too late $440,000,000 lost Company bankrupt. Acquired for $3.75/share (previously $10+). Every deployment must be all-or-nothing across all servers. Dead code must be removed, not left dormant. And any system that can lose money at machine speed needs an automated kill switch — a "circuit breaker" that halts operations when behavior deviates from expected parameters. Humans are too slow to stop a computer from losing $440 million in 45 minutes.
Every major outage teaches the same lesson: the failure was preventable. The backup existed but wasn't tested. The safeguard existed but was bypassed. The monitoring existed but was ignored. The kill switch existed but nobody knew where it was. Reliability engineering isn't about building perfect systems — it's about building systems that fail in boring, predictable, recoverable ways. The biggest outages in tech history share a pattern: human error meets missing safeguards. AWS S3 went down because one command had no blast-radius limit. GitLab had five backup methods and none worked. Facebook's recovery tools depended on the same network that was broken. Cloudflare pushed an untested regex to every server at once. Knight Capital lost $440M because dead code was never removed and there was no kill switch. The lesson: test your backups, limit your blast radius, use canary deployments, and keep recovery tools on separate infrastructure.
Section 14

Chaos Engineering — Break It On Purpose

Here's a question that sounds crazy until you think about it: what if you broke your own system on purpose?

Think about fire drills. Nobody wants a fire in their building. But every school, every office, every hospital runs fire drills regularly. Why? Because discovering that the emergency exit is locked during an actual fire is the worst possible time to learn that. You want to find that problem on a calm Tuesday afternoon when everyone's awake and the fire department is on standby.

That's exactly the philosophy behind chaos engineeringThe practice of deliberately injecting failures into a system — killing servers, slowing networks, filling disks — to verify that the system handles them gracefully. Pioneered by Netflix in 2011 with their famous "Chaos Monkey" tool.. You will have failures. The hard drives in your servers will die. The network between your data centers will hiccup. A dependency you rely on will go down. The question is: do you discover your weaknesses at 3 AM during a real outage, or at 2 PM on Tuesday when your entire team is ready, the coffee is fresh, and the incident response channel is already open?

Netflix pioneered this approach with a tool called Chaos MonkeyA tool built by Netflix that randomly terminates virtual machine instances in production during business hours. The philosophy: if a random server dying can break your service, you need to know about it now — not during a real outage. Netflix later expanded this into the "Simian Army" with tools like Chaos Gorilla (kills an entire availability zone) and Latency Monkey (adds artificial network delays).. During business hours, Chaos Monkey randomly kills production servers. Not staging. Not test environments. Production. If your service can't handle a random server dying at 2 PM when everyone's watching, it definitely can't handle it at 3 AM when nobody is. Netflix's reasoning was simple: they wanted to make individual server failures so routine that their systems handled them automatically without any human intervention.

The Chaos Engineering Process

Chaos engineering isn't just randomly pulling cables and hoping for the best. It's a disciplined, scientific process with five steps:

The Chaos Engineering Cycle 1. Define Steady State "Normal" = p99 < 200ms, error rate < 0.1% 2. Hypothesize "If cache dies, DB handles the load" 3. Inject Failure Kill the cache cluster in production 4. Observe Behavior Did latency stay under 200ms? Errors spike? 5. Fix Differences Hypothesis was wrong? Fix the system. repeat

Step 1: Define steady state. Before you break anything, measure what "normal" looks like. What's your p99 latency? What's your error rate? How much CPU are your servers using? You need a baseline so you can tell whether the experiment caused a problem.

Step 2: Hypothesize. Make a specific prediction: "If we kill 2 of our 6 API servers, the load balancer will redistribute traffic to the remaining 4, latency will increase by no more than 50ms, and zero requests will fail." Write this down. A vague "it should be fine" is not a hypothesis.

Step 3: Inject the failure. Actually do it. Kill the servers, introduce network latency, fill a disk to 95%, whatever your experiment calls for.

Step 4: Observe. Did reality match your hypothesis? Did latency stay within bounds? Did the load balancer actually reroute traffic? Did the alerts fire? Look at your dashboards and compare actual behavior to predicted behavior.

Step 5: Fix the gaps. If your hypothesis was wrong — if latency spiked to 2 seconds instead of 50ms — that's a success. You found a weakness at 2 PM on Tuesday instead of 3 AM during Black Friday. Now fix it.

Types of Chaos Experiments

Types of Chaos Experiments Server Failures Kill a random instance Terminate a VM with no warning Kill the leader in a cluster Crash a container mid-request Network Chaos Add 500ms latency Drop 10% of packets Partition two data centers Corrupt DNS responses Storage Pressure Fill disk to 95% Slow I/O by 10x Unmount a volume Corrupt a log file Resource Exhaustion Exhaust all memory Peg CPU at 100% Exhaust file descriptors Fill the connection pool Dependency Failures Kill the cache layer Make auth service return 500s Disconnect the message queue Return stale data from cache Time & Clock Drift Skew the system clock Expire all TLS certificates Set time 1 day in the future Randomize NTP responses

Game Days

A Game Day is a scheduled event where your entire team practices responding to failures. Think of it as a fire drill for your infrastructure. You pick a scenario — "the primary database goes down" or "we lose an entire availability zone" — and actually simulate it, usually in production or a production-like environment. The team follows their runbooks, uses their dashboards, and communicates through their incident channels. Every stumble — a missing runbook, a broken alert, a confusing dashboard — gets documented and fixed.

Companies like Google, Amazon, and Netflix run Game Days regularly. At some companies they're even surprise events — the SRE team injects a failure without warning the on-call team, to test real-world response times and procedures.

Your first chaos experiment should be killing one server in staging, not nuking an entire region in production. Start with the tamest possible failure, verify your monitoring catches it, verify your system handles it, and then gradually increase the severity. You're building confidence, not creating outages. Chaos engineering means deliberately breaking your system to find weaknesses before real failures do. The process is disciplined: define steady state, hypothesize what should happen, inject failure, observe actual behavior, and fix any gaps. Netflix's Chaos Monkey randomly kills production servers during business hours. Game Days are scheduled team exercises for practicing incident response. Start small in staging before escalating to production experiments.
Section 15

Idempotency — Safe to Retry

You tap "Pay Now" on your phone. The screen spins for 10 seconds... then shows a timeout error. Did the payment go through, or didn't it? You have no idea. If you tap "Pay Now" again and it did already go through, you just got charged twice. If it didn't go through and you don't retry, your order never processes. This is the fundamental problem that idempotencyA fancy word from mathematics that means "doing something multiple times has the same effect as doing it once." In APIs, an idempotent operation can be safely retried without causing duplicate side effects — like double charges or duplicate orders. solves.

The word comes from math: an operation is idempotent if doing it once produces the same result as doing it twice, or ten times, or a hundred times. Setting your thermostat to 72 degrees is idempotent — whether you press the button once or mash it 50 times, the temperature is still 72. Adding 1 to a counter is not idempotent — pressing it 50 times gives you 50 instead of 1.

Which Operations Are Naturally Idempotent?

Some operations are safe to repeat by their very nature. Others need special engineering to make them safe:

Reading data (GET) — asking "what's my balance?" ten times always returns the same balance. No side effects.

Absolute set (PUT with full replacement) — "set the user's name to Alice" is the same whether you do it once or ten times. The name is still Alice.

Delete by ID — "delete order #4521" is a no-op if order #4521 is already deleted. The end state is the same: order #4521 doesn't exist.

Creating a resource (POST) — "create a new order" twice creates two orders. Now you have a duplicate.

Increment operations — "add $10 to the balance" twice adds $20. The counter went up by double.

Append operations — "add item to cart" twice gives you two of the same item.

Send notifications — "send confirmation email" twice sends two emails. Your user gets annoyed.

The Problem: Network Uncertainty

The Timeout Problem — Did It Succeed? Client (You) Payment Server POST /charge — $49.99 Timeout! No response after 10s ?? Unknown State ?? Did the server process the charge or not? Scenario A: Server DID process it Retry = DOUBLE CHARGE ($99.98) User calls support angry. Chargeback. Bad reviews. Scenario B: Server did NOT process it No retry = ORDER NEVER PLACED User thinks they paid. Nothing ships. Also angry.

You're stuck. Without idempotency, both options are bad — retry and risk a double charge, or don't retry and risk a lost order. The client cannot know which scenario happened because the response was lost in transit.

The Solution: Idempotency Keys

The fix is elegant. Before the client sends the request, it generates a unique ID — called an idempotency keyA unique identifier (usually a UUID like 550e8400-e29b-41d4-a716-446655440000) that the client attaches to every mutating request. The server uses this key to detect duplicate requests: "I've already processed this key, so I'll return the cached result instead of processing it again." — and attaches it to the request. The server checks: "Have I seen this key before?" If yes, it returns the cached result from the first time without processing the operation again. If no, it processes the operation and stores the result keyed by that ID.

Now the client can safely retry as many times as it wants. First attempt goes through? The retries are no-ops that return the same result. First attempt didn't go through? The retry processes it for the first time. Either way, the operation happens exactly once.

Idempotency Key Flow Client Server Idempotency Store Generate key: key = uuid() POST /charge + key Seen this key? No — first time Process charge Store result 200 OK — charged Timeout... retry! POST /charge + same key Seen this key? Yes! Return cached result 200 OK — same result, no double charge

Implementation: The Database Approach

The most common approach is dead simple. You add a column to your database with a unique constraint on the idempotency key. When a request comes in, you try to insert a row with that key. If the insert succeeds (the key is new), you process the operation and store the result. If the insert fails because of a unique constraint violation, you know this is a duplicate — just look up the stored result and return it.

function handlePayment(request): key = request.headers["Idempotency-Key"] // Check if we've already processed this key existing = db.query("SELECT result FROM idempotency_store WHERE key = ?", key) if existing != null: // Duplicate request — return the cached result return existing.result // First time seeing this key — process the payment result = paymentProcessor.charge(request.amount, request.card) // Store the result for future duplicate detection db.insert("INSERT INTO idempotency_store (key, result, created_at) VALUES (?, ?, NOW())", key, result) return result Every payment API, every order creation, every money-moving operation MUST be idempotent. This is non-negotiable. Stripe, PayPal, Square — every major payment processor supports idempotency keys. If your checkout flow doesn't use them, it's only a matter of time before a user gets double-charged and your support queue explodes. Idempotency means "safe to retry" — doing an operation multiple times produces the same result as doing it once. This matters because network timeouts create uncertainty: did the request succeed or not? The solution is idempotency keys — unique IDs the client attaches to each request, allowing the server to detect duplicates and return cached results instead of re-processing. Every payment or money-moving API must be idempotent.
Section 16

Distributed Consensus — Agreeing When Networks Split

When all your data lives on one database server, "truth" is easy — whatever that server says is the truth. But the moment you replicate that data to multiple servers (for redundancy, speed, or both), you create a hard problem: what happens when those servers disagree?

Imagine you have three database servers holding account balances. A user transfers $20 from checking to savings. Server A processes the transfer. But before Server A can tell Servers B and C about it, the network cable between them gets unplugged (or more realistically, a router has a firmware bug). Now Server A says the checking balance is $80, but Servers B and C still say $100. Which is correct? If a user checks their balance on Server B, they see $100 — and might try to spend money that's already been transferred. This is the split-brain problemWhen servers in a cluster can't communicate with each other and each "half" starts acting independently, potentially making conflicting decisions. Like two managers both thinking they're in charge because they can't reach each other. Also called "network partition.", and it's one of the most fundamental challenges in distributed systems.

The CAP Theorem — You Can't Have It All

In 2000, computer scientist Eric Brewer proposed something called the CAP theoremBrewer's theorem states that a distributed system can provide at most two of three guarantees simultaneously: Consistency (every read returns the latest write), Availability (every request gets a response), and Partition tolerance (the system works despite network failures). Since network partitions are inevitable, the real choice is between consistency and availability.. It says that when your network splits (and it will — this is not optional), you have to choose between two guarantees:

The CAP Theorem — Pick Two (But P Is Mandatory) C Consistency Every read gets the latest write or an error — never stale data A Availability Every request gets a response (even if data might be stale) P Partition Tolerance System works even when network links between nodes fail P is NOT optional Networks WILL partition. Real choice: CP or AP CP: PostgreSQL, MongoDB "Refuse requests if unsure" AP: Cassandra, DynamoDB "Always respond, resolve conflicts later"

Consistency (C) means every read returns the most recent write. If Alice just transferred $20, anyone who checks the balance — on any server — sees the updated amount. No stale data, ever.

Availability (A) means every request gets a response. Even if some servers are unreachable, the system keeps answering queries. It might return slightly stale data, but it never says "sorry, I can't help you right now."

Partition Tolerance (P) means the system keeps working even when the network between nodes is broken. Since real networks do partition (routers fail, cables get cut, clouds have outages), P is not optional in any distributed system. That leaves you choosing between C and A.

Banks and financial systems typically choose CP — they'd rather temporarily refuse requests than give you a wrong balance that lets you overdraw. Social media and shopping carts typically choose AP — they'd rather show you a slightly stale news feed than show nothing at all.

Raft — A Consensus Algorithm You Can Actually Understand

Okay, so servers need to agree on truth. But how do they agree, especially when some of them can't communicate? That's the job of consensus algorithmsProtocols that allow a group of servers to agree on a value (like "the balance is $80") even when some servers are slow, unreachable, or have crashed. The most famous ones are Paxos (notoriously hard to understand), Raft (designed to be understandable), and ZAB (used by ZooKeeper)..

The first famous algorithm, PaxosInvented by Leslie Lamport in 1989. It's mathematically proven correct and extremely powerful, but so hard to understand that Lamport himself said most people find the description "greek to them" (he wrote the paper as a story about a Greek island). Most production implementations use Raft instead because it's far easier to reason about., is correct but notoriously difficult to understand — even experts struggle with it. So in 2014, Diego Ongaro and John Ousterhout created RaftA consensus algorithm specifically designed to be understandable. It breaks the problem into three sub-problems: leader election, log replication, and safety. Used in etcd (Kubernetes), CockroachDB, TiKV, and Consul., an algorithm that solves the same problem but was designed from the ground up to be understandable. Here's how it works:

The basic idea: one server is elected the leader. All writes go through the leader. The leader copies each write to the other servers (called followers). Once a majority of servers confirm they have the write, it's considered committed and safe. If the leader dies, the remaining servers hold an election and pick a new leader.

Raft — Leader Election & Write Replication Phase 1: Normal Operation Node A LEADER Node B Node C Node D Leader replicates writes to all followers 3 of 4 confirm = committed (quorum) Phase 2: Leader Dies Node A DEAD Node B Node C Node D Node C: "I haven't heard from A..." "Starting an election! Vote for me!" Phase 3: Election Complete Node B Node C NEW LEADER Node D C got votes from B and D (majority of 3 alive nodes). Now handles all writes.

Why Odd Numbers and Quorums Matter

You might wonder: why do clusters always use odd numbers of nodes — 3, 5, 7 — never 2, 4, 6? It's about the quorumThe minimum number of nodes that must agree before a decision is considered valid. For a cluster of N nodes, the quorum is (N/2) + 1. For 3 nodes, the quorum is 2. For 5 nodes, it's 3. This ensures that any two quorums always overlap by at least one node — preventing conflicting decisions., or the minimum number of servers that must agree. A quorum is a simple majority: more than half. With 3 nodes, you need 2 to agree. With 5 nodes, you need 3. This guarantees that any two groups who think they have a quorum overlap by at least one server, preventing two groups from making conflicting decisions.

With an even number — say 4 nodes — a network partition could split them evenly (2 vs 2). Neither side has a majority, so neither can make progress. With 5 nodes split 3 vs 2, the group of 3 has a quorum and can keep operating. You get the same fault tolerance with 4 nodes as with 3 (both tolerate 1 failure), so the 4th node is just wasted money.

You don't need to implement Raft yourself. But you need to understand it to know why your database cluster needs 3 nodes (quorum), why writes are slower than reads in a replicated setup (leader must wait for majority confirmation), and why a network partition can cause a brief period where writes are rejected (no quorum = no commits). When data is replicated across multiple servers, those servers need to agree on truth — that's the consensus problem. The CAP theorem says during a network partition you must choose between consistency (refuse requests if unsure) and availability (always respond, maybe with stale data). Raft solves consensus by electing a leader, replicating writes to followers, and requiring a quorum (majority) to commit. Always use odd numbers of nodes (3, 5, 7) so quorums can form even during partitions.
Section 17

Monitoring & Alerting — Your Early Warning System

You can build the most resilient system in the world — redundant servers, automatic failover, circuit breakers, retries — and it's all worthless if you don't know something is broken until a user tweets about it. MonitoringThe practice of collecting and analyzing metrics, logs, and traces from your system in real-time. Good monitoring lets you spot problems before users do — and gives you the data to diagnose them quickly when they happen. is the nervous system of your infrastructure. Without it, you're flying blind.

The Four Golden Signals

Google's Site Reliability EngineeringSRE is Google's approach to operations — treating infrastructure as a software problem. Google's SRE book (free online) is considered the bible of modern operations. SRE teams write code to automate away operational toil, and they defined the "four golden signals" that every service should monitor. (SRE) team distilled decades of experience into four metrics that every service should track. If you monitor nothing else, monitor these:

The Four Golden Signals (Google SRE) 1. Latency How long requests take to complete Track p50, p95, p99 — NOT just the average Separate successful vs failed request latency p99 < 200ms p99 > 1s BAD 2. Traffic How much demand is hitting your system Requests per second (RPS) for web services Queries per second for databases Normal: 2,400 RPS 3. Errors What percentage of requests are failing HTTP 5xx errors (your fault), 4xx (client fault) Also track "soft" errors: wrong results, slow timeouts 5xx < 0.1% 5xx > 1% = PAGE 4. Saturation How "full" your system is CPU usage, memory, disk space, connection pool Most services degrade before hitting 100% CPU > 70% = warn

Why Averages Lie — Use Percentiles

This is one of the most important lessons in monitoring, and most beginners get it wrong. Suppose 99% of your requests take 10 milliseconds (fast!) and 1% take 10,000 milliseconds (10 full seconds — terrible). What's the average? About 109ms. That sounds fine. But it masks a horrible reality: 1 out of every 100 users is waiting 10 seconds for a page to load. If you have 10,000 requests per minute, that's 100 miserable users every minute.

This is why experienced engineers track percentilesIf you sort all request times from fastest to slowest, the p50 is the median (half are faster, half slower), p95 is the value where 95% of requests are faster, and p99 is where 99% are faster. The p99 captures the worst 1% of user experience — the "tail latency" that averages hide completely. instead of averages. The p50 (median) tells you what a typical user experiences. The p95 tells you what 1 in 20 users experiences. The p99 tells you what 1 in 100 users experiences. Your SLA should be based on p99, not on the average.

Why Averages Lie — The Hidden Tail 0ms 100ms 1s 10s 99% of requests 5 – 15 ms 1% p99 = 10,000 ms 100 users/min waiting 10s avg = 109ms "Looks fine!" p50 = 10ms

Alert Fatigue — The Boy Who Cried Wolf

There's a trap that every team falls into at some point: they set up alerts for everything. CPU above 50%? Alert. A single 500 error? Alert. Disk at 60%? Alert. Within a week, the on-call engineer is getting 200 alerts per day — and they start ignoring all of them. That's alert fatigueWhen engineers receive so many alerts that they start ignoring them — including the real ones. Studies show that after a few weeks of high alert volume, response times to critical alerts increase dramatically. The solution: only alert on conditions that require immediate human action., and it's more dangerous than having no alerts at all. With no alerts, at least you know you're blind. With alert fatigue, you think you're watching but you're actually ignoring the signals.

The rule is simple: every alert must be actionable. If an alert fires and the correct response is "do nothing and wait," that alert shouldn't exist. If the on-call engineer's first reaction is "I can ignore this," the alert is broken — fix it or delete it.

The Observability Stack

Modern systems need three types of telemetry working together. Engineers call this the "three pillars of observability":

The Three Pillars of Observability Metrics Numbers over time CPU usage: 45% Requests/sec: 2,400 p99 latency: 180ms Error rate: 0.02% Prometheus, Datadog, CloudWatch "WHAT is happening?" Logs Event records with details ERROR: DB timeout after 30s user_id=4521 query=SELECT.. retry_count=3 pool_size=50 stack_trace=NpgsqlExcep.. ELK Stack, Loki, Splunk "WHY did it happen?" Traces Request journey across services API 12ms Auth DB Query: 890ms (SLOW!) Cache Jaeger, Zipkin, Tempo "WHERE is the bottleneck?"

Metrics tell you what is happening — numbers like CPU usage, request counts, and error rates. They're cheap to store and fast to query. When you see a spike on your dashboard, metrics tell you "error rate jumped to 5% at 2:47 PM."

Logs tell you why it happened — detailed event records with context. When metrics say "errors spiked," you dig into logs to find "ERROR: database connection pool exhausted, max_connections=50, active=50, waiting=347." Now you know the root cause.

Traces tell you where the problem is — they follow a single request as it travels through your microservices. A trace might show that an API call took 900ms total, and 890ms of that was one slow database query. Without tracing, in a system with 20 microservices, finding which service is slow is like finding a needle in a haystack.

What to Monitor — A Practical Checklist

  • Latency — p50, p95, p99 response times (per endpoint if possible)
  • Error rate — percentage of 5xx responses; alert if > 1%
  • Requests per second — watch for sudden drops (outage) or spikes (DDoS, viral moment)
  • Active connections — are you approaching your server's connection limit?
  • Thread pool saturation — how many request-handling threads are in use vs available
  • Query latency — p95 of SELECT and INSERT/UPDATE separately
  • Connection pool usage — active vs max connections; alert at 80%
  • Replication lag — how far behind are read replicas? Alert if > 5 seconds
  • Disk I/O — read/write IOPS, disk queue depth
  • Slow queries — queries taking > 1 second (log and review weekly)
  • Table/index size — prevent "disk full" surprises
  • Hit rate — percentage of requests served from cache; below 80% means your caching strategy is off
  • Memory usage — how close to max memory; evictions start hurting when cache is full
  • Eviction rate — keys being evicted per second; spikes mean your cache is too small
  • Connection count — are clients exhausting the max connections?
  • Latency — Redis should respond in < 1ms; anything higher suggests a problem
  • Queue depth — messages waiting to be processed; growing = consumers can't keep up
  • Consumer lag — how far behind are consumers from the latest message?
  • Processing time — how long each message takes to process
  • Dead letter queue size — messages that failed processing; should be near zero
  • Throughput — messages produced vs consumed per second
If an alert fires and the on-call engineer's first reaction is "I can ignore this," the alert is broken. Fix it or delete it. Every alert should pass the 3 AM test: if this wakes someone up at 3 AM, is it worth waking up for? If not, it should be a dashboard metric, not an alert. Monitoring is your system's nervous system — without it, you're blind. Track the four golden signals: Latency (use p50/p95/p99, never just averages), Traffic (requests per second), Errors (5xx rate), and Saturation (CPU, memory, disk, connections). Averages hide tail latency: a 109ms average can mask 1% of users waiting 10 seconds. Build observability with three pillars: Metrics (what), Logs (why), and Traces (where). Fight alert fatigue by making every alert actionable — if the right response is "ignore it," delete the alert.
Section 18

Common Mistakes — Reliability Traps Everyone Falls Into

You've now got a solid understanding of how reliable systems work. But knowing the theory and avoiding the traps in practice are two very different things. These seven mistakes look obvious in hindsight, but they trip up experienced teams every year — in real incidents, post-mortems, and system design interviews. Learn them here so you don't learn them at 3 AM during an outage.

What goes wrong: The team sets up automated daily backups and checks the box. Months later, a catastrophic failure happens. They try to restore — and discover the backups have been silently failing for weeks, or the restore process has never been documented, or the backup format is incompatible with the current schema.

Why it's dangerous: This is exactly what happened to GitLab in January 2017A GitLab engineer accidentally deleted a production database directory. When they tried to restore, they discovered that 5 out of 5 backup methods had issues — some hadn't run in days, others produced empty files, and the most recent backup was 6 hours old. They lost data for 5,000 projects.. They had five different backup systems. None of them worked when it mattered. Backups that have never been restored are not backups — they are comforting lies.

How to avoid it: Schedule monthly restore drills. Automate a "backup validation" job that restores to a scratch database and runs a row-count check. If the restore fails or counts don't match, page someone immediately — not in a week, immediately. The rule is simple: an untested backup is no backup at all.

What goes wrong: Your integration tests verify that the checkout flow works when the payment API responds in 200ms, the inventory database is up, and the email service sends confirmations. All green. Then in production, Redis goes down for 3 seconds. The payment API returns a 500. The network drops packets for 2 seconds. Your "fully tested" service crashes spectacularly.

Why it's dangerous: Happy-path tests prove your system works when nothing fails. But things always fail. If you haven't tested what happens when Redis times out, when an API returns garbage, when the network is slow — you have no idea how your system behaves under the conditions that actually cause outages.

How to avoid it: For every dependency, write failure tests: what happens when it's slow (inject 5-second latency)? When it returns errors? When it's completely unreachable? Tools like ToxiproxyAn open-source tool by Shopify that lets you simulate network conditions like latency, timeouts, and connection resets in your test environment. let you inject network failures in tests. If you can't answer "what happens when X is down?" for every dependency — you haven't tested enough.

What goes wrong: The team adds redundancy to the obvious things — two app servers, a database replica, multi-zone deployment. They feel safe. Then their single DNS provider has an outage, and the entire site is unreachable. Or the one config server that all services depend on crashes, and nothing can start. Or the single load balancer that fronts everything goes down.

Why it's dangerous: People add redundancy to the things they think about — servers, databases, application code. But they forget the "glue" infrastructure: DNS, load balancers, config servers, certificate authorities, the CI/CD pipeline, and even the one engineer who knows how the deployment works. These hidden SPOFsSingle Point of Failure — any component whose failure brings down the entire system. SPOFs are dangerous because they negate all the redundancy you've built around them. are the ones that actually cause outages.

How to avoid it: Run a "what if this dies?" exercise. Walk through every component in your architecture — including DNS, load balancers, config stores, monitoring, and the deployment pipeline. For each one, ask: "If this single thing disappears right now, what breaks?" Draw the dependency graph and look for any node that, if removed, disconnects everything.

Hidden SPOFs — The "Glue" Nobody Made Redundant Users 1 DNS Provider SPOF! 1 Load Balancer SPOF! App Server 1 App Server 2 App Server 3 Redundant 1 Config Server SPOF! DB Primary DB Replica Redundant Redundant Hidden SPOF

What goes wrong: The team adds alerts for everything — CPU above 60%, any 500 error, disk above 70%, response time above 200ms. The Slack channel gets 500 alerts per day. The team starts ignoring them. Then a real outage happens, and the critical alert is buried under noise. Nobody notices for 45 minutes.

Why it's dangerous: Alert fatigue is one of the most common causes of delayed incident response. When every alert is "critical," none of them are. The human brain can't sustain vigilance across hundreds of notifications — it starts filtering them out as background noise. This is the boy who cried wolfA classic parable: if you raise the alarm too often for non-emergencies, people stop responding — and when a real emergency comes, nobody pays attention. problem applied to infrastructure.

How to avoid it: Use a tiered alert system. Page-worthy (wake someone up): only for customer-facing impact — error rate above 5%, complete service down, data loss risk. Ticket-worthy (fix during business hours): elevated latency, disk above 85%, a single node down but redundancy is covering it. Dashboard-only (just watch it): CPU trends, cache hit ratios, queue depths. If your on-call engineer gets more than 2-3 pages per week, your thresholds are wrong.

What goes wrong: The recommendation engine goes down. Instead of showing generic popular items, the entire product page crashes with a 500 error. Or the search service is slow, so the whole website hangs waiting for it. The team never defined what should happen when a non-critical dependency fails.

Why it's dangerous: In a system with 10 dependencies, the probability that all of them are healthy at any given moment is surprisingly low. If your site requires 100% of its dependencies to be up, your effective availability is the product of all their individual availabilities. Ten services at 99.9% each gives you 99.9%10 = 99.0% — that's almost 4 days of downtime per year.

How to avoid it: For every dependency, define the fallback. Recommendations down? Show popular items. Search slow? Return cached results. Email service offline? Queue the email and send later. Payment API unreachable? Show "try again in a few minutes" instead of a cryptic 500. The key insight: decide these fallbacks before the outage, not during it.

What goes wrong: A downstream service gets slow under load. Your service retries immediately. So do the other 50 services that depend on it. The downstream service was handling 1,000 requests per second, now it's getting 3,000 (original load plus retries). It collapses completely. Now nothing works.

Why it's dangerous: Immediate retries turn a partial failure into a total failure. The downstream service was struggling but alive — the flood of retries killed it. This is called the thundering herdWhen many clients simultaneously retry or reconnect to a recovering service, overwhelming it before it can recover. Like a herd of buffalo all running at a narrow gate at once. problem, and it's one of the most common ways that small incidents become big outages.

How to avoid it: Always use exponential backoff with jitterExponential backoff means waiting longer between each retry (1s, 2s, 4s, 8s...). Jitter adds a random delay so all clients don't retry at the same instant. Together, they spread the retry load over time.. First retry after 1 second (plus random 0-500ms). Second retry after 2 seconds (plus random). Third after 4 seconds. Cap at 30 seconds. And set a maximum retry count — after 3-5 attempts, stop and fail gracefully. A circuit breaker (from Section 6) automates this even better.

What goes wrong: All your services run in a single availability zoneA physically distinct data center or section of a data center within a cloud region. Availability zones have independent power, cooling, and networking — so a failure in one zone shouldn't affect others.. Or all your microservices share the same Kubernetes cluster. Or all your databases are on the same physical host. When that one zone, cluster, or host goes down — everything goes down together.

Why it's dangerous: Shared fate means your redundancy is an illusion. Having 5 replicas sounds great — until you realize they're all on the same physical rack, sharing the same power supply and network switch. One hardware failure, and all 5 replicas die simultaneously. This happened in several major cloud outages where "redundant" services all lived in the same blast radius.

How to avoid it: Spread your infrastructure across independent failure domainsA failure domain is a group of resources that share a common point of failure. Examples: a single rack (shared power), a single availability zone (shared data center), a single region (shared geography).. Run across at least 2 availability zones (3 is better). Place database replicas in different zones. Use anti-affinity rules to prevent Kubernetes from scheduling critical pods on the same node. The rule: if two components can fail from the same cause, they're not truly redundant.

Seven reliability traps that even experienced teams fall into: untested backups (GitLab lost data from 5 broken backup methods), happy-path-only testing, hidden single points of failure in DNS/load balancers/config servers, alert fatigue from too many noisy alerts, no graceful degradation plan when dependencies fail, retrying without backoff causing thundering herds, and shared fate where "redundant" services share the same blast radius.
Section 19

Interview Playbook — Nail Reliability Questions

Reliability questions show up in almost every system design interview — either as the main topic ("design a highly available system") or as a follow-up ("what happens when this component fails?"). The best candidates don't wait for the interviewer to ask about failures. They bring it up proactively, showing they think about real-world production from the start.

The interviewer wants to hear you THINK about failures proactively. Don't wait for them to ask "what if X fails?" — bring it up yourself. Saying "before we move on, let me talk about what happens when the payment service is unreachable" signals senior-level thinking and separates you from candidates who only design for the happy path.

Here's a five-step framework that works for any reliability question. Memorize the steps, not the answers — because the specific system changes, but the thinking process is always the same.

1 Identify Failures "What can break?" 2 Add Redundancy "Eliminate SPOFs" 3 Define Detection "How do we know?" 4 Plan Recovery "How do we heal?" 5 Set SLOs "What's good enough?"

Let's see this framework in action. Here's how a strong candidate might think through a classic interview question:

Step 1 — Failures: "99.99% means only 52 minutes of downtime per year. So I need to think about every failure mode: server crashes, database failover, network partitions, bad deployments, and even cloud provider outages. Any one of these could eat my entire error budget in a single incident."

Step 2 — Redundancy: "I'll run at least 3 app servers across 2 availability zones behind an active-active load balancer. The database needs a hot standby with streaming replication — automated failover in under 30 seconds. I'll use a multi-zone Redis cluster for caching so one zone going down doesn't kill cache."

Step 3 — Detection: "Health checks every 10 seconds from the load balancer. An external uptime monitor pinging from 3 regions. Error rate and latency dashboards with alerts at 1% error rate — not 5%, because at 99.99% I can't afford to wait."

Step 4 — Recovery: "Automated failover for the database — no human in the loop for the first response. Blue-green deployments so I can roll back a bad deploy in under a minute. Canary releases that test new code on 5% of traffic first. And a runbook for the on-call engineer covering the top 10 failure scenarios."

Step 5 — SLOs: "I'll define SLIs: request success rate and P99 latency. The SLO is 99.99% success rate and P99 under 500ms. That gives me an error budget of about 4,300 failed requests per month on a service handling 1 million daily. If I'm burning budget too fast, I freeze deployments and investigate."

Now let's walk through the most common reliability interview questions and how to approach each one.

What the interviewer wants: They want to see you think about redundancy at every layer — not just "add more servers." A strong answer covers: load balancing, database replication and failover, multi-zone deployment, health checks, and graceful degradation.

Framework to use: Walk through the architecture layer by layer — DNS (use multiple providers or a managed service), load balancer (active-passive pair), app tier (stateless servers, at least N+1 capacity), data tier (primary + hot standby, read replicas), and caching tier (clustered Redis). For each layer, state the failure mode and the mitigation.

Key phrase to use: "I want to eliminate single points of failure at every layer, and make sure each layer can lose one component without user impact."

What the interviewer wants: They're testing whether you've thought about failure propagation. Don't just say "the backup takes over." Explain the full chain: detection (how do you know it failed?), impact (what do users see during the gap?), recovery (automatic or manual?), and data implications (any data loss?).

Framework to use: "When X fails, here's the timeline: detection takes Y seconds via health checks. During the detection window, requests to X will timeout — but our circuit breaker trips after 3 failures and returns a fallback response. Failover to the standby takes Z seconds. Total user-visible impact is... And we lose at most N seconds of uncommitted data due to replication lag."

Key phrase to use: "Let me walk through the blast radius of that failure."

What the interviewer wants: This tests your understanding of replication, data consistency, and the practical mechanics of failover. A weak answer says "the replica takes over." A strong answer discusses replication lag, in-flight transactions, split-brain risk, and client reconnection.

Framework to use: "The primary fails. The monitoring system detects it within 10 seconds (missed heartbeats). The orchestrator (like Patroni or RDS Multi-AZ) promotes the replica to primary. There's a window of replication lag — say 200ms of committed transactions that haven't replicated yet. Those might be lost. In-flight transactions get connection errors and need to retry. The application's connection pool detects the DNS change and reconnects. Total switchover: 15-30 seconds."

Key phrase to use: "The key trade-off is between recovery time and data loss — I need to define our RPO and RTO based on business requirements."

What the interviewer wants: They want to see you think big — beyond single-component failures to entire region outages. This is about RPO, RTO, data replication strategies, and how you test the plan.

Framework to use: "I'd classify our data into tiers. Tier 1 (user data, transactions): synchronous replication to another region, RPO near zero, RTO under 5 minutes. Tier 2 (analytics, logs): asynchronous replication, RPO of 1 hour is acceptable, RTO of 30 minutes. Tier 3 (derived data, caches): no replication needed, can be rebuilt. For the actual failover, I'd use DNS-based routing with health checks — if the primary region fails, traffic routes to the secondary within 60 seconds. And critically — we test this quarterly with a planned regional failover drill."

Key phrase to use: "A disaster recovery plan that hasn't been tested is just a disaster recovery hope."

Use these terms naturally (don't force them) and you'll signal deep understanding: failure domain — the group of things that fail together; blast radius — how far the damage spreads; error budget — how much failure you can afford; graceful degradation — serving partial results instead of crashing; circuit breaker — stopping cascading failures by failing fast; RPO/RTO — how much data and time you can lose. Use the 5-step reliability framework in every interview: (1) identify failure modes, (2) add redundancy, (3) define detection, (4) plan recovery, (5) set SLOs. For each question, walk through the failure timeline — detection, user impact, automated recovery, data implications. Don't wait for the interviewer to ask "what if it fails?" — bring it up yourself to signal senior-level thinking.
Section 20

Practice Exercises — Build Your Reliability Intuition

Reading about reliability is step one. But real understanding comes from working through problems yourself — figuring out where failures hide, calculating error budgets, and designing fallback strategies. Try each exercise before peeking at the hints. The struggle is where the learning happens.

Your e-commerce checkout flow has three dependencies: a payment API (processes credit cards), an inventory database (checks stock levels), and an email service (sends order confirmations). The payment API averages 500ms response time. The inventory DB responds in 20ms. The email service takes 2 seconds.

Questions: (a) Which dependencies need circuit breakers? (b) Which ones can fail gracefully without blocking checkout? (c) Design the degradation strategy — what does the user see when each dependency is down?

(a) Circuit breakers: The payment API and email service both need circuit breakers. The payment API is slow (500ms) and external — if it starts timing out at 10 seconds, it'll block your checkout threads. The email service is even slower (2 seconds) and is a fire-and-forget operation. The inventory DB is fast (20ms) and internal — a simple timeout + retry is sufficient, but a circuit breaker doesn't hurt.

(b) Graceful failure: The email service is the easy one — queue the confirmation email and send it later. The user doesn't need the email to complete checkout. The inventory DB can degrade to an optimistic strategy — accept the order and reconcile stock later (most e-commerce sites do this during flash sales). The payment API is the one that cannot degrade — you need payment confirmation to complete a purchase. If it's down, show "Payment processing is temporarily unavailable, please try again in a few minutes."

(c) Degradation plan: Payment API down → show friendly retry message, offer to save the cart. Inventory DB down → accept orders optimistically, flag for manual review. Email service down → queue emails, show "confirmation email coming soon" in the UI. The checkout still completes for 2 out of 3 failure modes.

Your API service has a SLOService Level Objective — the target reliability you promise internally. For example, "99.95% of requests will succeed." It's stricter than your SLA (external promise to customers) to give you a safety margin. of 99.95% success rate. It handles 2 million requests per day.

Questions: (a) How many failed requests per day does your error budget allow? (b) How many per month (30 days)? (c) A bad deployment causes 1,500 errors in 10 minutes before you roll back. What percentage of your monthly error budget did that single incident consume?

(a) Daily budget: 100% - 99.95% = 0.05% allowed failures. 2,000,000 × 0.0005 = 1,000 errors per day.

(b) Monthly budget: 1,000 × 30 = 30,000 errors per month.

(c) Incident impact: 1,500 errors out of a 30,000 monthly budget = 1,500 / 30,000 = 5% of your monthly error budget burned in 10 minutes. That's significant but survivable. If this happened 6 more times in the same month, you'd blow your entire budget and need to freeze deployments. This is exactly why error budgets matter — they turn abstract reliability targets into concrete deployment decisions.

Your PostgreSQL primary fails unexpectedly. You have a streaming replicaA PostgreSQL replica that receives a continuous stream of WAL (Write-Ahead Log) records from the primary. It's usually a few hundred milliseconds behind — close to real-time but not exact. with approximately 200ms of replication lag. At the moment of failure, the primary was processing 500 transactions per second.

Questions: (a) Walk through the failover process step by step. (b) How many transactions might be lost? (c) What happens to in-flight transactions that were mid-commit? (d) How do application servers discover the new primary?

(a) Failover steps: (1) The monitoring system (e.g., Patroni) detects missed heartbeats from the primary — typically 3 missed checks at 5-second intervals = 15 seconds to detect. (2) Patroni confirms the primary is truly dead (not just a network blip). (3) The replica is promoted to primary — it replays any remaining WAL and starts accepting writes. (4) DNS or connection pooler (PgBouncer) is updated to point to the new primary. (5) Application connections are reset and reconnect. Total time: 15-30 seconds.

(b) Data loss: With 200ms replication lag and 500 TPS, roughly 500 × 0.2 = ~100 transactions that were committed on the primary but not yet replicated. These are lost. This is the RPORecovery Point Objective — the maximum acceptable amount of data loss measured in time. An RPO of 200ms means you accept losing up to 200ms of data. trade-off with asynchronous replication.

(c) In-flight transactions: Any transaction that was mid-commit (sent to primary but not yet acknowledged) will get a connection error. The application must retry these. If the application is idempotent (which it should be!), retrying against the new primary is safe. If not, you risk duplicate operations — this is why idempotency matters.

(d) Discovery: Option 1: DNS-based — update the DNS record for the DB hostname; apps reconnect on next attempt (TTL matters — keep it low, 5-10 seconds). Option 2: Connection pooler (PgBouncer/HAProxy) handles routing transparently — the app doesn't even know the primary changed. Option 3: Client-side library with cluster awareness (like Patroni's REST API).

You run a microservice called order-service that depends on: a PostgreSQL database (stores orders), a Redis cache (caches product prices), and two external APIs — payment-api (processes payments) and shipping-api (calculates shipping rates).

Questions: (a) Define a liveness checkA health check that answers "Is this process alive and not stuck?" If liveness fails, the orchestrator kills and restarts the container. and a readiness checkA health check that answers "Is this service ready to handle traffic?" If readiness fails, the load balancer stops sending requests to it — but doesn't kill it.. (b) If Redis is down but everything else is healthy, should the service be marked "not ready"? Why or why not? (c) What timeout should each health check use?

(a) Liveness: A simple check that the process is running and not deadlocked. Ping an internal endpoint like /health/live that returns 200 if the event loop is responding. Do NOT check dependencies here — if the database is down, the process is still alive. Liveness failures trigger a restart, and restarting won't fix a database outage.

Readiness: Check that the service can actually handle requests. /health/ready should verify: (1) database connection pool has available connections, (2) Redis is reachable, (3) the service has completed startup initialization. External APIs (payment, shipping) should NOT be part of readiness — they're checked per-request with circuit breakers.

(b) Redis down: It depends on your degradation strategy. If the service can fall back to reading prices directly from the database (slower but functional), then Redis being down should NOT make the service "not ready" — it should still serve traffic, just slower. If Redis is absolutely required (e.g., it holds session data with no fallback), then yes, mark it not ready. The answer reveals whether you've thought about graceful degradation.

(c) Timeouts: Liveness: 1-2 seconds max (it's just checking the process). Readiness: 3-5 seconds (includes a DB connection check). Make these significantly shorter than your Kubernetes probe intervals — if the check itself times out, it counts as a failure. A common mistake is setting health check timeouts equal to the probe interval, causing false positives.

You run a global SaaS platform across three regions: US-East (primary), EU-West, and AP-Southeast. US-East hosts the primary database and the control plane. Each region handles local traffic. You have 50,000 active users across all regions. Design the disaster recovery plan.

Questions: (a) Define RPORecovery Point Objective — how much data you can afford to lose, measured in time. RPO of 0 means zero data loss (synchronous replication). RPO of 1 hour means you accept losing up to 1 hour of data. and RTORecovery Time Objective — how quickly the service must be restored after a failure. RTO of 5 minutes means users should be back online within 5 minutes of the outage. for three tiers of data. (b) Explain what happens when US-East goes completely offline — minute by minute. (c) How do EU-West and AP-Southeast users continue working? (d) How do you test this plan without causing an actual outage?

(a) Data tiers:

Tier 1 (user accounts, transactions, orders): RPO < 1 second (synchronous or near-synchronous replication to EU-West), RTO < 5 minutes. This data is irreplaceable.

Tier 2 (analytics, audit logs, activity history): RPO < 1 hour (async replication is fine), RTO < 30 minutes. Important but not urgent.

Tier 3 (caches, search indexes, derived data): RPO = N/A (can be rebuilt), RTO < 2 hours. Rebuild from Tier 1 data after failover.

(b) US-East fails — timeline: Minute 0: US-East goes dark. External monitors (Pingdom, Route53 health checks) detect failure within 30-60 seconds. Minute 1: DNS failover triggers — US traffic routes to EU-West. EU-West's replica is promoted to primary. Minute 2-3: Connection pools reset, applications reconnect to the new primary. Some US users see 30-60 seconds of errors during the switchover. Minute 5: Service is restored for all regions. Minute 10-30: Tier 2 data catches up via async replication replay. Hour 1-2: Tier 3 data (search indexes, caches) is rebuilt.

(c) EU/AP users: EU-West users experience a brief blip (2-3 seconds) as their local region absorbs the extra load from US traffic and the database promotion happens. AP-Southeast users route to EU-West for write operations (higher latency — 200-300ms instead of 50ms) but reads are served from their local read replica. Both regions remain operational throughout.

(d) Testing: Run quarterly DR drills: (1) Announce a maintenance window. (2) Simulate US-East failure by updating DNS to stop routing traffic there. (3) Verify EU-West handles the promotion and all traffic. (4) Measure actual RTO and RPO. (5) Fail back to US-East and verify data consistency. Netflix and Google do this routinely. If you can't test the plan, you can't trust the plan.

Five exercises from easy to hard: (1) design degradation strategies for checkout dependencies, (2) calculate error budgets from SLOs, (3) walk through a PostgreSQL failover including data loss math, (4) design liveness and readiness health checks for a microservice, (5) build a global disaster recovery plan with tiered RPO/RTO across three regions.
Section 21

Cheat Sheet — Reliability at a Glance

Quick-reference cards for every major reliability concept. Pin this section for your next system design interview or post-mortem review.

Running multiple copies of a component so that if one fails, another takes over. The simplest form: two servers instead of one. The key: redundant components must be in different failure domains (different racks, zones, or regions) or they'll fail together. The process of automatically switching from a failed component to a healthy backup. Can be active-passive (standby waits idle) or active-active (all copies serve traffic). Active-active is more efficient but harder to build, especially for databases. Periodic probes that ask "are you alive and working?" Liveness checks: "is the process running?" (restart if no). Readiness checks: "can you handle traffic?" (stop routing if no). Never check external dependencies in liveness — a DB outage shouldn't trigger a restart loop. A wrapper that monitors calls to a dependency and "trips open" after too many failures — returning a fallback immediately instead of waiting for timeouts. States: Closed (normal), Open (failing fast), Half-Open (testing recovery). Prevents one slow service from cascading into a system-wide outage. When a request fails, try again — but not immediately. Use exponential backoff: wait 1s, 2s, 4s, 8s between retries. Add jitter (random delay) so all clients don't retry at the same instant. Cap retries at 3-5 attempts. Without backoff, retries cause thundering herds. Serving reduced functionality instead of crashing when a dependency fails. Recommendations engine down? Show popular items. Search slow? Return cached results. The goal: the core experience works even when non-critical features don't. Define fallbacks before the outage. How far the damage spreads when something fails. A single server crash has a small blast radius. A shared database going down has a huge one. Design to minimize blast radius — isolate failure domains, use bulkheads, separate critical and non-critical services. SLI = what you measure (error rate, latency). SLO = your internal target (99.95% success). SLA = the contractual promise to customers (99.9% with penalties). Set SLO stricter than SLA to have a safety margin. Use SLIs to track reality against the target. The amount of unreliability your SLO allows. If SLO is 99.95%, your error budget is 0.05% of requests. When the budget is healthy, ship fast and take risks. When it's nearly burned, freeze deployments and focus on stability. It turns reliability into a measurable, tradeable resource. The plan for surviving catastrophic failures — entire region outages, data center fires, or massive data corruption. Key metrics: RPO (max data you can lose) and RTO (max time to recover). Tier your data by importance. Test the plan quarterly or it's just a wish. Intentionally injecting failures into production to find weaknesses before they find you. Netflix's Chaos Monkey randomly kills servers. Chaos Gorilla takes down availability zones. Start small (kill one pod), verify your system self-heals, then escalate. If it breaks, better to discover it on your terms. An operation that produces the same result whether you run it once or ten times. Critical for retries — if a payment request times out and you retry, an idempotent API charges the customer once, not twice. Implement with idempotency keys: a unique ID per operation that the server deduplicates.
Twelve reliability concepts in quick-reference format: redundancy, failover, health checks, circuit breakers, retries with backoff, graceful degradation, blast radius, SLI/SLO/SLA, error budgets, disaster recovery, chaos engineering, and idempotency.
Section 22

Connected Topics — Where to Go Next

Reliability doesn't exist in a vacuum. Every concept you've learned here connects to a deeper topic. Pick the ones that matter most for your next interview or your current system, and go deeper.

Reliability connects to scalability, availability, performance, load balancers, replication, CAP theorem, distributed systems, message queues, monitoring, rate limiting, microservices, and back-of-envelope estimation. Each topic deepens one aspect of building systems that survive failure.