Availability — System Guide

Section 1

TL;DR — The ATM That's Always Open

What "availability" actually means — and why it's different from reliability
How engineers measure uptime using "the nines" (99.9%, 99.99%, etc.)
Why the math behind availability is surprisingly simple but the engineering is brutally hard
The core strategies real companies use to keep systems running 24/7/365

Availability is the percentage of time your system is actually working when users need it.

Picture this: it's 11:45 PM on a Friday. You need cash for a taxi home. You walk up to the nearest ATM, slide your card in, and... it works. Money comes out. You go home. Boring story, right? That's the point. The best ATMs are the ones you never think about — they're just there, every single time.

Now imagine a different Friday. Same situation, except this time the ATM screen says "Out of Service." You walk three blocks to another ATM from a different bank. Also down. You end up asking a stranger for cash. Embarrassing, frustrating, and entirely preventable.

The first bank has high availability. The second bank has a serious problem. And here's the thing that matters: neither ATM was faster, had a better UI, or offered lower fees. The only difference was whether it worked when you needed it. That's availability in one sentence: being there when users show up.

Now here's the subtle but important distinction that trips up a lot of people: availability is not the same as reliabilityReliability means the system does the right thing every time it responds. It doesn't give wrong answers, doesn't corrupt data, doesn't charge your card twice. Availability means the system responds at all. You can be available but unreliable (the ATM is on, but gives you wrong amounts) or reliable but unavailable (the ATM always gives correct amounts, but it's offline half the time).. Think of it this way: reliability means the system doesn't break — each individual component works correctly. Availability means the service stays up — even when individual components do break.

Imagine a system with 10 servers. One server crashes every single day. That's not very reliable at the component level — your servers are fragile. But if the other 9 servers keep handling requests seamlessly while the dead one gets replaced, users never see an error. The service is highly available even though the hardware is unreliable. This is exactly how Google, Netflix, and Amazon think about it: they assume everything will break and engineer the system so it doesn't matter when it does.

Reliability = each piece works correctly. Availability = the overall service stays up. You can have high availability with unreliable parts — that's the whole point of redundancy. The magic is in the architecture, not the components.

What: Availability is the fraction of time your system is operational and serving users. Usually expressed as a percentage — 99.9%, 99.99%, etc. Higher is better, but each extra "nine" costs exponentially more.

When: Every production system needs an availability target. The target depends on the business impact of downtime — a personal blog can tolerate hours of downtime, a payment system cannot tolerate minutes.

Key Principle: Availability is not about preventing failure — it's about surviving failure. Build systems where components can fail without the user ever knowing.

Availability measures whether your system is working when users need it. It's different from reliability: reliability means components don't break, availability means the service stays up even when they do. High availability comes from architecture (redundancy, failover) not from having perfect hardware. The goal: users should never think about whether your system is up — it should just be there.

Section 2

The Math — Understanding "The Nines"

When engineers talk about availability, they almost always use a shorthand: "the nines." Instead of saying "our system is up 99.99% of the time," they say "we have four nines." It sounds like jargon, but once you see the table, you'll use it too — because the difference between each "nine" is dramatic.

Let's start with the most important table in all of system design. Every number here should shock you a little:

Look at the jump between 99% and 99.9%. Going from two nines to three nines means your allowed downtime drops from 3.6 days per year to 8.7 hours. That's a 10x reduction. And going from three nines to four nines? From 8.7 hours down to 52 minutes for the entire year. That's less time than a lunch break — spread across 12 months. One bad deploy that takes 20 minutes to roll back just ate a third of your annual budget.

Here's the gut punch: each additional nine doesn't just require a little more effort. It costs roughly 10 times more money and engineering effort than the previous one. Going from 99% to 99.9% might mean adding a load balancer and a second server. Going from 99.9% to 99.99% means multi-zone deployments, automated failover, chaos testing, on-call rotations, and sophisticated monitoring. Going from 99.99% to 99.999% is so expensive that only telecom companies and emergency services bother.

Each additional "nine" of availability costs roughly 10x more than the previous one. Going from 99% to 99.9% might cost $10K/year in infrastructure. Going from 99.9% to 99.99% might cost $100K/year. Going from 99.99% to 99.999% could easily cost $1M+/year. Always ask: "What's the business cost of downtime?" before choosing your target.

The Math That Actually Matters: Serial vs Parallel

Now for the part that changes how you think about architecture. When you connect components together, there are two ways they can be arranged, and the math is completely different for each:

Serial (both must work): Think of a chain. If ANY link breaks, the whole chain breaks. When two components are in series — meaning a request must pass through both of them — you multiply their availabilities. This always makes things worse. If your web server is 99.9% available and your database is 99.9% available, together they're 99.9% x 99.9% = 99.8%. You just lost a "nine" by adding one component.

Parallel (either works): Think of redundancy. If EITHER path works, the system works. When two components are in parallel — meaning the request can go through either one — the formula flips. You calculate the chance that both fail simultaneously (which is tiny). If each server is 99.9% available, two in parallel give you: 1 - (0.001 x 0.001) = 1 - 0.000001 = 99.9999%. You just went from three nines to six nines by adding one server.

Adding a second server doesn't double your availability — it squares the number of nines. One server at 99.9% (three nines). Two servers in parallel: 99.9999% (six nines). That single extra server is the best money you'll ever spend on infrastructure.

Let's walk through the serial and parallel calculations step by step, because this math shows up in every system design interview:

Serial Calculation (the chain problem)

Parallel Calculation (the redundancy win)

This math has a profound implication for architecture: keep your serial chain short and your parallel redundancy deep. Every component you add in series (another microservice, another proxy, another middleware) hurts your availability. Every copy you add in parallel (another server, another replica, another availability zone) helps it.

Availability is measured in "nines" — each additional nine means 10x less downtime but roughly 10x more cost. The critical math: serial components (A x B) lower availability, while parallel components (1 - (1-A)(1-B)) raise it dramatically. Adding one redundant server can take you from three nines to six nines. This serial vs parallel math is the foundation of every high-availability architecture.

Section 3

Single Points of Failure — The Weakest Link

Imagine a Christmas tree with 100 lights wired in series. If one bulb burns out, the entire string goes dark. You're on your hands and knees testing each bulb, one by one, trying to find the dead one. Now imagine 100 lights wired in parallel — each bulb has its own connection to power. One dies? The other 99 stay lit. You barely notice.

A single point of failureAny component in your system where, if it goes down, the entire system goes down with it. Abbreviated as SPOF. The goal of high-availability architecture is to eliminate every SPOF — or at least make each one redundant. (SPOF) is that one burned-out Christmas light on the series string. It's any component in your system where, if it fails, everything fails. Your system is only as available as its weakest, most singular component.

The tricky thing about SPOFs is that the obvious ones get caught early. Everyone knows you need more than one application server. But the hidden SPOFs — the ones that take down production at 3 AM — are the ones nobody thought about:

The most dangerous SPOFs are the ones nobody thinks about: a single DNS provider, a single CI/CD pipeline, a single config server, a single SSL certificate authority, or even a single engineer who's the only person who understands the payment system (the "bus factor" — what happens if they get hit by a bus?). If you haven't traced every request path end-to-end and asked "what if THIS dies?", you have SPOFs you don't know about.

How do you find SPOFs? There's a simple but powerful exercise: trace the full request path from the moment a user types your URL to the moment they see a response. At every single hop, ask: "Is there more than one of these? If this specific thing dies right now, does the user get an error?" If the answer is yes — that's a SPOF, and you need a plan.

Notice something important in that diagram: even after fixing all the obvious SPOFs in the request path, there are hidden SPOFs lurking outside of it. Your CI/CD pipeline. Your monitoring stack. Your SSL certificate — if it expires and nobody renews it, HTTPS breaks for everyone. Your configuration serverA centralized service (like Consul, etcd, or Spring Cloud Config) that stores configuration settings for all your microservices. If it goes down, services can't read their config and may fail to start or behave unpredictably.. Even the bus factorThe number of people who would need to be "hit by a bus" (leave the team) before a project is in serious trouble. If only one person understands the payment system code, your bus factor is 1 — and that's a SPOF for your organization, not just your software. — the number of engineers who understand a critical part of the system. If only one person knows how the billing service works, that person is a SPOF.

The fix for every SPOF follows the same pattern: redundancy. Two DNS providers instead of one. A pair of load balancers instead of a single one. A database primary with an automatic replica failover. A cache cluster instead of a cache node. Cross-training engineers so knowledge isn't trapped in one head. The next section dives deep into exactly how redundancy works.

A single point of failure (SPOF) is any component where one failure kills the whole system. Find SPOFs by tracing every request path end-to-end and asking "what if this dies?" The most dangerous SPOFs are hidden ones: DNS, CI/CD, config servers, SSL certs, and the "one engineer who knows everything." Fix every SPOF with the same strategy: redundancy.

Section 4

Redundancy — The Core Strategy

The fundamental answer to "how do I make this available?" is always the same: have more than one of it. That's redundancy. Every high-availability technique you'll ever learn — failover, replication, clustering, multi-zone deployments — is just redundancy wearing a different hat.

But "just add more copies" raises an immediate question: how do those copies work together? Do they all handle traffic at the same time, or does one sit idle waiting for the primary to die? There are two main approaches, and each has very different trade-offs.

Active-Active vs Active-Passive

Active-active means all copies serve real traffic simultaneously. If you have three servers, all three handle requests. When one dies, the other two absorb its load — they were already warmed up and handling traffic, so the transition is seamless. It's more efficient (no idle resources), but it's harder to build because you need to keep state synchronized across all copies.

Active-passive means one copy does all the work while the other sits idle, waiting. Like a backup goalkeeper who watches the match from the bench and jumps in only if the starter gets injured. It's simpler to set up, but you're paying for a server that does nothing most of the time. And there's always a brief delay when switching from primary to backup — the failover timeThe time it takes to detect that the primary has failed and switch traffic to the backup. In active-passive setups, this can range from a few seconds (automated failover) to several minutes (manual failover). During this time, the service may be degraded or unavailable..

N+1 Redundancy

There's a practical middle ground that most teams use: N+1 redundancyIf you need N servers to handle your normal traffic load, you deploy N+1 servers. The extra one covers any single failure. It's the sweet spot between cost (not paying for a fully idle backup) and safety (can survive one failure without degradation).. The idea is simple: if you need N servers to handle your normal traffic, deploy N+1. The extra server covers any single failure without overloading the remaining ones. Need 4 servers to handle peak traffic? Deploy 5. If one dies, the other 4 handle it without breaking a sweat.

Geographic Redundancy

The ultimate form of redundancy isn't just having more servers — it's having them in different physical places. This protects against the kind of failures that take out an entire data center: power grid outages, natural disasters, network backbone cuts, or even someone accidentally pulling the wrong fiber optic cable.

The two levels of geographic redundancy are availability zonesSeparate physical buildings within the same cloud region, each with independent power, cooling, and networking. They're close enough (a few miles apart) for low-latency replication but far enough that a fire or power outage at one won't affect the others. AWS, Azure, and GCP all offer multiple AZs per region. (AZs) and regionsGeographically distant locations — think US East (Virginia) vs US West (Oregon) vs EU (Ireland). Regions are hundreds or thousands of miles apart. Cross-region replication has higher latency (tens of milliseconds vs sub-millisecond for AZs) but protects against regional disasters.. Spreading across AZs within one region protects against building-level failures (a single AZ going down). Spreading across regions protects against regional disasters — but adds latency and complexity because data has to travel hundreds of miles.

Here's the trade-off table you'll need for design interviews:

Strategy	Cost	Complexity	Recovery Time	Best For
Active-Active	High (all resources active)	High (state sync)	Near-zero	Latency-sensitive, global users
Active-Passive	Medium (idle standby)	Medium	Seconds to minutes	Databases, stateful services
N+1	Low (just 1 extra)	Low	Near-zero	Stateless app servers
Multi-AZ	Medium (2-3x networking)	Medium	Seconds	Production workloads (standard)
Multi-Region	Very high (2x everything)	Very high	Minutes	Life-critical, global, compliance

Most teams start with N+1 redundancy behind a load balancer in a single AZ. Then they expand to multi-AZ. Multi-region is usually only worth the cost and complexity for services where downtime costs more than the infrastructure — think payment processing, healthcare, or serving users on multiple continents.

Redundancy — having more than one of everything — is the core strategy for high availability. Active-active is efficient but complex; active-passive is simpler but wastes resources. N+1 is the pragmatic sweet spot for most teams. Geographic redundancy (multi-AZ, multi-region) protects against physical disasters. The right level depends on your downtime cost vs infrastructure cost.

Section 5

Load Balancing for High Availability

You've added redundant servers. Great. But how does traffic actually get to those servers? If you just point your domain at one server's IP address, then when that server dies, all traffic goes nowhere — even though your other servers are perfectly healthy. You need a traffic cop: a load balancerA server (or service) that sits between users and your application servers. It receives every incoming request and forwards it to one of the healthy backend servers. Think of it as a receptionist who directs visitors to whichever agent is available..

A load balancer does two things that matter for availability. First, it distributes traffic across multiple servers so no single server gets overwhelmed. Second — and this is the HA superpower — it detects when a server is dead and stops sending traffic to it. Users don't see a single error. The dead server just quietly disappears from the pool, and the healthy ones pick up the slack.

How does the load balancer know a server is dead? It sends periodic health checksSmall requests the load balancer sends to each backend server every few seconds (typically every 5-10 seconds). Usually a simple HTTP GET to an endpoint like /health that returns 200 OK if the server is running properly. If a server fails to respond to 2-3 consecutive health checks, the LB marks it as unhealthy and stops sending traffic to it. — little "are you alive?" pings every few seconds. If a server doesn't respond to two or three checks in a row, the LB marks it as unhealthy and routes around it. When the server comes back, it starts getting traffic again.

But wait — the load balancer itself is a component in the request path. If it dies, nothing gets routed. The LB is a SPOF unless you make it redundant too! The standard solution is an active-passive LB pairTwo load balancers. The primary handles all traffic. The standby monitors the primary with a heartbeat signal. If the primary stops responding, the standby takes ownership of the Virtual IP address and starts handling traffic. The switch typically takes 1-3 seconds. Users only know the VIP, so they never notice the switch. sharing a virtual IPAn IP address that isn't tied to a physical machine. It "floats" between the primary and standby load balancers. Your DNS points to this VIP. When the primary LB dies, the standby claims the VIP and starts receiving traffic — no DNS change needed, so the failover is near-instant. (VIP). Users point to the VIP. When the primary LB dies, the standby claims the VIP — no DNS change needed, failover in seconds.

Algorithms and Availability

How the load balancer chooses which server gets each request also affects availability:

Round-robin: Requests go to servers in order — 1, 2, 3, 1, 2, 3. Simple and works well when all servers are identical. But if one server is slower, it builds up a backlog.

Least connections: Sends each new request to whichever server has the fewest active connections right now. Automatically adapts if one server is slow (it accumulates connections, so the LB sends fewer new ones). Better for mixed workloads.

Weighted: You assign weights to servers — a beefy server with 32 cores gets weight 4, a small one gets weight 1. The LB sends 4x more traffic to the big one. Essential when your fleet isn't homogeneous.

Layer 4 vs Layer 7

Load balancers operate at different levels of the networking stack, and this matters for HA:

Layer 4 (transport)Layer 4 load balancers route based on TCP/UDP information — source IP, destination IP, port numbers. They don't read the HTTP content at all. This makes them extremely fast (millions of connections per second) but less smart — they can't route based on URL paths, cookies, or headers. load balancers route based on IP addresses and port numbers. They're blindingly fast because they don't inspect the content of requests. But they can't make smart routing decisions — they don't know if a request is for /api/health or /api/heavy-report.

Layer 7 (application)Layer 7 load balancers read the full HTTP request — URL path, headers, cookies, even the request body. This lets them make intelligent decisions: route /api/* to API servers, /images/* to a CDN, or send premium users to faster servers. Slower than L4 because of the inspection overhead, but far more flexible. load balancers read the full HTTP request. They can route /api/* to API servers and /static/* to a CDN. They can do content-based health checks ("did the response include valid JSON?"). They're slower than L4 but much smarter.

Session Affinity (Sticky Sessions)

Here's a trade-off that directly impacts availability: session affinityAlso called "sticky sessions." The load balancer remembers which server a user was sent to and keeps sending them to the same one for the duration of their session. This is convenient if the server stores session data locally (shopping cart, login state), but it hurts availability because if that specific server dies, the user loses their session. (sticky sessions). If your application stores session data locally on each server (login state, shopping cart), the LB needs to keep sending the same user to the same server. That's sticky sessions.

The problem? Sticky sessions are an availability anti-pattern. If the server your session is stuck to dies, you lose your session — you're logged out, your cart is gone, your form data disappears. The user has to start over. Compare that to stateless serversServers that don't store any user-specific data locally. Session data is kept in a shared store (like Redis or a database) that all servers can access. Any server can handle any request from any user. This gives the load balancer maximum freedom to route around failures. where session data lives in a shared store like Redis: any server can handle any user's request, so when a server dies, the LB sends the user to another one and they don't notice anything.

Stateless services are the best friend of load balancers. If any server can handle any request, the LB has maximum freedom to route around failures. The moment you add sticky sessions, you create a dependency between users and specific servers — and that dependency becomes a mini-SPOF for each affected user.

Load balancers are the traffic cops of high availability — they distribute requests across healthy servers and automatically route around dead ones using health checks. But the LB itself is a SPOF, so it needs its own redundancy (active-passive pair with a virtual IP). Stateless servers give the LB maximum routing freedom. Sticky sessions hurt availability because users are tied to specific servers.

Section 6

Failover Strategies — When the Primary Goes Down

So you’ve built redundancy into every layer — multiple servers, replicated databases, backup load balancers. Great. But here’s the question that actually matters: when the primary dies, how fast does the backup take over? Because redundancy without fast failover is like having a spare tire locked in a shed 20 miles away. Technically you have a backup. Practically, you’re still stuck on the side of the road.

FailoverThe process of switching from a failed primary system to a backup (standby) system. Good failover happens automatically, in seconds, without users noticing. Bad failover requires a human to SSH in at 3 AM and manually flip a switch. is the process of detecting that a primary component has failed and switching traffic to a backup. It sounds simple, but every step introduces delay — and those delays add up to real downtime your users feel.

The Failover Timeline — Where Every Second Goes

When something dies in production, the clock starts ticking immediately. But you don’t know it’s dead yet. That gap between "it actually died" and "we know it’s dead" is the detection phase, and it’s usually the longest part of failover. Here’s why:

Most health check systems work by pinging your server every few seconds. If the server doesn’t respond N times in a row, it’s declared dead. So if you check every 5 seconds and need 3 failures to confirm, that’s already 15 seconds before you even know there’s a problem. Then comes promotion (making the backup the new primary), then routing updates (telling all traffic to go to the new primary). Each phase has its own latency.

Cold vs Warm vs Hot Failover

Not all backups are created equal. The faster you want recovery, the more you pay in ongoing costs. Think of it like car insurance: you can keep a wreck in the junkyard (cheap, slow to fix), a used car in a garage (moderate cost, moderate speed), or a running car with the engine idling in the driveway (expensive, instant). These three levels of readiness have names:

Cold failover — The backup server is powered off (or doesn’t exist yet). When the primary dies, you boot a new machine, install software, load data, configure networking, and start serving. Recovery time: minutes to hours. Cost: lowest (you only pay when you need it). Example: launching a new EC2 instance from an AMI.
Warm failover — The backup is running but not fully synchronized. It might be a database replica that’s a few seconds behind, or an app server that needs to warm its cacheData stored in fast memory (like Redis or in-process memory) so the system doesn’t have to fetch it from the slow database every time. A "cold cache" has nothing in it — every request hits the database, causing a flood of slow queries until the cache fills up. before it performs well. Recovery time: seconds to minutes. Cost: moderate (paying for a running server that does partial work).
Hot failover — The backup is fully running, fully synchronized, and actively receiving a copy of every write in real time. When the primary dies, the backup is already up to date — it just needs to be told "you’re the primary now." Recovery time: seconds. Cost: highest (paying for a full copy of everything, all the time). Example: AWS RDS Multi-AZ with synchronous replication.

DNS Failover vs Floating IP — The Routing Problem

Once the backup is promoted, you still need to tell the world to send traffic to it. There are two main approaches, and the speed difference is dramatic.

DNS failover changes your domain’s DNS record to point to the backup server’s IP address. The problem? DNS TTLTime To Live — how long DNS resolvers (and your browser) cache a DNS lookup result before asking again. If TTL is 300 seconds, then for up to 5 minutes after you change the DNS record, some clients will still use the old (dead) IP address. Lower TTL = faster failover but more DNS queries (more cost, more latency).. When your users’ browsers or ISP resolvers cached the old IP address (which they do for the TTL duration), they’ll keep trying to reach the dead server. Even if you set TTL to 60 seconds, some resolvers ignore it and cache for longer. You’re looking at 1–5 minutes where some users still can’t reach you.

A floating IP (also called a Virtual IP or VIPAn IP address that isn’t tied to a specific physical server. It floats between servers — when the primary dies, the IP is instantly reassigned to the backup. From the outside world’s perspective, nothing changed — the IP is the same, it just now points to a different machine behind the scenes. AWS calls these Elastic IPs.) solves this by operating at a lower level. Instead of changing DNS, you reassign the IP address itself from one server to another. The DNS record stays the same, clients use the same IP — but that IP now routes to the backup server. This happens in under a second on most cloud platforms. AWS Elastic IPs, DigitalOcean floating IPs, and GCP’s address reassignment all work this way.

DNS failover sounds simple but TTL caching means some users will hit the dead server for minutes after failover. Many ISP resolvers cache DNS results beyond the TTL you set. For production services that need fast failover, use floating IPs or a load balancer with health checks — not raw DNS failover. DNS failover is fine as a last-resort backup, but don’t rely on it as your primary failover mechanism.

Think First

Your database health check runs every 10 seconds with a failure threshold of 3. Replica promotion takes 30 seconds. You use DNS failover with a 60-second TTL. What’s the worst-case downtime from a primary database failure?

Detection: 10s × 3 = 30 seconds. Promotion: 30 seconds. DNS propagation: up to 60 seconds (or more if resolvers ignore TTL). Worst case: 30 + 30 + 60 = 2 minutes. With a floating IP instead of DNS, it drops to about 61 seconds. That’s the power of choosing the right routing mechanism.

Failover has three phases: detect the failure (5–30 seconds), promote the backup (5–60 seconds), and reroute traffic (instant to minutes). Cold failover is cheapest but slowest (minutes to hours). Warm failover runs an idle standby (seconds to minutes). Hot failover keeps a fully synced replica ready (seconds). For routing, floating IPs reroute in under a second; DNS failover can leave users stranded for minutes due to TTL caching.

Section 7

Database High Availability — The Hardest Part

Web servers are easy to make highly available. They’re statelessA stateless server doesn’t remember anything between requests. Each request is independent. If you lose the server, you lose nothing — just spin up a new one and it works exactly the same. Databases are the opposite: they hold your data (state), which is why they’re so much harder to replicate. — lose one, spin up another, no data lost. But databases? Databases are stateful. They hold your data. Your users’ orders, their account balances, their messages. If you lose a database and don’t have a replica, that data might be gone forever. That’s why the database layer is almost always the hardest component to make highly available.

Primary-Replica Replication (The Foundation)

The most common database HA setup is beautifully simple in concept: one database handles all the writes (the primaryThe main database server that accepts write operations (INSERT, UPDATE, DELETE). There’s only one primary to avoid write conflicts. Also called "master" in older terminology, but the industry has largely moved to primary/replica naming.), and it continuously copies those changes to one or more replicasRead-only copies of the primary database. They receive a continuous stream of changes from the primary and apply them locally. Replicas can handle read queries (SELECT), spreading the read load across multiple servers. Also called "secondaries" or historically "slaves.". Reads can go to any replica, spreading the load. If the primary crashes, one of the replicas gets promoted to become the new primary.

But the devil is in the details. When does the replica get the data? Right away, or eventually? This single question — synchronous versus asynchronous replication — defines the biggest tradeoff in database HA.

Synchronous vs Asynchronous — The Core Tradeoff

With synchronous replication, when you write data to the primary, it doesn’t tell you "OK, done" until at least one replica has also confirmed it received the data. If the primary crashes one millisecond later, the replica has the data. Zero data loss. The price? Every single write is slower because it has to wait for the replica’s acknowledgment. If the replica is in the same data center, that adds maybe 1–2 milliseconds. If it’s across regions (say, Virginia to Oregon), that’s 50–80 milliseconds per write. At thousands of writes per second, that latency adds up fast.

With asynchronous replication, the primary says "done" as soon as it writes the data locally, before the replica confirms. The replication happens in the background, usually within milliseconds. Writes are fast. But if the primary crashes before the replica catches up, those last few writes are gone. This gap is called replication lagThe delay between when a write happens on the primary and when it appears on the replica. In async replication, lag is usually milliseconds to seconds under normal conditions, but can spike to minutes during heavy write loads or network issues. This means a replica might serve stale (outdated) data., and it typically ranges from milliseconds (normal) to seconds or minutes (under heavy write load or network congestion).

Multi-Primary — Writes Everywhere (With a Price)

What if any node could accept writes? That’s multi-primaryA replication topology where two or more database nodes all accept write operations. Changes from each primary are replicated to the others. This eliminates the single-primary bottleneck but introduces the nightmare of conflict resolution — what happens when two nodes modify the same row at the same time? (also called multi-master). It sounds ideal — no single point of failure for writes, better write throughput. But it introduces a monster problem: write conflicts. If two users update the same row on two different primaries at the same millisecond, which write wins? You need a conflict resolution strategy, and every strategy has drawbacks. "Last write wins" loses data silently. Custom merge logic is complex and error-prone. This is why multi-primary is used far less than primary-replica in practice.

The Connection Pooling Trick

One underappreciated availability technique is connection poolingA proxy that sits between your application and the database, maintaining a pool of reusable database connections. Instead of each app server opening its own connections (which is slow and limited), the pooler multiplexes hundreds of app connections onto a smaller number of real database connections. Popular poolers: PgBouncer (PostgreSQL), ProxySQL (MySQL).. Tools like PgBouncer (PostgreSQL) or ProxySQL (MySQL) sit between your app and the database. During brief restarts or failovers (say, 5–10 seconds), the pooler holds onto your app’s connections and replays them once the database is back. Your application code doesn’t even know the database briefly went away. It’s like having a personal assistant who picks up the phone and says "please hold" instead of disconnecting the call.

Approach	Setup Effort	Data Loss Risk	Failover Time	Complexity
Primary + async replica	Low	Seconds of writes	30–120s	Low
Primary + sync replica	Low	Zero	15–60s	Medium
Multi-primary	High	Conflicts possible	Near-zero	Very High
Shared storage (Aurora)	Medium	Zero	~30s	Low (managed)

Amazon Aurora separates the compute layer (query processing) from the storage layer (data on disk). Traditional PostgreSQL ties compute and storage together on the same server. When a traditional primary fails, the replica has to replay transaction logs and verify data integrity before it can serve writes — that takes 60–120 seconds. Aurora’s approach? Just start a new compute instance and point it at the same shared storage. No data to copy, no logs to replay. That’s why Aurora failover takes about 30 seconds vs PostgreSQL’s 60–120 seconds.

Databases are the hardest component to make HA because they’re stateful — you can’t just replace them without moving the data. Primary-replica replication is the foundation: writes to one primary, reads from replicas. Synchronous replication guarantees zero data loss but slows every write. Asynchronous is faster but risks losing recent writes on failover. Multi-primary eliminates the write bottleneck but introduces painful conflict resolution. Connection poolers like PgBouncer can mask brief outages. Aurora achieves fast failover by separating compute from storage.

Section 8

Multi-AZ and Multi-Region — Surviving Data Center Failures

Everything we’ve talked about so far — redundant servers, database replicas, health checks — could all live in the same building. And if that building loses power, catches fire, or gets hit by a flood? Everything goes down together. Redundancy within a single data center protects you from individual machine failures. To survive an entire data center failure, you need to spread across multiple physical locations.

What’s an Availability Zone?

Cloud providers organize their infrastructure into Availability Zones (AZs)An Availability Zone is an isolated data center (or cluster of nearby buildings) within a cloud region. Each AZ has independent power, cooling, and networking. They’re close enough for low-latency connections (typically under 2 milliseconds) but far enough apart that a localized disaster (fire, flood, power outage) won’t hit two AZs at once. AWS typically has 3–6 AZs per region.. Think of an AZ as a separate building (or group of buildings) within the same city. Each AZ has its own power supply, its own cooling, its own network connections. They’re connected to each other with high-speed, low-latency links (typically under 2 milliseconds), but physically separated enough that a localized disaster — building fire, power grid failure, cooling system meltdown — affects only one AZ.

A regionA geographic area where a cloud provider operates (like us-east-1 in Virginia, eu-west-1 in Ireland, ap-southeast-1 in Singapore). Each region is completely independent — separate power grids, separate internet connectivity, separate staff. Regions are separated by hundreds or thousands of miles. Cross-region latency is typically 50–200 milliseconds. is a geographic area — like Northern Virginia (us-east-1) or Ireland (eu-west-1). Each region contains multiple AZs. Regions are separated by hundreds or thousands of miles, so a natural disaster that wipes out one region won’t touch another.

This is what a standard production setup looks like on AWS, Azure, or GCP. Your load balancer (like an AWS ALB) routes traffic across servers in multiple AZs. If AZ-1a has a power outage, the load balancer detects the health check failures within seconds and routes all traffic to AZ-1b and AZ-1c. Your users don’t even see a blip.

Cross-AZ latency is tiny — typically 0.5–2 milliseconds. So running your app across 3 AZs costs you almost nothing in performance. This is why multi-AZ is considered the minimum for any production service.

Multi-Region — The Big Leagues

Multi-AZ handles data center failures. But what about an entire region going offline? This is rare but it happens. AWS us-east-1 has had several major outages that took down huge chunks of the internet. When that happens, every service in that region — no matter how many AZs it spans — goes dark.

Multi-region means deploying your application in two or more geographic regions (say, us-east-1 and us-west-2). A global load balancerA load balancer that routes users to the closest healthy region. AWS Route 53 (DNS-based), Cloudflare, and Azure Traffic Manager are examples. They use latency-based routing (send users to the region with the lowest ping) or geolocation routing (send European users to eu-west-1). routes each user to the nearest healthy region. If one region fails, the global LB redirects everyone to the surviving region.

Sounds great, right? The catch is data. Replicating compute (app servers) across regions is easy — deploy the same container image in both regions and you’re done. But replicating data across regions means every write in Virginia needs to travel to Ireland (or wherever your second region is). That’s a round trip of 50–200 milliseconds. If you use synchronous replication, every single write waits that long. If you use async, you risk data loss during a region failure.

This is the fundamental challenge of multi-region: you can’t have fast writes AND zero data loss AND multiple regions all at the same time. You have to pick two. Most systems choose fast writes + multiple regions (async replication), accepting that a region failure might lose the last few seconds of writes. Only the most critical systems (banks, healthcare) pay the latency cost for synchronous cross-region replication.

Multi-AZ is the MINIMUM for any production service. It’s cheap (cross-AZ data transfer is pennies), fast (under 2ms latency), and protects against the most common type of infrastructure failure. Multi-region is required for global services or when a single region going down would be catastrophic to the business. It’s much harder and more expensive, but it’s the only protection against a full regional outage.

Think First

Your e-commerce site runs in us-east-1 across 3 AZs. You want to add a second region for disaster recovery. Your database gets 5,000 writes per second. If you use synchronous cross-region replication, what happens to write latency? At 5,000 writes/sec, is that acceptable?

Each write must wait for the cross-region round trip (~80ms Virginia to Oregon). That means each write takes at least 80ms instead of 1–2ms. At 5,000 writes/sec, you’d need enough connection capacity to handle 5,000 concurrent 80ms operations. Most teams can’t afford that latency penalty, which is why async replication with "eventual consistency" is the standard for multi-region.

Multi-AZ (2–3 Availability Zones within one region) is the minimum for production — it protects against single data center failures with negligible latency cost. Multi-region (2+ geographic regions) protects against an entire region going down but introduces the hard data replication problem: synchronous cross-region replication adds 50–200ms to every write, while asynchronous replication risks losing recent writes during a region failure. Most systems use multi-AZ as the default and add multi-region only when the business impact of a full regional outage justifies the complexity and cost.

Section 9

Graceful Degradation & Rate Limiting — Bending Without Breaking

Everything we’ve covered so far is about surviving failures — keeping things running when components die. But what about when nothing is broken, yet your system is drowning? A traffic spike. A viral tweet. Black Friday. Your servers are alive but can’t keep up with the load. You have two options: crash completely and serve errors to everyone, or deliberately reduce what you offer so most people still get a usable experience.

The second option has a name: graceful degradationIntentionally disabling non-critical features when the system is under stress so that core features can keep working. Instead of crashing entirely, the system "degrades gracefully" — it still works, just with fewer features. Think of it like a building that turns off decorative lighting during a power shortage to keep the elevators and emergency lights running.. Instead of serving a blank error page, you serve a simpler version of your product. Not perfect, but functional. Netflix does this: under heavy load, they reduce video quality rather than stop streaming entirely. Amazon shows cached product pages when the recommendation engine is overloaded. Twitter shows "Something went wrong" for tweets but keeps the timeline loading.

Degradation Tiers — Your Emergency Plan

Smart teams plan their degradation strategy in advance, defining clear tiers from "everything’s fine" down to "we’re barely alive." Each tier disables less-critical features to free up resources for the core experience.

Rate Limiting — Protecting the Many from the Few

Sometimes your system isn’t under stress from genuine traffic — it’s being hammered by one bad actor, a buggy client, or an accidental infinite loop in someone’s code. Rate limitingA mechanism that limits how many requests a single client (user, IP address, or API key) can make in a given time window. Requests beyond the limit receive a 429 "Too Many Requests" response. This protects the service from being overwhelmed by any single client, keeping it available for everyone else. is your bouncer at the door: it limits how many requests any single client can make, so one misbehaving user can’t take down the service for everyone else.

The most popular rate limiting algorithm is the token bucketA rate limiting algorithm that works like a bucket of tokens. The bucket starts full. Each request removes one token. Tokens refill at a steady rate (e.g., 10 per second). If the bucket is empty, the request is rejected. This allows controlled bursts (use up the bucket fast) while enforcing a long-term rate limit.. Picture a bucket that holds, say, 100 tokens. It refills at a rate of 10 tokens per second. Every request costs one token. If the bucket is empty, the request is rejected with a 429 Too Many Requests response. This allows short bursts (up to 100 rapid requests) while enforcing a sustained rate of 10 requests per second.

Load Shedding — Triaging at Maximum Capacity

Rate limiting protects against individual bad actors. Load sheddingWhen a system is at maximum capacity, it deliberately drops low-priority requests to ensure high-priority requests still get through. Like a hospital triage nurse who sends minor injuries to the waiting room so the ER can focus on life-threatening cases. protects against legitimate overload. When your system is at 100% capacity, you start prioritizing. A checkout request is more important than a recommendation request. A health check is more important than an analytics ping. A payment confirmation is more important than a product page load.

Here’s a typical priority scheme:

P0 (never shed): Health checks, payment processing, authentication
P1 (shed last): Product pages, search results, shopping cart
P2 (shed early): Recommendations, reviews, personalization
P3 (shed first): Analytics events, background sync, non-critical logging

When load hits a threshold (say, CPU at 90%), start rejecting P3 requests. Still overloaded? Reject P2. The goal is simple: keep the cash register ringing even when the store is packed beyond capacity.

During an outage, your team will be stressed, sleep-deprived, and making decisions under pressure. That’s the worst possible time to figure out which features to disable. The right approach: define your degradation tiers in advance, wire them to feature flags, test them regularly, and make sure anyone on the on-call team can activate them with a single click or command. At Netflix, degradation tiers are pre-planned and can be activated in under 60 seconds by the on-call engineer.

Think First

Your API gets 10,000 requests/sec normally. A viral campaign causes a spike to 50,000 requests/sec. Your servers can handle 20,000/sec at most. What’s your strategy? Rate limit? Degrade? Both?

Both. Rate limit individual clients to prevent abuse (maybe 100 req/sec per client). For the remaining legitimate flood: activate degradation tier 2 (disable recommendations, serve cached product pages), and shed P3 analytics traffic completely. This gets your effective load down from 50K to maybe 18K processed requests/sec — within capacity. The core shopping experience stays alive.

Graceful degradation keeps core features alive by shedding non-critical ones under load: full service → reduced features → read-only → static fallback. Rate limiting (token bucket is the most common algorithm) protects the system from individual abusers by rejecting excess requests with 429 responses. Load shedding triages at max capacity — drop analytics before search, drop search before checkout. The key: plan all degradation tiers BEFORE the incident and wire them to feature flags for instant activation.

Section 10

Real-World Availability Numbers — What Production Actually Looks Like

Let’s ground all this theory in real numbers. What do actual, honest-to-goodness production systems achieve? What do the biggest companies in the world promise their customers? And what does it actually cost when things go wrong?

What Real Services Promise (and Achieve)

Service	Target Availability	Allowed Downtime/Year	What This Means
AWS S3	99.99% (designed for 99.999999999% durability)	52 min	Your files are almost certainly safe, but the service might be briefly unreachable
Google Search	~99.999%	~5 min	The gold standard — Google invests billions to achieve this
Netflix	99.99%+ (designed for failure)	<52 min	Chaos Monkey ensures every component survives random failures
Major banks	99.95–99.99%	26 min – 4.4 hrs	Regulatory requirements push for high availability; still have maintenance windows
Typical SaaS startup	99.5–99.9%	8.7 hrs – 43 hrs	Acceptable for most B2B products; customers tolerate occasional downtime
Telecom / 911 systems	99.999% (five nines)	5.3 min	Lives depend on it — regulated by government, no excuses

The Cost of Downtime by Industry

Downtime isn’t just an engineering problem — it’s a business catastrophe measured in real dollars per minute. These numbers come from industry surveys and help explain why some companies invest tens of millions in availability infrastructure.

Industry	Estimated Cost per Hour of Downtime	Why So High
Financial services	$5.6 million	Trades fail, regulatory fines, customer lawsuits
Healthcare	$636,000	Patient care delayed, compliance violations, liability
Retail / e-commerce	$1.1 million	Lost sales, cart abandonment, customers go to competitors
Manufacturing	$260,000	Production lines stop, supply chain disrupted
Media / entertainment	$90,000	Ad revenue lost, subscriber churn, social media backlash

Why 100% Is Impossible

If you take nothing else from this section, take this: 100% availability is a mathematical impossibility for any non-trivial system. Here’s why:

Hardware fails randomly. Every disk, every chip, every cable has a failure rate. At scale (thousands of servers), hardware failure is not a matter of "if" but "which one this week."
Networks partition. The internet is a mess of cables, routers, and BGP configurations managed by thousands of independent organizations. Partitions, packet loss, and routing errors happen constantly.
Humans make mistakes. Remember — 60–80% of outages are caused by human error. Config changes, bad deployments, fat-fingered commands. You can’t eliminate humans from operations entirely.
Software has bugs. Even the most battle-tested code (Linux kernel, PostgreSQL, nginx) has bugs discovered years after release. Zero bugs is a fantasy.
Even time itself is tricky. Leap seconds have caused real outages. On June 30, 2012, a leap second caused a Linux kernel bug that took down Reddit, Mozilla, Yelp, LinkedIn, and dozens of other sites. On January 1, 2017, Cloudflare's RRDNS returned negative durations because of a leap second, causing DNS failures.

Google, Amazon, and Microsoft — the three companies with arguably the most sophisticated infrastructure on the planet — all have outages every year. Google had 5 major outages in 2023. AWS us-east-1 went down for 7 hours in December 2021. Azure Active Directory had outages affecting millions. The goal is not to eliminate failure. It’s to make failures so brief and so well-handled that users barely notice. That’s what separates 99.99% from 99.9% — not fewer failures, but faster recovery.

Think First

Your SaaS product currently achieves about 99.5% availability (43 hours of downtime per year). Your CEO wants 99.99% (52 minutes per year). That’s an 83× reduction in downtime. What would you need to build?

You’d need: multi-AZ deployment (probably multi-region), hot database failover with synchronous replication, automated deployment rollback, comprehensive health checks, rate limiting, graceful degradation, 24/7 on-call rotation, and chaos testing. Budget estimate: 3–5 additional engineers dedicated to reliability + 2–3× your current infrastructure cost. Each additional nine isn’t a small improvement — it’s a fundamentally different architecture.

Real production availability ranges from 99.5% (SaaS startups, ~43 hours downtime/year) to 99.999% (telecom, ~5 minutes/year). Downtime costs vary wildly by industry: $5.6M/hour for finance, $1.1M/hour for e-commerce. 100% availability is physically impossible — hardware fails, networks partition, humans make mistakes, and even time itself (leap seconds) has caused outages. Each additional nine of availability costs roughly 10x more in engineering and infrastructure. The goal is not zero failures but invisible failures.

Section 11

Availability Patterns in the Wild

Theory is nice, but how do real companies actually stay up? Not with magic — with a handful of battle-tested patterns that show up again and again across Netflix, Amazon, Slack, and every company that can’t afford to go down. Let’s walk through the four most impactful patterns, starting with the simplest one that gives you the biggest bang for your buck.

This is the single most important availability pattern, and it’s deceptively simple: don’t store anything on the web server itself. No sessions, no local files, no in-memory caches that matter. Every request can go to any server, because no server is "special."

Why does this help availability? Because it gives your load balancerA server that distributes incoming requests across multiple backend servers. If one backend is unhealthy, the load balancer just stops sending traffic to it — no human intervention needed. complete freedom. If Server 3 crashes, the load balancer simply stops sending traffic there. The next request goes to Server 1 or Server 2, and the user never notices. There’s no "but that user’s session was on Server 3!" problem because sessions live in an external store like RedisAn in-memory data store often used for sessions, caching, and real-time data. It’s fast (sub-millisecond reads) and can replicate across multiple nodes for its own high availability..

When to use it: Every web application. Seriously. If your web servers are stateful, that should be the first thing you fix.

Trade-off: You need an external session store (Redis, DynamoDB, etc.), which is one more piece of infrastructure to manage. But that store can be replicated and clustered, so it’s far easier to make highly available than trying to keep sticky sessions working across dozens of servers.

Imagine an online store that sends a confirmation email after every purchase. The naive approach: the checkout API calls the email service directly. If the email service is down, the checkout fails — even though the payment already went through. That’s terrible.

The fix: put a message queueA buffer that sits between two services. The producer writes messages into the queue; the consumer reads them out. If the consumer is offline, messages pile up in the queue (safely) and get processed when it comes back. Think of it like a mailbox — the postal service doesn’t need you to be home. between them. The checkout API writes a "send email" message to the queue and immediately responds to the user. The email service reads from the queue at its own pace. If the email service crashes, the messages just sit in the queue waiting. When it restarts, it picks up right where it left off. Zero messages lost, zero user impact for the checkout flow.

When to use it: Any work that doesn’t need to happen synchronously — emails, notifications, analytics, image processing, report generation.

Trade-off: The work becomes eventually consistent. The user might see "Order confirmed!" before the email goes out. For most async work, that’s fine. For something the user needs to see immediately, a queue adds unwanted delay.

Here’s a thought experiment: if your entire system runs on one stack (one set of servers, one database, one cache), and any of those break, every single user is affected. That’s a 100% blast radiusThe scope of impact when something fails. A blast radius of 100% means every user is affected. A blast radius of 5% means only a small fraction of users notice. The goal is to make the blast radius as small as possible..

Cell-based architecture fixes this by dividing your users into independent "cells." Each cell is a completely self-contained stack — its own servers, its own database, its own cache. Users 1 through 1 million are in Cell A, users 1 million to 2 million are in Cell B, and so on. If Cell B crashes, only those users are affected. Cells A and C keep running as if nothing happened.

AWS uses this pattern internally. Slack uses it. When Slack says "some users are experiencing issues," that’s a cell failure — not a total outage.

When to use it: Large-scale systems (millions of users) where total outages are unacceptable. Overkill for small apps.

Trade-off: Significantly more infrastructure to manage. Each cell is essentially a mini-deployment of your entire system. Cross-cell communication (e.g., User A in Cell 1 messaging User B in Cell 2) adds complexity.

Most internet traffic is reads, not writes. Think about Twitter: for every tweet written, it’s read thousands of times. For every product listed on Amazon, it’s viewed millions of times. The ratio is often 90-99% reads vs. 1-10% writes.

Read replicas exploit this by keeping one primary databaseThe database instance that handles all write operations (INSERT, UPDATE, DELETE). There’s only one primary to avoid conflicting writes. It replicates its changes to multiple read replicas. for writes, and multiple copies (replicas) for reads. Writes go to the primary, which then replicates changes to the replicas. Reads are distributed across all the replicas.

How does this help availability? If the primary is slow (say, processing a big batch job), the replicas keep serving reads at full speed. If one replica dies, the others absorb the extra read traffic. Even during a failover on the primary, reads keep working.

When to use it: Any read-heavy workload (which is most web applications). Blog, e-commerce catalog, social media feeds, dashboards.

Trade-off: Replication lagThe delay between when data is written to the primary and when it appears on the replicas. Usually milliseconds, but under heavy load it can stretch to seconds. During that gap, a read from a replica might return stale data. — there’s a small delay (usually milliseconds) between a write hitting the primary and appearing on replicas. For a brief moment, replicas have stale data. For most applications, this is perfectly fine. For things like showing a user their own just-posted comment, you route that specific read to the primary.

Start with stateless web tier + queue-based decoupling. These two patterns alone will get you to 99.9% availability. Cell-based architecture and read replicas are for when you’re scaling beyond what a single deployment can handle.

Think First

You’re running an e-commerce site where the checkout flow (synchronous) is backed by the same database that powers the product catalog (read-heavy). Which of these four patterns would you apply first, and why?

The catalog is read-heavy (pattern 4 helps), the order confirmation email is async (pattern 2 helps), and making the web tier stateless (pattern 1) is always step one. In what order would you tackle them?

Four battle-tested patterns keep real companies running: (1) stateless web tier — any server handles any request, so the load balancer routes around failures instantly; (2) queue-based decoupling — async work survives consumer crashes because messages wait in the queue; (3) cell-based architecture — independent user "cells" limit blast radius; (4) read replicas — reads fan out across copies while writes go to one primary. Start with patterns 1 and 2 for the biggest availability gains.

Section 12

SLA, SLO, SLI — The Business Language of Availability

So far we’ve talked about availability from an engineering perspective — redundancy, failover, patterns. But availability is also a business concept. Your customers don’t care about your architecture. They care about one question: "Is this thing going to work when I need it?" And they want that promise in writing.

That’s where three acronyms come in. They sound similar, but they mean very different things, and confusing them is a classic interview mistake. Let’s start from the bottom and work up.

SLI — Service Level Indicator (The Measurement)

An SLI is simply a number you measure. It’s the raw data — the thermometer reading, not the fever threshold. Common SLIs include:

Availability SLI: What percentage of requests got a successful response? ("99.2% of requests returned HTTP 200")
Latency SLI: How fast were those responses? ("The 99th percentile response time was 340ms")
Error rate SLI: What percentage of requests failed? ("0.8% of requests returned 5xx errors")

The key: SLIs are objective, measurable facts. No opinions, no targets — just numbers from your monitoring system.

SLO — Service Level Objective (The Target)

An SLO takes an SLI and sets a target for it. It’s the engineering team saying: "We commit to keeping this number above a certain threshold." For example:

"99.95% of requests will return a successful response" (availability SLO)
"P99 latency will stay under 500ms" (latency SLO)

SLOs are internal targets. Your customers usually don’t see them. They’re how your engineering team holds itself accountable. And critically, SLOs should always be stricter than the SLA. If your SLA promises 99.9%, your SLO should be 99.95%. That gap is your safety margin — if you start missing your SLO, you have time to fix things before you breach the legal contract.

SLA — Service Level Agreement (The Legal Contract)

An SLA is the public, legally binding promise. It says: "If our service drops below this level, here’s what we owe you." That "owe you" part is real — AWS, Google Cloud, and Azure all publish SLAs with financial penalties. Drop below 99.99%? They credit 10% of your bill. Drop below 99%? 30% credit. It’s not charity — it’s a contract.

Not every service needs an SLA. Internal tools, beta products, and free-tier services often don’t have them. But any service customers are paying serious money for? They’ll want an SLA.

Error Budgets — The Bridge Between Engineering and Product

Here’s where this gets really clever. If your SLO is 99.95%, that means you’re allowed 0.05% failures. That’s not a target to hit — it’s a budget to spend. And like any budget, you can choose how to spend it.

Want to deploy a risky new feature? That might cause a few errors. Want to do database maintenance? That might cause a brief slowdown. Want to run a chaos engineering experiment? That might crash a server. All of these "cost" some of your error budget. As long as you don’t overspend, you’re fine.

So out of 10 million monthly requests, 5,000 can fail and you’re still within your SLO. That sounds like a lot, but a single bad deployment could burn through thousands in minutes. A flaky database connection could drain 500 per day. Suddenly that budget feels very tight.

Engineers want stability (fewer deploys, less risk). Product managers want features (more deploys, more experiments). The error budget decides who wins: if the budget is healthy, ship features freely. If it’s burned, freeze deployments and stabilize. No more arguments about "should we deploy this risky change?" — the math decides. Google popularized this approach in their SRE book, and it works because it turns a subjective debate into an objective number.

Think First

Your SLO is 99.95% and you serve 50 million requests per month. On Day 3, a bad config change causes 8,000 errors in one hour. How much of your monthly error budget did that one incident consume?

Monthly error budget = 50M × 0.0005 = 25,000 errors. The incident burned 8,000. That’s 32% of the entire month’s budget gone in one hour. You’d better be very careful for the remaining 27 days.

SLI is the measurement (what you actually observe), SLO is the target (what you aim for), and SLA is the legal contract (what you promise customers, with financial penalties). Error budgets turn reliability into a resource: your SLO of 99.95% means 0.05% of requests can fail — that’s your budget for deploying features, running experiments, and doing maintenance. Burn through the budget too fast and deploys get frozen until things stabilize.

Section 13

Common Mistakes — Availability Traps Everyone Falls Into

You can read every availability pattern in the book and still have your system go down — because the most common failures come from things people assume are fine but never actually verify. These are the traps. Every single one has taken down real production systems.

The trap: You add 3 app servers behind a load balancer and feel great about your redundancy. But all 3 servers talk to a single database with no replica. The database crashes — and all 3 "highly available" app servers become useless, because they can’t read or write data.

Why it’s dangerous: People naturally think "more servers = more available." But the chain is only as strong as its weakest link. If your app servers are N+2 but your database is N+0 (no redundancy), your system availability is limited by the database’s uptime.

How to fix it: Add at least one hot standby replica with automatic failover. For critical databases, add read replicas too. Tools like PostgreSQL streaming replication, MySQL Group Replication, or managed services (RDS Multi-AZ, Cloud SQL HA) make this straightforward.

The trap: "Our servers never crash!" Great — but if the one load balancer that routes to those servers goes down, it doesn’t matter that the servers are healthy. Users still can’t reach them.

Why it’s dangerous: ReliabilityHow often a component fails. A reliable component rarely breaks. But even a perfectly reliable component causes downtime if it’s a single point of failure — because when it eventually does fail (and everything eventually fails), there’s nothing to take over. is about individual components not failing. Availability is about the system being accessible even when components do fail. You can have perfectly reliable components and still have terrible availability if there’s no redundancy.

How to fix it: Map every component in your architecture and ask: "If this specific thing dies right now, can users still reach the site?" If the answer is "no" for any single component, that’s a SPOF that needs redundancy.

The trap: You set up a primary database and a replica. You configure automatic failover. You write it up in the architecture doc and move on. Six months later, the primary crashes for real — and the failover doesn’t work. Maybe the replication was silently broken. Maybe the failover script has a bug. Maybe the DNS TTL is cached and clients keep connecting to the dead primary.

Why it’s dangerous: Untested failover is barely better than no failover. It gives you false confidence — the worst kind of risk because you’re not preparing alternative plans.

How to fix it: Schedule regular failover drills. Actually kill the primary (in a controlled way, during business hours) and verify the replica takes over. Netflix runs Chaos Monkey continuously. You don’t need to be that aggressive, but a quarterly failover test should be the minimum for any critical system.

The trap: Your load balancer pins each user to a specific server (sticky sessions). User A always goes to Server 2. This works great for performance — until Server 2 dies. Now User A’s session is gone. They get logged out. Their shopping cart disappears. Their draft email is lost.

Why it’s dangerous: Sticky sessions turn every server into a mini SPOF for the users pinned to it. Instead of one server failure being invisible (because the load balancer routes around it), it directly affects a specific group of users.

How to fix it: Move to a stateless web tier (Pattern 1 from Section 11). Store sessions in Redis or another external store. If you must use sticky sessions for legacy reasons, configure the load balancer to fall back to a different server when the original is unhealthy — and accept that those users will lose their session state.

The trap: Your health check endpoint returns 200 OK if the process is running. The load balancer says "Server is healthy!" But the server’s database connection pool is exhausted. Or the disk is full. Or a dependent service is timing out. The server is "alive" but completely unable to serve real requests.

Why it’s dangerous: Shallow health checks give you a false green light. The load balancer keeps sending traffic to a server that can’t actually handle it, causing user-visible errors even though monitoring says everything is fine.

How to fix it: Implement deep health checks. Your health endpoint should verify it can actually connect to the database (run a simple query), check disk space, verify cache connectivity, and confirm critical dependent services are reachable. If any of those fail, return 503 so the load balancer routes traffic elsewhere.

The trap: Your entire stack runs in us-east-1. You have redundant servers, replicated databases, the works. Then AWS us-east-1 has a major outage (which happens roughly once a year). Every user worldwide is affected, because there’s no other region to fail over to.

Why it’s dangerous: Cloud regions are not immune to outages. They have fires, network partitions, power failures, and bad deployments just like anything else. If your entire system lives in one region, a regional outage is a total outage for you.

How to fix it: For global services, deploy to at least two regions. Use DNS-based routing (Route 53, Cloudflare) to direct users to the nearest healthy region. Keep databases synchronized across regions using async replication. It’s expensive and complex, but if a single-region outage would cost you more than the multi-region infrastructure, the math is simple.

The trap: You’ve made your own infrastructure bulletproof. Three app servers, replicated database, multi-zone deployment. But your checkout flow depends on a single payment gateway (Stripe). Your login depends on a single auth provider (Auth0). Your email notifications depend on a single email service (SendGrid). Any of those goes down, and a critical feature stops working.

Why it’s dangerous: Your system’s availability is bounded by the least available external dependency. If Stripe has 99.95% uptime and you have 99.99% uptime, your checkout availability is capped at 99.95% — Stripe’s number, not yours.

How to fix it: For critical dependencies, have a fallback provider or a graceful degradationContinuing to operate with reduced functionality instead of failing completely. For example: if the recommendation engine is down, show bestsellers instead of personalized picks. The user still gets something useful. plan. Can’t process payments? Let users complete the order and retry the charge later. Auth provider down? Use a cached token validation for a grace period. The key is identifying which external services are critical path and having a plan for each one.

Most availability failures don’t come from the thing you planned for. They come from the thing you forgot about. The database you didn’t replicate. The failover you didn’t test. The health check that lied. The external service you assumed would always be there. The only defense is to systematically ask "what if THIS specific thing dies?" for every component in your architecture — including the ones you didn’t build.

Think First

Draw your current system architecture (or an imaginary one). For each component, label it: redundant or SPOF. Don’t forget external services, DNS, the load balancer itself, and the deployment pipeline. How many hidden SPOFs did you find?

Most people find 2-3 they didn’t expect. Common surprises: the CI/CD pipeline (can’t deploy a fix if Jenkins is down), DNS (one provider), and monitoring itself (if your alerting system goes down, you don’t know anything else is down).

The most common availability traps: (1) database as a hidden SPOF behind "redundant" servers, (2) confusing component reliability with system availability, (3) never testing failover, (4) sticky sessions without fallback, (5) shallow health checks that miss real problems, (6) single-region deployment for global services, (7) forgetting that external dependencies cap your availability. Most outages come from the thing you forgot about, not the thing you planned for.

Section 14

Interview Playbook — Nail Availability Questions

Availability questions show up in almost every system design interview. The interviewer isn’t looking for memorized nines — they want to see a structured thinking process. Here’s a framework that works for any availability question, followed by the most common questions with approach strategies.

Now let’s apply this framework to the most common interview questions. For each one, notice how the answer starts with numbers and constraints before jumping into architecture.

Step 1 — Target: "E-commerce needs at least 99.99% during peak hours. That’s 52 minutes of downtime per year. Checkout is the most critical flow — if users can’t buy, we’re losing revenue directly. Browsing is important but less critical. Analytics can be eventually consistent."

Step 2 — SPOFs: "Let me map the critical path: DNS → CDN → Load Balancer → App Servers → Database → Payment Gateway. Each of these is a potential SPOF. I also have Redis for sessions, a queue for order processing, and an email service."

Step 3 — Redundancy: "DNS: use two providers (Route 53 + Cloudflare). LB: active-active pair across two AZs. App servers: stateless, N+2, spread across 3 AZs. Database: primary + hot standby with automatic failover, plus 2 read replicas. Redis: clustered, 3 nodes. Queue: managed service like SQS (inherently redundant)."

Step 4 — Failover: "Database failover is the trickiest. I’d use a managed service like RDS Multi-AZ for automatic failover in under 30 seconds. For app servers, the LB health checks handle it — unhealthy server gets removed from the pool in 10-20 seconds. For the payment gateway, I’d integrate a secondary provider (Stripe primary, Adyen fallback) and route to the fallback if the primary returns errors."

Step 5 — Monitoring: "SLIs: request success rate, P99 latency, checkout completion rate. Alert at 0.5% error rate — don’t wait for 5%. External synthetic monitoring from 3 regions to catch issues our internal monitoring might miss."

Step 6 — Degradation: "If the recommendation engine is down, show bestsellers. If search is slow, show cached results. If the payment gateway is down, let users save to cart and notify them when payments are back. The checkout flow is the last thing we degrade — everything else drops first."

Common Interview Questions

What the interviewer wants: They want to hear the math first, then the architecture. Most candidates jump straight to "add more servers" without quantifying what 99.99% actually means.

Your approach: Start by converting: "99.99% means 52 minutes of allowed downtime per year, or about 4 minutes per month. That’s extremely tight — a single slow deployment could eat the entire monthly budget." Then walk through the 6-step framework layer by layer. Emphasize automated failover (no human in the loop for initial response), multi-AZ deployment, and how you’d calculate the combined SLA of your architecture.

Key phrase: "With 52 minutes per year, I can’t afford manual intervention for common failures. Everything on the critical path needs automated detection and failover."

What the interviewer wants: They’re testing whether you think about the full timeline: detection, impact, recovery, and data implications. Not just "the replica takes over."

Your approach: Walk through the timeline. "The health check detects the failure within 10-15 seconds. During the detection window, write requests fail — but our app returns cached data for reads, so browsing still works. The managed database service promotes the replica within 30 seconds. There’s a window of replication lag — maybe 200ms of committed transactions that haven’t replicated. Those writes might be lost. In-flight transactions get connection errors and retry with idempotency keys. Total user-visible impact: about 30-45 seconds of degraded writes."

Key phrase: "Let me walk through the blast radius and the timeline of impact."

What the interviewer wants: This tests your understanding of multi-region architecture, DNS routing, and the hard problems of cross-region data synchronization.

Your approach: "First, detection: external health checks from multiple locations detect the region is unreachable. DNS routing (Route 53 health checks) stops sending traffic to the failed region within 60-120 seconds. Traffic shifts to the secondary region, which has been running warm with replicated data. The hard part is the database: we use async replication across regions, so there’s a replication lag of maybe 1-5 seconds. Writes during that lag window might be lost — the RPO is 5 seconds. For critical data (payments), we use synchronous replication to a local standby first, then async to the other region."

Key phrase: "The trade-off is between RPO (how much data can I lose) and write latency (synchronous cross-region replication adds 50-100ms to every write)."

What the interviewer wants: They want to see you distinguish SLI/SLO/SLA and know how to actually measure availability in practice — not just state the formula.

Your approach: "I define availability using SLIs — the percentage of requests that returned a successful response within an acceptable latency threshold, say 500ms. I set an SLO of 99.95% internally, which is stricter than the 99.9% SLA we promise customers. That gap gives us a safety margin. I calculate a monthly error budget: for 10M requests at 99.95%, that’s 5,000 allowed failures. I track burn rate on a dashboard — if we’re burning faster than linear, we investigate before we breach the SLO."

Key phrase: "The SLO is always tighter than the SLA. The gap between them is our safety margin."

Key Vocabulary to Use Naturally

Term	Plain English	Use In Interview
Availability Zone (AZ)	A separate data center building (same city, different power/network)	"I’d deploy across 3 AZs for zone-level fault tolerance"
Failover	The backup taking over when the primary dies	"Automated failover within 30 seconds for the database tier"
Error Budget	How many failures you’re "allowed" per month	"At 99.95%, our monthly error budget is 5,000 out of 10M requests"
Single Point of Failure	One thing that takes everything down if it breaks	"I need to eliminate SPOFs at every layer of the stack"
Blast Radius	How many users are affected when something breaks	"Cell-based architecture limits the blast radius to 5-10% of users"
Active-Active	All copies serve traffic simultaneously	"Active-active load balancing across two regions for instant failover"

Always lead with the MATH. Saying "99.99% means 52 minutes of downtime per year" shows the interviewer you understand the real implications, not just the buzzwords. Then connect it to architecture: "With only 4 minutes per month, I can’t afford manual failover — detection and recovery must be automated." That one sentence demonstrates more depth than 10 minutes of listing technologies.

Think First

You’re in an interview and get asked: "Your payment service processes 1 million transactions per day with a 99.95% SLO. Last month you had 3 incidents totaling 45 minutes of downtime. Did you meet your SLO?"

99.95% of 1M × 30 days = 30M requests. Allowed failures: 30M × 0.0005 = 15,000. But downtime is measured in minutes: 30 days = 43,200 minutes. 99.95% allows 21.6 minutes. 45 minutes of downtime = SLO breached. The key: know whether "availability" is measured by requests or by time — and clarify with the interviewer.

Use the 6-step framework for any availability interview question: (1) define the target in nines and convert to real downtime, (2) identify SPOFs, (3) add redundancy, (4) design failover, (5) set up monitoring with SLIs, (6) plan graceful degradation. Always lead with concrete math ("52 minutes per year") rather than buzzwords. Walk through failure timelines step by step (detection → impact → recovery → data loss), and know the vocabulary: availability zone, error budget, blast radius, active-active, failover.

Section 15

Practice Exercises — Build Your Availability Intuition

Reading about availability is one thing. Doing the math yourself — and making real architectural decisions — is another. These five exercises go from napkin arithmetic to full-blown system design. Try each one on paper before opening the hints.

Your web app has 3 app servers behind a single load balancerA device or service that distributes incoming network requests across multiple backend servers so no single server gets overwhelmed.. Each app server has an individual availability of 99.9% (three nines).

Part A: Calculate the overall availability of just the app-server tier. (Hint: the servers are in parallel — the tier only fails if all three fail simultaneously.)

Part B: Now factor in the load balancer itself, which has 99.99% availability. The load balancer sits in series with the app-server tier — if the LB dies, no traffic reaches any server. What’s the combined system availability?

Part C: What’s the lesson here about single points of failureA SPOF is any component whose failure takes the entire system down. If only one load balancer exists and it dies, it doesn’t matter that you have 50 healthy app servers behind it — nobody can reach them.?

Part A — Parallel availability: Each server has a 0.1% chance of being down (failure probability = 0.001). All three fail simultaneously with probability 0.001 × 0.001 × 0.001 = 0.000000001. So the app tier availability is:

1 − (1 − 0.999)³ = 1 − 0.000000001 = 0.999999999

That’s 99.9999999% — essentially nine nines. The parallel setup made the app tier absurdly reliable.

Part B — Serial with load balancer: When components sit in series (one after another), you multiply their availabilities:

0.9999 × 0.999999999 ≈ 0.9999

The combined availability drops to roughly 99.99%. Your nine-nines app tier got dragged down to four nines by the load balancer.

Part C — The lesson: The weakest link in a serial chain determines overall availability. Your three redundant app servers gave you essentially perfect uptime, but a single load balancer capped everything at 99.99%. This is why production setups use redundant load balancers (active-passive or active-active pair with a virtual IP). Never let the component that’s supposed to protect you become the single point of failure.

Your team runs a web API with an SLOService Level Objective — the internal target your team sets for availability or performance. For example, “99.95% of requests succeed.” It’s stricter than the SLA you promise customers. of 99.95%. You serve 5 million requests per day.

Part A: How many failed requests per day fit inside your error budgetThe maximum number of errors your system is “allowed” before you breach your SLO. Calculated as total requests × (1 − SLO). It’s the “budget” of failures you can spend.?

Part B: Your team just deployed a buggy release that caused 3,000 errors in 2 hours before someone rolled it back. How much of your daily error budget did that single incident consume?

Part C: What should the team do differently next time?

Part A — Daily error budget:

5,000,000 × (1 − 0.9995) = 5,000,000 × 0.0005 = 2,500 errors/day

You can afford 2,500 failed requests per day and still meet your SLO.

Part B — Budget consumed:

3,000 ÷ 2,500 = 1.2 = 120%

That single 2-hour incident burned through your entire daily error budget and then some. You’ve already breached the SLO for the day. If your SLO is tracked monthly, this one deploy consumed roughly 4% of your monthly budget (3,000 ÷ 75,000) in just 2 hours.

Part C — What to do differently:

Canary deployments — roll the new code to 5% of servers first. If error rate spikes, auto-rollback before the blast radius grows.
Automated rollback triggers — if the error rate exceeds 2× baseline within 10 minutes, the deploy system reverts automatically. Humans shouldn’t need to be awake for this.
Feature flags — wrap new functionality behind a flag so you can disable the buggy feature without rolling back the entire deployment.

Your e-commerce platform stores all product data, orders, and user accounts in a single PostgreSQL database. There’s no replication. If that server dies, the entire site goes down.

Design a high-availability database setup. In your design, address:

What replicationCopying data from one database server (the primary) to one or more other servers (replicas). If the primary dies, a replica can take over. Synchronous replication waits for confirmation; asynchronous doesn’t. type do you choose — synchronous or asynchronous? Why?
What failoverThe process of switching from a failed primary server to a healthy replica. Can be manual (a human runs a script) or automatic (a tool like Patroni detects the failure and promotes a replica). Automatic is faster but riskier. strategy do you use? Manual or automatic?
How do you handle connection poolingA technique where a proxy (like PgBouncer) maintains a pool of reusable database connections. Instead of each app server opening its own connections, they share from the pool. This reduces load on the database. during failover?
What’s the expected failover time?

Replication type: Use synchronous replication to one standby and asynchronous to a second. Here’s the reasoning: synchronous means zero data loss (every committed transaction is confirmed on the standby before the primary says “committed”). The trade-off is slightly higher write latency (~1-5ms extra for local-region sync). For an e-commerce site where losing even one order is unacceptable, that trade-off is worth it. The async replica gives you a third copy for read scaling and disaster recovery without adding more write latency.

Failover strategy: Use automatic failover with a tool like Patroni (which uses a distributed consensus store like etcd to decide which replica becomes the new primary). Manual failover means a human has to wake up, assess the situation, and run commands — that’s 10-30 minutes. Patroni can detect failure and promote a replica in 10-30 seconds.

Connection pooling: Place PgBouncer between your app servers and the database. Configure it to point at the Patroni-managed virtual IP (or use Patroni’s REST API for endpoint discovery). During failover, PgBouncer’s connections to the old primary break. Once Patroni promotes the new primary and updates the endpoint, PgBouncer reconnects automatically. App servers see a brief pause (10-30 seconds) but don’t need any code changes.

Expected failover timeline:

Failure detection: 5-10 seconds (Patroni health checks every few seconds)
Consensus + promotion: 5-15 seconds (etcd elects new leader, Patroni promotes replica)
Connection re-establishment: 2-5 seconds (PgBouncer reconnects)
Total: 15-30 seconds of database unavailability

Compare that to your current setup (single server, no replication) where a disk failure means hours of downtime while you restore from backups — and you lose every transaction since the last backup.

You run a SaaS product deployed in a single AWS region (us-east-1, across 3 Availability ZonesAn Availability Zone (AZ) is essentially a separate data center within a cloud region. AZs within the same region are connected by high-speed, low-latency links but have independent power and networking. If one AZ floods, the others keep running.). A large enterprise customer asks: “What happens if the entire us-east-1 region goes down?”

Design a multi-region architecture. Address:

Which second region do you pick and why?
How do you replicate data across regions?
How do you handle writes — single-region writes or multi-region writes?
How does traffic routing work when a region fails?

Region choice: Pick us-west-2 (Oregon). It’s geographically distant from us-east-1 (Virginia) — different earthquake zones, power grids, and weather patterns. Cross-region latency is ~60-70ms, which is manageable for async replication. If your customer base is global, consider eu-west-1 (Ireland) instead.

Data replication: Use asynchronous replication across regions. Synchronous cross-region replication adds 60-70ms to every write — that’s brutal for user experience. Async means the secondary region is 1-5 seconds behind the primary, which is acceptable for most SaaS workloads. For the database, use Amazon Aurora Global Database (automatic async replication with <1s lag) or set up PostgreSQL logical replication.

Write strategy: Start with single-region writes (active-passive). All writes go to us-east-1 (primary). The secondary region (us-west-2) handles reads and stands ready for failover. Why not multi-region writes? Because conflict resolution is extremely hard. If two users edit the same record in two regions simultaneously, you need a conflict resolution strategy (last-writer-wins? merge? user-decides?). For most SaaS products, single-region writes are simpler and sufficient.

Traffic routing: Use Route 53 health checks + DNS failover. Route 53 pings your primary region’s health endpoint every 10 seconds. If 3 consecutive checks fail, it automatically updates DNS to point at the secondary region. User impact: 30-60 seconds of errors (health check detection) + DNS propagation (TTL-dependent, set it low — 60 seconds). Total: 1-3 minutes of degraded service during a full regional failure.

Key challenges:

Data lag — the secondary region is 1-5s behind. After failover, some very recent writes may be missing. For a SaaS app, this usually means a user might not see their last action. Communicate this clearly: “We guarantee RPO < 5 seconds.”
Session state — store sessions in a globally replicated store (DynamoDB Global Tables, or Redis with cross-region replication) so users don’t get logged out during failover.
Cost — running a full second region roughly doubles your infrastructure cost. An active-passive secondary (minimal compute, mainly data replication) costs about 30-50% more.

Design a globally available payment processing system that can survive an entire region failure with zero data loss. Payments are financial transactions — losing even one is unacceptable. The system processes payments from customers worldwide.

Address these questions:

What replication strategy guarantees zero data loss across regions?
How do you handle the latency implications of that strategy?
What’s the minimum number of regions, and where do you place them?
How does the system handle a region failure mid-transaction?

Replication for zero data loss: You need synchronous replication across regions. This is the only way to guarantee that every committed transaction exists in multiple regions before the user sees “payment successful.” Async replication has a lag window where data only exists in one region — if that region dies during the window, the transaction is lost. For payments, that’s not acceptable.

Latency implications: Synchronous cross-region replication adds the round-trip latency between regions to every write. That’s 60-80ms for US coast-to-coast, 80-120ms for US-to-Europe. A payment transaction that took 200ms now takes 280-320ms. For a payment — where users expect a brief wait anyway — this is acceptable. You’re not building a real-time game; you’re confirming a financial transaction.

To minimize this penalty:

Route payments to the nearest region — a user in London writes to eu-west-1, which synchronously replicates to us-east-1 (80ms). A user in New York writes to us-east-1, which replicates to us-west-2 (60ms).
Use consensus-based replication (like Raft or Paxos) so a write only needs acknowledgment from a majority of replicas, not all of them. With 3 regions, a write succeeds when 2 out of 3 acknowledge. This means the write only waits for the fastest secondary, not the slowest.

Minimum regions: 3 regions. With consensus-based replication, you need a majority to agree on every write. With 2 regions, losing one means you have 1 out of 2 — no majority, system halts. With 3 regions, losing one leaves 2 out of 3 — majority intact, system continues. Place them in: us-east-1 (Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore). This gives global coverage with no two regions on the same continent.

Handling mid-transaction region failure:

If the region fails before the write is committed — the transaction was never confirmed to the user. The payment gateway retries, hitting a healthy region. No data loss, no duplicate charge (use idempotency keysA unique identifier attached to each payment request. If the same request is sent twice (due to a retry), the system recognizes the duplicate key and returns the original result instead of processing the payment again.).
If the region fails after the write is committed — the transaction already exists in at least 2 regions (majority quorum). The remaining regions continue serving without data loss.
If the region fails during the write (acknowledged by primary, not yet by quorum) — the consensus protocol handles this. Raft/Paxos guarantees that either the write is committed everywhere it needs to be, or it’s rolled back. No partial commits.

Real-world examples: Google Spanner uses exactly this approach — synchronous replication across 5+ regions using Paxos consensus. CockroachDB and YugabyteDB are open-source databases that implement similar patterns. Stripe and major payment processors use multi-region synchronous setups for their core transaction ledgers.

Five exercises from simple math to hard architecture: (1) Parallel vs serial availability math — the LB is the bottleneck. (2) Error budget calculation — one bad deploy can burn 120% of your daily budget. (3) PostgreSQL HA design — Patroni + sync replication for 15-30s failover. (4) Multi-region SaaS — active-passive with Route 53 failover. (5) Global payments — synchronous replication across 3 regions with consensus quorum for zero data loss.

Section 16

Monitoring for Availability

Here’s a truth that bites teams over and over: you cannot maintain availability if you don’t know something is broken. Every technique we’ve covered — redundancy, failover, load balancing, multi-region — is useless if the system is quietly degrading and nobody notices until users start tweeting screenshots of error pages.

Monitoring is the nervous system of availability. It’s how your infrastructure talks to you. Let’s break it into layers.

The Four Golden Signals (Google SRE)

Google’s Site Reliability EngineeringA discipline that applies software engineering principles to infrastructure and operations. SREs build automation, set SLOs, manage error budgets, and treat reliability as a feature — not an afterthought. team distilled decades of experience into four metrics that matter most. If you only track four things, track these:

Signal	What It Measures	Why It Matters for Availability	Example Alert Threshold
Latency	How long requests take to complete	Rising latency is often the first sign of overload or a failing dependency. A system that responds in 30 seconds is technically “up” but effectively down for users.	p99 latency > 2s for 5 minutes
Traffic	How many requests per second hit the system	A sudden drop means something is blocking users from reaching you (DNS failure, LB issue). A sudden spike means you’re about to hit capacity limits.	Traffic drops > 50% vs previous hour
Errors	The rate of failed requests (5xx responses, timeouts)	The most direct availability signal. If 10% of requests fail, your availability is 90% — well below any reasonable SLO.	Error rate > 1% for 3 minutes
Saturation	How “full” your resources are (CPU, memory, disk, connections)	A server at 95% CPU isn’t down yet — but it’s one traffic spike away from falling over. Saturation alerts give you time to act before an outage.	CPU > 85% or disk > 90% for 10 min

Availability-Specific Metrics

Beyond the four golden signals, these metrics are specifically tuned for availability monitoring:

Rolling uptime percentage — track over 1-hour, 1-day, 7-day, and 30-day windows. A 30-day rolling window smooths out blips and shows the real trend.
Error rate by endpoint — your /api/payments endpoint might be failing at 5% while everything else is fine. Global error rate hides this.
Successful health checks — if your load balancer’s health checks are failing for a server, it’s about to be pulled from rotation. Track the count of healthy vs unhealthy targets.
Failover count — how often are automatic failovers happening? One per month is normal. Three per day means something is fundamentally wrong (“flapping”).
Replication lag — if your database replica is 30 seconds behind the primary and the primary dies, you just lost 30 seconds of data. Track this in real time.

Monitoring Layers: Outside-In

The best monitoring strategy works from the outside in. Start where the user is and drill down toward the infrastructure. Here’s why: internal metrics can all look green while users are experiencing a complete outage — maybe a DNS change hasn’t propagated, or a CDN edge node is misconfigured, or a firewall rule is blocking traffic. If you only monitor from the inside, you’ll miss these.

Synthetic Monitoring vs Real User Monitoring

These two approaches answer different questions, and you need both:

Synthetic monitoringAutomated bots that hit your endpoints on a schedule from multiple geographic locations. They simulate a user request and check if the response is correct and fast. Think of it as a robot that tries to use your site every 30 seconds from 10 different cities. sends fake requests to your system on a regular schedule from multiple geographic locations. It answers: “Is my system reachable right now from Virginia? From London? From Tokyo?” The beauty of synthetic monitoring is that it catches outages before real users do. If your probe from Virginia fails but your probe from Oregon succeeds, you instantly know the problem is regional — maybe an ISP routing issue or a CDN edge failure.

Real User Monitoring (RUM)JavaScript embedded in your pages that measures what actual users experience — page load time, error rates, time to interactive. Unlike synthetic probes, RUM captures real-world conditions: slow phones, bad WiFi, ad blockers, browser quirks. measures what actual users experience. It answers: “What percentage of my real users saw errors today? How long did the page take to load for a user on a 3G connection in rural India?” RUM captures problems that synthetic monitoring misses — like a JavaScript error that only happens on Safari, or a performance issue that only affects users with ad blockers.

Aspect	Synthetic Monitoring	Real User Monitoring (RUM)
What it tests	Simulated requests from known locations	Actual user sessions in the wild
Best for	Detecting outages, measuring baseline performance	Understanding real user experience, edge cases
Coverage	Limited to probe locations (10-20 cities)	Every user, every device, every network
Speed of detection	30-60 seconds (probe interval)	Depends on traffic volume; low-traffic periods are blind spots
Off-hours coverage	Excellent — probes run 24/7 regardless of traffic	Poor at 3 AM when few users are active

Alerting Strategy: Page vs Dashboard

The biggest mistake teams make with monitoring is alerting on everything. If your on-call engineer gets paged for a 1% CPU spike, they’ll start ignoring alerts — and they’ll miss the one that actually matters. The rule is simple:

Action	When to Use	Examples
Page someone (wake them up)	The issue is actively threatening availability or will within minutes	Error rate > 5%, all synthetic probes failing, replication lag > 60s, disk > 95%
Dashboard only (look at it tomorrow)	The issue is interesting but not urgent; no user impact right now	CPU at 70%, one slow query, cache hit rate dropped 2%, single health check blip

If your probe from Virginia can’t reach your service but your probe from Oregon can, you know the problem is regional — maybe an ISP routing issue or a CDN edge-node failure. You can alert your team and start mitigating before a single customer support ticket comes in. Tools like Pingdom, UptimeRobot, and Datadog Synthetics make this easy to set up in under 10 minutes.

You can’t maintain availability if you don’t know something is broken. Monitor the four golden signals (latency, traffic, errors, saturation) and availability-specific metrics (uptime rolling window, replication lag, failover count). Use synthetic probes for early detection and RUM for real user experience. Page on-call only for availability-threatening issues; dashboard everything else.

Section 17

Cheat Sheet — Availability at a Glance

Pin this section. When you’re in a system design interview or staring at a whiteboard, these cards have everything you need in one place.

99% = 87.6 hours downtime/year. 99.9% = 8.76 hours. 99.99% = 52.6 minutes. 99.999% = 5.26 minutes. Each additional nine = 10× less downtime and roughly 10× more cost.

Components in a chain (LB → App → DB): multiply availabilities. A_total = A₁ × A₂ × A₃. The weakest link sets the ceiling. Three components at 99.9% each give 99.7% overall.

Redundant components: A_total = 1 − (1−A₁)(1−A₂). Two servers at 99.9% each in parallel give 99.9999%. Redundancy is how you beat the serial math.

Any component that takes the entire system down if it fails. The fix: add redundancy. Common SPOFs: single database, single load balancer, single DNS provider, single region.

Active-Active: all nodes serve traffic simultaneously. Better utilization, handles more load, but harder to keep data consistent. Active-Passive: standby waits idle until the primary fails. Simpler, but wastes capacity.

Cold: standby is off, start it up on failure (minutes). Warm: standby is running but not serving, quick switch (seconds to a minute). Hot: standby actively serves traffic, instant switch (sub-second). Hotter = faster recovery but higher cost.

Minimum for any production workload. AZs are separate data centers in the same region with independent power and network. Same-region latency is <2ms, so no performance penalty for redundancy.

Required for global or mission-critical systems. Protects against entire region failures (rare but devastating). Trade-offs: 60-120ms cross-region latency, data replication complexity, roughly 2× cost.

Error budget = Total requests × (1 − SLO). If SLO is 99.95% and you serve 10M requests/month, your budget is 5,000 errors. Spend it on deploys and experiments. When it’s gone, freeze changes.

SLI (Indicator): the metric you measure (e.g., error rate). SLO (Objective): the target you set internally (e.g., 99.95%). SLA (Agreement): the legal promise to customers (e.g., 99.9%, with credits if breached). SLO is always stricter than SLA.

When under stress, serve a reduced experience instead of crashing entirely. Examples: serve cached product pages, disable recommendations, show static pricing. A degraded experience is infinitely better than an error page.

A single LB is a SPOF. Fix: run two LBs (active-passive or active-active) behind a virtual IP (VIP). If the active LB dies, the passive takes over the VIP in <5 seconds. Users never see a different IP address.

Twelve quick-reference cards covering availability essentials: the nines, serial and parallel formulas, SPOFs, active-active vs active-passive, failover types (cold/warm/hot), multi-AZ, multi-region, error budgets, SLI/SLO/SLA, graceful degradation, and load balancer HA. Pin this for interviews.

Section 18

Connected Topics — Where to Go Next

Availability doesn’t live in a vacuum — it connects to nearly every other system design concept. The topics below go deeper on the tools and ideas referenced throughout this page. Pick whichever catches your interest or maps to your next interview prep goal.

Availability connects to reliability, scalability, performance, load balancers, CAP theorem, replication, caching, DNS, message queues, CDNs, monitoring, and back-of-envelope estimation. Each topic deepens one piece of the availability toolkit.