TL;DR — The Two Generals' Dilemma
- Why every distributed system must choose between correct answers and always answering
- The bank branch analogy that makes the entire CAP trade-off click
- Why network failures aren't a "maybe" — they're a guaranteed fact of distributed life
- How DynamoDB, Spanner, and MongoDB each made different choices (and why)
When the network breaks, you must choose: serve stale data or refuse to serve. You cannot guarantee both.
Imagine two bank branches in the same city. They share one customer account — a joint account with $1,000 in it. On a normal day, every time someone withdraws money at Branch A, the system instantly tells Branch B the new balance. Everything stays in sync. Life is good.
Now imagine the phone line between the branches goes dead. At that exact moment, a person walks into Branch A and withdraws $800. The balance should be $200. But Branch B doesn't know that — it still thinks the balance is $1,000. Seconds later, someone walks into Branch B and tries to withdraw $500.
Branch B has a choice, and only two options exist:
Option A — Stay available: Branch B checks its own records, sees $1,000, and hands over $500. The customer walks out happy. But now the real balance is negative $300. The bank just lost money because it served stale data. The system was availableIn CAP terms, "available" means every request gets a response — the system never says "come back later." It might give you slightly outdated data, but it always answers. (it answered the request) but inconsistentInconsistency means different parts of the system disagree about the current state. Branch A says the balance is $200, Branch B says $500. Both are "correct" from their own perspective, but they can't both be right. (the answer was wrong).
Option B — Stay consistent: Branch B says "Sorry, our systems are temporarily down. I can't process withdrawals right now." The customer is frustrated and walks out. But the bank's books are still accurate — nobody got money they shouldn't have. The system was consistentIn CAP terms, "consistent" means every read returns the most recent write. If Branch A accepted a withdrawal, Branch B must know about it before serving any requests. All nodes agree on the current state. (correct data) but unavailableUnavailable doesn't mean the server is crashed — it means the server deliberately refuses to answer because it can't guarantee correctness. It's a conscious choice to protect data integrity. (it refused the request).
There is no Option C. Branch B cannot magically know about the $800 withdrawal — the phone line is dead. It can either guess (and risk being wrong) or refuse (and risk frustrating the customer). That's the entire CAP theoremProposed by Eric Brewer in 2000 and formally proven by Seth Gilbert and Nancy Lynch in 2002. It states that a distributed system can provide at most two out of three guarantees: Consistency, Availability, and Partition tolerance. in one story.
This isn't some edge case that only happens at Google scale. If you have any data that lives on more than one machine — a database with a replica, a cache alongside a database, two microservices that both store user state — you're in CAP territory. And every distributed database you'll ever use has already made this choice for you. DynamoDB chose availability (it always answers, even if the data might be stale). Google Spanner chose consistency (it will delay your read until it's sure you're getting the latest value). MongoDB changed its answer between versions — older versions were more available, newer versions lean toward consistency.
What: The CAP theorem says that when a network partition happens in a distributed system, you must choose between consistency (every read gets the latest data) and availability (every request gets a response). You cannot have both during a partition.
When: Any time your data lives on more than one machine — replicated databases, distributed caches, microservices with local state. If it's distributed, CAP applies.
Key Principle: Partitions aren't optional — networks WILL break. So the real question isn't "pick 2 of 3" but "when the network breaks, do you want wrong answers or no answers?"
The Scenario — Your Database Is in Two Cities
Let's make this concrete. You run an e-commerce platform. Business is great — growing 20% month over month. Your single database in New York handles everything, but you're worried. What if the data center loses power? What if an earthquake hits? And your East Coast response times are great (12 ms), but your West Coast users are seeing 85 ms because every request crosses the entire country.
So you do what every scaling guide tells you: set up a second database in Los Angeles. Both databases serve reads and writes. They replicate data between each other over a fiber-optic cable. For a while, life is beautiful. NYC users hit the NYC database. LA users hit the LA database. Latency drops. You have disaster recovery. Your boss loves you.
Then one Tuesday at 2:47 PM, a construction crew in Nevada accidentally cuts the fiber-optic cable between your two data centers. The databases can no longer talk to each other.
A user in NYC adds an iPhone case to their cart. That write goes to DB-East. But DB-East can't send the update to DB-West — the cable is cut. Now a user in LA (same account — maybe the person's spouse on a shared account) opens the cart. DB-West looks up the cart and returns: empty. Two users, same account, two different realities.
The alternative: when the cable breaks, DB-West stops accepting any reads or writes until it can resync with DB-East. Now LA users see "Service Unavailable." No stale data. No contradictions. But also no service.
This isn't a bug. This isn't bad engineering. This is a fundamental mathematical truth about distributed systems. Eric Brewer proposed it as a conjecture in 2000, and Seth Gilbert and Nancy Lynch formally proved it in 2002. No amount of clever code, no billion-dollar infrastructure, no genius architect can escape it. If your data is on more than one machine, and the network between those machines can fail, you will face this choice.
The Three Properties — C, A, and P
Now that you've seen the problem, let's define each letter precisely. These definitions matter — CAP is one of the most misunderstood concepts in distributed systems because people use vague definitions and then argue about the wrong things.
Consistency (C): "Every Read Gets the Latest Write"
Think of a shared Google Doc. When your colleague types a sentence, you see it appear in real time. You never see an old version of the document. That's consistency — no matter which server handles your request, you always get the most recent data.
In formal terms: after a write completes successfully, every subsequent read (from any node) returns that written value or a newer one. There are no "stale" reads. Every node in the system agrees on the current state of the data, all the time.
Availability (A): "Every Request Gets a Response"
Think of a 24/7 convenience store. It might not have everything you want. The milk might be from yesterday. But the door is always open. You walk in, you get something. It never has a "closed" sign.
In formal terms: every request to a non-failing node returns a response — no errors, no timeouts. The system is always "open for business." The response might not contain the most up-to-date data, but it will give you an answer.
Notice the subtle but important nuance: availability doesn't promise correct answers — just an answer. A system can be available and return data that's 5 seconds old. That's still available. What it can't do is return a timeoutWhen a system takes too long to respond and the request is abandoned. In HTTP, a 504 Gateway Timeout means the server didn't respond in time. In CAP terms, a timeout means the system is unavailable for that request. or an error saying "try again later."
Partition Tolerance (P): "The System Survives Network Splits"
A network partitionWhen the network between two or more nodes fails, splitting them into groups that can't communicate. It doesn't mean the servers are down — they're running fine — but they can't reach each other. Like two offices where the phone line is cut: both offices are open, but they can't coordinate. is when the network between servers breaks. The servers themselves are fine — they're running, powered on, processing requests. They just can't talk to each other. It's like two offices where the phone line goes dead: both offices are open and staffed, but they can't coordinate.
Partition tolerance means the system keeps operating even when this happens. It doesn't crash, it doesn't shut down — it continues to function in some capacity even though some nodes can't communicate.
Here's the critical insight that changes how you think about CAP: partitions are not optional. In any real distributed system running on physical networks, network failures will happen. Switches fail. Cables get cut. Routers reboot. Cloud providers have network hiccups. Even within a single data center, network partitions occur regularly — Google published a study showing their data center networks experience about one partition event per day across the fleet.
Since partitions will happen no matter what, P isn't really a choice — it's a requirement. A system that can't handle partitions simply isn't distributed. So the real question the CAP theorem asks is:
Why You Can't Have All Three — The Proof (In Plain English)
A lot of people hear "you can't have all three" and think it's just a rule of thumb — like "don't premature-optimize." It's not. It's a mathematical proof. It's as certain as "you can't have a triangle with four sides." Let's walk through it step by step, no math required.
The Setup
Imagine the simplest possible distributed system: two nodes, Node A and Node B, connected by a network. They both store the same piece of data. Right now, both nodes say value = 42.
Step 1: Normal Operation (No Partition)
A client writes value = 99 to Node A. Node A updates its local copy, then sends the update to Node B over the network. Node B receives it and updates its copy too. Now both nodes have value = 99. If a client reads from either node, they get 99. All three properties are satisfied: the data is consistent (both nodes agree), the system is available (both nodes respond), and... well, there's no partition to tolerate. Life is good.
Step 2: A Partition Happens
The network between Node A and Node B breaks. They can't communicate. A client writes value = 99 to Node A. Node A updates its local copy to 99. It tries to send the update to Node B, but the message never arrives — the network is down. Node A has value = 99. Node B still has value = 42.
Step 3: A Read Arrives at Node B
A client sends a read request to Node B. Node B has to respond. And here's where the impossibility kicks in. Node B has exactly two options:
Option 1 — Return 42 (Choose Availability): Node B returns its local value, which is 42. The client gets a response, so the system is available. But the response is wrong — the real value is 99. Consistency is violated. Node B served stale dataData that was correct at some point in the past but has since been superseded by a newer write. It's not "corrupted" data — it's just old. Like a cached web page that hasn't been refreshed..
Option 2 — Refuse to respond (Choose Consistency): Node B says "I can't answer this request because I might have stale data." The client gets an error or timeout. Consistency is maintained (no wrong data was served), but availability is violated — the client didn't get a response.
Why There's No Option 3
Could Node B somehow return 99 — the correct value? No. Node B doesn't know 99 exists. The network is partitioned. The message from Node A never arrived. Node B has no way to know that the value changed. It's not a matter of being clever or having better algorithms — the information physically cannot reach Node B. You can't read a letter that was never delivered.
This is why CAP is a mathematical impossibility, not an engineering challenge. No amount of money, talent, or innovation can make Node B know about a write that it can't receive. The speed of light doesn't help. Quantum computing doesn't help. The information simply isn't there.
CP vs AP — The Real Choice
By now you know that partitions aren't optional. The network will break at some point — undersea cables get damaged, routers misconfigure, cloud availability zones lose connectivity. Since you can't prevent partitions, the CAP theorem collapses into a single question: when the network breaks, do you want your system to be correct or responsive?
That gives you two real choices — CP and AP. Let's make each one concrete, so you can feel the difference instead of just reading about it.
You're building a ticket-booking system. Two users try to buy the last seat for a concert — one in New York, one in London. Right at that moment, the network between your data centers breaks. What happens if your system is CP? What happens if it's AP? Which is worse — selling the same seat twice, or telling both users "try again later"?
CP — Consistency + Partition Tolerance
A CP system says: "I would rather give you no answer than a wrong answer." When a node can't verify it has the most recent data — because the network partition means it can't talk to the other nodes — it simply refuses the request. It returns an error, or blocks until the partition heals. The system prefers silence over lies.
Why would you want this? Think about a bank account balanceBanks are the classic CP example. If you have $500 and the system shows $700 due to stale data, you might spend money you don't have. Overdrafts, bounced checks, and fraud all follow from showing the wrong balance.. If the London server doesn't know about a $200 withdrawal that just happened in New York, and it shows "$700" when the real balance is "$500" — the customer might withdraw $600 thinking they have enough. Now you've allowed an overdraft based on stale data. That one wrong answer could cost the bank real money and violate financial regulations.
So the CP approach is: during a partition, the London server detects it can't confirm the latest balance, and it tells the customer "Service temporarily unavailable. Please try again." Frustrating? Yes. But nobody gets the wrong balance, and nobody makes financial decisions based on stale information.
AP — Availability + Partition Tolerance
An AP system says: "I would rather give you a possibly-stale answer than no answer at all." Every node keeps serving requests with whatever data it has locally. If a write happened on the other side of the partition, this node doesn't know about it — but it still responds. After the partition heals, the system reconciles any differences between nodes.
Why would you want this? Think about DNSThe Domain Name System — the internet's phone book. It translates domain names (like google.com) to IP addresses (like 142.250.80.46). DNS is aggressively AP: servers cache records for hours or days, serving potentially stale entries rather than returning nothing. — the system that translates "google.com" into an IP address. If a DNS server can't reach the authoritative server to check for updates, should it refuse to answer? Absolutely not. Telling your browser "I can't resolve google.com" would break the internet. Instead, the DNS server serves whatever cached IP address it has. It might be a few hours old, but it almost certainly still works. A slightly stale answer is infinitely better than no answer.
Same logic for a social media feed. If your Instagram feed shows 2,041 likes on a photo but the real count is 2,043 — does anyone care? Not even a little bit. But if Instagram said "Feed unavailable, try later" every time there was a network hiccup, people would delete the app.
CA — The Ghost Option
You'll see "CA" on many CAP diagrams, usually grayed out. CA means: consistent and available, but doesn't tolerate partitions. In practice, this only exists when there are no partitions — which means a single machine, or machines on a perfectly reliable network (which doesn't exist in production). A standalone PostgreSQL server is technically "CA" because it can't have a partition with itself. But a single server isn't distributed, and if it dies, everything dies. That's why CA is grayed out — it's not a real option for distributed systems.
| CP (Consistency + Partition Tolerance) | AP (Availability + Partition Tolerance) | |
|---|---|---|
| During a partition | Refuses requests it can't guarantee are fresh | Serves whatever data it has locally |
| User experience | Some users see errors or timeouts | All users get responses (possibly stale) |
| After partition heals | Instantly back to normal — data was never inconsistent | Nodes reconcile conflicts (last-writer-wins, CRDTs, etc.) |
| Best for | Banking, inventory, config stores, leader election | Social feeds, DNS, product catalogs, CDN caches |
| Real examples | ZooKeeper, etcd, HBase, Google Spanner | DynamoDB, Cassandra, DNS, CouchDB |
| Motto | "I don't know" is better than "I'm wrong" | "Something is better than nothing" |
Real-World Systems on the CAP Spectrum
Theory is nice, but where do the databases you actually use in production land on the CAP spectrum? Let's walk through the major ones — not just labeling them "CP" or "AP," but explaining why they made that choice based on what they were built to do.
Before reading each system's classification, try to guess: is it CP or AP? The trick is to think about what the system stores. If stale data in that system would cause real damage (wrong config, double-spending, split-brain), it's probably CP. If the system stores content that can tolerate brief staleness (posts, product listings, cached pages), it's probably AP.
CP Systems — Correctness Over Speed
ZooKeeperApache ZooKeeper is a coordination service for distributed applications. It manages configuration, naming, distributed locks, and leader election. Used by Kafka, Hadoop, HBase, and many other distributed systems. is a coordination service — it helps distributed applications agree on things like "who is the leader?" and "what's the current cluster configuration?" It uses a consensus protocol called ZABZooKeeper Atomic Broadcast — ZooKeeper's consensus protocol. Similar to Raft/Paxos. All writes go through the leader, and a majority of followers must acknowledge before the write is committed. where every write requires a majority of nodes to agree before it's committed.
Why CP? Imagine ZooKeeper serves stale data during a partition. A Kafka cluster asks "who is the leader for partition X?" and ZooKeeper answers with the old leader (who is actually dead). Now two brokers both think they're the leader — that's split-brainWhen two parts of a distributed system both believe they are the primary/leader. Both accept writes independently, creating conflicting data. One of the most dangerous failure modes — it can corrupt data permanently., one of the most dangerous failure modes in distributed systems. Stale coordination data doesn't just cause inconvenience — it can take down entire clusters.
During a partition, minority-side ZooKeeper nodes become read-only. They won't serve writes at all, and reads may be rejected too depending on configuration. The system would rather be partially down than risk giving outdated answers about who leads what.
etcdA distributed key-value store written in Go. Uses the Raft consensus algorithm. Serves as the brain of Kubernetes — storing all cluster state: pod definitions, service configs, secrets, and node information. is the key-value store that powers Kubernetes. Every pod definition, every service configuration, every secret — it all lives in etcd. It uses the RaftA consensus algorithm designed to be easier to understand than Paxos. Ensures all nodes agree on the same sequence of operations. Used by etcd, CockroachDB, TiKV, and Consul. consensus algorithm, which means writes require a majority of nodes to agree.
Why CP? Picture what happens if etcd serves stale data: Kubernetes reads an outdated pod spec and deploys the wrong version of your application to production. Or it reads an old secret and your app connects to the wrong database. Or it misses a node deletion and keeps scheduling pods on a machine that's already been decommissioned. Wrong config data doesn't just mean stale reads — it means production outages caused by acting on incorrect information.
When etcd loses quorum (can't reach a majority), the minority partition stops accepting writes and reads. Kubernetes on that side of the partition goes into a holding pattern — existing pods keep running, but no new scheduling happens. Painful, but safe.
HBaseApache HBase is a distributed, column-oriented database modeled after Google's Bigtable. Runs on top of HDFS (Hadoop's file system). Used for real-time read/write access to large datasets — Facebook's messaging system originally ran on HBase. is a wide-column database that runs on top of Hadoop. Each row is owned by exactly one region serverIn HBase, the data is split into regions (ranges of rows), and each region is served by exactly one region server. This single-owner model is what makes HBase strongly consistent — there's no ambiguity about who owns the data.. During a partition, if a client can't reach the region server that owns their row, the request fails — the system won't redirect to a stale copy.
Why CP? HBase is built for analytics workloads where you're computing aggregates, running reports, and making business decisions. If your report reads stale data, your business decision is based on incorrect numbers. For a system that powers data pipelines and financial analytics, stale reads aren't just annoying — they produce wrong conclusions.
SpannerGoogle's globally distributed SQL database. Uses atomic clocks and GPS receivers (called TrueTime) in every data center to synchronize time across the planet. Provides external consistency (the strongest form) for global transactions. is Google's globally distributed SQL database. It's technically CP — consistency is non-negotiable. But here's the twist: Google made partitions so incredibly rare that Spanner achieves five-nines availability (99.999%). How? By investing billions in redundant network links, custom hardware, and TrueTimeGoogle's globally synchronized clock system. Uses atomic clocks and GPS receivers in every data center. Achieves clock uncertainty of just 1-7 milliseconds globally. This tiny uncertainty window is what makes Spanner's consistency guarantees possible without the usual performance penalty. — a system of atomic clocks and GPS receivers that keeps time synchronized across the entire planet within a few milliseconds.
Why is this special? Normal CP systems sacrifice availability during partitions. Spanner's approach is to make partitions almost nonexistent through sheer engineering investment. It doesn't solve CAP — it just makes the "P" so rare that the sacrifice is nearly invisible. Of course, this approach requires Google-scale infrastructure budgets.
AP Systems — Speed Over Correctness
DynamoDBAmazon's fully managed NoSQL database. Originally designed for Amazon's shopping cart. Every node can accept writes. Conflicts are resolved on read using last-writer-wins or application-level resolution. Designed for "always writable" access. was born from Amazon's philosophy that the shopping cart should never be unavailable. Their original paper (the Dynamo whitepaper, 2007) explicitly states the design goal: "customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados."
Why AP? Think about what happens at Amazon during a flash sale. Millions of people are adding items to their carts. If the cart service returned "503 Service Unavailable" for even 30 seconds, Amazon would lose millions of dollars in sales. A shopping cart that occasionally shows an item you removed 2 seconds ago is a minor annoyance. A shopping cart that's down is a revenue catastrophe.
DynamoDB achieves this by letting any node accept writes. During a partition, both sides keep working independently. When the partition heals, conflicts are resolved — typically with last-writer-winsA conflict resolution strategy where the write with the most recent timestamp is kept. Simple but can lose data: if two nodes accept conflicting writes during a partition, only the latest one survives. DynamoDB uses this by default, but applications can implement custom resolution., though applications can implement their own resolution logic.
CassandraOriginally built at Facebook for inbox search. Designed so every node is equal (no leader). Writes go to any node. Uses tunable consistency — you choose per-query how many replicas must acknowledge. Used by Apple (400K+ nodes), Netflix, Discord. was built at Facebook for their inbox search feature, where the priority was handling massive write volumes without any single point of failure. Every node in Cassandra is equal — there's no leader, no primary. Writes can go to any node.
Why AP? Cassandra's sweet spot is workloads with relentless writes — logging, time-series data, IoT sensor readings, messaging. These are use cases where losing a few seconds of data during a partition is acceptable, but having the system stop accepting writes is not. If your IoT platform can't record sensor readings for 5 minutes, you've lost 5 minutes of monitoring data forever.
Cassandra offers tunable consistencyCassandra lets you specify consistency per query: ONE (fastest, might be stale), QUORUM (majority must agree), ALL (every replica, slowest). This means the same table can be AP for some queries and CP for others. — you can choose per query. CONSISTENCY ONE is pure AP (fastest, might be stale). CONSISTENCY QUORUM gets closer to CP (majority must agree). CONSISTENCY ALL is basically CP (every replica must respond). Same database, different trade-offs per query.
DNS is the most aggressively AP system on the internet. When you type "google.com" in your browser, a DNS resolver looks up the IP address. If it can't reach the authoritative DNS server, it serves a cached recordDNS records have a TTL (Time To Live), typically 300 seconds to 86,400 seconds (1 day). During this window, resolvers serve the cached answer without checking the authoritative server. Even after TTL expires, many resolvers serve stale records if the authoritative server is unreachable. — potentially hours or even days old.
Why AP? If DNS returned "I don't know" instead of serving a cached record, your browser literally couldn't find any website. The entire internet would break. A slightly stale IP address that still works is infinitely better than no IP address at all. DNS was designed for availability from day one — the whole caching and TTL system exists specifically so that DNS keeps working even when parts of the infrastructure are down.
A CDNContent Delivery Network — a globally distributed network of servers that cache content close to users. CloudFront, Cloudflare, Akamai, and Fastly are major CDN providers. They serve static files (images, CSS, JS) and often cache dynamic content too. stores copies of your web content on servers around the world. When the origin server is unreachable (partition), the CDN serves whatever cached version it has — even if it's stale. CloudFront calls this "origin failover." Cloudflare calls it "Always Online."
Why AP? Showing a user a product page from 10 minutes ago is better than showing a blank error page. The product description hasn't changed. The images haven't changed. Maybe the price is slightly off — but the page loads, the user can browse, and they might still buy something. A "503 Origin Unreachable" page, on the other hand, drives users straight to your competitor.
PACELC — Beyond CAP
Here's the dirty secret about CAP: it only tells you what happens during a partitionWhen part of the network can't talk to another part. In practice, full partitions are rare — Google reports about 7.6 per year across their entire infrastructure. Most of the time, your system is operating normally.. But partitions are rare — maybe once or twice a year for most companies, sometimes never. So what trade-off are you making the other 99.9% of the time, when the network is perfectly healthy?
That's the question CAP completely ignores. And it's the question that actually matters for your users' everyday experience.
In 2010, Daniel Abadi — a database researcher at Yale — noticed this gap and proposed PACELCPronounced "pass-elk." Stands for: if Partition, choose Availability or Consistency; Else, choose Latency or Consistency. Published in 2010 by Daniel Abadi. Extends CAP to cover the 99.9% of time when the network is healthy. (pronounced "pass-elk"). The idea is simple but powerful:
The "Else" half is the breakthrough. Even when the network is fine, you face a trade-off every single request: do you replicate data synchronouslySynchronous replication: the write isn't confirmed until ALL replicas have stored it. Slow (must wait for the slowest replica) but consistent — every node always has the latest data. (wait for all replicas before responding — consistent but slow) or asynchronouslyAsynchronous replication: the write is confirmed as soon as the primary stores it. Replicas catch up in the background. Fast (no waiting) but a replica might lag behind by milliseconds to seconds. (respond immediately, let replicas catch up — fast but briefly stale)?
This matters because it affects every single request your system handles — not just the rare moments during a partition. When you're evaluating databases, the "ELC" half of PACELC tells you far more about everyday user experience than the "PAC" half ever will.
Your app has replicas in Virginia and Frankfurt. A user in Virginia writes data. You have two choices: (1) confirm the write after BOTH Virginia and Frankfurt store it — takes 90ms because of the transatlantic round trip, or (2) confirm after just Virginia stores it, let Frankfurt catch up asynchronously — takes 3ms. Neither option involves a partition. What are you trading?
The Four PACELC Combinations
PACELC gives you four possible combinations based on what a system chooses during partitions (PA or PC) and what it chooses during normal operation (EL or EC):
PA/EL is the "maximize speed" quadrant. These systems are available during partitions and favor low latency during normal operation. DynamoDB, Cassandra, DNS — they're optimized to respond as fast as possible, all the time. This is the most common AP choice because if you've already decided stale data is acceptable during partitions, you probably also want the speed benefits during normal operation.
PC/EC is the "maximize correctness" quadrant. These systems refuse stale data during partitions and synchronize replicas during normal operation. Spanner, MongoDB (default config), etcd — they never serve data they can't vouch for. The price is higher latency on every request, because every read and write involves coordination across replicas.
PA/EC is an interesting middle ground. The system is available during partitions (accepts stale reads) but uses synchronous replication during normal operation. Azure Cosmos DB with session consistency can behave this way. It's less common because it's a somewhat contradictory stance: "I care about correctness normally, but I'll accept staleness in emergencies."
PC/EL is the rarest combination. Consistent during partitions but favoring low latency normally. Yahoo's PNUTSYahoo's globally distributed database (circa 2008). Used per-record mastership — each record has a "master" region. Reads from the master were consistent, but reads from other regions used local replicas for speed. Discontinued when Yahoo declined. was a notable example. It's hard to build because the mechanisms that provide low-latency reads (read from nearest replica) inherently risk staleness, which conflicts with the consistency guarantee during partitions.
Consistency Models — Strong, Eventual, and Everything Between
When people say "consistency" in CAP, they usually mean it as a binary: either your data is consistent or it isn't. But in reality, consistency is a spectrum with at least five distinct levels, each offering a different trade-off between correctness and speed. Understanding these levels is crucial because you almost never need the strongest consistency for everything — and using the right level for each operation can make your system both faster and cheaper.
Think of it like shipping options. You can overnight everything — but why pay for overnight shipping on a book you'll read next month? Some data needs to arrive instantly and perfectly. Other data can take the scenic route.
You're building an e-commerce app. It has four types of data: (1) account balance, (2) product reviews, (3) "items in your cart" that you just added, and (4) the total number of likes on a product. Which of these absolutely MUST be perfectly up-to-date at all times? Which ones can tolerate being a few seconds — or even minutes — behind?
The Five Consistency Levels
Strong consistency (linearizability) — the gold standard. Every read returns the most recent write. Period. It's as if there's only one copy of the data in the entire world. If Alice writes a value and Bob reads it one nanosecond later (from a different continent), Bob sees Alice's write. This is the simplest model to reason about — but the most expensive, because every operation requires coordination across all replicas before responding.
Sequential consistency — slightly weaker. All operations appear in some global order, and that order respects each individual client's sequence. So if you write A then B, everyone sees A before B. But different clients might not see each other's operations in real-time — there could be a brief delay. Think of it like a news broadcast: the events are reported in order, but there's a slight delay from when they happen to when you see them.
Causal consistency — a practical middle ground. If operation A caused operation B (e.g., you post a photo, then someone comments on it), everyone is guaranteed to see A before B. But operations that are unrelated (two people posting photos at the same time, independently) can appear in different orders to different readers. This is often "good enough" because it preserves the logical relationships that matter.
Read-your-writes consistency — a very common guarantee. After you write something, you will always see your own write immediately. But other users might see the old value for a brief window. This is what most social media apps give you: you post a photo and see it immediately in your feed. Your friend across the country might not see it for a few seconds — but you always see your own posts right away.
Eventual consistency — the weakest useful guarantee. If no new writes happen, eventually all replicas will converge to the same value. But there's no guarantee on how long "eventually" takes — could be milliseconds, could be minutes. The only promise is that the system won't stay divergent forever. DNS is a classic example: after you update a DNS record, some servers serve the old IP for minutes or hours, but eventually every server on the internet has the new one.
The Trade-Off Ladder
Each step down the consistency ladder removes a coordination requirement, which directly translates to less waiting:
- Strong consistency requires every replica to agree before responding. If you have replicas in Virginia, Frankfurt, and Tokyo, every write waits for a round trip to Tokyo (~150ms). Every read might need to check with the leader.
- Sequential consistency removes the real-time requirement. Operations are globally ordered, but there's no guarantee that the ordering matches wall-clock time.
- Causal consistency only tracks dependencies between related operations. Unrelated operations skip coordination entirely — huge savings.
- Read-your-writes only coordinates for the originating client. Other clients' views can lag freely.
- Eventual consistency removes all coordination. Respond immediately with whatever data is local. Replicas sync in the background on their own schedule.
The causal consistency example illustrates why you don't always need strong consistency. In a social media app, it would be bizarre to see "Great photo!" before the photo itself appears. Causal consistency prevents that — it tracks that Bob's comment depends on Alice's photo. But it doesn't bother ordering two unrelated posts from different people — one user might see Post X before Post Y, and another user might see the reverse. That's fine. Nobody notices.
Strong consistency would guarantee everyone sees everything in the same order in real-time — but that would require coordinating every single post across every data center, adding hundreds of milliseconds to every write. For a social media feed, that's a terrible trade-off: much slower, for a guarantee nobody needs.
Here's a practical example of mixing consistency levels in one application:
| Data | Consistency Level | Why |
|---|---|---|
| Account balance | Strong | Wrong balance = overdraft, fraud, legal liability |
| Order status | Read-your-writes | You need to see your own order status update; others can lag |
| Comment threads | Causal | Replies must appear after the comment they reply to |
| Product reviews | Eventual | A 30-second delay on new reviews is invisible to shoppers |
| Like / view counts | Eventual | Nobody notices if the count is off by a few for 10 seconds |
| User's own profile edits | Read-your-writes | You change your bio, you expect to see the change immediately |
| Analytics dashboards | Eventual | Dashboards refresh every 60 seconds anyway — staleness is baked in |
Eventual Consistency Deep Dive — How Systems Actually Converge
"Eventual consistency" sounds terrifying if you've never seen it in action. It sounds like your data is floating around in chaos, and maybe someday the servers will figure it out. But here's the thing: most of the internet runs on eventual consistency right now, and it works great. When you update your DNS record, it takes minutes to hours to spread across the globe — that's eventual consistency. CDN caches serve stale versions of your website for seconds or minutes after you deploy — eventual consistency. Your social media feed doesn't show a friend's post the exact millisecond they publish it — eventual consistency. And nobody calls any of these systems "broken."
So how does it actually work? The core idea is simple: after some change happens (a write, a partition healing, a replication catch-up), all nodes exchange their differences and merge. Given enough time with no new writes, every replica will hold the exact same data. The interesting — and tricky — part is: what happens when two nodes received conflicting writes during a partition?
Let's break these down one by one.
Last-Writer-Wins (LWW) is the simplest approach: when two nodes disagree, whichever write has the later timestamp wins and the other is silently thrown away. It's dead simple to implement, which is why DynamoDBAmazon's managed NoSQL database. Default conflict resolution is LWW. Designed to always accept writes, even during partitions. and CassandraA distributed NoSQL database originally from Facebook. Every node is equal (no leader). Uses timestamps for conflict resolution by default. both use it as their default. But there's a catch: it silently loses data. If two users edit the same profile at the same time on different nodes, one person's changes simply vanish. For a "likes" counter, that's fine. For user profile updates, it might not be.
Vector clocks take a smarter approach. Instead of a single timestamp, each piece of data carries a "version history" — basically, a list of who changed it and how many times. When two versions arrive with incompatible histories (neither one is "newer" than the other), the system knows there's a genuine conflict. But it can't resolve it automatically — it flags it and hands it to either the application or the user. RiakA distributed key-value store that historically used vector clocks for conflict detection. It would return multiple conflicting versions ("siblings") and let the application merge them. famously used this approach, returning multiple conflicting versions ("siblings") and asking the app to merge them.
CRDTs (Conflict-free Replicated Data Types) are the mathematical solution. These are special data structures — counters, sets, maps — designed so that any order of operations produces the same final result. A G-CounterA "grow-only counter" CRDT. Each node tracks its own increment count separately. The total is the sum of all node counts. Merging is just taking the max of each node's count. Mathematically impossible to conflict. (grow-only counter), for example, tracks each node's count separately. Merging means taking the max of each node's count and summing them. No matter what order you merge, you get the same number. It's mathematically impossible to conflict. The trade-off? CRDTs only work for specific data structures — you can't CRDT-ify arbitrary data.
Application-level resolution is the "you decide" option. The database stores both conflicting versions and your application code contains the merge logic. Amazon's shopping cart is the textbook example: if two versions of a cart conflict, just take the union of all items. You might end up with a slightly larger cart than intended, but that's better than losing items. This is the most flexible strategy, but it means your application needs custom conflict resolution code for every data type that might conflict.
The "Eventually" Problem — How Long Is Eventually?
When people hear "eventually consistent," the natural question is: "OK, but how long is 'eventually'?" The honest answer is: it depends on where the replicas are.
Within the same datacenter, replicas typically converge in 1–5 milliseconds — faster than a human eye blink (which takes about 150ms). Cross-region replication takes 50–200ms, limited by the speed of light through fiber optic cables. During an actual network partition, convergence is delayed until the partition heals — which could be minutes or even hours in the worst case. But that worst case is genuinely rare: most cloud providers measure partition events in minutes per year.
Common Misconceptions — What CAP Does NOT Say
CAP theorem might be the most misunderstood idea in all of computer science. It's only one sentence long, but somehow most engineers — even experienced ones — get it wrong. Blog posts, YouTube videos, and even some textbooks repeat the same myths. Understanding what CAP does not say is just as important as understanding what it does say. Let's kill the six biggest myths.
Why it's wrong: This is the myth that started it all, and it's the most damaging. The "pick 2" framing implies that Partition tolerance is optional — like you could just say "I'll take C and A, hold the P please." But you can't opt out of network partitions. They will happen. Cables get cut, routers fail, cloud availability zones lose connectivity. P is not a choice — it's a fact of physics.
Once you accept that P is mandatory, the "triangle" collapses into a simple line: during a partition, do you want Consistency (refuse to serve stale data) or Availability (serve whatever you have, even if stale)?
And it gets even more nuanced: many systems aren't locked into one mode. You can use CP behavior for payment transactions and AP behavior for the product catalog — in the same database, in the same application.
Why it's wrong: CA means "Consistent and Available, but not Partition Tolerant." The only system that truly qualifies is a single-node database — like a standalone PostgreSQL server with no replication. There's no network between nodes (because there's only one node), so partitions are impossible, and you get both C and A.
But the moment you add a second node — a replica, a standby, a shard — you have a distributed system, and network partitions become possible. Then you must choose. Every real production distributed system is either CP or AP (or switches between them depending on the operation). Anyone who tells you their distributed database is "CA" doesn't understand CAP.
Why it's wrong: This might be the most practically damaging misconception. Different data has different requirements. In an e-commerce app:
- Account balance during a transfer → CP (you cannot tolerate stale data here, or you risk double-spending)
- Product catalog browsing → AP (showing a price that's 2 seconds behind is fine; showing an error page is not)
- User session data → AP (if the session store is briefly inconsistent, the worst case is re-authentication)
- Inventory count during checkout → CP (you need to know the real count to avoid overselling)
Modern databases like Cassandra and DynamoDB let you set the consistency level per query. Use it. Don't force your entire system into one mode because you think CAP requires it.
Why it's wrong: You can have both — when there's no partition. And there's no partition the vast majority of the time. CAP only constrains your system during a network partition. During normal operation (which is 99.9%+ of the time for most cloud setups), your system happily provides strong consistency and high availability.
Think of it like a fire escape: you hope to never use it, but when you need it, you must choose — jump or wait. The choice only exists during the fire. On an ordinary Tuesday, the building offers both safety and convenient exits.
Why it's wrong: "Eventually consistent" sounds like "always behind," but in practice, replication happens in milliseconds. DynamoDB's eventually consistent reads are typically less than 100ms behind the latest write. Cassandra replicates within the same datacenter in under 5ms. The "inconsistency window" is almost always shorter than the time it takes a human to blink.
The "eventually" in "eventually consistent" refers to the guarantee, not the typical experience. The guarantee says "at some point, all replicas will agree." The typical experience is "all replicas agree within a few milliseconds." It's like saying "your package will eventually arrive" when the delivery truck is already on your street.
Why it's wrong: Google SpannerGoogle's globally distributed database. Uses GPS-synchronized atomic clocks (TrueTime) to achieve strong consistency with low latency. Only available as a managed Google Cloud service. provides strong (linearizable) consistency with commit latencies around 10ms — globally. How? By throwing hardware at the problem: GPS receivers and atomic clocks in every datacenter (TrueTime) eliminate the need for expensive coordination rounds. Spanner is expensive, not slow.
CockroachDBAn open-source distributed SQL database inspired by Spanner. Uses Raft consensus for strong consistency. Works on commodity hardware without GPS clocks. achieves strong consistency with single-digit millisecond reads in the common case. The speed depends on how close your nodes are and how expensive your hardware is, not on whether you chose consistency.
Strong consistency adds latency compared to eventual consistency for the same hardware, sure. But "slower than the fastest option" doesn't mean "slow." It means "still fast, just not the absolute fastest possible."
Interview Playbook — Nail CAP Theorem Questions
CAP theorem comes up in almost every system design interview. The bad news: most candidates fumble it by reciting "pick 2 of 3" and stopping there. The good news: you now know more than most candidates. Here's how to turn that knowledge into an answer that impresses.
The 5-Step Framework
Whenever a CAP question lands, follow these five steps in order. This framework works whether the question is "explain CAP" or "classify this system" or "design something globally distributed."
Common Interview Questions
The weak answer: "CAP says you can only pick 2 out of Consistency, Availability, and Partition tolerance."
The strong answer: "CAP theorem says that when a network partition happens in a distributed system, you must choose between consistency and availability. The 'pick 2 of 3' framing is misleading because partition tolerance isn't optional — network partitions happen whether you want them to or not. So the real choice is: during a partition, do you reject requests to stay consistent (CP), or do you keep serving possibly stale data to stay available (AP)? And when there's no partition — which is the vast majority of the time — you actually get all three."
Bonus points: Use the bank branch analogy. "Imagine two bank branches whose phone line goes down. A customer walks into Branch B and wants to withdraw. Should Branch B process the withdrawal based on its last-known balance (available, possibly wrong) or refuse until it can call Branch A (consistent, but the customer can't get their money)?"
The weak answer: "It's CP because we use PostgreSQL."
The strong answer: "It's not one or the other for the whole system. I'd make different trade-offs for different data. User authentication and payment processing should be CP — we can't risk serving stale auth tokens or processing payments based on outdated balances. But the product catalog, user feeds, and recommendation results can be AP — showing slightly stale data is vastly better than showing error pages to users who are trying to browse."
Bonus points: "And even within the CP operations, I'd use the weakest consistency that's safe. For example, reading a user profile for display can be eventually consistent, but reading it during a login check needs strong consistency. Cassandra lets me set this per-query with consistency levels."
The strong answer: "DynamoDB is AP by default — it prioritizes availability. Every read goes to the nearest replica and returns immediately, even if that replica is slightly behind the latest write. Writes are accepted on any node, and conflicts are resolved with last-writer-wins using timestamps."
"But DynamoDB also offers tunable consistency. You can add --consistent-read to any read request, which forces the read to go to the leader partition and return the absolute latest value. This costs 2x the read capacity units and has higher latency, but gives you strong consistency for that specific read. So DynamoDB is AP by default but CP per-query when you need it."
Bonus points: "Under the PACELC model, DynamoDB is PA/EL — during partitions it favors Availability, and during normal operation (Else) it favors Latency by defaulting to eventually consistent reads. But you can switch to PA/EC per-query with the consistent-read flag."
The strong answer: "Technically yes — when there's no partition, every distributed system provides all three. The constraint only applies during a partition. And since partitions are rare (most cloud providers see them for minutes per year), your system operates with full C+A+P almost all the time."
"Google Spanner comes closest to 'all three in practice' by using atomic clocks and GPS receivers (TrueTime) to minimize the window where a partition would force a trade-off. It's still technically CP — during a real partition it would sacrifice availability — but Google's private network makes partitions so rare that the practical experience is near-perfect consistency and availability."
"But the important thing is: no system can guarantee all three during a partition. That's a mathematical impossibility, not an engineering limitation."
Applying CAP to a Design Scenario
Step 1 — State the constraint: "A shopping cart is accessed from anywhere in the world, so we'll have replicas in multiple regions. That means network partitions between regions are possible. I need to decide: when a partition happens, should the cart be consistent or available?"
Step 2 — Analyze the data: "A shopping cart has two critical operations: adding/removing items (writes) and viewing the cart (reads). If a user adds an item and then views their cart, they expect to see that item. But the consequences of a brief inconsistency aren't catastrophic — unlike a bank account, a shopping cart showing one extra or one missing item for a few seconds won't cause financial harm."
Step 3 — Choose AP for the cart itself: "I'd choose availability here. During a partition, both regions should keep accepting cart writes. If a user adds a hat in US-East while the partition is active, and adds shoes in EU-West, both writes succeed locally. When the partition heals, I'd merge the carts using a union strategy — the merged cart contains all items from both versions. The worst case is a slightly larger cart than intended, which is better than showing an error page."
Step 4 — But CP for checkout: "However, when the user clicks 'Place Order,' I'd switch to strong consistency. I need the real inventory count and the real cart contents at that moment. If a partition prevents me from confirming inventory, I'd rather show 'please try again in a moment' than accidentally sell an out-of-stock item. So: AP for browsing and cart management, CP for the final checkout."
Step 5 — Mention PACELC: "Under PACELC, my cart service is PA/EL — during partitions it favors availability (keeps accepting writes), and during normal operation it favors low latency (reads hit the nearest replica). My checkout service is PC/EC — during partitions it favors consistency (blocks until it can verify inventory), and during normal operation it also favors consistency (reads the leader for accurate stock counts). This means checkout is always a bit slower, but it's always correct."
Practice Exercises — Build Your CAP Intuition
Reading about CAP is the easy part. The hard part is applying it to real decisions — "should this feature be CP or AP?" These exercises start simple and build toward the kind of reasoning that wins system design interviews. Try each one before peeking at the hint.
A single PostgreSQLAn open-source relational database known for strong consistency and ACID compliance. One of the most popular choices for transactional workloads. database with no replication — just one server, one disk, one copy of every row.
Questions: (a) Is this system CP, AP, or CA? (b) Now you add a streaming replicaA read-only copy of the database that receives changes from the primary server in near-real-time. Used for read scaling and disaster recovery. with asynchronous replication. How does the CAP classification change? (c) What if you switch to synchronous replication?
(a) Technically CA. With a single node, there's no network link between database servers that could break — so "partition tolerance" is meaningless. You get both consistency and availability (as long as the one server is up). The catch: if that one server dies, you get neither C nor A. That's why single-node "CA" is a bit of a trick answer.
(b) Async replica = AP during a partition. If the network between primary and replica breaks, the primary keeps accepting writes (available) and the replica keeps serving reads — but the replica's data is stale (not consistent). You chose availability over consistency.
(c) Sync replica = CP during a partition. With synchronous replication, every write must be confirmed by the replica before the primary acknowledges it. If the network breaks, writes block — the system sacrifices availability to maintain consistency.
You're designing a ride-sharing app like Uber. For each feature below, decide: should it be CP (correct data, even if sometimes slow or unavailable) or AP (always responsive, even if data is briefly stale)?
Classify these: (a) Driver location on the map, (b) Payment processing, (c) Ride history, (d) Surge pricing display, (e) Driver-rider matching.
(a) Driver location → AP. A driver dot that's 5 seconds behind real position is fine. Showing "no drivers available" because you couldn't reach a replica is not — riders leave.
(b) Payment processing → CP. Double-charging a rider or losing a payment is unacceptable. It's better to show "payment processing..." for a few extra seconds than to charge the wrong amount.
(c) Ride history → AP. If your "past rides" list is 30 seconds behind, nobody notices or cares. Eventual consistency is perfect here.
(d) Surge pricing display → AP. Showing a price that's 10 seconds old is better than showing no price at all. The final charge uses the CP payment system anyway.
(e) Driver-rider matching → CP. If two riders get matched to the same driver, one of them gets stranded. Correctness matters more than speed here — use a consensus-based lock or queue.
DynamoDBAmazon's managed NoSQL database. Uses last-writer-wins by default for conflict resolution — the write with the latest timestamp overwrites all others. uses last-writer-wins (LWW)A conflict resolution strategy where the write with the most recent timestamp simply overwrites any competing write. Simple but can silently lose data. for conflict resolution. Describe a concrete scenario where this causes data loss. Then propose a CRDTConflict-free Replicated Data Type — a data structure designed so that concurrent updates from different replicas can always be merged automatically without data loss. Examples: counters, sets, registers.-based solution that avoids the problem entirely.
The scenario: Alice and Bob share a wishlist. During a network partition, Alice adds "Headphones" on replica A (timestamp T1), and Bob adds "Keyboard" on replica B (timestamp T2, slightly later). When the partition heals, LWW keeps Bob's write (T2 > T1) and silently drops Alice's "Headphones." Alice never knows her item was lost.
Why LWW fails here: The two writes aren't conflicting — they're adding different items. But LWW treats the entire wishlist as one value and picks the "latest" one, destroying the other.
CRDT solution — G-Set (grow-only set): Instead of replacing the whole wishlist, model it as a set where both replicas can only add items. When the partition heals, merge = union of both sets. Result: {Headphones, Keyboard}. No data lost, no conflict to resolve. For deletions, upgrade to an OR-Set (observed-remove set) which tracks add/remove operations.
Your system serves users in three regions: US-East, EU-West, and AP-Southeast. Design the data replication strategy for each of these data types:
(a) User profile data (name, avatar, bio). (b) Financial transactions (payments, transfers). (c) A global chat system (messages between users).
For each, specify: CP or AP, the consistency model, and the estimated replication latency.
(a) User profiles → AP with eventual consistency. Use async replication across all three regions. If a user in EU updates their avatar and a friend in AP sees the old one for 100-200ms, nobody notices. Replication latency: ~100-200ms cross-region. Every region has a full replica for low read latency.
(b) Financial transactions → CP with strong consistency. Use synchronous replication or a RaftA consensus algorithm where a leader node coordinates writes. A write is only committed after a majority of nodes confirm it. Used by etcd, CockroachDB, and TiKV.-based consensus protocol. Write latency will be 150-300ms (one cross-region round-trip) because the leader must wait for at least one remote replica to confirm. Correctness is non-negotiable — a lost payment or double-charge is far worse than 300ms of latency.
(c) Global chat → AP with causal consistency. Messages must appear in the right order (causal), but a 50-100ms delay between regions is acceptable. Use vector clocksA mechanism where each node maintains a counter for every other node. By comparing vectors, you can determine whether two events are causally related or concurrent. or Lamport timestamps to preserve message ordering. Async replication keeps latency low within each region (~5ms local, ~100-200ms cross-region).
Google SpannerGoogle's globally distributed database that provides strong consistency with high availability. Uses GPS and atomic clocks (TrueTime) to synchronize clocks across data centers with ~7ms uncertainty. claims to be "effectively CA" despite being a globally distributed database. Explain: (a) How does it achieve this using TrueTimeGoogle's internal time API that uses GPS receivers and atomic clocks in every data center. It provides a time interval [earliest, latest] rather than a single timestamp, with uncertainty typically under 7 milliseconds.? (b) What are the limitations? (c) Could you build a Spanner-equivalent without atomic clocks?
(a) How TrueTime works: Every Google data center has GPS receivers and atomic clocks. TrueTime doesn't give you a timestamp — it gives you a time interval: "the real time is between [earliest, latest], and the gap is at most ~7ms." When Spanner commits a write, it does a commit-wait: it waits for the uncertainty window to pass (~7ms) before making the write visible. This guarantees that any later transaction will see a later timestamp — linearizability without coordination.
(b) Limitations: (1) You need Google-grade hardware infrastructure in every data center — GPS antennas, atomic clocks, redundant time servers. (2) Write latency has a floor of ~7ms (the commit-wait). (3) Partitions can still happen — Spanner is technically CP, but partitions are so rare in Google's private network (not the public internet) that availability stays above 99.999%. "Effectively CA" means "CP but partitions almost never happen."
(c) Without atomic clocks: Standard NTP gives ~100-200ms clock uncertainty. Commit-wait would add 100-200ms to every single write — unacceptable for most workloads. CockroachDB (open-source Spanner-inspired) uses NTP and works around it with "uncertainty intervals" that occasionally restart transactions when clock skew is detected. It's usable but slower and less predictable than Spanner. The honest answer: you can get close to Spanner without atomic clocks, but not equivalent.
Cheat Sheet — CAP at a Glance
Pin this to your wall. These ten cards cover every concept on this page in one-liner form — perfect for quick review before an interview or when you're knee-deep in a design doc at 2 AM.
Connected Topics — Where to Go Next
CAP theorem doesn't live in a vacuum — it's the foundation that every other distributed systems topic builds on. Whether you're designing a database, choosing a message broker, or sizing a multi-region deployment, the CP-vs-AP decision shows up everywhere. Pick the topics that matter most for your next interview or project.