Databases

Scaling Databases

Your single PostgreSQL handles 5K reads/sec today. Your app is about to need 50K. Vertical scaling buys you 6 months. Read replicas, connection pooling, and failover buy you the next 5 years. Real configs, real commands, real companies.

8 Think Firsts 22 SVG Diagrams Real Configs 15 Sections 30+ Tooltips
Section 1

TL;DR — The One-Minute Version

Mental Model: A single database is like a single chef in a busy restaurant. When the orders pile up, you have two choices: get a bigger kitchen (vertical scaling) or hire more chefs (horizontal scaling). Read replicasCopies of your primary database that handle read queries (SELECT). The primary handles all writes (INSERT, UPDATE, DELETE). Since most applications are 80-95% reads, replicas absorb the majority of traffic. You can verify a replica is running right now: SELECT * FROM pg_stat_replication; on the primary. are the extra chefs who can take orders and serve food, but only the head chef can change the recipe. Connection poolingA middleman (like PgBouncer) that sits between your app and the database. Instead of each request opening its own database connection (expensive: ~5ms + 1.3KB RAM each), a pool of 25 connections is shared by 1,000 app requests. Think of it as a phone switchboard — 25 lines serving 1,000 callers, because no one talks forever. is the phone switchboard that stops every single customer from dialing the kitchen directly.

Most backend scaling problems aren't about your code being slow. They're about your single database drowning under traffic it was never designed to handle. The fix isn't rewriting your app. It's understanding three things: where reads go (replicas), how connections are managed (pooling), and what happens when the primary dies (failover). Master these three, and you can scale any relational database from 5K reads/sec to 500K reads/sec without changing a single line of application code.

BEFORE: One Database App 1 App 2 App 3 App 4 PostgreSQL 50K req/sec CPU: 98% OVERLOADED AFTER: Primary + Read Replicas PRIMARY Writes only: 5K/sec Replica 1 15K reads/sec Replica 2 15K reads/sec Replica 3 15K reads/sec Apps writes reads 5K writes to primary + 45K reads across 3 replicas = 50K total Same data. Same queries. 10x throughput. Zero code changes.
What top companies actually do: Instagram runs PostgreSQL with a 100:1 read/write ratio — all reads hit replicas. GitHub uses MySQL primary + read replicas with ProxySQL splitting traffic automatically. Shopify shards across Vitess-managed MySQL clusters, each merchant on its own shard. CricBuzz puts all reads through Redis cache, writing once every 30 seconds per match. Database scaling isn't theoretical — it's what every high-traffic system does on day one of growing.
Section 2

The Scenario — 5K Reads/Sec Today, 50K in Six Months

You're the backend engineer for a growing SaaS product. Things are going well — too well. You launched 8 months ago, your user base is doubling every quarter, and your single PostgreSQLA free, open-source relational database. It's what Instagram, Stripe, Reddit, and Notion all use in production. You can spin one up right now: docker run -e POSTGRES_PASSWORD=test -p 5432:5432 postgres:16. It runs on port 5432 by default, stores data in pages of 8 KB each, and supports advanced features like streaming replication, logical replication, and JSONB columns. instance on AWS RDSAmazon's managed database service. You pick the database engine (PostgreSQL, MySQL, etc.), the instance size, and AWS handles backups, patching, and failover. An db.r6g.xlarge gives you 4 vCPUs, 32 GB RAM, and costs about $450/month. An db.r6g.4xlarge gives you 16 vCPUs, 128 GB RAM, and costs about $1,800/month. is handling everything: user authentication, product catalog, order processing, analytics queries, admin dashboards, and background jobs.

Right now, it's handling about 5,000 reads per second and 500 writes per second. Your db.r6g.xlarge instance (4 vCPUs, 32 GB RAM, $450/month) is sitting at 70% CPU. Comfortable. You open Grafana on a Monday morning and the chart looks healthy:

Your Single PostgreSQL — Today PostgreSQL (RDS db.r6g.xlarge) 4 vCPUs • 32 GB RAM • $450/mo 70% CPU 5,000 reads/sec 500 writes/sec In 6 months: 50K reads/sec needed

But then the growth projections land on your desk. Marketing just signed three enterprise deals. The sales team is promising a mobile app launch. Your product manager shows you a graph: user base will 10x in the next six months. That means your database needs to handle 50,000 reads per second and 5,000 writes per second — ten times what it handles today.

You do the math in your head. 50K reads/sec on a single PostgreSQL instance? That's roughly 40 vCPUs of read throughput, assuming your current 4 vCPUs handle 5K reads. The biggest RDS instance you can buy (db.r6g.16xlarge) has 64 vCPUs. So technically, it could handle it — but at $11,000 per month, and you'd be running at 65% capacity with zero headroom for spikes.

Think First

Your database is at 70% CPU handling 5K reads/sec. You need to get to 50K reads/sec. You have two options: upgrade to a bigger machine, or add copies of your database that handle reads. What are the trade-offs of each approach? Think about cost, failure modes, and how long each solution lasts.

For upgrading: there's a ceiling to how big one machine can be. For copies: you need to keep them in sync. What happens if the sync falls behind?

This is the exact problem that every growing startup hits. Instagram hit it. GitHub hit it. Shopify hit it. Notion hit it. And they all solved it the same way — not by buying one massive database server, but by understanding which traffic can be split off and served by copies. Let's trace the same journey they took.

Follow along: If you have Docker, you can spin up a real PostgreSQL right now and try every command in this page. docker run --name pg-primary -e POSTGRES_PASSWORD=test -p 5432:5432 -d postgres:16. We'll set up an actual replica in Section 6.
Section 3

The First Attempt — Buy a Bigger Machine

The fastest fix for a database that's running hot is the simplest one: give it more power. In AWS, this means clicking a dropdown and upgrading your RDS instance classThe hardware tier of your managed database. Each class specifies vCPUs, RAM, network bandwidth, and EBS throughput. db.r6g.xlarge = 4 vCPUs, 32 GB RAM. db.r6g.4xlarge = 16 vCPUs, 128 GB RAM. db.r6g.16xlarge = 64 vCPUs, 512 GB RAM. The "r" family is memory-optimized, which is what you want for databases.. This is called vertical scaling — making one machine bigger instead of adding more machines.

Here's what the upgrade path looks like in real numbers:

Instance vCPUs RAM Est. Reads/sec Monthly Cost Headroom at 50K
db.r6g.xlarge 4 32 GB ~5K $450 None (current)
db.r6g.2xlarge 8 64 GB ~10K $900 Not enough
db.r6g.4xlarge 16 128 GB ~20K $1,800 Not enough
db.r6g.8xlarge 32 256 GB ~38K $3,600 Close but tight
db.r6g.16xlarge 64 512 GB ~70K $7,200 Enough for now

So the plan looks straightforward: upgrade from db.r6g.xlarge ($450/mo) to db.r6g.8xlarge ($3,600/mo) to handle the immediate growth, and if that's not enough, go all the way to db.r6g.16xlarge ($7,200/mo). Your cost goes up 16x but you get 14x the throughput. Seems reasonable.

The upgrade itself takes about 15 minutes of downtime (RDS modifies the instance in-place) or zero downtime if you use Multi-AZAWS RDS Multi-AZ keeps a synchronous standby replica in a different Availability Zone. When you upgrade, RDS upgrades the standby first, fails over to it, then upgrades the old primary. Result: a few seconds of connection interruption instead of 15 minutes of downtime. Costs roughly 2x the single-instance price.. You schedule it for 3 AM on a Sunday, run the upgrade, and wake up Monday morning to a database that's barely sweating.

Vertical Scaling — Just Make It Bigger xlarge 4 vCPU, 32 GB 5K reads/sec $450/mo 4xlarge 16 vCPU, 128 GB 20K reads/sec $1,800/mo Buys ~3 months 16xlarge (MAX) 64 vCPU, 512 GB 70K reads/sec $7,200/mo Buys ~6 months Then what? CEILING No bigger to buy

This actually works. For a while. If your growth is moderate — say 2x over 12 months — vertical scaling might be all you ever need. Many startups run on a single, beefy database for years without issues. The db.r6g.16xlarge is a seriously powerful machine: 64 cores, 512 GB of RAM, 25 Gbps of network bandwidth. It can handle a lot.

But you've already done the math. At 10x growth in 6 months, even the biggest single machine is a temporary bandage. And there are two other problems you might not have thought about yet — problems that no amount of CPU can fix.

Vertical scaling isn't dumb — it's the right first move. Instagram ran on a single PostgreSQL instance for their first million users. GitHub ran on one MySQL server for years. Don't prematurely optimize. If a bigger machine buys you 6 months of runway, take it. Use that time to plan the horizontal strategy properly.
Section 4

Where It Breaks — The Three Walls of Vertical Scaling

Vertical scaling is a ladder leaning against a building with a roof. You can climb higher and higher, but eventually you hit the ceiling and there's nowhere left to go. For database scaling, that ceiling comes in three forms, and all three are hard limits that money can't buy your way past.

Wall 1: The Hardware Ceiling

The biggest machine AWS sells for RDS is the db.r6g.16xlarge: 64 vCPUs, 512 GB RAM, $7,200/month. That's it. There is no db.r6g.32xlarge. If your traffic grows beyond what 64 cores can handle, you physically cannot buy a bigger machine. This isn't a cost problem — it's a physics problem. There is no server on Earth with 1,000 cores and 10 TB of RAM running a single PostgreSQL process.

And here's the thing most people don't realize: CPU scaling isn't linear. Going from 4 cores to 64 cores doesn't give you 16x the throughput. PostgreSQL uses shared memoryA region of RAM that all PostgreSQL backend processes share. It includes the buffer pool (cached data pages), WAL buffers, lock tables, and more. When many processes read and write to shared memory simultaneously, they compete for access — this is called "contention." The more cores you add, the more processes compete, and the lock overhead grows. This is why 64 cores don't give you 16x the throughput of 4 cores. and internal locking that creates contention as you add more cores. In practice, going from 4 to 64 cores gives you roughly 10-12x the throughput, not 16x.

Wall 2: Single Point of Failure

No matter how big your single database is, it's still one machine. If that machine dies — hardware failure, kernel panic, EBS volume corruption, a bad patch — your entire application goes down. Every read, every write, every background job, every API endpoint. Everything. Gone.

AWS RDS Multi-AZ helps here (a standby replica in a different data center), but the failover still takes 60-120 seconds. During those 60-120 seconds, every request to your application fails. For a service like Instagram or GitHub, 60 seconds of downtime means millions of failed requests and a very loud PagerDutyAn incident management service that wakes you up at 3 AM when your servers are on fire. It sends SMS, phone calls, and push notifications to on-call engineers. If the first person doesn't acknowledge in 5 minutes, it escalates to the next person in the rotation. The most hated and most essential tool in DevOps. alarm.

Wall 3: Maintenance Downtime

Databases need maintenance. PostgreSQL needs VACUUMPostgreSQL doesn't immediately delete old row versions — it marks them as "dead" and cleans them up later. VACUUM is the cleanup process. Without it, your table bloats (takes up more disk space than needed) and queries slow down because they have to skip over dead rows. Auto-vacuum runs in the background, but heavy writes can outpace it. You can check dead tuple count: SELECT relname, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC; to clean up dead rows. Major version upgrades (PostgreSQL 15 to 16) require downtime. Schema migrations on large tables can lock the table for minutes. On a single database, every one of these operations either degrades performance (during the operation) or requires downtime (for major changes).

With only one database, you're stuck scheduling maintenance windows at 3 AM on Sunday and hoping nothing goes wrong. With replicas, you can upgrade one at a time, fail over to a fresh replica, and upgrade the old primary — all with zero downtime for your users.

HARDWARE CEILING Max: 64 vCPUs, 512 GB No bigger machine exists. Scaling is sub-linear: 16x cores = 10-12x throughput (lock contention + shared memory) SINGLE POINT OF FAILURE 1 machine = 1 point of death Hardware failure, kernel panic, EBS corruption, bad patch RDS failover: 60-120 seconds (every request fails during that time) MAINTENANCE DOWNTIME VACUUM, upgrades, migrations Major version upgrade: downtime Large ALTER TABLE: table lock 3 AM maintenance windows (with replicas: zero-downtime upgrades) Vertical scaling = temporary fix. Every growing company eventually outgrows it.
The Real Wake-Up Call

On January 28, 2017, GitLab's production PostgreSQL went down because an engineer accidentally ran rm -rf on the data directory of the primary database. Their replicas were behind, their backups were broken, and they lost 6 hours of production data. The postmortem revealed that their entire system depended on one database with no verified backup restoration process. This is what happens when you scale vertically without building redundancy. Read the full postmortem →

Section 5

The Breakthrough — Split Reads From Writes

Here's the insight that changes everything: in most applications, reads vastly outnumber writes. Think about Instagram — for every photo uploaded (one write), that photo is viewed thousands of times (thousands of reads). For every tweet posted, it's displayed in millions of timelines. For every product listed on Amazon, it's browsed by thousands of shoppers before one person buys it.

The typical read-to-write ratio for a web application is somewhere between 80:20 and 99:1. Instagram reports roughly 100:1. Most SaaS products sit around 90:10. Even write-heavy systems like messaging apps are often 70:30 when you count all the "display this conversation" reads.

Think First

If 90% of your database traffic is reads and 10% is writes, and your single database is at 100% capacity, what happens if you could magically move all the reads to a different server? How much load would be left on the original server? And what would that server now be free to focus on?

90% of traffic disappears. The original server drops to 10% capacity. It can now focus entirely on writes — which are the hardest operations to scale because they modify data.

The breakthrough is embarrassingly simple: make copies of your database and send all reads to the copies. The original database (the primaryThe one database that handles ALL writes. Also called "master" or "leader" (though "primary" is the modern preferred term). Every INSERT, UPDATE, and DELETE goes here. The primary then streams its changes to replicas. In PostgreSQL, you can check if a server is the primary: SELECT pg_is_in_recovery(); — returns false on the primary, true on replicas.) handles only writes. The copies (called read replicasA database server that receives a continuous stream of changes from the primary and applies them to its own copy of the data. It's always slightly behind the primary (milliseconds to seconds). Replicas can serve read queries (SELECT) but cannot accept writes. In PostgreSQL, a replica runs in "recovery mode" — you can verify with SELECT pg_is_in_recovery(); which returns true.) handle only reads. Since reads are 90% of your traffic, you just moved 90% of the load off the primary.

The Breakthrough: Read/Write Splitting Your App 50K req/sec total PROXY ProxySQL / PgBouncer Routes by query type WRITES (10%): 5K/sec PRIMARY All writes land here CPU: 15% READS (90%): 45K/sec Replica 1 15K reads/sec Replica 2 15K reads/sec Replica 3 15K reads/sec replication stream Primary at 15% CPU. Replicas handle 45K reads. Total: 50K. Need more? Add another replica. Each one adds 15K reads/sec capacity.

And here's the best part: this scales linearly. Need more read capacity? Add another replica. Each replica you add gives you another 15K reads/sec. Need 150K reads/sec? Run 10 replicas. Need 300K? Run 20. There's no ceiling (well, not until your primary can't stream changes fast enough, but that's a problem at millions of writes/sec, not thousands).

Let's compare the economics:

ApproachSetupReads/secMonthly CostFailure Impact
Vertical (single) 1 × db.r6g.16xlarge ~70K $7,200 Total outage (60-120s failover)
Horizontal (replicas) 1 primary + 3 replicas (all 4xlarge) ~65K $7,200 Lose 1 replica = 33% less reads (no outage)

Same cost. Same throughput. But the replica setup survives hardware failures gracefully — if one replica dies, the other two keep serving reads while the failed one is replaced. And you can scale beyond 70K reads/sec just by adding more replicas, while the vertical approach is at its absolute ceiling.

Instagram's numbers: Instagram uses PostgreSQL with a 100:1 read/write ratio. Their primary handles a relatively modest write load while dozens of read replicas serve the enormous read traffic (every time you open the app, scroll your feed, view a profile, check comments). They use pglogical for logical replication to keep specific tables in sync across replicas.
Section 6

How It Works — The Five Building Blocks

Splitting reads from writes sounds simple in theory. In practice, you need five pieces working together: a replication mechanism to keep copies in sync, a routing layer to direct traffic, connection pooling to manage resources, a failover strategy for when things die, and a consistency model to handle the gap between write and read. Let's open each one up.

1. Replication — Keeping Copies in Sync

When a user inserts a row on the primary, that change needs to show up on every replica. But when does it show up? Right away, or eventually? That single question defines the two fundamental replication modes, and everything else follows from it.

Synchronous replication means the primary waits for at least one replica to confirm it received and wrote the change before telling the client "OK, your write is committed." The write is slower (the primary has to wait for the network round-trip to the replica), but you're guaranteed that if the primary dies, at least one replica has every single committed transaction.

Asynchronous replication means the primary commits the write locally, tells the client "done," and then streams the change to replicas in the background. The write is fast (no waiting), but there's a gap — the replica might be milliseconds to seconds behind the primary. If the primary dies during that gap, those un-replicated writes are lost.

SYNCHRONOUS Client Primary Replica INSERT writes WAL stream WAL writes WAL ACK now commit OK (committed) Slower writes Zero data loss ASYNCHRONOUS Client Primary Replica INSERT writes WAL OK (committed) stream (async) catches up THE GAP replica is behind Fast writes Possible data loss Most production systems use async replication (faster) + semi-sync for critical data (safer)

Here's how you actually set up PostgreSQL streaming replication. On the primary server, you enable replication in postgresql.conf:

postgresql.conf (primary)
# Enable WAL streaming to replicas
wal_level = replica                    # Log enough info for replicas to reconstruct
max_wal_senders = 10                   # Up to 10 replicas can connect
wal_keep_size = 1GB                    # Keep 1GB of WAL for slow replicas to catch up
synchronous_commit = on                # or 'remote_apply' for sync replication

On the replica, you point it at the primary. In PostgreSQL 12+, you create a file called standby.signal and set the connection info in postgresql.conf:

postgresql.conf (replica)
# Connect to the primary and stream changes
primary_conninfo = 'host=pg-primary port=5432 user=replicator password=secret'
hot_standby = on          # Allow read queries while in recovery mode

Once the replica starts, you can verify it's connected by running this on the primary:

check-replication.sql (run on primary)
-- See all connected replicas, their lag, and sync state
SELECT client_addr,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       sync_state
FROM pg_stat_replication;

-- Example output:
-- client_addr  | state     | sent_lsn    | replay_lsn  | sync_state
-- 10.0.1.52    | streaming | 0/3A000148  | 0/3A000148  | async
-- 10.0.1.53    | streaming | 0/3A000148  | 0/3A000060  | async
--                                          ^ this one is slightly behind

The sent_lsn vs replay_lsn difference tells you exactly how far behind each replica is. When they match, the replica is fully caught up. When replay_lsn is behind sent_lsn, the replica received the data but hasn't applied it yet — this is replication lagThe delay between when a write is committed on the primary and when it's visible on a replica. In async replication, this is typically 1-100ms on the same network. The lag is usually CPU-bound (the replica is busy serving reads AND applying WAL), not network-bound. You can measure it: SELECT now() - pg_last_xact_replay_timestamp() AS lag; on the replica..

Arpit's insight: Replication lag is usually CPU-bound, not network-bound. The replica is doing two things at once: serving your read queries AND applying the WAL stream from the primary. If you're hammering a replica with heavy analytical queries, it falls behind on applying WAL — not because the network is slow, but because the CPU is busy answering your queries instead of processing the replication stream.

Having replicas is useless if your application still sends all queries to the primary. You need a way to split traffic: writes go to the primary, reads go to replicas. There are three ways to do this, from simple to sophisticated.

Option A: Application-level routing. Your code explicitly decides. This is what most small teams start with. In your data access layer, you have two connection strings — one for the primary, one for replicas — and your code picks which to use based on the operation:

app-level-routing.py
# Pseudocode — your app decides where each query goes
primary_conn = connect("host=pg-primary dbname=myapp")
replica_conn = connect("host=pg-replica-1 dbname=myapp")

def get_user(user_id):
    # Reads go to replica
    return replica_conn.execute("SELECT * FROM users WHERE id = %s", user_id)

def update_user(user_id, name):
    # Writes go to primary
    primary_conn.execute("UPDATE users SET name = %s WHERE id = %s", name, user_id)

Simple, but fragile. What if a replica goes down? What about load balancing across multiple replicas? What about read-your-writes consistency?

Option B: Proxy-level routing with ProxySQL (MySQL) or PgBouncer + HAProxy (PostgreSQL). This is what GitHub, Shopify, and most mid-size companies use. A proxy sits between your app and the databases, inspects each query, and routes it automatically. GitHub uses ProxySQL with MySQL to route reads to replicas and writes to the primary — the application code just sends queries to one endpoint, and ProxySQL figures out where they go.

proxysql.cnf (MySQL read/write splitting)
# ProxySQL routes queries by inspecting the SQL
mysql_servers =
(
  { address="primary.db" , port=3306, hostgroup=0 },    # Writes (hostgroup 0)
  { address="replica-1.db", port=3306, hostgroup=1 },   # Reads (hostgroup 1)
  { address="replica-2.db", port=3306, hostgroup=1 },   # Reads (hostgroup 1)
  { address="replica-3.db", port=3306, hostgroup=1 }    # Reads (hostgroup 1)
)

mysql_query_rules =
(
  { match_pattern="^SELECT", destination_hostgroup=1 },  # All SELECTs → replicas
  { match_pattern=".*",      destination_hostgroup=0 }   # Everything else → primary
)

Option C: Cloud-native routing. AWS RDS has built-in reader endpoints. When you create read replicas, AWS gives you a single "reader endpoint" URL that automatically load-balances across all replicas using DNS round-robin. Your app connects to mydb-cluster.cluster-ro-xxxx.us-east-1.rds.amazonaws.com for reads and mydb-cluster.cluster-xxxx.us-east-1.rds.amazonaws.com for writes. Zero proxy setup, zero configuration — but you get less control over routing logic.

GitHub's real setup: GitHub uses ProxySQL in front of MySQL. Every application server connects to ProxySQL on localhost (running as a sidecar). ProxySQL inspects each query, routes SELECTs to the nearest healthy replica, and routes writes to the primary. If a replica goes down, ProxySQL detects it via health checks and removes it from the rotation within seconds — no application changes needed.

Here's a problem most people don't think about until it bites them: every database connection costs memory. A PostgreSQL connection takes about 5-10 MB of RAM on the server. If you have 20 application servers, each with 50 threads, each thread opening its own database connection — that's 1,000 connections consuming 5-10 GB of RAM just for connection overhead. RAM that isn't available for caching your data.

Worse, PostgreSQL uses a process-per-connectionUnlike MySQL (which uses threads), PostgreSQL forks a new operating system process for every client connection. Each process gets its own memory space. This is safer (a crash in one connection doesn't kill others) but more expensive — each process needs its own copy of query plans, temp buffers, and sort memory. This is why connection limits matter more in PostgreSQL than in MySQL. model. Each connection is a separate OS process. Context-switching between 1,000 processes is expensive. The CPU spends more time switching between connections than actually executing queries.

The fix is a connection pooler — a lightweight proxy that maintains a small pool of actual database connections and shares them across all your application requests. The most popular one for PostgreSQL is PgBouncerA lightweight connection pooler for PostgreSQL. It sits between your app and PostgreSQL, managing a small pool of real database connections. Written in C, it uses only ~2 KB of memory per client connection (vs 5-10 MB for a real PostgreSQL connection). It can handle 10,000+ client connections mapped to just 25 actual database connections..

pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# THE KEY SETTINGS:
max_client_conn = 1000       # Accept up to 1,000 connections from your app
default_pool_size = 25       # But only open 25 real connections to PostgreSQL
reserve_pool_size = 5        # 5 extra connections for spikes
pool_mode = transaction      # Share connections between requests (recommended)

That's it. Your 1,000 application connections now share 25 real database connections. The database server sees only 25 processes instead of 1,000. Memory usage drops from 5-10 GB to ~125 MB. CPU context-switching drops by 40x.

WITHOUT POOLING 20 App Servers 50 threads each PostgreSQL 1,000 connections 1,000 conns 5-10 GB RAM for connections alone CPU thrashing from context switches WITH PGBOUNCER 20 App Servers 50 threads each PgBouncer pool=25 1,000 PostgreSQL 25 conns 25 ~125 MB RAM for connections 40x less context switching 1,000 app connections → 25 database connections → same queries, 40x fewer resources PgBouncer uses ~2 KB per client connection. PostgreSQL uses ~5-10 MB per real connection.
The pool_mode matters: PgBouncer's transaction mode reassigns connections between transactions — ideal for web requests. Session mode keeps a connection for the entire client session — needed if you use SET commands, prepared statements, or temp tables. Most web apps should use transaction mode. If you use session mode, you lose most of the pooling benefit.

Your primary database just crashed. Maybe a disk failed. Maybe the kernel panicked. Maybe an engineer ran a bad migration. Whatever the cause, your primary is gone and every write is failing. What happens next depends entirely on whether you planned for this moment.

Manual failover means a human (you, at 3 AM) gets paged, logs into the infrastructure, promotes a replica to primary, updates connection strings, and restarts the application. This takes 10-30 minutes. During those minutes, your application can serve reads from replicas but can't process any writes — no new orders, no sign-ups, no state changes.

Automatic failover means software detects the primary is down and promotes a replica within seconds. There are two popular approaches:

AWS RDS Multi-AZ: Amazon maintains a synchronous standby in a different Availability Zone. When the primary dies, RDS automatically promotes the standby and updates the DNS endpoint. Your app reconnects within 60-120 seconds. You don't touch anything. This is what most teams on AWS use — it costs roughly 2x the price of a single instance but gives you automatic failover with zero data loss.

Patroni (self-managed PostgreSQL): If you're running PostgreSQL on bare metal or EC2 instances, PatroniAn open-source tool by Zalando that automates PostgreSQL failover. It uses a distributed consensus store (etcd, ZooKeeper, or Consul) to elect a new primary when the current one fails. Patroni handles promotion, DNS updates, and replication reconfiguration automatically. GitHub, GitLab, and Zalando all use Patroni in production. is the standard tool. It uses etcdA distributed key-value store used for consensus. Multiple Patroni nodes use etcd to agree on which PostgreSQL instance is the primary. If the primary stops sending heartbeats to etcd, Patroni triggers an election and promotes the most up-to-date replica. etcd is the same technology Kubernetes uses for its control plane. as a consensus store, monitors the primary's health, and automatically promotes the most up-to-date replica if the primary stops responding. Failover takes 5-30 seconds depending on configuration.

1. Primary dies Primary 2. Patroni detects (5s) Patroni + etcd heartbeat lost 3. Promote replica (10s) Replica 1 NEW PRIMARY 4. Update DNS (5s) DNS / VIP points to new primary 0s: crash 5s: detected 15s: promoted 20s: traffic restored Total downtime: 15-30 seconds. Zero human intervention. Reads from other replicas continue uninterrupted throughout.
Key detail: During failover, reads keep working (other replicas are still alive). Only writes are interrupted for those 15-30 seconds. This is a massive advantage over a single-database setup where EVERYTHING stops.

Here's the most common bug teams hit after setting up read replicas: a user updates their profile name, the write goes to the primary, the page refreshes, the read goes to a replica that's 200ms behind, and the user sees their old name. They click "save" again. Now they have a duplicate write. They refresh again. Still the old name. They think the app is broken.

This is the read-your-writes consistencyA consistency guarantee that says: after you write something, any subsequent READ by the same user should see that write. It doesn't mean OTHER users see it immediately (that's a different guarantee called "strong consistency"). Read-your-writes is about YOU seeing YOUR OWN changes without delay. problem. The write succeeded, but the user can't see their own change because the read was served by a replica that hasn't caught up yet.

The most common fix is simple: after a write, read from the primary for a short window. The user writes to the primary, and for the next 5 seconds, all their reads also go to the primary. After 5 seconds, the replicas have caught up (typical replication lag is 1-100ms), and reads switch back to replicas.

read-your-writes.py
# After a write, pin reads to primary for 5 seconds
WRITE_PIN_SECONDS = 5

def handle_write(user_id, query):
    primary.execute(query)
    # Set a timestamp in Redis (or a cookie) marking this user's last write
    redis.set(f"last_write:{user_id}", time.now(), ex=WRITE_PIN_SECONDS)

def handle_read(user_id, query):
    last_write = redis.get(f"last_write:{user_id}")
    if last_write and (time.now() - last_write) < WRITE_PIN_SECONDS:
        # User wrote recently — read from primary to guarantee consistency
        return primary.execute(query)
    else:
        # Safe to read from replica
        return replica.execute(query)

For PostgreSQL, you can also enforce read-only connections on replicas to prevent accidental writes:

enforce-read-only.sql
-- Run on replica connections to prevent accidental writes
SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY;

-- Now any INSERT/UPDATE/DELETE on this connection will fail:
-- ERROR: cannot execute INSERT in a read-only transaction

And you can measure the actual replication lag on any replica, right now:

measure-lag.sql (run on replica)
-- PostgreSQL: check how far behind this replica is
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- Output: 00:00:00.047  (47 milliseconds behind — totally fine)
pg-replication-lag.sql
-- Run on replica:
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- Run on primary (see all replicas):
SELECT client_addr,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;
mysql-replication-lag.sql
-- Run on replica:
SHOW REPLICA STATUS\G

-- Look for this field:
-- Seconds_Behind_Source: 0
-- (0 means fully caught up, any positive number is lag in seconds)

-- MySQL 8.0.22+ also has:
-- CHANGE REPLICATION SOURCE TO SOURCE_HOST='primary', SOURCE_PORT=3306;
redis-replication-lag.sh
# Run on Redis primary:
redis-cli INFO replication

# Look for:
# role:master
# connected_slaves:2
# slave0:ip=10.0.1.52,port=6379,state=online,offset=123456,lag=0
# slave1:ip=10.0.1.53,port=6379,state=online,offset=123450,lag=1
#                                                            ^ 1 second behind

# Also check for disconnected replicas:
# master_link_down_since_seconds: 0  (0 = connected, >0 = trouble)
CricBuzz's approach (from Arpit Bhayani): CricBuzz doesn't even read from the database for live match data. All reads go through Redis cache (~12KB per match). The database writes once every 30 seconds (when the score updates). With writes that infrequent, replication lag is essentially zero — every replica has time to fully catch up before the next write arrives. This is the ideal scenario: if your write rate is low, replication lag becomes a non-issue.
Section 7

Going Deeper — The Details That Bite You in Production

Everything in Section 6 works on paper. But production has a way of finding the edge cases that diagrams skip over. This section covers the four gotchas that experienced database engineers have learned the hard way — each one discovered through real outages and late-night debugging sessions.

Think First

A user updates their profile name. Your app writes to the primary. 200 milliseconds later, the user refreshes the page — and sees the old name. The data hasn't reached the replica yet. How would you fix this without sending all reads to the primary? (Hint: think about which reads need to be "fresh" and which can tolerate a small delay.)

One approach: after a write, direct that specific user's reads to the primary for the next N seconds. Everyone else still reads from replicas. This is called "read-your-own-writes" consistency.

Replication lag is the time between when a write is committed on the primary and when it becomes visible on a replica. In a perfectly idle system, this is under a millisecond. In production, it can spike to seconds or even minutes. Understanding why is the key to fixing it.

Cause 1: CPU contention on the replica. This is the most common cause and the one most people miss. Your replica is doing two jobs simultaneously: serving your read queries AND applying the WAL streamWrite-Ahead Log — a sequential log of every change made to the database. When you INSERT, UPDATE, or DELETE a row, PostgreSQL first writes the change to the WAL file (sequential write, very fast), then updates the actual data pages later. Replicas receive a copy of this WAL stream and "replay" it to reconstruct the same changes. The WAL is what makes crash recovery and replication possible. from the primary. If you're running heavy analytical queries on the replica (big JOINs, GROUP BYs, full table scans), the CPU is busy answering those queries instead of processing the replication stream. The WAL bytes pile up, and lag increases.

You can see this happening in real time. On the primary, run:

diagnose-lag.sql (run on primary)
-- See how far behind each replica is, in bytes and time
SELECT client_addr,
       state,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
       pg_wal_lsn_diff(sent_lsn, write_lsn) AS network_lag_bytes,
       pg_wal_lsn_diff(write_lsn, replay_lsn) AS apply_lag_bytes
FROM pg_stat_replication;

-- If network_lag_bytes is high → the replica isn't receiving data fast enough (network issue)
-- If apply_lag_bytes is high → the replica received data but can't apply it fast enough (CPU issue)
-- In practice, apply_lag is almost always the culprit — Arpit is right, it's CPU-bound

Fix for CPU contention: Separate your workloads. Use one set of replicas for your application's fast reads (point lookups, simple queries) and a different replica for heavy analytics. The analytics replica can fall behind without affecting your users.

Cause 2: Long-running transactions on the replica. PostgreSQL replicas have a conflict: if a long-running query is reading data that the replication stream wants to modify, the replica has to wait. By default, PostgreSQL cancels the long query after max_standby_streaming_delay (default: 30 seconds). If you increase this to avoid canceling queries, your lag grows. If you decrease it, analytical queries get killed.

Cause 3: Write spikes on the primary. A batch import or migration generates a massive burst of WAL. The replica can't process it as fast as it's generated. This is usually temporary — lag spikes during the burst and recovers afterward. If it doesn't recover, your replica is undersized.

CPU CONTENTION Replica busy serving reads Can't apply WAL fast enough Fix: Separate analytics replicas from app replicas LONG TRANSACTIONS Query conflicts with WAL WAL replay blocked by queries Fix: hot_standby_feedback or max_standby_delay tuning WRITE SPIKES Batch imports, migrations WAL generated faster than applied Fix: Throttle batch writes or use bigger replica instances Pro tip: Alert on lag > 1 second. Investigate if it stays above 5 seconds for > 1 minute. Normal lag: 1-50ms. Acceptable lag: up to 1s. Problem lag: > 5s sustained.

Synchronous replication is safe but slow. Asynchronous is fast but risky. MySQL offers a middle ground called semi-synchronous replication, and it's what many production MySQL deployments actually use.

Here's how it works: the primary waits for at least one replica to acknowledge that it received the WAL data (wrote it to the relay log), but does not wait for the replica to apply it to its data files. So the primary knows the data is safely on two machines (durability), but the replica might not have it queryable yet (consistency).

Think First

Semi-sync waits for the replica to receive the data but not apply it. If the primary crashes right after getting the ACK, the replica has the data in its relay log but hasn't applied it yet. Is the data safe or lost?

The data is in the relay log file on the replica's disk. When the replica starts up, it will apply the relay log. So the data is safe — it just might not be visible for a few milliseconds until the relay log is replayed.
enable-semi-sync.sql (MySQL)
-- On the primary:
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 1000;  -- Wait up to 1 second for ACK
-- If no replica ACKs within 1 second, fall back to async (don't block writes forever)

-- On the replica:
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled = 1;

-- Check status:
SHOW STATUS LIKE 'Rpl_semi_sync%';
-- Rpl_semi_sync_master_status: ON
-- Rpl_semi_sync_master_yes_tx: 4523  (commits that got ACK'd)
-- Rpl_semi_sync_master_no_tx: 0      (commits that timed out to async)

The key setting is rpl_semi_sync_master_timeout. If no replica ACKs within that window, the primary falls back to asynchronous mode — it won't block writes indefinitely. This is important because a network partition between primary and replicas shouldn't halt your entire application. Most teams set this to 500ms-2000ms.

ModeWrite LatencyData SafetyWhen to Use
Async Fastest (no waiting) Possible loss on primary crash Read-heavy apps where losing last few writes is acceptable
Semi-sync +0.5-2ms per write Data on 2 machines (disk, not applied) Most production MySQL deployments (GitHub, Shopify)
Sync +2-10ms per write Zero data loss, fully applied Financial systems, payment processing
PostgreSQL equivalent: PostgreSQL doesn't call it "semi-sync," but you get similar behavior with synchronous_commit = remote_write (waits for replica to write to OS cache but not flush to disk) or remote_apply (waits for replica to apply and make queryable). The default on only guarantees local durability.

When we say "replication," there are actually two fundamentally different approaches happening under the hood, and understanding the difference matters when you need to do things like replicate only certain tables or replicate between different PostgreSQL versions.

Physical replication (also called streaming replication) sends raw WAL bytesThe actual binary data that represents disk-level changes. When PostgreSQL updates a row, it modifies a specific data page on disk. Physical replication sends those raw byte changes to the replica, which applies them to the same page locations. The replica ends up as a byte-for-byte copy of the primary — same files, same page layout, same everything. from the primary to the replica. The replica applies these byte-level changes to its data files, ending up as an exact binary copy of the primary. This is what pg_basebackup and streaming replication use. It's fast, simple, and reliable — but the replica must be the same PostgreSQL major version, the same OS architecture, and it replicates everything (all databases, all tables).

Logical replication sends decoded row-level changes: "INSERT this row into the users table," "UPDATE row 42 in the orders table." The replica interprets these logical operations and applies them. This is slower but far more flexible: you can replicate specific tables, replicate between different PostgreSQL versions, replicate to a different database schema, or even replicate to a different database engine entirely.

PHYSICAL (Streaming) Sends: raw WAL bytes (binary) Speed: Fast (byte copy) Scope: ALL tables, ALL databases Limit: Same PG version required Limit: Replica is read-only Used by: pg_basebackup, streaming rep LOGICAL (Row-level) Sends: decoded SQL operations Speed: Slower (decode + apply) Scope: Selected tables only Flex: Cross-version, cross-engine Flex: Target can have extra indexes Used by: pglogical, PG 10+ PUBLICATION Instagram uses pglogical (logical) to replicate specific tables to analytics replicas
logical-replication-setup.sql (PostgreSQL 10+)
-- On the primary: publish specific tables
CREATE PUBLICATION my_pub FOR TABLE users, orders, products;

-- On the replica: subscribe to the publication
CREATE SUBSCRIPTION my_sub
  CONNECTION 'host=primary port=5432 dbname=mydb'
  PUBLICATION my_pub;

-- Only users, orders, products replicate. Everything else stays local.
-- The replica can have its own additional tables, indexes, and even writable tables.
When to use which: Use physical for standard read replicas (simple, fast, full copy). Use logical when you need to replicate only specific tables, do cross-version upgrades (replicate from PG 15 to PG 16, switch traffic, done — zero-downtime major upgrade), or feed data to an analytics system.

In Section 6 we configured PgBouncer with pool_mode = transaction. But there are actually three pool modes, and picking the wrong one either wastes your connection pool or breaks your application. Let's understand each one.

Session mode: A client gets a dedicated backend connection for its entire session (connect → multiple queries → disconnect). The connection isn't shared until the client disconnects. This is the safest mode — it supports everything PostgreSQL supports (prepared statements, SET commands, LISTEN/NOTIFY, temp tables). But it provides minimal pooling benefit, because connections are reserved even while the client is idle between queries.

Transaction mode: A client gets a backend connection only for the duration of each transaction. Between transactions, the connection goes back to the pool. This is the sweet spot for most web applications: each HTTP request opens a transaction, runs a few queries, commits, and the connection is immediately available for the next request. But it breaks features that span transactions: prepared statements, session-level SET commands, advisory locks, LISTEN/NOTIFY.

Statement mode: Each individual SQL statement gets its own connection. This is the most aggressive pooling but the most restrictive — you can't even use multi-statement transactions. Rarely used in practice.

Pool ModeConnection SharingEfficiencySupportsBreaks
session Per client session Low Everything Nothing
transaction Per transaction High Regular queries, transactions Prepared stmts, SET, LISTEN, temp tables
statement Per SQL statement Highest Simple queries only Transactions, everything above

How to decide: Start with transaction mode. If your app uses prepared statements (most ORMs do by default), you have two options: (1) disable prepared statements in your ORM's connection config (Django: DISABLE_SERVER_SIDE_CURSORS = True; Rails: prepared_statements: false), or (2) use PgBouncer 1.21+ which added support for prepared_statements in transaction mode.

pgbouncer-admin.sh
# Connect to PgBouncer's admin console
psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer

# See current pool statistics
SHOW POOLS;
# database | user    | cl_active | cl_waiting | sv_active | sv_idle | pool_mode
# mydb     | webapp  | 847       | 12         | 23        | 2       | transaction
#                      ^ 847 app    ^ 12 waiting  ^ 23 busy    ^ 2 idle
#                        connections   for a conn    DB conns     DB conns

# See connected clients
SHOW CLIENTS;

# See backend (real) connections
SHOW SERVERS;
The "cl_waiting" trap: If cl_waiting in SHOW POOLS is consistently above zero, your app connections are queuing up waiting for a real database connection. This means your pool is too small. Increase default_pool_size gradually. But don't set it too high — if you set it to 500, PostgreSQL is back to managing 500 processes and you've defeated the purpose. The sweet spot is usually 2-4x your CPU count (e.g., 25-50 for a 16-vCPU database).
Section 8

The Variations — Different Topologies for Different Problems

Everything so far has been about one topology: a single primary that handles writes, with read replicas that handle reads. But that's not the only way to scale a database. Different problems call for different architectures, and knowing which topology fits which problem is what separates a junior who "adds replicas" from a senior who designs the right replication strategy.

Think First

What if you allowed two databases to both accept writes? User A writes to Database 1 and User B writes to Database 2 at the same time. Both update the same row — User A sets status = 'shipped' and User B sets status = 'cancelled'. When the two databases sync, which value wins? How would you resolve this conflict?

This is the fundamental problem with multi-primary replication. There's no universal answer — some systems use "last-write-wins" (timestamp), some use application-level conflict resolution, and some avoid the problem by ensuring the same row is only written on one primary.

Primary-Replica vs Multi-Primary — One Writer or Many?

In the primary-replica topology (what we've been discussing), one database accepts all writes and streams changes to replicas. This is the most common setup and works beautifully for the vast majority of applications. It's what Instagram, GitHub, Shopify, and most production systems use.

But what if writes are your bottleneck? What if you need 100K writes/sec and a single primary can only handle 20K? This is where multi-primary (also called multi-master) replication comes in: multiple databases all accept writes, and they synchronize changes with each other.

Multi-primary sounds perfect in theory. In practice, it introduces the hardest problem in distributed systems: write conflictsWhen two primaries accept writes to the same row at roughly the same time, they each have a different version of that row. When they try to synchronize, they conflict. Example: User 42's name is updated to "Alice" on Primary A and "Bob" on Primary B at the same instant. Which one wins? There's no universal answer — you need conflict resolution rules (last-write-wins, merge, or reject). This is why multi-master is rarely used for regular applications.. If two primaries accept a write to the same row at the same time, which one wins? There's no clean answer. Every conflict resolution strategy has trade-offs, and bugs in conflict resolution are some of the nastiest to debug.

PRIMARY-REPLICA (recommended) Primary (W) Replica (R) Replica (R) Replica (R) No conflicts. Simple. Predictable. Write bottleneck = single primary MULTI-PRIMARY (rare, complex) Primary A Primary B sync WRITE CONFLICTS Same row updated on both primaries Higher write throughput but... conflict resolution is a nightmare Rule of thumb: Use primary-replica unless you've exhausted every other option for write scaling
When multi-primary makes sense: (1) Geographic distribution — you need writes to succeed in both US and EU data centers without cross-Atlantic latency. (2) Extremely high write throughput that a single primary genuinely can't handle. (3) Active-active failover where both data centers must be fully writable. In all other cases, primary-replica is simpler, cheaper, and more reliable.

So far we've assumed all our databases are in the same data center (or AWS region). But what happens when your users are worldwide? A user in Tokyo connecting to a database in Virginia experiences ~150ms of network latency on every query. For a page that fires 10 database queries, that's 1.5 seconds of pure network wait — before the database even starts working.

The fix: put read replicas near your users. Run replicas in us-east-1 (Virginia), eu-west-1 (Ireland), and ap-northeast-1 (Tokyo). Each user reads from the nearest replica. Writes still go to the primary in one region, but reads are served locally.

The challenge: replication lag increases across regions. Within the same data center, lag is 1-50ms. Across the Atlantic, the physical speed of light adds ~70ms of minimum latency (New York to London). Across the Pacific, it's ~120ms. Your replicas in Tokyo will always be at least 120ms behind the primary in Virginia. For reads that need strong consistency (showing a user their own just-submitted data), you might need to route those specific reads back to the primary.

Think First

Your primary is in us-east-1. You have replicas in eu-west-1 and ap-northeast-1. A user in Tokyo updates their email, then immediately refreshes their profile. If the read goes to the Tokyo replica, which is 120ms behind, they see the old email. How do you solve this without routing ALL reads to the US primary?

Think about the read-your-writes pattern from Section 6, Card 5. Can you apply it here? Route only the user's own reads to primary for a few seconds after they write.
US-EAST-1 (Virginia) PRIMARY Replica EU-WEST-1 (Ireland) Replica (~70ms lag) AP-NE-1 (Tokyo) Replica (~120ms lag) ~70ms ~120ms
Real-world example: Notion runs their primary PostgreSQL in US-East and has cross-region replicas for their global user base. Slack does something similar with MySQL. The pattern is always the same: one primary for writes (low latency within a region), replicas near users for reads (low latency globally).

People use the words "partitioning" and "sharding" interchangeably. They shouldn't. Arpit BhayaniA software engineer known for deep-dive content on system design and databases. Former engineer at Amazon and Unacademy. His YouTube channel and newsletter cover database internals, distributed systems, and real-world architecture decisions with a focus on understanding the WHY behind every design choice. draws a sharp distinction that clears up the confusion:

Partitioning splits a large table into smaller pieces, but all pieces live on the same database server. Think of it as organizing files into folders on the same hard drive. PostgreSQL's built-in PARTITION BY does this. It helps query performance (the planner skips irrelevant partitions) but doesn't help with server capacity — you're still on one machine.

Sharding splits data across different database servers. Each server holds a subset of the data. This is true horizontal scaling — adding more machines to handle more traffic. But it comes with enormous complexity: cross-shard queries, distributed transactions, rebalancing when you add/remove shards.

Arpit's 2×2 Matrix: Partitioning vs Sharding SAME SERVER (Partition) DIFFERENT SERVERS (Shard) HORIZONTAL (split rows) VERTICAL (split columns) Horizontal Partitioning orders_2024, orders_2025 on same DB PG: PARTITION BY RANGE (created_at) Horizontal Sharding users A-M on Server 1, N-Z on Server 2 Shopify: each merchant on a Vitess shard Vertical Partitioning user_profile + user_blob in separate tables Keep hot columns separate from cold BLOBs Vertical Sharding Users DB on Server 1, Orders DB on Server 2 Service-per-database microservices pattern Start with partitioning (same server). Only move to sharding when one server isn't enough. Sharding adds cross-server JOINs, distributed transactions, and rebalancing complexity.

Shopify's real sharding: Shopify uses VitessAn open-source database clustering system originally built at YouTube/Google. It sits in front of MySQL instances and handles sharding transparently. Your application sends queries to Vitess (which looks like a single MySQL), and Vitess routes them to the correct shard. It handles shard splitting, online schema changes, and query routing. Shopify migrated to Vitess in 2020 to handle their scale. to shard their MySQL databases. Each merchant's data lives on a specific shard. When a merchant's traffic grows, Vitess can split their shard transparently. The application code doesn't know about shards — it sends queries to Vitess, which routes them to the right MySQL instance.

The Sharding Rule of Thumb

Don't shard until you've exhausted these options in order: (1) optimize queries (indexes, query rewriting), (2) add read replicas, (3) add caching, (4) vertical partition hot vs cold data. Sharding is a one-way door — once you shard, cross-shard JOINs become impossible or extremely expensive, and you can't easily go back. Shopify, Instagram, and GitHub all ran on a single (replicated) database for years before sharding.

Section 9

At Scale — How Real Companies Do It

Theory is great, but what does database scaling actually look like at companies handling millions of users? Let's walk through four real-world setups — each one solves a different kind of scaling problem.

Think First

GitHub serves 100 million developers. Every git push writes to the database, and every page view reads from it. Their read/write ratio is roughly 50:1. If a single primary handles 10K writes/sec and each replica handles 15K reads/sec, how many replicas does GitHub need to absorb 500K reads/sec? What's the bottleneck — reads or writes?

500K / 15K = ~34 replicas for reads. But 10K writes/sec from one primary? At that scale, they need write sharding too. This is why GitHub switched to Vitess — to shard writes across multiple primaries.

Instagram stores billions of photos, likes, and comments. Early on, they ran on a single PostgreSQL database. When that maxed out, they didn't jump straight to sharding. They first added read replicas to absorb the "show me my feed" traffic. But eventually, even the write load overwhelmed the primary — too many likes, comments, and follows per second.

Their solution: shard by user_id. Every user's data (photos, likes, comments, follows) lives on a specific shard determined by user_id % num_shards. They use pglogicalA PostgreSQL extension for logical replication that lets you selectively replicate specific tables or subsets of data between PostgreSQL instances. Unlike physical replication (which copies everything byte-for-byte), pglogical decodes changes into row-level operations. Instagram used it to migrate data between shards and keep cross-shard references eventually consistent. to replicate reference data (like user profiles) across all shards, so a feed query on Shard 3 can still resolve usernames without cross-shard JOINs.

Instagram: user_id % N Sharding App + Routing Layer user_id % 4 = shard Shard 0 user_id % 4 = 0 photos, likes, follows Shard 1 user_id % 4 = 1 photos, likes, follows Shard 2 user_id % 4 = 2 photos, likes, follows Shard 3 user_id % 4 = 3 photos, likes, follows pglogical: replicate user_profiles table across all shards (reference data) Feed queries resolve usernames locally — no cross-shard JOINs needed
Key insight: Instagram doesn't shard every table. Reference data like user profiles is replicated to all shards via pglogical. Only user-generated content (photos, likes, comments) is sharded by user_id. This avoids 90% of cross-shard JOIN headaches.

GitHub runs on MySQL (not PostgreSQL) and handles over 10,000 queries per second across their fleet. Every application server talks to ProxySQLA high-performance MySQL proxy that sits between your application and MySQL. It can route read queries to replicas, manage connection pools, cache frequent queries, and perform query rewriting. GitHub runs ProxySQL as a sidecar on every app server, so the app connects to localhost:3306 and ProxySQL handles the rest. running as a local sidecar. The app connects to localhost:3306 and ProxySQL transparently routes queries to the right backend.

ProxySQL inspects every query: SELECT queries go to the nearest healthy replica. INSERT, UPDATE, DELETE go to the primary. If a replica fails a health check, ProxySQL removes it from the pool within seconds. The app never knows a replica went down.

GitHub: ProxySQL Sidecar on Every App Server App Server + ProxySQL sidecar localhost:3306 MySQL Primary All writes Replica 1 Replica 2 WRITES READS Replica down? Removed in <5 seconds
Why sidecar, not central proxy? A central ProxySQL would be a single point of failure. By running ProxySQL on every app server, each instance handles only local traffic. If one ProxySQL crashes, only that app server is affected. GitHub scales this to hundreds of app servers, each with its own ProxySQL routing 10K+ QPS in aggregate.

CricBuzz handles millions of concurrent users during major cricket matches. But here's the thing — live match data isn't that big. A complete match state (score, overs, batsmen, bowlers, commentary) fits in roughly 12 KB of JSON. And it only changes once every 30 seconds (when a ball is bowled).

Their insight: don't hit the database for live scores at all. Instead, one background worker writes the match state to RedisAn in-memory data store that can serve reads in microseconds (vs milliseconds for PostgreSQL). Data lives in RAM, making it incredibly fast for hot data that's read millions of times. Redis can handle 100K+ reads/sec on a single instance. The tradeoff: it's volatile (data can be lost on crash unless persistence is configured), and it's more expensive per GB than disk. every ~30 seconds. All user-facing reads come from Redis. The actual PostgreSQL database is only used for permanent storage (match history, player records, analytics).

CricBuzz: Redis Front, DB for Archival Scorer Worker 1 write / 30 sec Redis 12 KB / match state 100K+ reads/sec Millions of Users GET /live-score <1ms response PostgreSQL (archival) Match history, analytics async persist

The takeaway: sometimes the best way to scale your database is to not hit it at all. If your data is small, changes infrequently, and is read millions of times, put it in an in-memory cache and let the database handle only durable storage.

Slack used PostgreSQL for years. Their messages table was the hottest table in the system — every message send, every channel load, every search hit it. As Slack grew to millions of workspaces, the single messages table became the bottleneck. Read replicas helped for a while, but the write load (every sent message = an INSERT) eventually overwhelmed the primary.

Slack migrated their messages to VitessAn open-source database clustering system built at YouTube. It puts a proxy layer in front of MySQL, handles sharding transparently, and supports online schema changes and shard splitting. Your application connects to Vitess as if it were a normal MySQL, and Vitess routes each query to the correct shard based on the sharding key you configure. — the same system YouTube and Shopify use. Vitess sits in front of MySQL instances, handles sharding by workspace_id, and makes the whole cluster look like a single database to the application. The migration took months of dual-writes and shadow reads, but the result was a system that could handle 10x the message volume without sweating.

Why not just shard PostgreSQL? PostgreSQL doesn't have a built-in sharding proxy. You'd need to build routing logic in your app layer, handle shard splits manually, and manage distributed transactions yourself. Vitess gives you all of that out of the box, but it only works with MySQL. This is why Slack migrated from PostgreSQL to MySQL+Vitess — the tooling for sharding MySQL is significantly more mature.
Section 10

Anti-Lessons — Things People Get Wrong

These are popular "rules of thumb" that sound right but are dangerously incomplete. They spread through blog posts and interview prep sites, and people repeat them without thinking through the consequences.

Think First

Your team proposes: "Let's use async replication everywhere — it's faster, and we'll deal with failures later." You have two workloads: a social media "like" counter and a payment processing system. Should you use async for both? What's the worst case for each if the primary crashes 0.5 seconds after a write?

Losing a "like" = annoying. Losing a confirmed payment = regulatory incident. The replication strategy should match the data's importance.

Yes, async replication is faster. But "faster" means the primary doesn't wait for replicas to confirm they received the data. If the primary crashes after committing a transaction but before the replica receives it, that transaction is permanently lost. Gone. No recovery possible.

For a social media "like" counter? Losing a few likes on crash is annoying but survivable. For a bank transfer? Losing a committed payment is a regulatory incident. The question isn't "async or sync" — it's "what's the cost of losing the last N seconds of writes?"

Async Replication: The Data Loss Window T1: Write committed (on primary disk) T2: WAL streaming... (replica hasn't received yet) T3: PRIMARY CRASHES Write from T1 is LOST DATA LOSS WINDOW (typically 0.1-5 seconds)
The fix: Use semi-synchronous replication for critical data (payments, orders). The primary waits for at least one replica to ACK receipt before confirming the commit. Adds ~1ms latency but guarantees the data exists on two machines. Use async for everything else (analytics, logs, sessions).

This is one of the most common misconceptions in system design interviews. Read replicas do exactly one thing: absorb read traffic. They don't accept writes. If your primary is maxed out because of too many INSERTs and UPDATEs, adding 10 read replicas does nothing for you. Zero improvement.

Write bottlenecks require fundamentally different solutions: vertical scaling the primary (bigger CPU, faster SSD), write batching, queue-based writes, or — the nuclear option — sharding. Replicas only help when reads are the bottleneck, which is true for most apps (80-95% reads), but not all.

Quick test: Run SELECT * FROM pg_stat_user_tables ORDER BY n_tup_ins + n_tup_upd + n_tup_del DESC LIMIT 5; on your primary. If the top tables are doing millions of writes and your CPU is high, replicas won't help. You need to optimize writes or shard.

It's tempting to think: 1 replica handles 10K reads/sec, so 5 replicas handle 50K, and 10 replicas handle 100K. But there's a hidden cost: replication overhead on the primary. Every replica is a consumer of the primary's WAL stream. The primary has to send WAL data to each replica, and each additional replica adds network I/O and CPU load to the primary.

In practice, you hit diminishing returns around 5-7 replicas. Beyond that, the primary spends so much time streaming WAL to replicas that its write performance degrades. The fix is cascading replication: Replica A streams from the primary, and Replicas B and C stream from Replica A. The primary only manages one replication connection.

Number of Replicas Read Throughput 1 3 5 7 10 15 Expected (linear) Actual Diminishing returns (primary WAL overhead)
Rule of thumb: 3-5 replicas is the sweet spot for most setups. If you need more read capacity, use cascading replication or add a caching layer (Redis/Memcached) in front of the replicas.
Section 11

Common Mistakes — 6 Ways Teams Shoot Themselves

These aren't hypothetical. Every one of these mistakes has caused production incidents at real companies. They're easy to make and easy to prevent — if you know what to look for.

You set up replicas, route reads to them, and everything works great. Then one day, a big migration runs on the primary. The replica falls behind by 30 seconds. Users start seeing stale data — they update their profile, refresh, and see the old name. Support tickets flood in. You have no idea what's happening because you never set up a lag alert.

monitor-lag.sql (PostgreSQL)
-- Run on the PRIMARY to check all replicas:
SELECT
  client_addr,
  state,
  sent_lsn,
  replay_lsn,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
  replay_lag
FROM pg_stat_replication;

-- Alert if replay_lag > 5 seconds or replay_lag_bytes > 100MB
-- Most monitoring tools (Datadog, Prometheus) have built-in PostgreSQL integrations
Rule: Set up alerting on replication lag before you route any production traffic to replicas. Alert at 1 second, page at 10 seconds. If you're not monitoring pg_stat_replication, you're flying blind.

A classic race condition. The user submits a form (write goes to primary), the page redirects (read goes to replica), and the replica hasn't received the write yet. The user sees "nothing saved" and hits the button again. Now you have duplicate data.

This is called the read-your-writes consistency problem. The fix: after a write, read from the primary for a short window (1-5 seconds), then fall back to replicas. Some ORMs and proxies support this out of the box.

read-your-writes.py (Application-level)
# Pseudocode: sticky reads after writes
def get_connection(user_session):
    last_write = user_session.get("last_write_timestamp")
    if last_write and (now() - last_write) < 5_seconds:
        return primary_connection   # Read from primary for 5s after write
    return replica_connection       # Otherwise, read from replica

def handle_write(user_session, query):
    primary_connection.execute(query)
    user_session["last_write_timestamp"] = now()  # Mark write time
ProxySQL can do this automatically: Configure mysql_query_rules with flagOUT to route reads to the primary for a configurable window after any write on the same connection.

Every Friday afternoon at 5 PM, traffic spikes. Each new request opens a database connection. PostgreSQL forks a new process for each one. At 500 connections, the server slows. At 1,000, it starts rejecting connections with "too many connections." At 1,500, the OOM killer terminates PostgreSQL entirely.

The fix is connection pooling (Section 6 covered PgBouncer in detail). But the mistake teams make is deploying PgBouncer in production without load-testing it first. PgBouncer's transaction pooling mode breaks session-level features like SET variables, prepared statements, and advisory locks. Test your full query suite against PgBouncer before going live.

The 100-connection rule: If your app has more than 100 connections to PostgreSQL, you need a connection pooler. No exceptions. PostgreSQL was designed for tens of connections, not thousands.

You've set up a primary and two replicas. Great. They're all in us-east-1. If AWS has a regional outage (and they do — us-east-1 went down in December 2021), your entire database layer goes dark. Replicas in the same region as the primary don't protect you against regional failures.

The fix: at least one cross-region replica. Put a replica in us-west-2 or eu-west-1. It'll have higher replication lag (50-100ms vs 1-5ms for same-region), but it's your insurance against a regional disaster. Use async replication for cross-region — waiting for a 100ms round-trip on every commit would destroy write performance.

AWS RDS makes this easy: Create a cross-region read replica with one click. RDS handles the networking, encryption in transit, and automatic catch-up. If us-east-1 goes down, you can promote the cross-region replica to a standalone primary.

You set up Patroni (or RDS Multi-AZ), pat yourself on the back for having automatic failover, and never test it. Six months later, the primary actually dies. Patroni tries to promote the replica, but the replica's pg_hba.conf doesn't allow connections from the application servers. Or the DNS record TTL is 5 minutes, so apps keep connecting to the dead primary for 5 minutes after failover.

Failover is a mechanism, not a guarantee. Test it. Monthly. In production (during low traffic). Use Patroni's patronictl switchover to trigger a planned failover and verify everything works: DNS updates, connection rerouting, replication catch-up, zero data loss.

test-failover.sh
# Patroni: trigger a planned switchover
patronictl switchover --master primary-node --candidate replica-1 --force

# Verify:
patronictl list           # Check new primary is replica-1
psql -h primary-dns -c "SELECT pg_is_in_recovery();"  # Should be false
psql -h replica-dns -c "SELECT pg_is_in_recovery();"   # Should be true

# Check replication is flowing:
psql -h primary-dns -c "SELECT * FROM pg_stat_replication;"

Your payments table uses async replication because "it's faster." One day the primary crashes after committing 50 payment records that haven't replicated yet. You promote the replica. Those 50 payments are gone. Customers were charged, but your database has no record of the charges. Now you're reconciling with your payment gateway manually.

The fix: use semi-synchronous replication for tables where data loss is unacceptable. In PostgreSQL, set synchronous_commit = remote_write on the payment service's connection. In MySQL, enable rpl_semi_sync_master_enabled. The extra 1-2ms of latency is nothing compared to the cost of lost financial records.

Arpit Bhayani's 2×2 matrix: Plot your data on two axes — read frequency (low/high) and loss tolerance (acceptable/unacceptable). High-read + loss-acceptable (session data, analytics) = async replicas. Low-read + loss-unacceptable (payments, audit logs) = semi-sync + WAL archiving. This matrix is the fastest way to choose your replication strategy in an interview.
Section 12

Interview Playbook — "Scale This Database"

The Prompt: "Your PostgreSQL database handles 5K QPS reads but the product is growing and you'll need 50K QPS soon. How do you scale it?"

This is one of the most common system design interview questions. The difference between levels isn't knowledge — it's depth and tradeoff awareness. Here's what each level looks like:

Junior Answer (Passing)

"First, I'd check the read-to-write ratio. If it's 90% reads like most apps, I'd add read replicas. PostgreSQL supports streaming replication — I'd set up 2-3 replicas and route SELECT queries to them using a proxy like PgBouncer or ProxySQL. The primary handles all writes.

I'd also add a connection pooler (PgBouncer) to reduce the connection overhead on PostgreSQL. Instead of 1,000 direct connections, PgBouncer would share 25 real connections across all app servers."

Why this passes: Shows understanding of read/write splitting and connection pooling. Mentions specific tools. Knows the 90% read ratio heuristic.

Mid-Level Answer (Strong)

"Before adding infrastructure, I'd profile the queries. pg_stat_statements shows the top queries by total time. Often, adding an index or rewriting a query gets you 5x improvement for free.

If we still need to scale after optimization, I'd add 3 read replicas with streaming replication. I'd use PgBouncer in transaction mode with default_pool_size = 25 on each app server. For read routing, I'd use ProxySQL or application-level routing that sends SELECTs to replicas.

I'd also address replication lag: for the first 2-5 seconds after a write, reads for that user go to the primary (read-your-writes consistency). I'd set up monitoring on pg_stat_replication and alert if lag exceeds 1 second.

For failover, I'd use Patroni with etcd as the consensus store. It detects primary failure and promotes a replica automatically. I'd test failover monthly."

Why this is strong: Starts with profiling (cheapest fix first). Mentions specific configs. Addresses replication lag, read-your-writes, monitoring, and failover. Shows operational maturity.

Senior Answer (Exceptional)

"I'd approach this in phases. Phase 1 (this week): enable pg_stat_statements, identify the top 10 queries by total_exec_time, and add missing indexes. This alone often gets 3-5x improvement.

Phase 2 (this month): deploy PgBouncer as a sidecar on each app server. Transaction pooling mode, 25 connections per pool. This drops PostgreSQL's connection count from thousands to tens, freeing RAM for buffer cache.

Phase 3 (next month): set up 3 streaming replicas. Two in the same AZ for low-latency reads, one cross-region for disaster recovery. Use semi-sync for the same-AZ replicas (data safety) and async for cross-region (latency). Implement read-your-writes at the application layer with a 5-second sticky window per user session.

Phase 4 (if needed): add a Redis caching layer for hot queries (user profiles, product listings). Cache invalidation via PostgreSQL LISTEN/NOTIFY on updates. This handles the next 10x without touching the database schema.

Phase 5 (last resort): if writes become the bottleneck, evaluate vertical partitioning first (move audit_logs to a separate DB). Only shard if single-server write capacity is truly exhausted. Sharding is a one-way door — cross-shard JOINs, distributed transactions, and rebalancing are permanent operational costs."

Monitoring from day 1: pg_stat_replication for lag, pg_stat_activity for active connections, pg_stat_user_tables for seq scans (missing indexes). Patroni + etcd for automated failover, tested monthly with planned switchovers."

Why this is exceptional: Phased approach (cheapest first). Specific configs and tools. Addresses both reads and writes. Mentions cache invalidation strategy. Acknowledges sharding as last resort with clear tradeoffs. Shows operational awareness (monitoring, failover testing).
The Scaling Decision Tree (Always Go Left to Right) 1. OPTIMIZE Indexes, queries Cost: $0, Risk: Low 2. POOL PgBouncer Cost: Low, Risk: Low 3. REPLICATE Read replicas Cost: Med, Risk: Med 4. CACHE Redis / Memcached Cost: Med, Risk: Med 5. SHARD Last resort Cost: High, Risk: High Always exhaust the cheaper option before moving to the next one. Steps 1-2 take hours. Step 3 takes days. Step 4 takes weeks. Step 5 takes months. Instagram, GitHub, and Shopify all ran on a single (replicated) database for YEARS before sharding.
Section 13

Practice Exercises

Hands-on is the only way to really learn this. Each exercise builds on the previous one. Do them in order.

Exercise 1: Set Up PostgreSQL Streaming Replication with Docker Easy

Create a Docker Compose file that spins up one PostgreSQL primary and two streaming replicas. Insert a row on the primary and verify it appears on both replicas within seconds.

  • Use the official postgres:16 Docker image for all three containers
  • The primary needs wal_level = replica and max_wal_senders = 5 in postgresql.conf
  • Create a replication user: CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret';
  • On replicas, use pg_basebackup to clone the primary, then configure primary_conninfo in postgresql.conf
  • Verify replication: SELECT * FROM pg_stat_replication; on the primary should show both replicas
docker-compose.yml
version: '3.8'
services:
  primary:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: mydb
    command: >
      postgres
        -c wal_level=replica
        -c max_wal_senders=5
        -c hot_standby=on
    ports: ["5432:5432"]
    volumes: ["primary_data:/var/lib/postgresql/data"]

  replica1:
    image: postgres:16
    environment:
      PGUSER: replicator
      PGPASSWORD: secret
    depends_on: [primary]
    # Use entrypoint script to pg_basebackup from primary

  replica2:
    image: postgres:16
    environment:
      PGUSER: replicator
      PGPASSWORD: secret
    depends_on: [primary]

volumes:
  primary_data:

After starting, create the replication user on the primary, then configure each replica to stream from it. Run SELECT * FROM pg_stat_replication; on the primary to verify both replicas are connected and streaming.

Exercise 2: Measure Replication Lag Easy

Using your Docker setup from Exercise 1, write a script that inserts a row on the primary with a timestamp, then polls the replica every 10ms until the row appears. Measure the lag in milliseconds. Run it 100 times and calculate p50, p95, and p99 lag.

  • Insert: INSERT INTO lag_test (ts) VALUES (clock_timestamp()) RETURNING id;
  • Poll replica: SELECT ts FROM lag_test WHERE id = $1; — loop until row exists
  • Lag = replica_poll_time - insert_timestamp
  • On a local Docker network, expect <10ms p50 and <50ms p99
Exercise 3: Configure PgBouncer Medium

Add PgBouncer to your Docker Compose setup. Configure it in transaction mode with default_pool_size = 10. Connect your app through PgBouncer (port 6432) instead of directly to PostgreSQL. Run SHOW POOLS; on PgBouncer's admin console to verify pooling is working.

  • Use the edoburu/pgbouncer Docker image or build your own
  • Key config: pool_mode = transaction, default_pool_size = 10, max_client_conn = 500
  • Admin console: psql -h localhost -p 6432 -U pgbouncer pgbouncer
  • SHOW POOLS; shows active, waiting, and idle connections per database
  • Test: open 100 connections through PgBouncer and check that PostgreSQL only sees ~10
Exercise 4: Simulate Failover with Patroni Hard

Set up a 3-node Patroni cluster (1 primary, 2 replicas) with etcd as the consensus store. Use patronictl switchover to trigger a planned failover. Then simulate a crash by stopping the primary container and verify Patroni promotes a replica automatically within 30 seconds.

  • Use the patroni/patroni Docker image or build from the Patroni repo
  • Each Patroni node needs a patroni.yml with bootstrap config, etcd endpoints, and PostgreSQL settings
  • Key Patroni config: ttl: 30, loop_wait: 10, retry_timeout: 10
  • After stopping the primary, watch patronictl list — the new leader should appear within ~30 seconds
  • Check pg_stat_replication on the new primary to verify the remaining replica reconnected
Exercise 5: Implement Read-Your-Writes Routing Hard

Build a simple proxy (in Python, Node, or Go) that routes SQL queries. SELECTs go to a replica, INSERTs/UPDATEs/DELETEs go to the primary. After a write, all reads for that session go to the primary for 5 seconds (read-your-writes consistency). Log every routing decision.

  • Parse the first word of each SQL query to determine type (SELECT vs INSERT/UPDATE/DELETE)
  • Maintain a per-session last_write_time dictionary
  • If now() - last_write_time < 5 seconds, route reads to primary
  • Otherwise, route reads to a random healthy replica
  • Add health checks: periodically run SELECT 1 on each backend to detect failures
Section 14

Cheat Sheet

Replication Modes
Async   → fast, data loss risk
Semi    → +1ms, data on 2 nodes
Sync    → +5ms, zero data loss

Physical → binary WAL, same PG ver
Logical  → row ops, cross-version
PgBouncer Quick Setup
pool_mode = transaction
default_pool_size = 25
max_client_conn = 1000
reserve_pool_size = 5

1000 app conns → 25 DB conns
Check Replication Lag
-- On primary:
SELECT client_addr,
       replay_lag
FROM pg_stat_replication;

-- Alert at 1s, page at 10s
Failover Tools
Patroni  → PG, etcd consensus
Orchestr → MySQL, raft-based
RDS      → Multi-AZ, 60-120s
Aurora   → Shared storage, 30s

Test monthly. Always.
Scaling Order
1. Optimize (indexes, queries)
2. Pool (PgBouncer)
3. Replicate (read replicas)
4. Cache (Redis)
5. Shard (last resort)

Exhaust each before next.
Arpit's 2×2 Matrix
High read + loss OK:
  → async replicas + cache

High read + loss BAD:
  → semi-sync + WAL archive

Low read + loss OK:
  → single DB, backups

Low read + loss BAD:
  → sync replication
Section 15

Connected Topics — Where to Go Next