TL;DR โ The Mental Model
Here's everything you need to know in 60 seconds. Your database stores data in fixed-size chunks called pages (8 KB in PostgreSQL, 16 KB in MySQL). When you run a query, the database doesn't read individual rows โ it reads entire pages from disk into a memory area called the buffer poolA chunk of RAM that the database uses to cache frequently accessed pages. Think of it as the database's own private cache โ pages live here so the database doesn't have to read from disk every time.. Indexes are just cleverly organized trees (usually B+ treesA tree data structure optimized for disk storage. Every leaf node is linked to its neighbors, making range scans fast. Nearly all relational database indexes use B+ trees.) that tell the database exactly which page to read instead of scanning every page in the table. And the WALWrite-Ahead Log โ before the database changes any data page, it writes a description of the change to this log file. If the server crashes, the WAL is replayed to recover any changes that were in memory but not yet written to disk. (Write-Ahead Log) makes sure nothing is lost if the power goes out.
That's the whole machine. Four moving parts: pages on disk, a buffer pool in RAM, indexes to find the right pages, and a WAL to keep everything crash-safe. Every major relational database โ PostgreSQL, MySQL, Oracle, SQL Server โ is a variation on this theme. Learn these four pieces, and you can reason about any database problem you'll ever face.
The Scenario โ When Your Query Takes 45 Seconds
It's Monday morning. Your team just shipped a new feature โ a user search page. The query looks innocent enough:
SELECT * FROM users WHERE email = 'alice@example.com';
-- Table: users
-- Rows: 500,000,000 (500 million)
-- Result: 1 row
-- Time: 45.2 seconds ๐ฑ
Forty-five seconds to find a single row. The table has 500 million users. The product manager is furious. Users are leaving. The CEO sends a Slack message with exactly one word: "Fix."
You could just "add an index" and move on. Most developers do. But why does the query take 45 seconds without an index? What is the database actually doing during those 45 seconds? What is an index, physically? Where does it live on disk? How does the database decide to use it? Without understanding the internals, you're always guessing โ and guessing stops working at scale.
You have 500 million rows in a table. Each row is about 200 bytes. How much raw data is that? If the database reads from a standard SSD at 500 MB/s, how long would it take to scan every single row? Work it out before scrolling.
Let's do the math together. 500 million rows ร 200 bytes = 100 GB of raw data. An SSD reads at roughly 500 MB/s sequentially. So scanning the entire table takes: 100 GB รท 500 MB/s = 200 seconds. Wait โ the actual query took 45 seconds, not 200. That means the database isn't even reading every byte sequentially โ it's doing random I/O, page-by-page, which is even worse in some cases because random readsReading data from scattered locations on disk, as opposed to reading a continuous block. Random reads are much slower because the storage device (especially HDDs) has to seek to each location separately. Even SSDs have some penalty for random reads vs sequential. have higher latency than sequential scans. The 45-second time tells us something important: the database was probably doing a sequential scanAlso called a "full table scan" โ the database reads every single page of the table from start to finish, checking each row to see if it matches your WHERE clause. It's the brute-force approach. but with OS-level read-ahead optimizations helping.
Twelve and a half million pages. The database opened every single one, scanned every row inside, checked if the email matched, and moved on. For one row. There has to be a better way โ and there is. But to understand it, you need to understand what pages are, how they're organized, and how indexes change the game.
The First Attempt โ "Just Add an Index"
Every developer knows the answer: create an index. You've probably done it dozens of times. Two seconds of typing, and the query goes from 45 seconds to 2 milliseconds:
-- Before: 45.2 seconds (sequential scan of 12.5M pages)
CREATE INDEX idx_users_email ON users(email);
-- After: 0.002 seconds (index scan of ~4 pages)
SELECT * FROM users WHERE email = 'alice@example.com';
-- Time: 2 ms โ
Problem solved, right? The query is fast. Ship it. Go home.
But wait. What actually happened? What did CREATE INDEX build? Where does it live on disk? How does 45 seconds become 2 milliseconds? Most developers treat indexes like magic spells โ you say the incantation, and things go fast. But magic isn't real, and understanding the mechanics is what separates someone who can fix database problems from someone who can only google them.
The query went from scanning 12.5 million pages to reading about 4 pages. An index made that possible. But how? What data structure could possibly find one row among 500 million by reading only 4 pages? And if indexes are so great, why don't we just index every column?
Here are the questions that "just add an index" doesn't answer:
- What is a page? โ You keep hearing "8 KB page" but what's actually in a page?
- What's inside an index? โ It's a B+ tree, sure. But what does that look like on disk? How deep is it?
- Why only 4 page reads? โ How did the index navigate 500 million rows in 4 hops?
- What's the buffer pool doing? โ If the same query runs twice, does it hit disk again?
- What about writes? โ You added an index for reads, but what happens to INSERT and UPDATE performance?
- What if the index doesn't help? โ When would the database ignore your index and scan anyway?
You can use EXPLAIN ANALYZE to see what changed โ but unless you understand pages, trees, and I/O patterns, the output is just cryptic numbers:
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
-- Without index:
-- Seq Scan on users (cost=0.00..9850000.00 rows=1 width=200)
-- Filter: (email = 'alice@example.com'::text)
-- Rows Removed by Filter: 499999999
-- Buffers: shared read=12500000 โ 12.5 MILLION page reads!
-- Planning Time: 0.1 ms
-- Execution Time: 45200 ms
-- With index:
-- Index Scan using idx_users_email on users (cost=0.57..8.59 rows=1 width=200)
-- Index Cond: (email = 'alice@example.com'::text)
-- Buffers: shared hit=4 โ 4 page reads, ALL from cache!
-- Planning Time: 0.1 ms
-- Execution Time: 0.02 ms
Look at that Buffers line. Without an index: shared read=12,500,000 โ the database read 12.5 million pages from disk. With an index: shared hit=4 โ just 4 pages, and they came from the buffer pool (RAM), not disk. That's a 3-million-fold reduction in I/O. But how? To answer that, we need to go deeper.
Where It Breaks โ When "Just Add an Index" Isn't Enough
Adding an index fixed the email lookup. Great. But now your team hits a cascade of new problems โ and every single one requires understanding what's happening inside the database to diagnose:
Problem 1: "Writes got 3ร slower after adding 6 indexes"
Your team added indexes on email, name, created_at, city, phone, and status. Reads are blazing fast. But every INSERT now takes 18ms instead of 6ms. Every UPDATE that touches an indexed column triggers index maintenance on all affected indexes. Each index is a separate B+ treeA self-balancing tree structure optimized for systems that read and write large blocks of data (pages). Unlike a binary tree where each node has 2 children, a B+ tree node can have hundreds of children, keeping the tree extremely shallow. on disk that must be updated independently.
If each index update costs 2-3 page writes (finding the right leaf, possibly splitting a node), and you have 6 indexes, how many extra page writes does each INSERT require? What happens at 10,000 INSERTs per second?
The math: 6 indexes ร 2-3 page writes each = 12-18 extra page writes per INSERT. At 10K inserts/sec, that's 120K-180K extra page writes per second. Without understanding pages and B+ tree maintenance, you'd never figure out why your writes are slow โ the query itself looks fine.
Problem 2: "Queries are fast in staging, slow in production"
Same query, same index, same data shape. Staging: 2ms. Production: 800ms. Why? The answer is the buffer pool. In staging, your 8 GB buffer pool easily holds the entire index and most of the hot data pages. In production, 500 million rows means the index alone is 12 GB โ larger than your buffer pool. Pages are constantly being evictedWhen the buffer pool is full and needs to load a new page, it removes (evicts) the least-recently-used page to make room. If that evicted page was modified (dirty), it must be written to disk first. and re-read from disk.
Run this in PostgreSQL to see exactly what's in your buffer pool right now:
-- See what's in the buffer pool RIGHT NOW
-- Each row = one 8KB page in shared memory
SELECT c.relname AS table_name,
count(*) AS pages_in_buffer,
pg_size_pretty(count(*) * 8192) AS size_in_buffer,
round(100.0 * count(*) / (SELECT setting::int
FROM pg_settings WHERE name = 'shared_buffers'), 2) AS pct_of_pool
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
GROUP BY c.relname
ORDER BY pages_in_buffer DESC
LIMIT 10;
This shows you exactly which tables and indexes are hogging your buffer pool. If your users table is taking 60% and your index is fighting for the remaining 40%, you've found your bottleneck. Without understanding the buffer pool, you'd waste hours profiling the query itself โ which isn't the problem.
Problem 3: "The database crashed, and we lost 30 seconds of transactions"
A power outage took down your database server. When it came back, you discovered that the last 30 seconds of committed transactions were missing. Users who got a "success" response had their data vanish. How is that possible if the database said it committed?
The answer lies in the WAL (Write-Ahead Log) and how fsyncA system call that forces the operating system to flush its file buffers to the physical storage device. Without fsync, data that the OS says is "written" might actually still be in an OS cache in RAM โ vulnerable to power loss. works. Some configurations delay flushing the WAL to disk for performance (synchronous_commit = off in PostgreSQL). The database told the client "committed" when it wrote to the WAL in memory, but the WAL hadn't been flushed to disk yet. Power dies โ WAL in memory is gone โ those transactions are lost.
Without understanding the WAL, you'd never know this setting existed, let alone the trade-off: synchronous_commit = off makes commits ~5ร faster but risks losing the last few hundred milliseconds of data on crash. For a logging system? Fine. For a payment system? Catastrophic.
These aren't exotic edge cases. They're the problems every team hits once their database has more than a few million rows. The fix isn't more indexes, more RAM, or a bigger server โ it's understanding the four building blocks. Let's start with the most fundamental one.
The Breakthrough โ Everything Is Pages
Here's the single most important idea in database internals โ the one that makes everything else click:
A disk-based database never reads or writes individual rows. It reads and writes fixed-size blocks called pages. A page is 8 KB in PostgreSQL, 16 KB in MySQL/InnoDB. Every table is a file on disk made up of these pages, laid out one after another. Every index is a separate file, also made of pages. The buffer pool caches pages. The WAL logs changes to pages. Pages are the atom of the database world โ the smallest unit of everything.
Think about it like a library. You don't store books as individual loose pages scattered around โ you put them on shelves, and each shelf holds a fixed number of books. When someone wants a book, the librarian fetches an entire shelf section (a page), not one book at a time. If the librarian is smart, they keep the popular shelves on a cart near the front desk (the buffer pool) so they don't have to walk to the back every time.
If a page is 8 KB and a row is 200 bytes, how many rows fit in one page? And if you have 500 million rows, how many pages is that? (This is the same math from Section 2 โ but now you understand why the database thinks in pages.)
The math: 8,192 bytes per page รท 200 bytes per row = about 40 rows per page (a bit less after accounting for page headers and row pointers). 500 million rows รท 40 rows/page = 12.5 million pages. Now the EXPLAIN output makes sense: Buffers: shared read=12500000 means "I read 12.5 million pages from disk." That's not an arbitrary number โ it's every page in the table.
You can actually see the file where your table lives. This is a real file on disk, not an abstraction:
-- Find the actual file path of your table on disk
SELECT pg_relation_filepath('users');
-- Result: base/16384/16389
-- That's a REAL file. You can see it in the filesystem:
-- $ ls -la /var/lib/postgresql/14/main/base/16384/16389
-- -rw------- 1 postgres postgres 107374182400 Mar 20 10:00 16389
-- โ that's ~100 GB โ your 500M rows
-- For tables > 1 GB, PostgreSQL splits into 1 GB segments:
-- 16389, 16389.1, 16389.2, ... 16389.99
And you can peek inside a single page using the pageinspect extension โ this shows you the actual bytes the database reads when it fetches page 0 of your users table:
-- Enable the page inspection extension (one-time setup)
CREATE EXTENSION IF NOT EXISTS pageinspect;
-- Look at the header of page 0 (the first page) of the users table
SELECT * FROM page_header(get_raw_page('users', 0));
-- lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-- 0/1A8E930 | 0 | 0 | 212 | 6920 | 8192 | 8192 | 4 | 0
-- pd_lower = 212 โ row pointers end at byte 212
-- pd_upper = 6920 โ tuple data starts at byte 6920
-- Free space = 6920 - 212 = 6708 bytes available on this page
-- Count how many rows (tuples) are on this page:
SELECT count(*) FROM heap_page_items(get_raw_page('users', 0));
-- Result: 38 (slightly less than our 40 estimate โ page header + alignment overhead)
Every page has this same structure: a header at the top (with bookkeeping info like the LSNLog Sequence Number โ a monotonically increasing number that identifies a position in the WAL. Every page stores the LSN of the last WAL record that modified it. During crash recovery, the database replays WAL records that have an LSN higher than the page's LSN.), item pointers that grow downward, free space in the middle, and actual row data (tuples) that grows upward. When the free space runs out, the page is full, and new rows go to a different page.
This is the breakthrough: once you see that everything is pages, the entire database becomes understandable. Reads = fetching pages. Writes = modifying pages. Caching = keeping pages in RAM. Recovery = replaying changes to pages. Indexes = a fast way to find the right page. It's all pages, all the way down.
How It Works โ The Five Building Blocks
Now that you know everything is pages, let's look at how the five core subsystems work together. Each one is a piece of the puzzle โ and each one is something you can inspect on a real PostgreSQL instance.
1. Pages & Heap Files โ Where Your Data Actually Lives
When you create a table, the database creates a heap fileAn unordered collection of pages. Rows are stored wherever there's free space โ there's no sorting or structure. It's called a "heap" for the same reason as a pile of laundry: things go wherever they land. โ an unordered collection of pages. Think of it as a stack of paper where new rows get written on whatever page has space. There's no sorting, no ordering โ it's just a pile.
In PostgreSQL, this heap file is literally a file in your data directory. For tables larger than 1 GB, it gets split into 1 GB segments (because some filesystems can't handle huge files well). Each segment is just more pages, numbered sequentially.
If the heap file is unordered (rows are just thrown in wherever there's space), what happens when you DELETE a row? Does the database physically remove it and compact the page? Or does something else happen?
PostgreSQL doesn't physically remove deleted rows immediately. Instead, it marks them as "dead" โ the space is wasted until VACUUMA PostgreSQL maintenance process that reclaims storage occupied by dead tuples. It marks the space as reusable but doesn't return it to the OS. VACUUM FULL actually reclaims disk space but locks the table. comes along and reclaims the space. This is why PostgreSQL tables can "bloat" over time โ if VACUUM isn't running often enough, dead rows accumulate and pages fill up with garbage.
-- See the physical size of your table's heap file
SELECT pg_size_pretty(pg_relation_size('users')) AS heap_size,
pg_size_pretty(pg_indexes_size('users')) AS index_size,
pg_size_pretty(pg_total_relation_size('users')) AS total;
-- heap_size: 97 GB | index_size: 12 GB | total: 109 GB
-- Count dead vs live tuples (are you bloated?)
SELECT relname, n_live_tup, n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
WHERE relname = 'users';
-- dead_pct > 20%? You need more aggressive VACUUM settings.
MySQL/InnoDB is different. InnoDB doesn't use heap files โ it uses clustered indexesIn InnoDB, the table data IS the primary key index. Rows are physically sorted by the primary key on disk. This means primary key range scans are extremely fast (data is contiguous), but inserts with random PKs cause page splits. instead. The data is physically sorted by the primary key inside a B+ tree. This means InnoDB doesn't have a separate heap file โ the primary key B+ tree IS the table. This has huge implications for INSERT performance, especially with UUID primary keys (random inserts cause constant page splits).
2. B+ Tree Indexes โ How 4 Page Reads Find 1 Row in 500 Million
A B+ tree is the data structure behind the vast majority of relational database indexes. The name sounds intimidating, but the idea is simple: it's a tree where each node can have hundreds of children instead of just two. This makes the tree incredibly shallow โ even with billions of entries.
Here's the key number: fanout. Fanout means "how many children does each node have." In a B+ tree stored in 8 KB pages, if each key is 8 bytes and each pointer is 6 bytes, you can fit about (8192 - header) รท (8 + 6) โ 500 children per node. That's the magic.
If each B+ tree node (page) can point to 500 children, how many levels deep does the tree need to be to hold 500 million entries? (Hint: each level multiplies capacity by 500.)
The math is beautiful. Level 1 (root): 1 node โ points to 500 children. Level 2: 500 nodes โ each points to 500 = 250,000 entries. Level 3: 250,000 nodes โ each points to 500 = 125,000,000 entries. Level 4: 125,000,000 ร 500 = 62.5 billion entries. So for 500 million rows, you need 3 levels โ maybe 4 if the pages aren't perfectly full. That's why the EXPLAIN showed Buffers: shared hit=4: one page per level, plus one to fetch the actual row from the heap.
You can inspect a real B+ tree index in PostgreSQL:
-- See the internal structure of your B+ tree index
-- Requires pageinspect extension
SELECT * FROM bt_metap('idx_users_email');
-- magic | version | root | level | fastroot | fastlevel
-- 340322 | 4 | 3 | 3 | 3 | 3
-- โ root is page 3
-- โ tree has 3 levels!
-- Look at stats of any internal page
SELECT * FROM bt_page_stats('idx_users_email', 3); -- page 3 = root
-- blkno | type | live_items | dead_items | avg_item_size | free_size
-- 3 | r | 487 | 0 | 15 | 508
-- โ r = root โ 487 child pointers (close to our 500 estimate!)
Critical detail: leaf nodes are linked. B+ tree leaf nodes form a doubly-linked listEach leaf page has a pointer to the next leaf page AND the previous one. This means once you find a starting leaf (e.g., for email > 'a' and email < 'b'), you can just follow the forward pointers to scan through all matching entries without ever going back up the tree.. This is what makes range queries fast: WHERE email BETWEEN 'a' AND 'b' finds the first leaf, then follows the linked list forward. No need to traverse the tree for each row.
3. Buffer Pool โ The Database's Private RAM Cache
Every database has a section of RAM reserved for caching pages โ this is the buffer pool. In PostgreSQL, it's controlled by the shared_buffers setting (typically 25% of system RAM). In MySQL/InnoDB, it's innodb_buffer_pool_size (typically 70-80% of RAM).
The buffer pool is the single most important performance tuning knob. When a query needs a page, the database checks the buffer pool first. If the page is there (buffer hitThe requested page was found in the buffer pool (RAM). Costs about 0.1 microseconds. The EXPLAIN output shows this as "shared hit=N".), great โ that's a memory access (~100 nanoseconds). If not (buffer missThe requested page was NOT in the buffer pool and had to be read from disk. Costs about 100 microseconds for SSD, 10 milliseconds for HDD. The EXPLAIN output shows this as "shared read=N".), the database must read the page from disk (~100 microseconds on SSD) โ that's 1,000ร slower.
-- What's your buffer pool hit ratio?
-- (Should be > 99% for OLTP workloads)
SELECT
sum(blks_hit) AS hits,
sum(blks_read) AS misses,
round(100.0 * sum(blks_hit) / nullif(sum(blks_hit) + sum(blks_read), 0), 2)
AS hit_ratio_pct
FROM pg_stat_database
WHERE datname = current_database();
-- hits: 847,293,104 | misses: 2,341,987 | hit_ratio_pct: 99.72%
-- If hit_ratio < 95%, your buffer pool is too small.
-- Pages are being evicted and re-read constantly.
When the buffer pool is full and a new page needs to be loaded, the database evicts the least-recently-used page (using a clock-sweep algorithm in PostgreSQL, LRU in InnoDB). If the evicted page was dirtyA page that has been modified in the buffer pool but not yet written to disk. Evicting a dirty page requires writing it to disk first (a "flush"), which adds I/O overhead. (modified but not yet written to disk), it must be flushed to disk first.
Remember the "fast in staging, slow in production" problem from Section 4? Now you can diagnose it: if your index is 12 GB but your buffer pool is 8 GB, the B+ tree's upper levels compete with data pages for space. The root page stays cached (it's accessed on every query), but internal and leaf pages get evicted constantly โ turning a 4-page buffer hit into a 4-page disk read.
4. Write-Ahead Log (WAL) โ The Crash Safety Net
Here's a scary scenario. You INSERT a row. The database modifies the page in the buffer pool (RAM). It returns "INSERT 0 1" โ success. Then the power goes out. The modified page was in RAM and never written to disk. Is your data gone?
No โ because of the WAL. Before the database modifies any page in the buffer pool, it first writes a description of the change to a sequential log file on disk. This log is the Write-Ahead Log. The rule is absolute: the WAL record hits disk before the client gets a success response. If the server crashes, the database replays the WAL on startup โ re-applying any changes that were in the buffer pool but hadn't been written to the actual data files yet.
Writing to the WAL is a sequential write (just append to the end of a file). Writing to the actual data page on disk is a random write (seek to a specific position in the file). Why is this distinction crucial for performance?
Sequential writes are much faster than random writes โ 10-100ร faster on HDDs, 2-5ร faster even on SSDs. So the WAL gives you the best of both worlds: the durability of writing to disk immediately (via sequential WAL writes) and the performance of batching random data page writes (dirty pages are flushed in the background by the checkpointerA background PostgreSQL process that periodically writes all dirty pages from the buffer pool to disk and records a "checkpoint" in the WAL. After a checkpoint, WAL records before that point are no longer needed for crash recovery and can be recycled.).
# See actual WAL records โ what was written when you did an INSERT
# (Run as postgres user on the server)
pg_waldump --start=0/1A8E000 --end=0/1A8F000 /var/lib/postgresql/14/main/pg_wal/000000010000000000000001
# Output looks like:
# rmgr: Heap len (rec/tot): 59/211 tx: 7843291
# desc: INSERT off 12, blkref #0: rel 1663/16384/16389 blk 7832491
# โ โ
# inserted at item 12 on the page this is heap page 7,832,491!
#
# rmgr: Btree len (rec/tot): 64/64 tx: 7843291
# desc: INSERT_LEAF off 143, blkref #0: rel 1663/16384/16392 blk 94021
# โ โ
# B-tree leaf insert index page 94,021
Look at that output. One INSERT caused TWO WAL records โ one for the heap page (the actual data) and one for the B-tree leaf page (the index). Every indexed column adds another WAL record per write. This is the write amplification cost of indexes โ and it's why the WAL is often the bottleneck in write-heavy systems.
5. MVCC Version Chains โ How Reads Don't Block Writes
Here's a question that seems impossible: how can two transactions see different versions of the same row at the same time? Transaction A is reading Alice's balance while Transaction B is updating it. A sees the old balance, B writes the new one, and neither blocks the other. How?
The answer is MVCCMulti-Version Concurrency Control โ instead of using locks to prevent concurrent access to the same row, the database keeps multiple versions of each row simultaneously. Each transaction sees the version that was current when it started. (Multi-Version Concurrency Control). Instead of locking a row when someone reads it, the database keeps multiple versions of the row simultaneously. Each version is tagged with which transaction created it and which transaction deleted it.
In PostgreSQL, every row (tuple) on a page has two hidden columns: xmin (the transaction ID that created this version) and xmax (the transaction ID that deleted/updated it โ 0 if it's still alive). You can see them:
-- See the hidden MVCC columns on your rows
SELECT xmin, xmax, ctid, email FROM users WHERE email = 'alice@example.com';
-- xmin | xmax | ctid | email
-- 7843291 | 0 | (7832491,12) | alice@example.com
-- โ โ โ
-- created alive page 7,832,491, item 12 on that page
-- by tx (this is the physical location!)
-- 7843291
-- Now UPDATE the row:
UPDATE users SET name = 'Alice Smith' WHERE email = 'alice@example.com';
-- Check again:
SELECT xmin, xmax, ctid, email FROM users WHERE email = 'alice@example.com';
-- xmin | xmax | ctid | email
-- 7843310 | 0 | (7832491,39) | alice@example.com
-- โ โ
-- NEW tx ID DIFFERENT item on the page!
-- The old version (xmin=7843291) is still on the page
-- but with xmax=7843310. Running transactions that
-- started before tx 7843310 can still see the old version.
Notice the ctid changed โ that means the UPDATE created a new copy of the row at a different position on the page. The old version is still there, marked as dead. This is PostgreSQL's MVCC in action: an UPDATE is really a DELETE + INSERT at the storage layer. The old version stays around until VACUUM removes it.
MySQL/InnoDB handles MVCC differently. Instead of creating new row versions in the main table pages, InnoDB stores old versions in a separate undo logA separate storage area where InnoDB keeps the previous versions of modified rows. When a transaction needs to see an old version, the database follows a chain of undo records backwards to reconstruct what the row looked like at the start of that transaction.. When a transaction needs to see an old version, it follows the undo chain backwards. This keeps the main table pages clean but makes long-running transactions expensive (the undo log grows and old versions can't be purged).
Going Deeper โ LSM Trees, Compaction, and Bloom Filters
B+ trees are the king of read-heavy workloads (OLTP databases like PostgreSQL and MySQL). But what about systems that need to handle millions of writes per second? Social media feeds, IoT sensor data, time-series metrics โ these systems need a different data structure. Enter the LSM tree.
The core trade-off in storage engines is simple: optimize for reads or optimize for writes. You can't have both. B+ trees optimize for reads (any row in 3-4 page reads), but writes are expensive (random I/O to insert into the right leaf page, plus possible page splits). LSM trees flip this: writes are blazing fast (just append to a memory buffer), but reads are slower (have to check multiple levels on disk).
If all writes go to an in-memory buffer (memtable), what happens when the buffer gets full? And if you flush it to disk as a sorted file, how do reads work when there are 50 such files on disk?
Here's how an LSM tree works, step by step:
Step 1: Write to the memtable. Every write goes to an in-memory sorted data structure (usually a red-black treeA self-balancing binary search tree that maintains O(log n) insertion, deletion, and lookup. It's used for the in-memory component because in-memory operations don't need to worry about page sizes โ only speed of insertion and sorted iteration. or skip list). This is RAM-speed โ microseconds per write. The write is also appended to a WAL for crash safety (same idea as PostgreSQL).
Step 2: Flush to SSTables. When the memtable reaches a threshold (typically 64 MB), it's written to disk as an immutable sorted file called an SSTableSorted String Table โ an immutable file on disk containing key-value pairs sorted by key. Once written, an SSTable is never modified. This immutability simplifies concurrency (no locks needed to read) and crash recovery. (Sorted String Table). "Immutable" means it's never modified after being written โ this is crucial for safety and simplicity.
Step 3: Read from everywhere. A read must check: the memtable first, then L0 SSTables (most recent flushes), then L1, then L2... This is why reads are slower โ in the worst case, you check every level. Bloom filters (see below) skip levels that definitely don't have your key.
| Operation | B+ Tree (PostgreSQL) | LSM Tree (RocksDB) |
|---|---|---|
| Point read | 3-4 page reads (always) | 1 to N level checks (variable) |
| Range scan | Follow leaf linked list | Merge sorted streams from all levels |
| Point write | Random I/O to leaf page + WAL | Sequential write to memtable + WAL |
| Bulk write | Expensive (many random page writes) | Sequential memtable flushes |
| Space overhead | ~1.5ร data size (fragmentation) | ~1.1ร (compacted), 2ร during compaction |
| Used by | PostgreSQL, MySQL, Oracle, SQL Server | RocksDB, Cassandra, LevelDB, CockroachDB |
Real-world numbers: Facebook's MyRocks (MySQL on RocksDB) saved 50% disk space compared to InnoDB because LSM compaction eliminates fragmentation. But read latency P99 increased from 2ms to 8ms. That's the trade-off โ and for Facebook's use case (write-heavy social graph), it was worth it.
Compaction is the background process that merges SSTables to keep reads manageable. Without compaction, reads would have to check hundreds of SSTables. There are two main strategies:
Size-Tiered Compaction (Cassandra default): When there are enough SSTables of similar size, merge them into one larger SSTable. Simple and write-friendly โ fewer background writes. But it can temporarily use 2ร the disk space during a major compaction (old + new files coexist until the old ones are deleted). Also, read amplification is higher because SSTables at the same level can have overlapping key ranges.
Leveled Compaction (RocksDB default): SSTables at each level have non-overlapping key ranges (like B+ tree pages). When L0 fills up, its SSTables are merged with the overlapping range in L1. Each level is 10ร larger than the previous. Reads are faster (at most one SSTable per level to check), but writes are amplified because compacting one L0 file might rewrite several L1 files.
| Property | Size-Tiered | Leveled |
|---|---|---|
| Write amplification | Lower (~10ร) | Higher (~30ร) |
| Read amplification | Higher (overlapping SSTables) | Lower (1 SSTable/level) |
| Space amplification | Higher (up to 2ร during compaction) | Lower (~1.1ร) |
| Best for | Write-heavy, space-tolerant | Read-write balanced, space-sensitive |
| Default in | Cassandra | RocksDB, LevelDB |
Write amplification is the ratio of actual bytes written to disk vs bytes written by the application. If you write 1 byte and the database ends up writing 30 bytes (across WAL, memtable flush, and multiple compaction passes), that's 30ร write amplification. This matters because SSDs have a limited number of write cycles โ high write amplification wears out SSDs faster.
The biggest problem with LSM reads is checking every level. If you're looking for a key that only exists in L3, you'd waste time reading L0, L1, and L2 first. Bloom filtersA probabilistic data structure that can tell you "this key is DEFINITELY NOT in this SSTable" or "this key is PROBABLY in this SSTable." It uses a bit array and multiple hash functions. False positives are possible (says "probably yes" when the key isn't there), but false negatives are impossible (never says "no" when the key IS there). solve this elegantly.
Each SSTable has a Bloom filter โ a small bit array (typically 10 bits per key, so ~1.2 MB per million keys) that can answer: "Is key X definitely NOT in this SSTable?" If the Bloom filter says no, you skip the entire SSTable โ no disk read needed. If it says "maybe yes," you check the SSTable. With a 1% false positive rate, you skip 99% of unnecessary SSTable reads.
RocksDB allocates 10 bits per key for its Bloom filters by default, giving a ~1% false positive rate. You can tune this โ more bits per key means fewer false positives but more memory. For a 100 GB database with 1 billion keys, the Bloom filters use about 1.2 GB of RAM โ a worthwhile investment when each avoided disk read saves 100 microseconds.
Different databases choose different storage engines for different reasons. Here's what's actually happening under the hood in the three most important ones:
| Feature | PostgreSQL | MySQL/InnoDB | RocksDB |
|---|---|---|---|
| Data structure | Heap + B-tree indexes | Clustered B+ tree (PK) | LSM tree |
| Table organization | Unordered heap file | Sorted by primary key | Sorted SSTables per level |
| Page size | 8 KB | 16 KB | 4 KB (configurable) |
| MVCC approach | Multiple row versions in heap | Undo log for old versions | Multiple versions as separate entries |
| Crash recovery | WAL replay | Redo log + doublewrite buffer | WAL replay |
| PK range scan | Index scan + heap fetch | Extremely fast (data is sorted by PK) | Good (sorted within levels) |
| Random writes | Moderate (heap + index updates) | Moderate (page splits with UUID PKs) | Fastest (sequential memtable writes) |
| Space efficiency | Moderate (bloat from dead tuples) | Good (undo log is separate) | Best (compaction eliminates waste) |
| Used by | PostgreSQL | MySQL, MariaDB | CockroachDB, TiDB, Yugabyte, MyRocks |
Why InnoDB's clustered index matters: In InnoDB, the primary key index IS the table. Rows are physically sorted by PK on disk. This means: (1) PK range scans are incredibly fast โ the data is contiguous on disk. (2) Secondary indexes must store the PK value (not a physical row pointer) because row locations change during page splits. (3) Random UUID primary keys are terrible for InnoDB โ every insert goes to a random position in the tree, causing constant page splits. Auto-increment IDs always append to the end, which is much faster.
Why RocksDB is eating the database world: RocksDB (built by Facebook) is now used as the storage engine inside many modern databases. CockroachDB, TiDB, and YugabyteDB all use RocksDB (or a fork) under the hood โ they implement SQL and transactions on top, but the actual bytes-on-disk part is RocksDB. Facebook's own MySQL fork (MyRocks) replaced InnoDB with RocksDB and saved 50% of their SSD storage fleet.
The Variations โ Different Storage for Different Problems
Not every workload looks the same. An e-commerce site doing thousands of small transactions per second needs a very different storage layout than an analytics warehouse scanning billions of rows for a report. The storage engine you choose shapes everything โ query speed, write throughput, compression ratio, and memory usage.
Row-Store vs Column-Store โ OLTP vs OLAP
Every database we've discussed so far is a row-store: each page contains complete rows, laid out one after another. When you SELECT * FROM users WHERE id = 42, the database reads one page and gets all columns for that user. This is perfect for OLTPOnline Transaction Processing โ workloads that do many small, fast operations: insert an order, update a balance, read one user's profile. Think: your application's day-to-day operations. workloads (read one user, insert one order).
But what about analytics? SELECT AVG(age) FROM users WHERE country = 'US' needs only the age and country columns from every row โ but a row-store reads entire rows (all 50 columns). If each row is 500 bytes but you only need 8 bytes (age + country), you're reading 62ร more data than necessary.
What if, instead of storing rows together, you stored each column separately? All the age values in one file, all the country values in another. What would that change about read performance? About compression? About write performance?
That's exactly what a column-store does. Each column is stored in its own file. To compute AVG(age), the database reads only the age file โ which is tiny and highly compressible (all the same data type, similar values). Column stores like ClickHouseAn open-source columnar database built by Yandex. Known for extreme analytical query speed โ it can scan billions of rows per second on a single machine thanks to columnar storage, vectorized execution, and aggressive compression., BigQuery, and Redshift can scan billions of rows per second for analytical queries.
| Property | Row-Store | Column-Store |
|---|---|---|
| Best for | OLTP โ single-row reads/writes | OLAP โ scans over many rows |
| Single row read | 1 page read | 1 read per column file (many files) |
| Full scan (2 cols) | Read ALL columns (wasteful) | Read only needed columns |
| Compression | Moderate (mixed types per page) | Excellent (same type = similar values) |
| Write speed | Fast (one page write per row) | Slower (write to N column files) |
| Examples | PostgreSQL, MySQL, Oracle | ClickHouse, BigQuery, Redshift, Parquet |
Clustered vs Heap Storage โ Where Your Rows Physically Live
We've touched on this difference in Section 6, but it's important enough to explore in detail. PostgreSQL and MySQL/InnoDB store table data in fundamentally different ways, and this affects everything from INSERT speed to secondary index size.
Heap storage (PostgreSQL): Rows are stored in an unordered heap file. New rows go wherever there's free space. The primary key index is just another B+ tree that points to the physical location (page number + item offset) of each row. This means:
- Inserts are always fast โ just find a page with space and put the row there
- Primary key range scans require random I/O โ the rows aren't physically sorted
- Secondary indexes are the same size as the primary index (both store physical locations)
- UPDATEs create new row versions that might be on a different page (HOT updates avoid this when possible)
Clustered storage (MySQL/InnoDB): The table data IS the primary key B+ tree. Rows are physically sorted by PK in the leaf pages. This means:
- PK range scans are blazing fast โ rows are contiguous on disk
- Random PK inserts (UUIDs) cause page splits โ terrible performance
- Sequential PK inserts (auto-increment) always append to the end โ great performance
- Secondary indexes store the PK value (not a physical pointer), so they're smaller but require a second lookup
Using UUIDv4 as a primary key in InnoDB is one of the most common performance mistakes. Each insert goes to a random position in the B+ tree, causing constant page splits. At scale, insert throughput can drop by 3-5ร compared to auto-increment. If you need UUIDs, use UUIDv7 (time-ordered) or ULID, which preserve insert order while still being globally unique.
In-Memory Databases โ When Disk I/O Is the Enemy
Everything we've discussed assumes data lives on disk and gets cached in the buffer pool. But what if the entire dataset fits in RAM? Then you eliminate disk I/O completely, and the data structures can be optimized for memory access patterns instead of disk page access patterns.
If all data is in memory, do you still need B+ trees? What data structure would you use for a key-value store if disk page size was no longer a constraint?
Redis internals: Redis keeps all data in memory and uses data structures optimized for RAM, not disk. Instead of B+ trees, Redis uses hash tablesA data structure that maps keys to values using a hash function. Average O(1) lookups. Redis uses a hash table as its global keyspace โ every key you SET is stored in this hash table. for key-value lookups (O(1) average), skip lists for sorted sets (O(log n) like a tree but simpler), and ziplistA compact, memory-efficient encoding used by Redis for small lists, hashes, and sorted sets. Instead of using pointers (8 bytes each on 64-bit), entries are packed tightly in a contiguous block of memory. Saves significant RAM for small datasets but becomes slow for large datasets due to O(n) operations./listpack for small collections (trading CPU for memory savings). Redis uses jemalloc as its memory allocator to minimize fragmentation.
But "in-memory" doesn't mean "data disappears on restart." Redis offers two persistence options: RDB snapshots (periodic full dumps to disk โ fast recovery but you lose data since the last snapshot) and AOF (Append-Only File โ like a WAL, logs every write command โ slower recovery but less data loss). You can use both together for the best of both worlds.
VoltDB takes a different approach: it's a full SQL relational database that lives entirely in memory. It uses no buffer pool at all โ every transaction operates directly on in-memory data. For durability, it uses command logging (similar to WAL) and K-safety replication (synchronous copies to other nodes). VoltDB achieves millions of SQL transactions per second on a single node because there's zero disk I/O in the critical path.
| Feature | Redis | VoltDB | Memcached |
|---|---|---|---|
| Data model | Key-value + data structures | Relational (SQL) | Key-value only |
| Persistence | RDB snapshots + AOF | Command log + replication | None (pure cache) |
| Data structures | Hash tables, skip lists, ziplist | B+ tree indexes (in RAM) | Hash table only |
| Transactions | Limited (MULTI/EXEC) | Full ACID | None |
| Threading | Single-threaded (6.0+ has I/O threads) | Partitioned, no locking | Multi-threaded |
| Use case | Caching, sessions, leaderboards | High-throughput OLTP | Simple caching |
At Scale โ How the Giants Tame Their Storage Engines
Everything we've covered so far โ pages, B+ trees, buffer pools, WAL, LSM trees โ sounds clean in theory. But when your table has 2 billion rows and your cluster handles 500K queries per second, every internal detail becomes a landmine. Here's how four real companies hit the limits of database internals and what they did about it.
PostgreSQL at Instagram โ Tuning shared_buffers for 1 Billion Users
Instagram runs one of the world's largest PostgreSQL deployments. Their biggest challenge? Buffer pool management on machines with 256 GB of RAM serving tables with hundreds of millions of rows.
You have a PostgreSQL server with 256 GB of RAM. The default shared_buffers is 128 MB. Your DBA sets it to 200 GB (80% of RAM) thinking "more cache = faster." What goes wrong? (Hint: there's another cache layer you're forgetting about.)
The answer: the OS page cache. Linux already caches file pages in RAM. When you set shared_buffers to 200 GB, PostgreSQL manages 200 GB of pages and the OS also caches the same files โ you get double caching. The OS cache becomes tiny (256 - 200 = 56 GB for everything else, including OS page cache), and PostgreSQL's own clock-sweep eviction algorithm is less efficient than the OS's LRU for large caches.
Instagram's approach: shared_buffers = 64 GB (25% of RAM) with huge_pages = on. Standard 4 KB memory pages require a page table entry each โ 64 GB / 4 KB = 16 million entries, consuming ~256 MB of kernel memory for the page table alone. Huge pages (2 MB each) cut that to 32,768 entries. They also tuned checkpoint_completion_target = 0.9, spreading dirty page writes across 90% of the checkpoint interval instead of the default 50%, smoothing out I/O spikes that caused latency blips during checkpoints.
InnoDB at Uber โ Why They Left PostgreSQL
In 2016, Uber published a blog post titled "Why Uber Engineering Switched from Postgres to MySQL." The reason came down to one internal detail: how secondary indexes store row pointers.
In PostgreSQL, every index entry points to a physical location on disk โ a (page_number, offset) tuple called a ctidIn PostgreSQL, every row has a ctid (current tuple identifier) that specifies its physical location: (page_number, item_offset). Secondary indexes store this ctid to find rows. The problem: when VACUUM moves a row, ALL indexes pointing to that row must be updated.. When VACUUM compacts a table and moves rows to new locations, every single index pointing to that row must be updated โ that's write amplification proportional to the number of indexes. Uber's trip tables had dozens of indexes, so a single VACUUM pass triggered massive cascading index updates.
InnoDB's clustered index solves this differently. Secondary indexes store the primary key value, not a physical location. When a row moves, secondary indexes don't change โ only the clustered index (which IS the table, sorted by PK) needs updating. For Uber's workload โ wide tables with many indexes and heavy UPDATE traffic โ this meant 10x less write amplification during maintenance.
PostgreSQL has improved since 2016. The HOT (Heap-Only Tuples) optimization allows updates that don't change indexed columns to avoid index updates entirely. And covering indexes (INCLUDE clause) reduce the need for heap fetches. But the fundamental architecture difference โ physical ctid pointers vs logical PK references โ remains.
RocksDB at Facebook โ 50% Disk Savings with MyRocks
Facebook stores petabytes of social graph data in MySQL. Their problem: InnoDB's B+ tree storage wastes space. A B+ tree page targets ~70% fill factor after splits, meaning 30% of every page is empty. For a 10 TB database, that's 3 TB of wasted disk. Multiply by thousands of shards and it adds up to millions of dollars in SSDs.
InnoDB B+ trees have ~70% space utilization. LSM trees compact data periodically and achieve ~90%+ utilization. If your database fleet is 30 petabytes on InnoDB, how many petabytes could you save by switching to an LSM engine? What's that worth at $0.10/GB/month for SSD?
The math: 30 PB ร 30% waste = 9 PB of overhead with InnoDB. With LSM at ~90% utilization, overhead drops to ~3 PB. That's a 6 PB saving. At $0.10/GB/month: 6,000,000 GB ร $0.10 = $600,000/month in storage costs alone. Facebook built MyRocks โ MySQL with RocksDB as the storage engine โ and achieved the predicted ~50% space reduction. The trade-off: read latency P99 increased from 2ms to 8ms because LSM reads might check multiple levels. For their write-heavy social graph, that was acceptable.
The key compaction tuning: Facebook uses leveled compaction with a custom rate limiter that caps compaction I/O at 100 MB/s during peak hours and allows 500 MB/s during off-peak. Without rate limiting, compaction storms can consume all disk bandwidth and make foreground queries timeout.
SQLite โ The Most Deployed Database on Earth
There are roughly 1 trillion active SQLite databases in the world right now. Every Android phone has dozens of them. Every iPhone. Every Mac. Every Chrome browser, every Firefox browser, every Electron app. It's in every Boeing 787 and every Airbus A350. SQLite is not a client-server database โ it's a library linked directly into your application, reading and writing a single file on disk.
Internally, SQLite uses the same fundamentals: B+ tree pages (4 KB default), a WAL for crash recovery, and an in-process page cache. But with one crucial design constraint โ single writer. Only one connection can write at a time (readers never block). This eliminates all the concurrency complexity (MVCC version chains, lock managers, deadlock detection) that makes PostgreSQL and MySQL so complex. For embedded use cases โ a phone app's local storage, a browser's IndexedDB, an IoT sensor's log โ you don't need concurrent writers. You need a reliable, zero-configuration, single-file database. SQLite is 150,000 lines of C with 100% branch coverage in its test suite โ one of the most tested software projects in history.
The Anti-Lesson โ Dangerous "Advice" That Sounds Right
A team sets shared_buffers to 80% of RAM on a 64 GB machine, and the Linux OOM-killer starts murdering PostgreSQL backends under load. Another team leaves autovacuum at defaults and watches a 2 GB table bloat to 8 GB over six months. These three "tips" sound smart but cause real damage โ and the fix always comes down to the internals we just covered.
Why it sounds right: More cache = more pages in memory = fewer disk reads. Simple logic.
Why it's wrong: PostgreSQL reads data through the operating system's file I/O. Linux automatically caches file pages in its own page cache. When shared_buffers is 80% of RAM, the OS cache shrinks to almost nothing โ and you end up with the same pages cached twice: once in PostgreSQL's buffer pool and once in the OS page cache. You've halved your effective cache, not doubled it. The PostgreSQL documentation itself recommends 25% of system RAM as the starting point.
Real consequence: A startup set shared_buffers = 48 GB on a 64 GB machine. Linux OOM-killer started killing PostgreSQL backends under load because the OS had almost no free memory for its own operations. Queries that should have been sub-millisecond were taking 200ms because the OS page cache was thrashing.
shared_buffers = 25% of RAM. Let the OS page cache handle the rest. Monitor with pg_stat_database โ if your buffer hit ratio is above 99%, you're fine. If it's below 95%, investigate which tables/indexes are too large to cache, not whether shared_buffers is big enough.
Why it sounds right: B+ trees give O(log n) reads with no amplification. LSM trees might check multiple levels. PostgreSQL and MySQL both use B+ trees. Case closed?
Why it's wrong: It completely depends on your write:read ratio. B+ trees do random I/O on every write (seek to the correct leaf page, possibly split it). LSM trees convert all writes to sequential I/O (append to memtable, flush sorted runs). For a workload that does 90% writes and 10% reads โ think time-series data, logging pipelines, social media feeds โ LSM trees can sustain 5-10x higher write throughput on the same hardware. Facebook's migration from InnoDB (B+ tree) to MyRocks (LSM) on their social graph wasn't because B+ trees were "bad" โ it was because write volume was so high that the random I/O cost was unjustifiable.
| Workload | Better Engine | Why |
|---|---|---|
| Read-heavy (90:10 read:write) | B+ tree | Single-page lookup, no level checking |
| Write-heavy (10:90 read:write) | LSM tree | Sequential writes, no random I/O |
| Balanced (50:50) | Either โ benchmark | Depends on data size, SSD vs HDD, key distribution |
| Space-constrained | LSM tree | Compaction achieves ~90%+ utilization vs ~70% for B+ tree |
Why it sounds right: PostgreSQL has autovacuum enabled by default since version 8.1. It runs in the background. Set it and forget it.
Why it's wrong: Autovacuum's default settings are tuned for conservative, small databases. The defaults trigger a vacuum when 20% of a table's rows are dead (50 + 0.2 ร n_live_tuples). For a 100-million-row table, that means VACUUM won't kick in until there are 20 million dead rows โ by which point the table has bloated significantly, sequential scans are crawling through dead space, and your VACUUM job might take hours.
Your orders table has 50 million live rows and 12 million dead rows (from UPDATEs that created new versions). Each dead row wastes ~200 bytes. How much disk space are dead rows wasting? And if autovacuum's threshold is 20% (10 million dead rows), has it triggered yet?
The math: 12 million dead rows ร 200 bytes = 2.4 GB of dead space, sitting in pages that sequential scans still have to read through. And yes, autovacuum triggered at 10 million dead rows โ but by the time it finished processing all 50 million rows, another 2 million had died. For hot tables, the right approach is aggressive per-table tuning:
-- Tune autovacuum for a hot table (50M+ rows, heavy updates)
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01, -- trigger at 1% dead rows (not 20%)
autovacuum_analyze_scale_factor = 0.005, -- re-analyze stats at 0.5%
autovacuum_vacuum_cost_delay = 2 -- less throttling (default: 20ms)
);
-- Check current bloat
SELECT schemaname, relname, n_live_tup, n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup, 0), 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 100000
ORDER BY n_dead_tup DESC;
Real consequence: Sentry (the error tracking company) wrote about a PostgreSQL incident where autovacuum fell behind on a high-traffic table. Dead rows accumulated until the table was 4x its actual data size, sequential scans became 4x slower, and eventually autovacuum itself couldn't keep up because each pass was reading the bloated table. They had to do a VACUUM FULL during a maintenance window โ which locks the entire table and rebuilds it from scratch.
Common Mistakes โ 6 Things That Quietly Kill Performance
Your query took 2ms three months ago. Today it takes 12ms. Nothing changed in the code. The table grew from 5 million to 20 million rows, dead tuples accumulated, the buffer pool hit ratio dropped from 99.8% to 91%, and nobody noticed until users started complaining. These six mistakes are the usual suspects.
The mistake: You deployed your app, queries are "fast enough," and you never look at buffer pool stats. Six months later, your dataset outgrew the buffer pool and your hit ratio dropped from 99.8% to 87%. Every 13th page read now hits disk instead of RAM โ that's 1,000x slower per miss. At 50K queries/sec, those misses add up to seconds of latency.
The fix: Add this to your monitoring dashboard on day one. Alert if the hit ratio drops below 95%.
-- Buffer cache hit ratio (run periodically)
SELECT round(100.0 * sum(blks_hit) /
nullif(sum(blks_hit) + sum(blks_read), 0), 2) AS hit_ratio_pct
FROM pg_stat_database WHERE datname = current_database();
-- OLTP target: > 99%. Below 95% = investigate immediately.
The mistake: Your table has 10 million live rows but occupies the space of 40 million because VACUUM hasn't reclaimed dead tuple space. Sequential scans read 4x more pages than necessary. Indexes point to pages filled with dead tuples. You keep adding more RAM and faster SSDs, but the real problem is bloat.
The fix: Check pg_stat_user_tables regularly. If n_dead_tup is more than 10% of n_live_tup, your autovacuum settings are too conservative for that table.
-- Find the most bloated tables
SELECT relname, n_live_tup, n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC LIMIT 10;
The mistake: Leaving fillfactor at the default (100% for heap tables, 90% for B+ tree indexes) on a table that gets heavy UPDATEs. With fillfactor 100%, there's no room on the page for new tuple versions โ PostgreSQL must put them on a different page, breaking HOT update chains and requiring index updates.
The flip side: Setting fillfactor to 50% on a read-heavy, rarely-updated table wastes half your pages and makes scans twice as slow.
The rule: Heavy UPDATEs โ fillfactor 70-80% (leave room for new versions). Read-mostly โ fillfactor 100% (pack pages tight, fewer pages to read).
The mistake: You set shared_buffers = 32 GB but leave huge_pages = off. The OS manages memory in 4 KB pages. 32 GB / 4 KB = 8,388,608 page table entries. Each entry consumes ~64 bytes of kernel memory โ that's 512 MB of overhead just for the page table. And every TLB miss (the CPU cache for page table lookups) triggers a multi-level page table walk.
The fix: Enable huge_pages = on in postgresql.conf and configure the OS to allocate huge pages. With 2 MB huge pages: 32 GB / 2 MB = 16,384 entries. Page table overhead drops from 512 MB to ~1 MB, and TLB coverage increases 512x.
The mistake: Using UUID v4 as the primary key in MySQL/InnoDB. Remember, InnoDB's clustered index stores rows sorted by primary key. UUIDs are random โ every INSERT goes to a random position in the B+ tree. At 10K inserts/sec, nearly every insert targets a different leaf page, causing constant page splitsWhen a B+ tree leaf page is full and a new entry must be inserted, the database splits the page in half โ creating two pages each about 50% full. This causes write amplification (extra I/O) and reduces space efficiency until the pages fill back up.. The table ends up with pages that are 50-70% full instead of ~90%.
InnoDB pages are 16 KB. With auto-increment IDs, inserts always append to the rightmost leaf (sequential). With UUID v4, inserts target random leaves. If your buffer pool caches 100,000 pages but your table has 10 million leaf pages, what fraction of inserts will cause a disk read to fetch the target page? (Hint: it's not good.)
The math: 100,000 / 10,000,000 = 1% of pages are cached. That means 99% of UUID inserts require a disk read before the write, plus a page split if the target page is full. With auto-increment, inserts hit the same rightmost pages repeatedly โ they stay cached. Percona benchmarks showed UUID v4 PKs causing 10x worse insert throughput compared to auto-increment on large InnoDB tables. If you need UUIDs, use UUID v7 (time-ordered) or ULID โ they're roughly sequential, so inserts append instead of scattering.
The mistake: You COPY 50 million rows into a table and immediately start querying. The query planner uses stale statistics โ it thinks the table has 1,000 rows (from before the load) and chooses a sequential scan instead of an index scan. Your "fast" query takes 30 seconds instead of 2ms because the planner made a catastrophically wrong decision based on outdated cardinality estimates.
The fix: Always run ANALYZE immediately after bulk loads. It samples ~30,000 rows and updates pg_statistics with column value distributions, most common values, and distinct counts. The query planner depends entirely on these statistics to choose between index scans, hash joins, merge joins, and sequential scans.
-- After any bulk load:
COPY orders FROM '/data/orders.csv' WITH CSV HEADER;
ANALYZE orders; -- Update statistics NOW, don't wait for autovacuum
-- Verify planner has fresh stats
SELECT relname, last_analyze, last_autoanalyze, n_live_tup
FROM pg_stat_user_tables WHERE relname = 'orders';
Interview Playbook โ "Explain B+ Tree vs LSM Tree"
You're in a system design round. The interviewer draws a database on the whiteboard and asks: "We're ingesting 200K writes per second of time-series data. Would you use a B+ tree or an LSM tree โ and why?" Your answer reveals whether you understand storage engines or just memorize "use PostgreSQL for OLTP."
Before reading the model answers: write down 3 differences between B+ trees and LSM trees from memory. For each, identify which workload it favors. If you can't, go back to Sections 6 and 7.
What they want to hear: You understand both data structures at a conceptual level and know when each is used.
"B+ trees and LSM trees are the two main ways databases organize data on disk. A B+ tree keeps data sorted in fixed-size pages with a tree structure โ think of it like a book's table of contents. You follow pointers from root to leaf to find any row in 3-4 disk reads. This is great for reads. PostgreSQL and MySQL both use B+ trees for their indexes.
An LSM tree takes a different approach: it writes everything to an in-memory buffer first, then flushes sorted batches to disk. Reads are slower because you might check multiple levels, but writes are much faster because they're all sequential โ no random disk I/O. RocksDB, Cassandra, and LevelDB use LSM trees.
Rule of thumb: B+ tree for read-heavy, LSM for write-heavy."
What they want to hear: Trade-offs with specific numbers, real-world examples, and understanding of amplification.
"The core trade-off is between read amplification, write amplification, and space amplification.
B+ trees: reads hit 3-4 pages (tree depth), writes do random I/O to the target leaf page โ plus potential page splits. Write amplification is moderate. Space utilization is ~70% after splits. InnoDB's clustered index sorts data by primary key, so range scans on the PK are extremely fast โ the data is physically contiguous.
LSM trees: writes go to an in-memory memtable, then flush to sorted SSTables on disk. Write amplification comes from compaction โ leveled compaction in RocksDB can hit 30x. But write throughput is 5-10x higher because all disk I/O is sequential. Space utilization is ~90%+ because compaction eliminates fragmentation.
Real example: Facebook's MyRocks replaced InnoDB with RocksDB as MySQL's storage engine and saved 50% of their SSD fleet. Their social graph is write-heavy โ the LSM trade-off made sense. Bloom filters reduce LSM read amplification โ each SSTable has a filter that skips it in microseconds if the key isn't there."
What they want to hear: You've operated these in production, understand the failure modes, and can reason about compaction strategies.
"Beyond the standard read/write amplification trade-off, the operational differences are what matter in production.
B+ tree failure mode: page splits under random-insert workloads. InnoDB with UUID v4 primary keys is the classic โ random inserts scatter across the tree, splitting pages constantly. Throughput drops 10x vs auto-increment. The fix: UUID v7 or ULID for time-ordered keys.
LSM failure mode: compaction storms. If write ingestion outpaces compaction, levels accumulate and reads degrade. RocksDB's rate limiter caps compaction I/O โ Facebook limits it to 100 MB/s during peak. Leveled compaction gives better read performance but 3x higher write amplification than size-tiered. The choice depends on your SSD endurance budget.
Hybrid approaches exist: PostgreSQL's nbtree is a B+ tree with HOT updates to avoid index churn. TiDB and CockroachDB use RocksDB underneath but expose a SQL interface. And newer designs like BW-trees (used in Azure SQL) attempt to combine the best of both โ log-structured updates with B-tree read performance.
The real answer is always: benchmark YOUR workload. pgbench and sysbench exist for a reason."
Practice Exercises โ Hands-On with Real Database Internals
You've read about 8 KB pages, B+ trees with 500 fanout, and buffer pool hit ratios. Now run the commands yourself โ the difference between "I've read about pageinspect" and "I've used pageinspect to look inside a real page" is enormous in interviews. Every exercise below works on a standard PostgreSQL installation with just CREATE EXTENSION pageinspect;.
Install the pageinspect extension and examine the first page of any table. Count how many tuples (rows) fit in a single 8 KB page. Compare a narrow table (2 columns) vs a wide table (20 columns) โ how does row width affect tuples per page?
Use CREATE EXTENSION pageinspect; then SELECT * FROM heap_page_items(get_raw_page('your_table', 0)); to see every tuple on page 0. The lp_len column shows each tuple's size in bytes.
CREATE EXTENSION IF NOT EXISTS pageinspect;
-- Inspect page 0 of your table
SELECT lp, lp_off, lp_len, t_xmin, t_xmax, t_ctid
FROM heap_page_items(get_raw_page('users', 0));
-- Count tuples per page
SELECT count(*) AS tuples_on_page_0
FROM heap_page_items(get_raw_page('users', 0))
WHERE lp_len > 0;
-- See page header (free space, etc.)
SELECT * FROM page_header(get_raw_page('users', 0));
Take a real table in your database. Calculate the theoretical B+ tree depth using the fanout formula: depth = ceil(log(rows) / log(fanout)). Then verify with bt_metap(). How close was your estimate?
PostgreSQL B+ tree pages are 8 KB. Each index entry is roughly key_size + 8 bytes (ctid pointer). For a bigint key (8 bytes), each entry is ~16 bytes. Fanout = 8,192 / 16 = ~500. Use SELECT * FROM bt_metap('your_index_name'); to see the actual tree depth in the level column.
-- Step 1: Count rows
SELECT reltuples::bigint AS estimated_rows
FROM pg_class WHERE relname = 'users';
-- Example: 10,000,000
-- Step 2: Calculate theoretical depth
-- fanout ~500 for bigint keys
-- depth = ceil(log(10,000,000) / log(500)) = ceil(7 / 2.7) = ceil(2.59) = 3
-- Step 3: Verify with pageinspect
SELECT * FROM bt_metap('idx_users_email');
-- level = 3 โ matches our estimate!
Write a query that captures the buffer cache hit ratio, run it before and after a workload. Use pgbench to generate load. Does the ratio change as the working set exceeds shared_buffers?
Use pg_stat_database for the hit/read counters. Run SELECT pg_stat_reset(); before your test to get clean numbers. Then run pgbench -c 10 -T 60 your_database and check the ratio again.
# Terminal 1: Reset stats and watch
psql -c "SELECT pg_stat_reset();"
# Wait for pgbench to finish, then:
psql -c "SELECT round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) AS hit_pct
FROM pg_stat_database WHERE datname = current_database();"
# Terminal 2: Generate load
pgbench -i -s 100 testdb # init with 10M rows (~1.5 GB)
pgbench -c 10 -T 60 testdb # 10 clients, 60 seconds
# Compare: if shared_buffers = 128 MB and data = 1.5 GB,
# you'll see hit ratio around 85-90% (working set > cache).
# Increase shared_buffers to 512 MB -> hit ratio jumps to 99%+.
Create two identical tables โ one with fillfactor = 100 (default) and one with fillfactor = 70. Insert 1 million rows into each, then run 100,000 UPDATEs (changing a non-indexed column). Compare: total time, n_dead_tup, and table size on disk.
Create with CREATE TABLE t1 (id int, val text) WITH (fillfactor = 100); and similarly for t2 with fillfactor 70. After UPDATEs, check pg_stat_user_tables for n_dead_tup and pg_total_relation_size() for disk usage. Fillfactor 70 should have more HOT updates (check n_tup_hot_upd).
-- Create tables with different fillfactors
CREATE TABLE bench_ff100 (id serial PRIMARY KEY, val text) WITH (fillfactor = 100);
CREATE TABLE bench_ff70 (id serial PRIMARY KEY, val text) WITH (fillfactor = 70);
-- Insert 1M rows each
INSERT INTO bench_ff100 (val) SELECT md5(random()::text) FROM generate_series(1, 1000000);
INSERT INTO bench_ff70 (val) SELECT md5(random()::text) FROM generate_series(1, 1000000);
ANALYZE bench_ff100; ANALYZE bench_ff70;
-- Run 100K updates
\timing on
UPDATE bench_ff100 SET val = md5(random()::text) WHERE id <= 100000;
UPDATE bench_ff70 SET val = md5(random()::text) WHERE id <= 100000;
-- Compare results
SELECT relname, n_tup_upd, n_tup_hot_upd, n_dead_tup,
pg_size_pretty(pg_total_relation_size(relid))
FROM pg_stat_user_tables WHERE relname LIKE 'bench_ff%';
Perform a single INSERT into a table with 3 indexes. Use pg_waldump to find all WAL records generated by that INSERT. Count: how many records? How many bytes? What's the write amplification?
Use SELECT pg_current_wal_lsn(); before and after the INSERT to know the WAL range. Then run pg_waldump --start=BEFORE --end=AFTER WAL_FILE. You should see 1 Heap INSERT + 3 Btree INSERT_LEAF records (one per index).
# In psql:
SELECT pg_current_wal_lsn(); -- e.g., 0/1A8E000
INSERT INTO users (email, name, age) VALUES ('test@example.com', 'Test', 25);
SELECT pg_current_wal_lsn(); -- e.g., 0/1A8E400
# In terminal (as postgres user):
pg_waldump --start=0/1A8E000 --end=0/1A8E400 \
/var/lib/postgresql/14/main/pg_wal/000000010000000000000001
# Expected output: 4 records
# rmgr: Heap desc: INSERT โ the row itself
# rmgr: Btree desc: INSERT_LEAF โ PK index
# rmgr: Btree desc: INSERT_LEAF โ email index
# rmgr: Btree desc: INSERT_LEAF โ name index
# Total: ~800 bytes of WAL for a ~200-byte row = 4x write amplification
Cheat Sheet โ Database Internals at a Glance
Pin this. Every number below comes from the internals we covered โ if any feels unfamiliar, the section reference tells you exactly where to go back.
PostgreSQL: 8 KB MySQL/InnoDB: 16 KB SQLite: 4 KB (default) RocksDB: 4 KB (configurable) OS page: 4 KB (x86) Huge page: 2 MB (Linux)
Fanout โ page_size / entry_size 8 KB / 16 B = ~500 (bigint key) Depth = ceil(log(rows)/log(fanout)) 500M rows โ depth 3-4 Page reads = depth + 1 (heap)
Buffer hit (RAM): ~0.1 ฮผs SSD random read: ~100 ฮผs HDD random read: ~10 ms Buffer hit vs SSD: 1,000ร faster Buffer hit vs HDD: 100,000ร faster
shared_buffers: 25% of RAM Buffer hit ratio: > 99% (OLTP) huge_pages: ON if > 8 GB checkpoint_target: 0.9 dead_tup ratio: < 10%
B+ tree reads: O(log_f N) LSM reads: O(levels ร log) B+ tree writes: random I/O LSM writes: sequential I/O B+ space util: ~70% LSM space util: ~90%+
bt_metap('idx') โ tree depth
heap_page_items() โ tuples on page
pg_stat_database โ hit ratio
pg_stat_user_tables โ dead tuples
pg_waldump โ WAL records
EXPLAIN (BUFFERS) โ page reads
Connected Topics โ Where to Go Next
The WAL you learned about in Section 6 is the same mechanism that powers database replication. The buffer pool from Section 6 is why Redis exists โ it's an external buffer pool for when the internal one isn't enough. Every topic below connects directly to an internal you just mastered.