TL;DR — The One-Minute Version
orders table has user_id = 42, the database guarantees that user #42 actually exists in the users table. Try to insert an order for a non-existent user? The database refuses. That's referential integrity. are links between sheets. IndexesA separate data structure (usually a B-tree) that the database maintains alongside your table. It's like the index at the back of a textbook — instead of reading every page to find "B-tree," you look up the index and jump straight to page 247. Without an index, the database reads EVERY row to find your data (called a "sequential scan"). With an index, it jumps straight to the right row. You can create one right now: CREATE INDEX idx_email ON users(email); are the table of contents. ACIDFour guarantees every relational database makes: Atomicity (all-or-nothing), Consistency (rules always hold), Isolation (concurrent transactions don't see each other's half-done work), Durability (committed data survives crashes). You can verify your database's isolation level right now: SHOW transaction_isolation; in PostgreSQL. is the promise that your data won't corrupt itself, even if the power goes out mid-write.
You probably have a relational database on your machine right now and don't know it. Every Mac and Linux ships with SQLiteA serverless, zero-configuration relational database that's embedded directly into your application. It stores the entire database in a single file. There are over 1 trillion SQLite databases in active use — it's in every iPhone, every Android, every Mac, every web browser (Chrome, Firefox, Safari), every copy of Windows 10+, every Airbus A350, and every Dropbox client. Richard Hipp created it in 2000 for the US Navy. Run sqlite3 test.db in your terminal right now — you just created a relational database.. Open a terminal and type sqlite3 test.db — congratulations, you just created a relational database. Zero install, zero config, zero excuses for not following along.
# Create a database, a table, insert a row, query it — in 10 seconds
sqlite3 test.db "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);"
sqlite3 test.db "INSERT INTO users VALUES (1, 'Ada Lovelace', 'ada@example.com');"
sqlite3 test.db "SELECT * FROM users;"
# Output: 1|Ada Lovelace|ada@example.com
# That's it. You just used a relational database.
# The file 'test.db' IS the database. You can copy it, email it, back it up.
Here's the map of what we'll cover: where relational databases came from (a single paper by one IBM researcher in 1970), why they've survived 54 years of "NoSQL will replace everything" hype cycles, how ACID actually works under the hood (with commands you can break and fix), and when you should — and shouldn't — reach for one.
Every command in this page is real. If you have PostgreSQLThe most advanced open-source relational database. Created by Michael Stonebraker at UC Berkeley in 1986 as "Postgres" (Post-Ingres). Now used by Instagram, Stripe, Shopify, Reddit, and thousands more. Install: brew install postgresql (Mac), sudo apt install postgresql (Ubuntu), or download from postgresql.org. Connect: psql -U postgres. List tables: \dt. Show schema: \d users. List indexes: \di. installed, run psql -U postgres. If you have MySQLThe world's most popular open-source relational database by install count. Created by Michael Widenius and David Axmark in 1995. Acquired by Sun Microsystems (2008), then Oracle (2010). Used by Facebook, Twitter, YouTube, and Shopify. Connect: mysql -u root -p. List databases: SHOW DATABASES; List tables: SHOW TABLES; Show schema: DESCRIBE users;, run mysql -u root -p. If you have neither, sqlite3 is already on your machine. No excuses.
The Scenario — Your JSON File Just Ate $5,000
You're building a payment system for a small online store. Day 1, you make the perfectly reasonable decision that every beginner makes: store everything in a JSON file. Users, orders, products — one big data.json. It works. Your 10 beta testers love it.
{
"users": [
{ "id": 1, "name": "Alice", "state": "CA", "balance": 500 },
{ "id": 2, "name": "Bob", "state": "NY", "balance": 300 }
],
"orders": [
{
"id": 101,
"user_id": 1,
"user_name": "Alice",
"user_state": "CA",
"items": [
{ "product": "Mechanical Keyboard", "price": 149.99 }
],
"total": 149.99,
"status": "paid"
}
]
}
Looks clean. Feels simple. You ship it. Here's what happens over the next 30 days.
Let's do the math on that search time. A 500 MB JSON file with 10,000 users and 50,000 orders. To find one order, your code has to:
- Read the entire 500 MB file into memory (that's half a gigabyte of RAM, just to find one order)
- Parse the JSON (turning raw text into objects — at ~50 MB/sec for a good JSON parser, that's 10 seconds just for parsing)
- Loop through every order object until you find the one you want — a linear scanChecking every single item one by one, from first to last. If you have 50,000 orders and the one you want is #49,999, you check all 49,998 before it. Average case: you check half of them. This is O(n) complexity. A database with an index can find the same record in O(log n) — for 50,000 records, that's ~16 lookups instead of 25,000., O(n)
That's 3+ seconds to find a single order. A relational database with an index does the same thing in 0.05 milliseconds. That's 60,000 times faster. Not a typo.
But the speed isn't even the worst part. The worst part is the corruption. Here's exactly how it happens:
# Two customers click "Buy" at the EXACT same moment.
# Your server spawns two processes to handle them.
# Process A (Alice's order): # Process B (Bob's order):
data = read_file("data.json") # t=0ms data = read_file("data.json") # t=1ms
data["orders"].append(alice_order) # t=50ms data["orders"].append(bob_order) # t=51ms
write_file("data.json", data) # t=100ms write_file("data.json", data) # t=102ms
# ^^^ Bob's write OVERWRITES Alice's.
# Alice's order is gone. $149.99 lost.
# Process B read the file BEFORE Process A wrote.
# So Process B's copy doesn't include Alice's order.
# When Process B writes, it saves its version — without Alice.
# No error. No warning. The data is just silently wrong.
Can you fix this with file locking? What if you make each process lock the file before reading and unlock after writing?
File locking "works" for one server. But the moment you have two servers (which you will, the day your single server can't keep up), file locks don't work across machines. And even on one server, file locks are advisory on most operating systems — a buggy process can ignore them. You need a system PURPOSE-BUILT for concurrent access. That's exactly what a database is.This is the moment every developer hits. The JSON file was fine for prototyping. But your store has real customers now, real money, and real consequences when data goes wrong. You need something that can handle thousands of reads and writes per second, keep data consistent even when multiple people are writing at the same time, and survive crashes without losing a single transaction.
You need a database. But which kind?
First Attempt — "Just Use MongoDB"
You ask around. A coworker says: "Just use MongoDBA document-oriented NoSQL database that stores data as JSON-like documents (called BSON). Created in 2009 by 10gen (now MongoDB Inc.). Very popular for prototyping because the schema is flexible — you can store any shape of document without defining columns first. However, it doesn't enforce relationships between documents by default (no foreign keys), which means your application code has to maintain data consistency instead of the database.. It's faster, it's easier, it stores JSON natively — perfect for your data." And honestly, it sounds great. Your data is already JSON. MongoDB stores JSON documents. No schema to define. No migrations to run. Just shove the documents in.
// MongoDB: each order is a self-contained document
db.orders.insertOne({
_id: ObjectId("..."),
user: {
id: 1,
name: "Alice",
state: "CA",
email: "alice@example.com",
address: "123 Market St, San Francisco, CA 94105"
},
items: [
{ product: "Mechanical Keyboard", category: "Electronics", price: 149.99 },
{ product: "USB-C Cable", category: "Accessories", price: 12.99 }
],
total: 162.98,
status: "paid",
created_at: ISODate("2024-01-15T10:30:00Z")
});
// Fast read — everything in one document, no JOINs needed
db.orders.findOne({ _id: ObjectId("...") });
// Returns the complete order with user info embedded. Quick!
This works beautifully for simple reads. Fetch one order? Lightning fast — everything is in one document. No JOINsA SQL operation that combines rows from two or more tables based on a related column. For example, SELECT * FROM orders JOIN users ON orders.user_id = users.id merges each order row with its corresponding user row. JOINs are a superpower of relational databases — they let you store data ONCE (normalized) and combine it on the fly. Document databases don't have native JOINs, so you either duplicate data across documents or do multiple queries and combine the results in your application code. needed. MongoDB is living up to the hype. Your coworker was right.
Then your boss asks a simple question: "Show me all orders from users in California who spent more than $100 in the last 30 days, grouped by product category."
In your head, you think: "That's a basic business question. How hard can it be?"
db.orders.aggregate([
// Step 1: Only orders from last 30 days
{ $match: {
created_at: { $gte: new Date(Date.now() - 30*24*60*60*1000) }
}},
// Step 2: Only users in California (embedded in every document)
{ $match: { "user.state": "CA" } },
// Step 3: Unwind the items array (one row per item)
{ $unwind: "$items" },
// Step 4: Group by category, sum the prices
{ $group: {
_id: "$items.category",
total_spent: { $sum: "$items.price" },
order_count: { $sum: 1 }
}},
// Step 5: Only categories over $100
{ $match: { total_spent: { $gt: 100 } } },
// Step 6: Sort by total
{ $sort: { total_spent: -1 } }
]);
// 23 lines. A pipeline of stages. Each stage transforms the data.
// You had to learn a whole new query language ($match, $unwind, $group).
// And if you want to JOIN data from another collection? $lookup. Good luck.
SELECT p.category, SUM(oi.price) AS total_spent, COUNT(*) AS order_count
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.state = 'CA'
AND o.created_at >= NOW() - INTERVAL '30 days'
GROUP BY p.category
HAVING SUM(oi.price) > 100
ORDER BY total_spent DESC;
-- 8 lines. Reads like English:
-- "From users in CA, join their orders from the last 30 days,
-- group by product category, show only categories over $100."
-- No special syntax to learn. SQL was DESIGNED for this.
Look at the difference. The MongoDB version requires you to learn a custom aggregation pipeline syntax — $match, $unwind, $group, $sort — a whole domain-specific language. The SQL version reads almost like English: "from users in California, join their orders, group by category, where total is more than $100." SQL was literally designed for exactly this kind of question. It's been doing this since 1974.
But the query complexity isn't even the real problem. The real problem is what's hiding inside every order document. Look at the MongoDB document again. Alice's name, state, email, and address are embedded inside every single order she's ever made.
Alice moves from California to New York and updates her address. She has 47 orders in the system. What happens to her address in those 47 order documents?
Nothing. Those 47 documents still say "CA" and "123 Market St, San Francisco." You'd have to manually find and update all 47. Miss one? Now your reports show Alice in two different states. Your "California orders" report is wrong. This is the update anomaly — the same data stored in 47 places, and you have to update all 47 perfectly. In a relational database, Alice's address is stored ONCE in theusers table. Update it once, and every query that JOINs to that row sees the new address automatically.
This is called denormalizationStoring the same data in multiple places for faster reads. In the MongoDB example, Alice's address is copied into every order document so you can read one order without looking up the user separately. The trade-off: writes and updates become dangerous. Change Alice's address? You must update every copy. Miss one, and your data is inconsistent. Document databases denormalize by default. Relational databases normalize by default (store each fact once) and use JOINs to combine data on the fly. — duplicating data across documents for read speed. And it's fine when data rarely changes. But user addresses change. Names change (people get married). Email addresses change. Every time something changes, you have to find and update every copy, across every document that embedded it. The more copies, the more chances you miss one. The more you miss, the more your reports lie to you.
MongoDB is a great tool for specific use cases (we'll cover when to use it later). But for a payment system with relational data — users who have orders, orders that contain items, items that reference products — you're swimming upstream. You need a system that was designed for relationships.
Where It Breaks — The Four Sins of Flat Storage
You delete user #42 from your JSON file. But user #42 has 15 orders, 3 reviews, and a saved payment method — all referencing user_id: 42. What happens to those records? Who's responsible for cleaning them up? What if you miss one?
Let's name the four specific things that go wrong when you store relational data in flat files or document stores. These aren't theoretical — each one has cost real companies real money.
Let's prove sin #4 with real PostgreSQL commands. Open two terminal windows and connect to the same database.
-- Setup (run this first in either terminal):
CREATE TABLE products (id SERIAL PRIMARY KEY, name TEXT, stock INT);
INSERT INTO products VALUES (1, 'Limited Edition Keyboard', 1);
-- Alice's transaction:
BEGIN;
SELECT stock FROM products WHERE id = 1;
-- Returns: stock = 1. Great, it's in stock!
SELECT stock FROM products WHERE id = 1 FOR UPDATE;
-- ^^^ FOR UPDATE locks this row. Bob's transaction will WAIT here.
-- This is how relational databases prevent the oversell.
UPDATE products SET stock = stock - 1 WHERE id = 1;
COMMIT;
-- Alice gets the keyboard. Stock is now 0.
-- Bob's transaction (run while Alice's is still open):
BEGIN;
SELECT stock FROM products WHERE id = 1 FOR UPDATE;
-- ^^^ Bob's query HANGS here. It's waiting for Alice's lock to release.
-- In a flat file, both would read stock=1 and both would succeed.
-- In PostgreSQL, Bob waits his turn.
-- After Alice commits, Bob's query returns:
-- stock = 0. Out of stock.
-- Bob's app checks: if stock <= 0, abort
ROLLBACK; -- Bob doesn't get the keyboard. But no oversell happened.
-- That's ACID Isolation in action. You just prevented a $149.99 lawsuit.
Try this yourself. It takes 60 seconds with two terminal windows. The FOR UPDATE clause is the secret — it tells PostgreSQL "lock this row until I'm done with it." Any other transaction that tries to read the same row with FOR UPDATE has to wait. That's pessimistic lockingA concurrency control strategy where you lock the resource BEFORE modifying it, preventing anyone else from touching it until you're done. Called "pessimistic" because you assume conflicts WILL happen, so you lock preemptively. The opposite is "optimistic locking" — you don't lock, but you check at write time whether anyone else changed the data. Pessimistic is safer but slower (other transactions wait). Optimistic is faster but requires retry logic. PostgreSQL supports both. — the database assumes conflicts will happen and prevents them proactively.
A flat file can't do this. MongoDB can do single-document transactions (since version 4.0), but multi-document transactions are slower and come with caveats. Relational databases have been doing this correctly, efficiently, and at scale since the 1980s.
These four problems — orphans, duplication, query complexity, no ACID — are the symptoms. What's the root cause? What fundamental idea is missing from flat files and document stores that relational databases have?
The root cause is the absence of relationships as first-class citizens. In a flat file or document store, relationships between data exist only in your head (or in your application code). The storage system doesn't know or care that an order "belongs to" a user. A relational database knows. It enforces it. The foreign key IS the relationship, and the database guards it like a bouncer. You can't break it accidentally. That's the breakthrough Edgar Codd had in 1970.The Breakthrough — Codd's Relational Model (1970)
The year is 1970. Computers fill entire rooms. Data is stored in hierarchical databasesIBM's IMS (Information Management System), released in 1966 for the Apollo space program. Data is organized as a tree: a parent record has child records, which have grandchild records, etc. To find data, you navigate the tree from the root down. The problem: if you want to query across branches (e.g., "all orders across all customers in California"), you have to traverse the ENTIRE tree. Restructuring the hierarchy requires rewriting every application that uses it. It was a nightmare. (IBM's IMS) and network databasesThe CODASYL model (Conference on Data Systems Languages), formalized in 1969. An improvement on hierarchical: records can have MULTIPLE parents, forming a graph instead of a tree. But you still navigate by following pointers from record to record. Want a new query? You might need to add new pointer chains. Programmers spent more time managing pointers than working with actual data. (CODASYL). Both work like a maze — to find data, you follow pointers from one record to the next. Want to ask a question the original designers didn't anticipate? Tough luck. You'd need to restructure the entire database and rewrite every application that uses it.
Then an IBM researcher named Edgar F. Codd publishes a paper that changes everything: "A Relational Model of Data for Large Shared Data Banks" (Communications of the ACM, June 1970). The idea is radical in its simplicity:
Stop thinking about HOW data is stored. Think about WHAT the data IS. Organize data into tables (which he called "relations" — hence "relational"). Each table has rows (records) and columns (attributes). Link tables together using keys. Then let users describe WHAT they want in a simple language, and let the computer figure out HOW to get it efficiently.
You say "give me all orders from California users," and the database figures out whether to use an index, a sequential scan, a hash join, or a nested loop. You never touch a pointer. You never navigate a tree. You just ask.
IBM ignored him. They had billions invested in IMS and didn't want to cannibalize their own product. But a young programmer named Larry Ellison read the paper and saw the future. He founded a company to build a relational database. That company is OracleOracle Corporation, founded in 1977 by Larry Ellison, Bob Miner, and Ed Oates. They read Codd's papers and IBM's System R research papers (which described SQL), then built the first commercial relational database. Oracle v2 was released in 1979 — before IBM's own SQL/DS (1981). Ellison literally beat IBM to market using IBM's own research. Oracle is now a $300B company. Larry Ellison is one of the richest people on Earth..
Meanwhile, at UC Berkeley, a professor named Michael Stonebraker built an open-source relational database called IngresInteractive Graphics and Retrieval System, built at UC Berkeley from 1973-1975. One of the first relational database implementations. Ingres used its own query language called QUEL (not SQL). Stonebraker later built its successor, Postgres (1986), which evolved into PostgreSQL. Several Ingres students went on to create major databases: Robert Epstein co-founded Sybase (which became the basis for Microsoft SQL Server), and the Ingres codebase was commercialized as CA-Ingres. (1974). Then he built its successor: Postgres (1986). "Post-Ingres." That project became PostgreSQL — the database that today powers Instagram, Stripe, Shopify, and Reddit. Stonebraker won the Turing AwardThe highest honor in computer science, often called "the Nobel Prize of computing." Named after Alan Turing. Awarded annually by the Association for Computing Machinery (ACM). Codd won it in 1981 for the relational model. Stonebraker won it in 2014 for his contributions to database systems. The award carries a $1 million prize (funded by Google). in 2014. Codd won it in 1981.
Now let's see what Codd's model actually looks like in practice. Remember that messy JSON document from Section 2 where Alice's address was embedded in every order? Here's how a relational database handles the same data.
Normalization — From One Giant Blob to Clean Tables
NormalizationThe process of organizing data to eliminate redundancy. "Third Normal Form" (3NF) means: (1) every column depends on the primary key (1NF), (2) every column depends on the WHOLE key (2NF), and (3) every column depends on NOTHING BUT the key (3NF). The famous mantra: "The key, the whole key, and nothing but the key, so help me Codd." In practice: if data is duplicated, break it into a separate table and link with a foreign key. is the fancy word for "don't store the same thing twice." Instead of cramming everything into one massive document, you split data into separate tables, each storing one type of thing. Then you link them with keys. Here's the transformation:
Now let's create these tables for real. Copy and paste this into psql, mysql, or sqlite3:
-- Create the four normalized tables
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
state CHAR(2) NOT NULL,
addr TEXT
);
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
category TEXT NOT NULL,
price DECIMAL(10,2) NOT NULL
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL REFERENCES users(id), -- FK: must exist in users
total DECIMAL(10,2) NOT NULL DEFAULT 0,
status TEXT NOT NULL DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INT NOT NULL REFERENCES orders(id), -- FK: must exist in orders
product_id INT NOT NULL REFERENCES products(id), -- FK: must exist in products
quantity INT NOT NULL DEFAULT 1,
price DECIMAL(10,2) NOT NULL -- price at time of purchase
);
-- Insert sample data
INSERT INTO users (name, email, state, addr) VALUES
('Alice', 'alice@example.com', 'CA', '123 Market St, SF'),
('Bob', 'bob@example.com', 'NY', '456 Broadway, NYC'),
('Carol', 'carol@example.com', 'CA', '789 Sunset Blvd, LA');
INSERT INTO products (name, category, price) VALUES
('Mechanical Keyboard', 'Electronics', 149.99),
('USB-C Cable', 'Accessories', 12.99),
('Monitor Stand', 'Furniture', 89.50),
('Webcam HD', 'Electronics', 79.00);
INSERT INTO orders (user_id, total, status) VALUES
(1, 162.98, 'paid'), -- Alice's order
(3, 89.50, 'paid'), -- Carol's order
(1, 79.00, 'pending'); -- Alice's second order
INSERT INTO order_items (order_id, product_id, quantity, price) VALUES
(1, 1, 1, 149.99), -- Alice: Keyboard
(1, 2, 1, 12.99), -- Alice: USB-C Cable
(2, 3, 1, 89.50), -- Carol: Monitor Stand
(3, 4, 1, 79.00); -- Alice: Webcam
Now try to break it. Try inserting an order for a user that doesn't exist:
-- Try to create an order for user_id 999 (doesn't exist):
INSERT INTO orders (user_id, total, status) VALUES (999, 50.00, 'paid');
-- ERROR: insert or update on table "orders" violates foreign key constraint
-- DETAIL: Key (user_id)=(999) is not present in table "users".
-- Try to delete Alice while she has orders:
DELETE FROM users WHERE id = 1;
-- ERROR: update or delete on table "users" violates foreign key constraint
-- DETAIL: Key (id)=(1) is still referenced from table "orders".
-- The database REFUSES to let you create orphaned data.
-- This is Codd's gift: the database guards your data's integrity.
-- No amount of "careful coding" in your application gives you this guarantee.
That's referential integrityThe guarantee that every foreign key value in one table actually points to an existing row in the referenced table. It's enforced by the database engine itself, not by your application code. This means it's impossible to have an orphaned order (pointing to a deleted user) or a dangling reference. Even if your application has bugs, even if someone manually runs SQL against the database, the constraint holds. You'd have to explicitly drop the constraint to break it.. The database doesn't trust your application code. It doesn't trust you. It enforces the rules at the storage layer, where they can't be bypassed. Your application can have bugs, your ORM can have bugs, someone can connect directly with psql and run raw SQL — and the constraints still hold.
The Magic Query
Remember your boss's question from Section 3? "All orders from California users who spent over $100 in the last 30 days, grouped by product category." Here it is, running against your four normalized tables:
-- "Show me California spending by category, over $100, last 30 days"
SELECT
p.category,
SUM(oi.price * oi.quantity) AS total_spent,
COUNT(DISTINCT o.id) AS order_count
FROM users u
JOIN orders o ON u.id = o.user_id -- link users → orders
JOIN order_items oi ON o.id = oi.order_id -- link orders → items
JOIN products p ON oi.product_id = p.id -- link items → products
WHERE u.state = 'CA'
AND o.created_at >= NOW() - INTERVAL '30 days'
AND o.status = 'paid'
GROUP BY p.category
HAVING SUM(oi.price * oi.quantity) > 100
ORDER BY total_spent DESC;
-- Read it out loud: "From users in California, join their paid orders
-- from the last 30 days, join the items, join the products, group by
-- category, only show categories where total > $100, sort by total."
--
-- You just described WHAT you want. The database figures out HOW.
-- It decides: use the index on users(state)? Hash join or nested loop?
-- Sequential scan or bitmap index scan? You don't care. It's optimal.
This is the core breakthrough of the relational model. You describe what you want. The database's query plannerThe component of a relational database that takes your SQL query and figures out the fastest way to execute it. It considers: which indexes exist, how many rows each table has, what the data distribution looks like (are most users in CA or NY?), how much memory is available, and dozens of other factors. Then it generates an "execution plan" — a step-by-step recipe. You can see the plan yourself: EXPLAIN ANALYZE before any query. PostgreSQL's planner is one of the most sophisticated pieces of software ever written. figures out how. This is called declarative queryingYou declare the result you want without specifying the steps to get it. SQL is declarative: "give me all users in California" doesn't say HOW to find them. Compare with imperative code: "open file, read line by line, check if state field equals CA, add to results list." The database engine translates your declarative query into an optimal imperative execution plan. This is why SQL has survived 50 years — the underlying execution can improve with every database version without changing your queries., and it's the reason SQL has survived for 50 years. The execution engine gets better with every version of PostgreSQL, MySQL, and SQL Server — and your queries automatically get faster without you changing a single line.
Let's see the planner in action. Run the same query with EXPLAIN ANALYZE in front:
EXPLAIN ANALYZE
SELECT p.category, SUM(oi.price * oi.quantity) AS total_spent
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.state = 'CA'
GROUP BY p.category;
-- Output (simplified):
-- Hash Join (cost=2.15..3.28 rows=2) (actual time=0.089..0.093 rows=2)
-- -> Seq Scan on users (cost=0..1.04 rows=2) (actual time=0.008..0.010)
-- Filter: (state = 'CA')
-- -> Hash Join (cost=1.08..2.15 rows=4) (actual time=0.043..0.048)
-- -> Seq Scan on order_items
-- -> Hash (cost=1.03..1.03 rows=3)
-- -> Seq Scan on orders
-- Planning Time: 0.285 ms
-- Execution Time: 0.127 ms
--
-- 0.127 milliseconds. The database read 4 tables, joined them,
-- filtered by state, grouped by category — in 127 MICROSECONDS.
-- Your JSON file took 3 seconds for a simpler query.
Edgar Codd didn't just invent a way to store data. He invented a way to think about data. Separate the logical structure (tables, rows, keys) from the physical storage (files, indexes, memory). Describe what you want, not how to get it. Enforce rules at the data layer, not the application layer. Fifty-four years later, every major technology company on Earth still uses his model. Not because they're stuck in the past — because nobody has found anything better for structured, relational data.
Once you have data in PostgreSQL, you can see the health of every table with a single query: SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables; This shows: table name, live rows, dead rows (from updates/deletes), and when the last cleanup ran. This is how real DBAs know when a table is "sick" (too many dead tuples = bloat = slow queries). You now know something most junior developers don't.
How It Works — ACID Deep Dive
Alice has $500 in her account. She transfers $200 to Bob. The system deducts $200 from Alice's balance, but crashes before adding $200 to Bob's balance. Now $200 has vanished into thin air. How would a database prevent this — making sure BOTH changes happen or NEITHER does?
Section 1 mentioned ACID. Now let's prove each property with commands you can run in a real PostgreSQL terminal. These aren't theoretical — every demo below works on a fresh psql session. If you set up the tables from Section 5, you're ready to go.
Let's dive into each one. Every card below has commands you can paste into psql and see for yourself.
Atomicity — All or Nothing
Imagine you're transferring $500 from Account A to Account B. Two things must happen: debit A and credit B. If the system crashes after the debit but before the credit, the $500 just vanished into thin air. That's not a bug — that's a lawsuit.
Atomicity means: either both operations happen, or neither does. There is no in-between. The database treats the entire transaction as one indivisible unit — like an atom (before we split those, anyway).
-- Setup: create two accounts
CREATE TABLE accounts (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
balance DECIMAL(10,2) NOT NULL CHECK (balance >= 0)
);
INSERT INTO accounts (name, balance) VALUES ('Alice', 1000.00), ('Bob', 500.00);
-- Transfer $500 from Alice to Bob — ATOMICALLY
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE id = 1; -- Alice: 1000 → 500
UPDATE accounts SET balance = balance + 500 WHERE id = 2; -- Bob: 500 → 1000
COMMIT;
SELECT * FROM accounts;
-- id | name | balance
-- 1 | Alice | 500.00
-- 2 | Bob | 1000.00
-- Both updates happened. Money is conserved. Everyone's happy.
Now the crash test. Start a transaction, do the debit, then kill the connection (close the terminal, pull the plug, whatever):
-- Reset balances for the test
UPDATE accounts SET balance = 1000 WHERE id = 1;
UPDATE accounts SET balance = 500 WHERE id = 2;
-- Start the transfer but DON'T COMMIT
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE id = 1;
-- Alice's balance is 500 INSIDE this transaction...
-- NOW: close the terminal. Kill the process. Ctrl+C. Whatever.
-- Open a NEW terminal, reconnect to psql, and check:
SELECT * FROM accounts;
-- id | name | balance
-- 1 | Alice | 1000.00 ← UNCHANGED! The debit was rolled back.
-- 2 | Bob | 500.00 ← UNCHANGED!
-- The WAL (Write-Ahead Log) protected you. No money vanished.
That's the Write-Ahead Log (WAL)Before any change is applied to the actual data files, PostgreSQL writes a record of the change to a sequential log file on disk. If the system crashes, PostgreSQL replays the WAL on restart: committed transactions are reapplied (redo), uncommitted transactions are discarded (undo). This is why uncommitted changes vanish after a crash — the WAL knows they were never committed, so it ignores them during recovery. The WAL is the backbone of both atomicity and durability. in action. Every change is written to a log before touching the actual data. If the transaction never commits, the WAL knows to discard those changes on restart.
One more trick: savepointsA bookmark within a transaction that lets you roll back to that specific point without aborting the entire transaction. Useful when you're doing multiple operations and one might fail (e.g., inserting into a table with a UNIQUE constraint). Instead of rolling back ALL the work, you roll back to the savepoint and try an alternative. Syntax: SAVEPOINT sp1; then ROLLBACK TO sp1; to undo everything after the savepoint. let you do partial rollbacks within a transaction:
BEGIN;
INSERT INTO users (name, email, state) VALUES ('Dave', 'dave@x.com', 'TX');
SAVEPOINT sp1;
-- Try something risky:
INSERT INTO users (name, email, state) VALUES ('Eve', 'dave@x.com', 'CA');
-- ERROR: duplicate key value violates unique constraint on email
ROLLBACK TO sp1; -- Undo only Eve's insert. Dave's insert is still alive.
INSERT INTO users (name, email, state) VALUES ('Eve', 'eve@x.com', 'CA'); -- Fixed!
COMMIT;
-- Both Dave and Eve are inserted. The failed attempt was surgically removed.
A common production disaster: a transaction starts but never commits or rolls back. It sits there holding locks, blocking other queries, and preventing the database from cleaning up dead rows. Find these zombies with: SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'idle in transaction'; If query_start is hours old, that's your problem. Kill it: SELECT pg_terminate_backend(pid);
Consistency — Rules That Can't Be Broken
Consistency means the database enforces rules that your data must always obey. Not "eventually" — always. Not "if the app remembers to check" — unconditionally. These rules are baked into the schema, and no query, no ORM, no admin can violate them without explicitly dropping the constraint.
There are four types of constraints, and you should know all of them:
-- 1. FOREIGN KEY: references must be valid
ALTER TABLE orders ADD CONSTRAINT fk_user
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE;
-- ON DELETE CASCADE: if you delete a user, their orders are auto-deleted too.
-- Alternative: ON DELETE RESTRICT (refuse to delete user if orders exist)
INSERT INTO orders (user_id, total) VALUES (99999, 50.00);
-- ERROR: Key (user_id)=(99999) is not present in table "users".
-- The database says: "That user doesn't exist. I won't let you lie."
-- 2. CHECK: custom rules on column values
ALTER TABLE products ADD CONSTRAINT positive_price CHECK (price > 0);
INSERT INTO products (name, category, price) VALUES ('Free Thing', 'Test', -5.00);
-- ERROR: new row violates check constraint "positive_price"
-- Negative prices? Not in this database.
-- 3. UNIQUE: no duplicates allowed
ALTER TABLE users ADD CONSTRAINT unique_email UNIQUE (email);
INSERT INTO users (name, email, state) VALUES ('Fake Alice', 'alice@example.com', 'TX');
-- ERROR: duplicate key value violates unique constraint "unique_email"
-- Two users with the same email? Nope.
-- 4. NOT NULL: mandatory fields
ALTER TABLE users ALTER COLUMN name SET NOT NULL;
INSERT INTO users (name, email, state) VALUES (NULL, 'nobody@x.com', 'CA');
-- ERROR: null value in column "name" violates not-null constraint
-- Every user must have a name. Period.
For complex business rules that go beyond simple constraints, you can use triggersA stored procedure that the database runs automatically when a specific event occurs (INSERT, UPDATE, DELETE) on a specific table. Triggers fire INSIDE the same transaction as the triggering statement, so they're atomic. Common use: updating inventory counts when an order is placed, or logging every change to a sensitive table. Be careful — too many triggers make debugging hard because logic is hidden in the database, not in your application code.:
-- When an order item is inserted, automatically reduce product stock
CREATE OR REPLACE FUNCTION reduce_stock()
RETURNS TRIGGER AS $$
BEGIN
UPDATE products SET stock = stock - NEW.quantity
WHERE id = NEW.product_id;
-- Prevent overselling
IF (SELECT stock FROM products WHERE id = NEW.product_id) < 0 THEN
RAISE EXCEPTION 'Not enough stock for product %', NEW.product_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_reduce_stock
AFTER INSERT ON order_items
FOR EACH ROW EXECUTE FUNCTION reduce_stock();
-- Now, inserting an order item automatically updates stock.
-- No application code needed. The database enforces it.
Isolation — Concurrent Access Without Chaos
You've got 10,000 users hitting your database at the same time. Two of them are trying to buy the same limited-edition item. Three others are updating their addresses while a report query reads those same rows. If everyone's changes were instantly visible to everyone else, you'd get corrupted data, phantom reads, and bugs that only happen on Fridays at 3pm under heavy load (the worst kind of bugs).
Isolation controls how much of one transaction's uncommitted work other transactions can see. The database gives you four levels, from "fast but risky" to "slow but bulletproof."
Here's the killer demo. You need two terminal windows, both connected to the same PostgreSQL database:
-- Make sure we have stock data
CREATE TABLE IF NOT EXISTS inventory (
id SERIAL PRIMARY KEY, name TEXT, stock INT
);
INSERT INTO inventory VALUES (1, 'Limited Hoodie', 10)
ON CONFLICT (id) DO UPDATE SET stock = 10;
-- Start a transaction and change the stock — but DON'T COMMIT
BEGIN;
UPDATE inventory SET stock = stock - 1 WHERE id = 1;
-- Stock is 9 inside THIS transaction.
-- But we haven't committed yet.
-- Go to Terminal 2 and run the SELECT there.
-- DON'T close this terminal. Leave the transaction open.
-- While Terminal 1's transaction is still open (not committed)...
SELECT stock FROM inventory WHERE id = 1;
-- What do you see?
-- Under READ COMMITTED (PostgreSQL default): stock = 10
-- → You see the OLD value. Terminal 1's change is invisible.
-- → This is SAFE. You're reading committed data only.
-- Under READ UNCOMMITTED: stock = 9
-- → You see Terminal 1's UNCOMMITTED change. This is a "dirty read."
-- → If Terminal 1 rolls back, you just made a decision based on fake data.
-- → PostgreSQL doesn't even support this level. That's how bad it is.
-- Check YOUR database's isolation level:
SHOW transaction_isolation;
-- Probably: "read committed" — the sane default.
-- Set isolation level per-transaction:
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- ... your queries here ...
COMMIT;
-- The four levels (from weakest to strongest):
--
-- 1. READ UNCOMMITTED — see other transactions' dirty writes
-- PostgreSQL ignores this; treats it as READ COMMITTED.
-- MySQL supports it. Almost never used in production.
--
-- 2. READ COMMITTED — only see committed data (PostgreSQL default)
-- Safe for most apps. Instagram uses this for feeds.
--
-- 3. REPEATABLE READ — snapshot at transaction start
-- Same SELECT returns same results throughout the transaction,
-- even if other transactions commit changes in between.
-- MySQL InnoDB default. Good for reports.
--
-- 4. SERIALIZABLE — transactions execute as if one-at-a-time
-- Slowest but safest. No anomalies possible.
-- Stripe uses this for payment processing.
-- If a conflict is detected, PostgreSQL aborts one transaction
-- and you must retry it.
If SERIALIZABLE prevents all anomalies, why doesn't everyone just use it? Why is READ COMMITTED the default?
Because SERIALIZABLE is slower. The database has to track every row you read and every row you write, and if two transactions conflict, one gets aborted and must be retried by your application code. For a social media feed (Instagram), losing a "like" count by 1 for a split second is fine — READ COMMITTED is fast enough. For a payment system (Stripe), losing even one cent is unacceptable — SERIALIZABLE is worth the performance cost. Pick the level that matches your tolerance for wrong answers.Durability — Survives Power Outages
You committed a transaction. The database said "OK, committed." Then the power goes out. When the server comes back, is your data still there?
With durability: yes, always. Once the database says "committed," that data is on disk, period. Even if the machine catches fire one millisecond after the commit.
The magic behind this is the Write-Ahead Log (WAL)A sequential, append-only file on disk where PostgreSQL writes every change BEFORE applying it to the actual data tables. Because the WAL is sequential (not random I/O), writing to it is fast. On crash recovery, PostgreSQL reads the WAL from the last checkpoint forward: committed changes are replayed (redo), uncommitted changes are discarded (undo). The WAL is also the foundation of replication — replicas read the WAL stream to stay in sync with the primary.:
Every change goes to the WAL first, then to the actual data files later (in the background, when it's convenient). The WAL write is forced to physical disk with fsync() before the database tells your app "committed." That's the guarantee: if you got "COMMIT" back, the WAL has your data on physical disk.
-- Check your WAL configuration
SHOW wal_level;
-- 'replica' (default) — enough for streaming replication
-- 'logical' — needed for logical replication (CDC, Debezium)
-- 'minimal' — fastest but no replication possible
-- See your current position in the WAL
SELECT pg_current_wal_lsn();
-- Returns something like: 0/15A3D40
-- This is the Log Sequence Number — a pointer into the WAL stream.
-- Every committed transaction advances this number.
-- See actual WAL files on disk (run from shell, not psql):
-- ls -la /var/lib/postgresql/14/main/pg_wal/
-- Each file is 16MB by default. They accumulate until checkpointed.
-- The durability vs speed trade-off:
SHOW synchronous_commit;
-- 'on' (default): fsync every commit. Safest. ~600us per commit.
-- 'off': don't wait for fsync. Faster but can lose last ~600ms
-- of commits on crash. Some teams use this for log tables
-- or analytics data where losing a few rows is acceptable.
Setting synchronous_commit = off is NOT the same as disabling durability. It means the database doesn't wait for fsync before telling your app "committed." The data is still written to the WAL — just in OS buffers, not yet on physical disk. If the OS crashes within ~600ms, those buffered commits can be lost. Some teams accept this for non-critical data (session logs, click tracking) because the throughput improvement is dramatic: 2-3x more transactions per second. For financial data, you keep it on. Always.
Going Deeper — Indexing
Your users table has 10 million rows. You run SELECT * FROM users WHERE email = 'alice@example.com'. Without any optimization, the database reads all 10 million rows to find one match. That takes ~30 seconds. How would you make this query run in under 5 milliseconds? What data structure would you use behind the scenes?
If ACID is the safety net, indexing is the speed engine. A query on a table with 10 million rows can take 30 seconds without an index and 2 milliseconds with one. The difference between "our site is down" and "our site is fast" is often a single CREATE INDEX statement.
Here's the intuition: imagine a 500-page textbook with no table of contents and no index at the back. To find every mention of "B-tree," you'd read every page. That's a sequential scanThe database reads every single row in the table, one by one, checking each against your WHERE clause. For a 10 million row table, that's 10 million checks. It's the slowest access pattern, but it's always correct — it never misses a row. PostgreSQL falls back to sequential scan when no suitable index exists, when the table is tiny (faster than the index overhead), or when you're selecting most of the rows anyway (a full scan beats 8 million individual index lookups).. Now imagine the textbook has an index: "B-tree: pages 47, 112, 289." You jump straight to those pages. That's a database index.
When you run CREATE INDEX without specifying a type, PostgreSQL creates a B+ treeA self-balancing tree structure where all data pointers live in the leaf nodes, and leaf nodes are linked together in a doubly-linked list. This means: (1) lookups are O(log n) — follow the tree from root to leaf, (2) range scans are efficient — find the start leaf, then follow links forward, (3) the tree stays balanced automatically, so worst-case = average-case. The "B" stands for "balanced" (not "binary"). A B+ tree with fanout 500 and 10 million rows has depth 3 — that's just 3 disk reads to find any row.. It's the workhorse of database indexing — good at equality lookups (WHERE email = 'x'), range queries (WHERE created_at > '2024-01-01'), and sorting (ORDER BY name).
Here's how it works in plain English: imagine a phone book. The root page tells you "A-M is in the left half, N-Z is in the right half." Each half breaks down further: "A-F is in section 1, G-M is in section 2." You follow these signposts until you reach the bottom level (the leaf), which has the actual phone numbers (row pointers). For a table with 10 million rows, a B+ tree with a branching factor of 500 has just 3 levels. That's 3 disk reads to find any row out of 10 million.
-- Create a B+ tree index on email (this is the default type)
CREATE INDEX idx_users_email ON users(email);
-- Verify the query planner uses it:
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
-- Index Scan using idx_users_email on users
-- Index Cond: (email = 'alice@example.com'::text)
-- Planning Time: 0.085 ms
-- Execution Time: 0.042 ms
-- B+ trees also handle range queries:
EXPLAIN ANALYZE SELECT * FROM orders WHERE created_at > '2024-01-01';
-- Index Scan using idx_orders_created_at (if you create one)
-- The tree finds the starting point, then follows leaf-node links forward.
-- See which indexes exist and how often they're used:
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'users';
-- idx_scan: how many times this index has been used (higher = good investment)
-- If idx_scan is 0, you're paying storage cost for an unused index. Drop it.
-- See index sizes:
SELECT pg_size_pretty(pg_indexes_size('users'));
-- Shows total size of ALL indexes on the users table.
A hash indexAn index that uses a hash function to convert the column value into a bucket number. Lookups are O(1) — compute the hash, jump to the bucket, find the row. But hash indexes can't do range queries (WHERE price > 100), can't do pattern matching (WHERE name LIKE 'A%'), and can't sort results (ORDER BY). They're specialists: blindingly fast at one thing, useless at everything else. In PostgreSQL, hash indexes became crash-safe only in version 10 (2017). Before that, they weren't WAL-logged. is like a coat check: give your coat (value), get a ticket number (hash). Later, hand back the ticket and get your coat instantly. No searching. But you can't say "give me all coats between ticket 50 and 75" — the numbers aren't in order. That's the trade-off: O(1) for exact matches, useless for everything else.
-- Create a hash index
CREATE INDEX idx_hash_email ON users USING hash(email);
-- This helps:
SELECT * FROM users WHERE email = 'alice@example.com';
-- Hash lookup: O(1). Faster than B-tree for exact equality.
-- This does NOT help (hash can't do ranges):
SELECT * FROM users WHERE email LIKE 'a%';
-- Falls back to sequential scan. Hash index is ignored.
-- This does NOT help (hash can't do ordering):
SELECT * FROM users ORDER BY email;
-- Also falls back to sequential scan.
-- Rule of thumb:
-- Use hash for: UUID lookups, session token lookups, exact key matches
-- Use B-tree for: almost everything else
-- In practice, B-tree is so good that hash indexes are rarely needed.
A composite index covers multiple columns. It's like a phone book sorted by last name, then first name. You can look up "Smith" (last name only) quickly. You can look up "Smith, John" (both columns) even faster. But you can't look up just "John" (first name only) without scanning the whole book — because the book isn't sorted by first name.
This is called the left-prefix ruleA composite index on columns (A, B, C) can be used for queries that filter on: A alone, A+B together, or A+B+C together. But NOT for B alone, C alone, or B+C (without A). The index is sorted by A first, then by B within each A group, then by C within each B group. Without knowing A, the database can't narrow down where B values live in the index. It's exactly like a phone book sorted by (city, last name, first name): you can find all Smiths in Boston, but you can't find all Johns across all cities without scanning everything., and it's the most common indexing mistake in production. Let's see it in action:
-- Create a composite index: user_id first, then created_at
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
-- Uses the index (queries on the LEFT-most column):
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 5;
-- Index Scan using idx_orders_user_date
-- Uses the index (both columns, left-to-right):
EXPLAIN ANALYZE SELECT * FROM orders
WHERE user_id = 5 AND created_at > '2024-01-01';
-- Index Scan using idx_orders_user_date
-- Index Cond: (user_id = 5 AND created_at > '2024-01-01')
-- Does NOT use the index (skipped the left column):
EXPLAIN ANALYZE SELECT * FROM orders
WHERE created_at > '2024-01-01';
-- Seq Scan on orders (index is useless here!)
-- You'd need a separate index on created_at for this query.
-- Pro tip: put the most-filtered column FIRST in the composite index.
-- If 90% of your queries filter by user_id, it should be the left column.
Indexes aren't free. Every index is a separate data structure that the database maintains alongside your table. Every INSERT, UPDATE, and DELETE on the table must also update every index on that table. More indexes = slower writes.
-- See how much disk space your indexes consume:
SELECT
tablename,
pg_size_pretty(pg_total_relation_size(quote_ident(tablename))) AS total_size,
pg_size_pretty(pg_relation_size(quote_ident(tablename))) AS table_size,
pg_size_pretty(pg_indexes_size(quote_ident(tablename))) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(quote_ident(tablename)) DESC;
-- On a real 10M-row table, you might see:
-- tablename | total_size | table_size | index_size
-- users | 1.8 GB | 1.2 GB | 600 MB
-- 600 MB of indexes! That's RAM they're competing for.
-- Find UNUSED indexes (wasting space + slowing writes):
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND indexrelname NOT LIKE '%pkey'
ORDER BY pg_relation_size(indexrelid) DESC;
-- idx_scan = 0 means the index has NEVER been used since last stats reset.
-- If it's been weeks/months: drop it.
DO index: columns in WHERE clauses, JOIN conditions, ORDER BY, and columns used in GROUP BY. These are columns the query planner searches, sorts, or matches against.
DON'T index: boolean columns (only 2 values — the index doesn't narrow much), columns you rarely filter on, tables with fewer than ~1,000 rows (sequential scan is faster than the index overhead), and columns that change constantly (every update rewrites the index entry).
Monitor: Use pg_stat_user_indexes to find unused indexes. Use EXPLAIN ANALYZE to verify the planner uses your index. If idx_scan = 0 for weeks, that index is dead weight — drop it.
The Variations — Normalization & Denormalization
You have a users table with columns for city, state, and country. "New York" appears in 500,000 rows. If NYC renamed itself (unlikely, but bear with us), you'd update 500,000 rows. A purist would say: make a separate cities table and store just a city_id in users. But now every user query needs a JOIN. Which is worse — the redundancy or the JOIN overhead? When does each choice make sense?
Section 5 showed you normalized tables — "store each fact once." But how far do you take this? And when should you break the rules and store data in more than one place? This section gives you the theory (normal forms), the practice (when to break them), and the tools to see what's happening under the hood (the query planner).
1NF Through BCNF — The Normalization Staircase
Normalization is a process with steps. Each step removes a specific type of redundancy. Think of it like cleaning a messy room — you don't throw everything away at once; you sort by category.
Let's start with the messiest possible table and clean it up step by step, with real SQL at every stage:
-- UNNORMALIZED: one row per order, comma-separated items
CREATE TABLE bad_orders (
order_id INT, customer_name TEXT, customer_state TEXT,
items TEXT, item_prices TEXT
);
INSERT INTO bad_orders VALUES
(1, 'Alice', 'CA', 'Keyboard, Mouse', '149.99, 29.99'),
(2, 'Alice', 'CA', 'Cable', '12.99');
-- Problem: WHERE items LIKE '%Keyboard%' — slow, fragile, wrong.
-- 1NF: Every column holds ONE value (atomic).
CREATE TABLE orders_1nf (
order_id INT, customer_name TEXT, customer_state TEXT,
item_name TEXT, item_price DECIMAL(10,2)
);
INSERT INTO orders_1nf VALUES
(1, 'Alice', 'CA', 'Keyboard', 149.99),
(1, 'Alice', 'CA', 'Mouse', 29.99),
(2, 'Alice', 'CA', 'Cable', 12.99);
-- Each row has one item. But Alice's name/state repeated 3 times.
-- 2NF: Remove partial dependencies.
CREATE TABLE customers_2nf (
customer_id SERIAL PRIMARY KEY, name TEXT NOT NULL, state CHAR(2) NOT NULL
);
CREATE TABLE order_items_2nf (
order_id INT, customer_id INT REFERENCES customers_2nf(customer_id),
item_name TEXT, item_price DECIMAL(10,2)
);
-- Alice's name stored ONCE. Much better.
-- 3NF: Remove transitive dependencies.
CREATE TABLE products_3nf (
product_id SERIAL PRIMARY KEY, name TEXT UNIQUE NOT NULL, price DECIMAL(10,2) NOT NULL
);
CREATE TABLE order_items_3nf (
order_id INT, customer_id INT REFERENCES customers_2nf(customer_id),
product_id INT REFERENCES products_3nf(product_id),
quantity INT DEFAULT 1, price_at_purchase DECIMAL(10,2)
);
-- Product price in ONE place. price_at_purchase is a snapshot.
Computer scientist Bill Kent summarized 3NF with a phrase every DBA knows: "Every non-key column must provide a fact about the key, the whole key, and nothing but the key — so help me Codd." If a column depends on something other than the primary key, it belongs in a different table.
When to Denormalize — Breaking the Rules on Purpose
Normalization is the right default. But sometimes, enforcing 3NF everywhere makes your queries slow. If a dashboard runs a 7-table JOIN every time the page loads, and that page gets 100,000 hits per minute, you have a problem. The query might be "correct" but it's too slow to be useful.
DenormalizationDeliberately duplicating data across tables to avoid expensive JOINs at read time. You're trading write complexity (must update data in multiple places) for read speed (fewer JOINs, less computation per query). The key word is "deliberately" — accidental duplication is a bug. Intentional duplication with a refresh strategy is an optimization. Common tools: materialized views, summary tables, and cached aggregates. means intentionally duplicating data to make reads faster. It's not "wrong" — it's a trade-off. You're trading write complexity for read speed. The trick is doing it knowingly, with a plan to keep the copies in sync.
-- Pre-compute per-user stats with a materialized view
CREATE MATERIALIZED VIEW user_stats AS
SELECT u.id AS user_id, u.name, COUNT(o.id) AS order_count,
COALESCE(SUM(o.total), 0) AS lifetime_value, MAX(o.created_at) AS last_order_at
FROM users u
LEFT JOIN orders o ON u.id = o.user_id AND o.status = 'paid'
GROUP BY u.id, u.name;
-- Dashboard queries the view (instant):
SELECT * FROM user_stats WHERE user_id = 42;
-- No JOIN. No aggregation. Just a simple row lookup.
-- Refresh periodically (e.g., every 5 minutes via cron):
REFRESH MATERIALIZED VIEW CONCURRENTLY user_stats;
-- CONCURRENTLY means the view stays readable during refresh.
Instagram doesn't run SELECT COUNT(*) FROM followers WHERE following_id = 42 every time you view a profile. That query would be catastrophic for accounts with 400 million followers. Instead, they store follower_count directly on the user record and update it with a trigger or background job whenever someone follows/unfollows. That's denormalization. The count might be 1 second stale, but nobody notices — and the page loads in 50ms instead of 30 seconds.
The Query Planner — How PostgreSQL Thinks
You write SQL. The database figures out how to execute it. The component that does this thinking is the query plannerAlso called the query optimizer. It takes your SQL statement, generates multiple possible execution plans (different JOIN orders, different index choices, different scan methods), estimates the cost of each plan using table statistics (row counts, value distributions, index sizes), and picks the cheapest one. PostgreSQL's planner considers hundreds of factors. You can see its decision with EXPLAIN ANALYZE. A bad plan can make a query 1000x slower than a good one — and the difference is often a missing index or stale statistics. (or query optimizer). Understanding it is the difference between "my query is slow" and "I know exactly why my query is slow and how to fix it."
EXPLAIN ANALYZE
SELECT u.name, o.total FROM users u
JOIN orders o ON u.id = o.user_id WHERE u.state = 'CA';
-- Output (annotated):
-- Hash Join (cost=1.04..2.09 rows=2 width=68)
-- Hash Cond: (o.user_id = u.id)
-- -> Seq Scan on orders o (cost=0..1.03 rows=3)
-- -> Hash (cost=1.03..1.03 rows=2)
-- -> Seq Scan on users u (cost=0..1.03 rows=2)
-- Filter: (state = 'CA')
-- Planning Time: 0.258 ms
-- Execution Time: 0.087 ms
--
-- Reading bottom-up:
-- 1. Scan users, filter state='CA' → 2 rows
-- 2. Build hash table from those rows
-- 3. Scan orders, look up each in hash table
-- 4. Return matches. 0.087ms total.
-- Force PostgreSQL to NOT use sequential scans (for testing only!):
SET enable_seqscan = off;
-- Re-run EXPLAIN ANALYZE — it'll pick an index scan if one exists.
SET enable_seqscan = on; -- RESET after testing
-- Find your slowest queries automatically:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
-- Update table statistics (helps the planner make better decisions):
ANALYZE users;
At Scale — Real Company Databases
Theory is nice, but you want to know: do relational databases actually work at internet scale? Not toy projects — we're talking billions of page views, millions of merchants, hundreds of millions of daily active users. The answer is yes, and here's the proof.
Stack Overflow — 1.3 Billion Pages on 2 SQL Servers
Stack Overflow is the canonical example of "you don't need microservices." As of 2024, the site that serves 1.3 billion page views per month runs on 2 SQL Server instances. Not 200. Not a distributed cluster. Two machines.
Their engineering team has written extensively about this. A few highlights that matter for system design:
- Tags as pipe-delimited strings: Instead of a normalized many-to-many table, they store tags as
<javascript><react><typescript>directly in the posts table. That's denormalization. It avoids a JOIN on every page load. They useLIKE '%<javascript>%'for tag queries — which sounds terrible, but with proper full-text indexesAn index designed for text searching rather than exact value matching. Instead of matching "does this column equal X?", full-text indexes answer "does this column contain the WORDS X?". SQL Server and PostgreSQL both have built-in full-text search. PostgreSQL usestsvectorandtsquery. These indexes tokenize text, remove stop words, apply stemming (so "running" matches "run"), and build an inverted index (word → list of documents). Much faster than LIKE '%word%' for text-heavy searches., it's fast enough. - Aggressive caching: Redis sits in front of SQL Server. Most reads hit Redis. SQL Server handles the writes and cache misses.
- Smart indexing: Their team obsesses over
EXPLAINplans. Nick Craver (their infrastructure lead) has blogged about indexing strategies that squeezed 10x more throughput from the same hardware.
The lesson: before you reach for a NoSQL database or a microservice architecture, ask yourself — have you actually pushed your relational database to its limits? Most teams haven't even tried.
Instagram — 5 Engineers, 30M Users, PostgreSQL
When Facebook acquired Instagram in 2012 for $1 billion, the entire engineering team was 5 people. They had 30 million users. Their database? PostgreSQL.
No custom database. No fancy distributed system. Plain PostgreSQL, sharded by user ID. They used pg_partmanA PostgreSQL extension for automatic table partitioning management. Instead of managing partition creation and dropping manually, pg_partman handles it based on time (daily, weekly, monthly) or ID ranges. Instagram uses it to automatically create new partitions for their ever-growing data. for automatic partition management within each shard.
At Meta scale (2B+ monthly active users), they still use PostgreSQL. They've added caching layers (Memcached, TAO), but the source of truth is still relational. The data model — users, photos, likes, comments, follows — is inherently relational. Foreign keys and JOINs are exactly the right tool.
Key takeaway: PostgreSQL with thoughtful sharding, proper indexing, and a caching layer in front can handle billions of users. The database wasn't the bottleneck at 30 million. It wasn't the bottleneck at 1 billion. It still isn't.
Uber — Why They LEFT PostgreSQL (and the Controversy)
In 2016, Uber published a blog post titled "Why Uber Engineering Switched from Postgres to MySQL." It was one of the most debated database articles ever written.
Their argument: PostgreSQL's MVCC implementationMulti-Version Concurrency Control — how PostgreSQL handles concurrent access. When you UPDATE a row, PostgreSQL doesn't overwrite the old row. It marks the old version as dead and creates a new version. This means every UPDATE creates a full new copy of the row, including all indexed columns. For a table with 10 indexes, one UPDATE writes 1 new row + 10 new index entries. This is "write amplification." MySQL/InnoDB handles MVCC differently: it stores row versions in an "undo log" and updates the original row in-place, so indexes don't need updating (unless the indexed column itself changed). caused excessive write amplification. For Uber's write-heavy workload (driver locations updating every few seconds), every GPS update wrote to multiple indexes — creating massive I/O overhead during replication.
MySQL/InnoDB uses a different approach: it updates the row in-place and stores old versions in an undo log. Indexes only need updating if the indexed column itself changed. For Uber's workload, MySQL was significantly more efficient.
Uber's problem was specific to their workload: extremely write-heavy, with frequent updates to non-indexed columns on tables with many indexes. For read-heavy workloads, PostgreSQL's MVCC is actually an advantage (readers never block writers). The lesson isn't "PostgreSQL is bad" — it's "understand your workload and pick the tool that matches."
Shopify — MySQL + Vitess for Millions of Merchants
Shopify supports millions of online stores, each with their own products, orders, and customers. Their database strategy: MySQL, sharded using VitessA database clustering system originally built at YouTube in 2010 to scale their MySQL infrastructure. Vitess sits between your application and MySQL, handling: connection pooling, query routing (which shard has this data?), query rewriting, and online schema changes. It presents itself as a single logical database while distributing data across many MySQL instances. Vitess is now a CNCF graduated project. Slack, GitHub, and Shopify all use it., the sharding proxy that YouTube built.
Each merchant is effectively a shard. Vitess routes queries to the correct MySQL instance based on the merchant ID. The genius is that each merchant's data is small enough to fit on a single MySQL instance. The scale comes from having millions of these small shards.
The pattern: when your data naturally partitions by tenant (user, merchant, organization), relational databases with a sharding layer can handle enormous scale. You don't need to abandon SQL — you just need to distribute it smartly.
Anti-Lesson — Myths That Cost Companies Millions
The internet is full of database advice. Most of it is wrong, and some of it is dangerously wrong. These three myths have caused more production outages, unnecessary rewrites, and wasted engineering months than any actual technical limitation. Let's kill them with evidence.
This is the single most harmful myth in system design. It's repeated in blog posts, YouTube videos, and even some system design interview prep materials. And it is demonstrably false.
The evidence:
- Facebook runs MySQL at a scale that would crash most "web-scale" NoSQL databases. Their TAO cache layer sits in front, but MySQL is the source of truth for social graph data.
- Instagram serves 2 billion monthly active users with PostgreSQL. They've never switched.
- Stack Overflow handles 1.3 billion page views per month on 2 SQL Server instances. Not 200 — two.
- Shopify supports millions of merchants on MySQL + Vitess. Every Black Friday, they process hundreds of millions of dollars in transactions through relational databases.
- Stripe processes billions of dollars in payments through PostgreSQL with SERIALIZABLE isolation — the strictest and slowest setting.
When someone says "we moved to MongoDB because PostgreSQL couldn't handle the load," the correct follow-up question is: "Did you add indexes? Did you run EXPLAIN ANALYZE? Did you check for N+1 queries?" Nine times out of ten, the answer is no. They didn't outgrow the database — they never learned to use it properly.
Relational databases scale vertically (bigger hardware) further than most people think. A single PostgreSQL instance on modern hardware (64 cores, 512GB RAM, NVMe SSDs) can handle millions of transactions per second. And when you do need horizontal scale, tools like Vitess, Citus, and native partitioning let you shard without abandoning SQL.
Normalization is a tool, not a religion. Yes, 3NF eliminates redundancy and prevents update anomalies. But taken to an extreme, it creates queries with 7-table JOINs that take 30 seconds to run — and your dashboard users aren't going to wait 30 seconds.
SELECT u.name, a.street, a.city, s.name AS state_name,
c.name AS country_name, o.total, p.name AS product
FROM users u
JOIN addresses a ON u.address_id = a.id
JOIN cities ci ON a.city_id = ci.id
JOIN states s ON ci.state_id = s.id
JOIN countries c ON s.country_id = c.id
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.id = 42;
-- 7 JOINs. For ONE user's order page.
-- At 10M users, this runs in 2-5 seconds. Unacceptable.
-- Store city/state/country as text on the address row.
-- "New York" is not going to rename itself.
SELECT u.name, a.street, a.city, a.state, a.country,
o.total, p.name AS product
FROM users u
JOIN addresses a ON u.address_id = a.id
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.id = 42;
-- 4 JOINs, runs in 5ms. The page loads instantly.
-- The "impure" schema is the better engineering decision.
The rule of thumb: normalize by default. Denormalize when you have evidence (slow queries, EXPLAIN ANALYZE showing expensive JOINs) that normalization is hurting you. Document WHY you denormalized and HOW you keep the copies in sync. Denormalization without a sync strategy is just a bug waiting to happen.
ORMs (Object-Relational Mappers) like ActiveRecord (Rails), Entity Framework (.NET), SQLAlchemy (Python), and Hibernate (Java) are wonderful tools. They reduce boilerplate, prevent SQL injection, and let you work with objects instead of raw strings. But they have a dark side: they hide the SQL they generate, and that hidden SQL is often terrible.
The most infamous problem is the N+1 queryA performance bug where loading a list of N items triggers 1 query to get the list, plus N additional queries to load a related object for each item. Example: fetching 100 blog posts (1 query), then for each post, fetching its author (100 queries) = 101 queries total. The correct approach: eager loading (JOIN or IN clause) to fetch posts and authors in 1-2 queries. ORMs make N+1 easy to create accidentally because the lazy loading happens transparently.:
# Ruby on Rails — looks innocent:
posts = Post.all # 1 query: SELECT * FROM posts
posts.each do |post|
puts post.author.name # N queries! One for EACH post!
end
# What the ORM actually generated:
# SELECT * FROM posts; -- 1 query
# SELECT * FROM authors WHERE id = 1; -- query 2
# SELECT * FROM authors WHERE id = 2; -- query 3
# ... (for every single post)
# SELECT * FROM authors WHERE id = 100; -- query 101
#
# 100 posts = 101 database round-trips.
# Total: 200-500ms for a page that should take 5ms.
# Fix: eager loading
posts = Post.includes(:author).all
posts.each do |post|
puts post.author.name # No extra queries! Already loaded.
end
# What the ORM now generates:
# SELECT * FROM posts;
# SELECT * FROM authors WHERE id IN (1, 2, 3, ..., 100);
#
# 2 queries instead of 101. 5ms instead of 500ms.
# One word changed: .includes(:author)
The fix is easy if you know the problem exists. That's why you need to know SQL even if you use an ORM. Turn on query logging in development. Watch what your ORM generates. Tools like bullet (Rails), django-debug-toolbar (Django), and MiniProfiler (.NET) can automatically detect N+1 queries and warn you.
In PostgreSQL, enable query logging: SET log_min_duration_statement = 0; (logs every query). In development, this shows you exactly what SQL your ORM generates. In production, set it to something like 1000 (1 second) to log only slow queries. The first time you see the ORM generating 500 queries for one page load, you'll understand why knowing SQL matters.
Common Mistakes — The Six Sins of Production SQL
Every one of these has taken down a production system. They're easy to make, easy to miss in code review, and absolutely brutal at 3 AM when your pager goes off. The good news: they're all fixable in minutes once you know what to look for.
Here's a mistake that hides perfectly until your table hits ~100K rows: you create a foreign key from orders.user_id to users.id, and PostgreSQL happily enforces the relationship. But it does not create an index on orders.user_id automatically. That means every time you run SELECT * FROM orders WHERE user_id = 42, the database reads every single row in the orders table to find user 42's orders. With 10 rows, nobody notices. With 10 million rows, your query takes 8 seconds.
The way to prove this is with EXPLAIN — the database's own confession of how it plans to run your query:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
-- Output:
-- Seq Scan on orders (cost=0.00..18904.00 rows=52 width=68)
-- (actual time=0.021..127.483 rows=48 loops=1)
-- Filter: (user_id = 42)
-- Rows Removed by Filter: 999952
-- Planning Time: 0.085 ms
-- Execution Time: 127.551 ms
--
-- "Seq Scan" = sequential scan = reading EVERY row.
-- "Rows Removed by Filter: 999952" = it read ~1M rows to find 48.
-- That's like reading an entire phone book to find one person.
CREATE INDEX idx_orders_user_id ON orders(user_id);
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
-- Output:
-- Index Scan using idx_orders_user_id on orders
-- (cost=0.42..8.61 rows=52 width=68)
-- (actual time=0.028..0.043 rows=48 loops=1)
-- Index Cond: (user_id = 42)
-- Planning Time: 0.112 ms
-- Execution Time: 0.059 ms
--
-- "Index Scan" = jumped straight to the right rows.
-- 127ms → 0.059ms. That's 2,157× faster.
-- One CREATE INDEX statement. That's all it took.
Run EXPLAIN on every query that filters by a foreign key. If you see Seq Scan on a table with more than a few thousand rows, you almost certainly need an index. In PostgreSQL, list all indexes with \di and compare against your foreign key columns.
It's so convenient. SELECT * FROM users WHERE id = 42 — just give me everything. The problem? "Everything" might be 20 columns including a 50KB bio text field, a profile_photo BYTEA, and 8 columns you never display. You're hauling a shipping container when you needed a backpack.
The cost isn't just network bandwidth — it's disk I/O (reading wider rows from disk), memory (buffering larger result sets), and CPU (serializing unnecessary columns). On a high-traffic endpoint, this adds up fast.
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE id = 42;
-- Output:
-- Index Scan using users_pkey on users
-- Buffers: shared hit=4
-- (actual time=0.018..0.020 rows=1 loops=1)
-- Planning Time: 0.071 ms
-- Execution Time: 0.035 ms
--
-- 4 buffer hits — reads 4 pages (32KB) from shared memory.
-- Fetches ALL 20 columns including that 50KB bio field.
EXPLAIN (ANALYZE, BUFFERS) SELECT name, email, created_at
FROM users WHERE id = 42;
-- Output:
-- Index Scan using users_pkey on users
-- Buffers: shared hit=2
-- (actual time=0.015..0.016 rows=1 loops=1)
-- Planning Time: 0.065 ms
-- Execution Time: 0.028 ms
--
-- 2 buffer hits vs 4 — half the I/O.
-- Multiply by 10,000 requests/sec and it matters.
The rule is simple: name the columns you need. It costs you 5 seconds of typing and saves megabytes of wasted I/O per minute in production. The one exception: ad-hoc debugging queries in your terminal, where convenience trumps efficiency.
Every time your app connects to PostgreSQL, the database forks a new process. Each process costs about 10MB of RAM. That's fine when you have 20 connections. But a typical web app with 50 server instances, each opening 20 connections, means 1,000 connections — that's 10GB of RAM just for connection overhead, before the database stores a single byte of data.
-- How many connections can your database handle?
SHOW max_connections; -- Default: 100 (way too low for production)
-- How many are currently in use?
SELECT count(*) FROM pg_stat_activity;
-- What are they doing?
SELECT pid, state, query, wait_event_type
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
The fix is a connection pooler — a lightweight proxy that sits between your app and the database. Your 1,000 app connections talk to the pooler; the pooler maintains just 50 real database connections and recycles them. PgBouncerA lightweight connection pooler for PostgreSQL. It's a single binary (~2MB) that sits between your app and the database. Your app connects to PgBouncer (which looks exactly like a PostgreSQL server), and PgBouncer manages a small pool of real database connections. Three pooling modes: session (one connection per session — safest), transaction (one per transaction — best performance), statement (one per statement — most restrictive). Most production setups use transaction mode. is the most popular choice:
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432 ; App connects here (not 5432)
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction ; Recycle after each transaction
default_pool_size = 50 ; 50 real DB connections
max_client_conn = 2000 ; Accept up to 2000 app connections
reserve_pool_size = 5 ; Emergency overflow
reserve_pool_timeout = 3 ; Wait 3s before using overflow
With this config, your app sees 2,000 available connections while the database only maintains 50 real ones. That's 500MB of RAM instead of 20GB. In production, this is non-negotiable.
PostgreSQL uses MVCCMulti-Version Concurrency Control. Instead of locking rows when you UPDATE them, PostgreSQL creates a NEW version of the row and marks the old one as "dead." This means readers never block writers and writers never block readers — they just see different versions. The downside: dead row versions pile up and waste space until VACUUM cleans them. (Multi-Version Concurrency Control) to handle concurrent access. When you UPDATE a row, PostgreSQL doesn't modify it in place — it creates a new version and marks the old one as "dead." Those dead tuples pile up like trash in an apartment nobody cleans. Without VACUUM, your 1GB table can bloat to 10GB, and queries slow to a crawl because the database is scanning through gigabytes of garbage.
-- Check dead tuple count and last vacuum time
SELECT relname,
n_live_tup,
n_dead_tup,
round(n_dead_tup::numeric / GREATEST(n_live_tup, 1) * 100, 1)
AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders'
ORDER BY n_dead_tup DESC;
-- If dead_pct > 20%, you have a problem.
-- If last_autovacuum is NULL, autovacuum has NEVER run on this table.
-- Manual vacuum (reclaims space within the table):
VACUUM ANALYZE orders;
-- Nuclear option — rewrites the entire table (locks it!):
VACUUM FULL orders; -- Use only during maintenance windows
PostgreSQL has autovacuum enabled by default, but it's conservative. High-churn tables (like event logs or session stores) generate dead tuples faster than autovacuum cleans them. For those tables, tune the thresholds:
-- Make autovacuum more aggressive for high-churn tables
ALTER TABLE events SET (
autovacuum_vacuum_threshold = 1000, -- Default: 50
autovacuum_vacuum_scale_factor = 0.01, -- Default: 0.2 (20%)
autovacuum_analyze_threshold = 500,
autovacuum_analyze_scale_factor = 0.005
);
-- Translation: "vacuum this table when 1% of rows are dead
-- (instead of waiting for 20%)."
It's tempting. TEXT accepts any string, any length. No need to think about limits. But "no limits" means "no validation" — someone can insert a 10MB string into your phone_number column, and the database won't complain. You've traded a one-time schema decision for an infinite supply of data quality bugs.
-- ❌ The lazy way — no validation at all
CREATE TABLE users_bad (
name TEXT, -- Could be 10MB, no limit
email TEXT, -- No format check
country TEXT, -- "US", "us", "United States", "USA" — inconsistent
phone TEXT -- Could store "banana" and nobody would know
);
-- ✅ The intentional way — the schema IS the documentation
CREATE TABLE users_good (
name VARCHAR(100), -- Names rarely exceed 100 chars
email VARCHAR(255), -- RFC 5321 max is 254 characters
country CHAR(2), -- ISO 3166-1 alpha-2: "US", "GB", "IN"
phone VARCHAR(20), -- E.164 max is 15 digits + prefix
bio TEXT -- Actually variable-length: TEXT is right here
);
-- Rule of thumb:
-- CHAR(n) → fixed-length codes (country, state, currency)
-- VARCHAR(n) → bounded strings (name, email, phone)
-- TEXT → truly unbounded content (comments, articles, bios)
In PostgreSQL specifically, VARCHAR(n) and TEXT have the same underlying storage and performance. The difference is validation: VARCHAR(255) rejects inserts longer than 255 characters at the database level, acting as a safety net even if your application code has a bug. Use TEXT when the content is genuinely unbounded (blog posts, comments). Use VARCHAR(n) when you know a reasonable upper bound.
You write a query. It returns the right data. You ship it. Three weeks later, someone inserts a million rows and your "fast" query is now taking 14 seconds. The query didn't change — the data did. And you never checked how the database actually executes the query.
EXPLAIN ANALYZE is the single most important debugging tool in SQL. It tells you exactly what the database does: which indexes it uses (or ignores), how many rows it scans, how long each step takes. Run it on every query before it goes to production.
-- Step 1: Check the plan BEFORE deploying
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 20;
-- Step 2: In production, auto-log slow query plans
-- postgresql.conf:
-- shared_preload_libraries = 'auto_explain'
-- auto_explain.log_min_duration = '1s' -- Log plans for 1s+ queries
-- auto_explain.log_analyze = true -- Include actual timing
-- auto_explain.log_buffers = true -- Include I/O stats
-- Step 3: Check pg_stat_statements for repeat offenders
SELECT query,
calls,
mean_exec_time::int AS avg_ms,
total_exec_time::int AS total_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Enable auto_explain in production with a threshold of 1–5 seconds. It automatically logs the execution plan of any query that exceeds the threshold — no code changes needed. Combined with pg_stat_statements, you get a complete picture of which queries are burning your CPU and why. This is how companies like Stripe and Instagram catch performance regressions before users notice.
Interview Playbook — E-Commerce Database Design
"Design the database schema for an e-commerce platform that supports users, products, orders, and payments."
This is one of the most common system design interview questions, and it's a perfect test of how deeply you understand relational databases. The same schema can be answered at three very different levels of depth. Let's walk through all three.
What the interviewer expects: Correct tables, primary keys, foreign keys, basic data types. Show you understand relational modeling fundamentals.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
description TEXT,
price DECIMAL(10, 2) NOT NULL CHECK (price > 0),
stock INTEGER NOT NULL DEFAULT 0 CHECK (stock >= 0)
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
status VARCHAR(20) DEFAULT 'pending',
total DECIMAL(10, 2) NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INTEGER NOT NULL REFERENCES orders(id),
product_id INTEGER NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price DECIMAL(10, 2) NOT NULL -- Price at time of order
);
Key points to mention: unit_price in order_items is stored separately from products.price because product prices change — you need the price at the time of purchase. The CHECK constraints prevent negative prices and zero-quantity orders at the database level. Foreign keys enforce that every order belongs to a real user and every line item references a real product.
What the interviewer expects: Everything from Junior, plus indexes, normalization awareness, connection management, and performance thinking.
-- Indexes on every foreign key + common query patterns
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
CREATE INDEX idx_order_items_product_id ON order_items(product_id);
-- Composite index for "user's recent orders" (most common dashboard query)
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC);
-- Partial index: only index active products (saves space)
CREATE INDEX idx_products_active
ON products(name) WHERE stock > 0;
-- Add an addresses table (normalized from users)
CREATE TABLE addresses (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
type VARCHAR(10) CHECK (type IN ('billing', 'shipping')),
street VARCHAR(255) NOT NULL,
city VARCHAR(100) NOT NULL,
state CHAR(2),
zip VARCHAR(10),
country CHAR(2) NOT NULL DEFAULT 'US'
);
CREATE INDEX idx_addresses_user ON addresses(user_id);
Talking points at this level:
- Normalization trade-off: Addresses are normalized into their own table (a user can have multiple addresses). But we deliberately keep
unit_pricedenormalized inorder_itemsbecause we need historical accuracy. - Connection pooling: "In production, I'd put PgBouncer in front of PostgreSQL in transaction mode with a pool of ~50 connections, even if we have 500 app instances."
- Composite index reasoning: The
(user_id, created_at DESC)index serves the most common query pattern: "show me this user's recent orders." Without it, the database fetches all orders for a user and sorts them.
What the interviewer expects: Everything above, plus scaling strategy, isolation levels, replication, and materialized views. Think in systems, not just tables.
-- 1. ISOLATION LEVELS for payments (the money path)
-- Use SERIALIZABLE for payment transactions to prevent double-spend
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
UPDATE products SET stock = stock - 1 WHERE id = 42 AND stock > 0;
INSERT INTO orders (...) VALUES (...);
INSERT INTO payments (order_id, amount, status) VALUES (...);
COMMIT;
-- If two users buy the last item simultaneously, one gets
-- a serialization failure → retry. No overselling.
-- 2. READ REPLICAS for dashboards (no impact on write path)
-- Point analytics/dashboard queries to read replicas:
-- primary: handles writes (orders, payments)
-- replica1, replica2: handle reads (product catalog, search, dashboards)
-- 3. MATERIALIZED VIEWS for expensive dashboard aggregates
CREATE MATERIALIZED VIEW dashboard_stats AS
SELECT DATE_TRUNC('day', o.created_at) AS day,
COUNT(DISTINCT o.id) AS orders,
SUM(o.total) AS revenue,
COUNT(DISTINCT o.user_id) AS unique_buyers
FROM orders o
WHERE o.status = 'completed'
GROUP BY 1;
-- Refresh every 5 minutes (not real-time, but dashboards don't need it)
REFRESH MATERIALIZED VIEW CONCURRENTLY dashboard_stats;
-- 4. SHARDING STRATEGY (when single-node isn't enough)
-- Shard key: user_id (keeps all of a user's data on one shard)
-- Tool: Citus extension or Vitess
-- Avoid sharding until you NEED it — vertical scaling + read
-- replicas + connection pooling handles more than most people think.
The differentiators at this level:
- Isolation level choice: Use
SERIALIZABLEon the payment path (correctness > speed), butREAD COMMITTED(PostgreSQL's default) everywhere else for better throughput. Explain why each choice. - EXPLAIN plan analysis: "Before deploying any query to production, I'd run
EXPLAIN ANALYZEagainst a staging database with production-volume data. I'd set upauto_explainto catch regressions." - Read replica strategy: Writes go to the primary. All read-heavy queries (catalog browsing, search, dashboards) go to replicas. This lets you scale reads independently.
- Materialized views: Instead of running expensive aggregation queries live, pre-compute them and refresh periodically. Dashboards don't need real-time data — 5-minute staleness is fine.
- Sharding as last resort: "I'd exhaust vertical scaling, read replicas, caching, and query optimization before considering sharding. Sharding adds enormous operational complexity."
Practice Exercises
Reading about databases is one thing. Running commands and watching what happens is where the learning actually sticks. Start with Exercise 1 — you can do it right now with nothing but sqlite3.
Create a SQLite database with three tables: users (id, name, email), posts (id, user_id, title, body, created_at), and comments (id, post_id, user_id, body, created_at). Insert at least 3 users, 5 posts, and 10 comments. Then write a JOIN query that shows each post's title, author name, and comment count.
- Use
sqlite3 blog.dbto create the database file. - For foreign keys to work in SQLite, you need:
PRAGMA foreign_keys = ON; - The JOIN query needs
LEFT JOIN(a post with 0 comments should still appear),GROUP BY, andCOUNT().
PRAGMA foreign_keys = ON;
CREATE TABLE users (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL
);
CREATE TABLE posts (
id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
title TEXT NOT NULL,
body TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE comments (
id INTEGER PRIMARY KEY,
post_id INTEGER NOT NULL REFERENCES posts(id),
user_id INTEGER NOT NULL REFERENCES users(id),
body TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
INSERT INTO users VALUES (1, 'Alice', 'alice@example.com');
INSERT INTO users VALUES (2, 'Bob', 'bob@example.com');
INSERT INTO users VALUES (3, 'Charlie', 'charlie@example.com');
INSERT INTO posts (id, user_id, title, body)
VALUES (1, 1, 'SQL is not scary', 'Here is why...'),
(2, 2, 'My first JOIN', 'I finally get it...'),
(3, 1, 'Indexes changed my life', 'Let me explain...'),
(4, 3, 'Why I switched from MongoDB', 'It started when...'),
(5, 2, 'ACID in plain English', 'Imagine a bank...');
INSERT INTO comments (post_id, user_id, body)
VALUES (1,2,'Great post!'),(1,3,'Thanks for this'),
(2,1,'Nice work'),(2,3,'Same here'),
(3,2,'So true'),(3,3,'Game changer'),
(3,1,'Updated my app today'),
(4,1,'Bold move!'),(4,2,'How did it go?'),
(5,3,'Best explanation I have seen');
-- The JOIN query:
SELECT p.title,
u.name AS author,
COUNT(c.id) AS comment_count
FROM posts p
JOIN users u ON p.user_id = u.id
LEFT JOIN comments c ON c.post_id = p.id
GROUP BY p.id
ORDER BY comment_count DESC;
Using PostgreSQL, create a products table and insert 100,000 rows (use generate_series). Run EXPLAIN ANALYZE on a query filtering by price — note the Seq Scan and timing. Then create a B-tree index on price and run the same EXPLAIN. Compare the output: scan type, row count, and execution time.
INSERT INTO products (name, price) SELECT 'Product ' || i, (random() * 1000)::numeric(10,2) FROM generate_series(1, 100000) AS s(i);- Query to test:
SELECT * FROM products WHERE price BETWEEN 50.00 AND 51.00; - Look for "Seq Scan" vs "Index Scan" or "Bitmap Index Scan" in the output.
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
price DECIMAL(10, 2)
);
INSERT INTO products (name, price)
SELECT 'Product ' || i, (random() * 1000)::numeric(10,2)
FROM generate_series(1, 100000) AS s(i);
-- BEFORE index:
EXPLAIN ANALYZE SELECT * FROM products
WHERE price BETWEEN 50.00 AND 51.00;
-- → Seq Scan, ~25ms, reads all 100K rows
-- Add the index:
CREATE INDEX idx_products_price ON products(price);
-- AFTER index:
EXPLAIN ANALYZE SELECT * FROM products
WHERE price BETWEEN 50.00 AND 51.00;
-- → Index Scan (or Bitmap Index Scan), ~0.3ms
-- The speedup should be 50-100×.
Open two psql terminals. In Terminal A, set isolation level to READ UNCOMMITTED and begin a transaction. In Terminal B, insert a row but do NOT commit. Can Terminal A see Terminal B's uncommitted row? (Spoiler: in PostgreSQL, no — figure out why.) Then try the same demo with the READ COMMITTED default and observe the difference when Terminal B commits mid-query.
- PostgreSQL treats
READ UNCOMMITTEDasREAD COMMITTED— it refuses to do dirty reads. This is a deliberate design choice. - To see an actual dirty read, you'd need MySQL with
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; - For the READ COMMITTED demo: Terminal A runs a long query, Terminal B commits halfway through — Terminal A may see different data if it runs the same SELECT again within the same transaction (non-repeatable read).
-- === TERMINAL A ===
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT * FROM accounts WHERE id = 1; -- balance = 1000
-- (pause — switch to Terminal B)
-- After Terminal B commits:
SELECT * FROM accounts WHERE id = 1; -- balance = 500!
-- Same transaction, different result → non-repeatable read.
COMMIT;
-- === TERMINAL B ===
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1;
COMMIT; -- ← Terminal A's 2nd SELECT sees this change
-- === FIX: use REPEATABLE READ ===
-- Terminal A:
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM accounts WHERE id = 1; -- 1000
-- Terminal B commits the update...
SELECT * FROM accounts WHERE id = 1; -- STILL 1000
-- REPEATABLE READ gives you a consistent snapshot.
COMMIT;
Start with one giant denormalized table: user_name, user_email, movie_title, genre1, genre2, genre3, director, rating, watchlist_added_at. Identify the redundancy and update anomalies. Then normalize step by step: 1NF (atomic values), 2NF (remove partial dependencies), 3NF (remove transitive dependencies). End with separate tables for users, movies, genres, ratings, and watchlists, connected by foreign keys.
- 1NF violation:
genre1, genre2, genre3should be a separategenrestable with a junction tablemovie_genres. - 2NF violation:
movie_titleanddirectordepend only on the movie, not on the user+movie composite key. - 3NF violation:
directordepends onmovie_title, not on the primary key. But for simplicity, keeping director in the movies table is fine — full 3NF would extract a directors table. - The junction tables:
movie_genres(movie_id, genre_id),watchlist(user_id, movie_id, added_at),ratings(user_id, movie_id, score, rated_at).
Install PgBouncer on your machine. Configure it in transaction pooling mode with default_pool_size = 10. Create a test database and run pgbench directly against PostgreSQL first (record TPS). Then run pgbench through PgBouncer with 200 simulated clients. Compare transactions per second, connection time, and CPU usage.
- Install:
brew install pgbouncer(Mac) orsudo apt install pgbouncer(Ubuntu). - Initialize pgbench:
pgbench -i -s 10 testdb(creates test tables with ~1M rows). - Direct benchmark:
pgbench -c 200 -j 4 -T 30 testdb(200 clients, 4 threads, 30 seconds). - Via PgBouncer:
pgbench -c 200 -j 4 -T 30 -h 127.0.0.1 -p 6432 testdb. - Without PgBouncer, 200 direct connections may hit
max_connectionsand fail. With PgBouncer, they all succeed because the pooler queues excess connections.
Cheat Sheet
Pin this section. Everything you'll reach for during debugging, interviews, and production firefighting — in one place.
A — Atomicity → BEGIN; ... COMMIT;
(all or nothing — ROLLBACK undoes everything)
C — Consistency → CHECK, NOT NULL, UNIQUE, FK
(rules enforced BEFORE data is written)
I — Isolation → SHOW transaction_isolation;
(concurrent txns don't see each other's mess)
D — Durability → WAL (Write-Ahead Log)
(committed = survives crash. Period.)
B+ Tree (default) = , < , > , BETWEEN, ORDER BY, LIKE 'abc%' CREATE INDEX idx ON t(col); Hash = only (no range). Rarely used. CREATE INDEX idx ON t USING hash(col); GIN (Generalized Inverted) JSONB, arrays, full-text search CREATE INDEX idx ON t USING gin(col); GiST (Generalized Search Tree) Geometry, ranges, nearest-neighbor CREATE INDEX idx ON t USING gist(col);
\dt List all tables \di List all indexes \d table Show table schema \d+ table Schema + sizes + descriptions \x Toggle expanded output \timing Show query execution time EXPLAIN ANALYZE SELECT ...; → Shows actual execution plan + timing EXPLAIN (ANALYZE, BUFFERS) SELECT ...; → Plan + I/O stats (buffer hits/reads)
1NF — Atomic values No arrays, no comma-separated lists. Every cell = one value. 2NF — No partial dependencies Every non-key column depends on the WHOLE primary key (not part of it). 3NF — No transitive dependencies Non-key columns depend ONLY on the primary key, not on other non-key cols. Rule of thumb: "Every non-key column provides a fact about the key, the whole key, and nothing but the key."
-- Dead tuples (bloat):
SELECT relname, n_dead_tup, last_autovacuum
FROM pg_stat_user_tables;
-- Slow queries (needs pg_stat_statements):
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 5;
-- Auto-log slow plans:
-- shared_preload_libraries = 'auto_explain'
-- auto_explain.log_min_duration = '1s'
-- Cache hit ratio (should be > 99%):
SELECT sum(heap_blks_hit) /
sum(heap_blks_hit + heap_blks_read)
FROM pg_statio_user_tables;
SHOW max_connections; -- Default: 100 Per-connection RAM: ~10MB (PostgreSQL) 100 conns = ~1GB RAM overhead 1000 conns = ~10GB RAM overhead Fix: PgBouncer (connection pooler) pool_mode = transaction default_pool_size = 50 max_client_conn = 2000 Result: 2000 app connections routed through 50 real DB connections. RAM saved: ~19.5GB.
Connected Topics — Where to Go Next
Relational databases don't live in isolation. Every topic below builds on what you've just learned — and several of them were mentioned throughout this page. Pick whichever sounds most interesting and keep going.