TL;DR — Caching in Plain English
- What a cache is and why it gives such a disproportionate performance boost for so little effort
- The 5 canonical caching strategies — cache-aside, read-through, write-through, write-behind, refresh-ahead — and what makes each one different
- Why cache invalidation is genuinely hard and what the main approaches are
- The gotchas — stampedes, stale reads, data loss on crash — that catch production teams off guard
A cache is just a fast, temporary copy of slower data placed close to whoever needs it. 99% of reads can be served from that copy in ~1 ms instead of hitting the database at ~50 ms. The hard part isn't reading from the cache — it's deciding when the copy is no longer valid and needs to be refreshed.
Think of a cache the same way you'd think of keeping a sticky note of your boss's phone number on your desk. You could look it up in the company directory every time — accurate, slow. Or you keep the sticky note — instant, but you have to update it when the number changes. A cache in software is exactly that sticky note: a smaller, faster store that holds a copy of data from a bigger, slower store. The classic pairing is Redis (fast, in-memory) sitting in front of PostgreSQL (slower, on disk).
There are 5 canonical ways to connect your application to a cache and a database. Cache-aside (lazy load): your app checks the cache; on a miss, fetches from DB and fills the cache. Read-through: a caching library handles the miss transparently, so your app only ever asks the cache. Write-through: every write goes to cache and DB at the same time — perfectly fresh but doubles write latency. Write-behind (write-back): writes land in cache first, get flushed to DB asynchronously — lowest latency, but data in cache that hasn't flushed yet can be lost if the cache crashes. Refresh-ahead: the cache proactively fetches fresh data before the TTL expires so no request ever sees a cold cache.
Reading from cache is trivial. The hard part is invalidation — knowing when your cached copy is no longer correct and needs to be thrown away. Phil Karlton famously said: "There are only two hard things in computer science: cache invalidation and naming things." The three main approaches are letting entries expire after a fixed time (TTL), explicitly deleting entries when data changes (purge), and using an event stream to propagate changes (CDC/pub-sub). Each trades simplicity for consistency.
Why You Need This — One Change, 100× Payoff
Let's start with a concrete story that plays out at almost every growing startup. Your homepage runs 5 SQL queries — user profile, featured products, active promotions, recent reviews, navigation menu. Each query averages about 16 ms. Total: roughly 80 ms of database time per page load. That sounds fine. Now multiply it.
The Math That Forces Your Hand
At 10,000 requests per second, those 5 queries fire 50,000 times per second against your database. A typical Postgres instance handles maybe 5,000–10,000 queries per second at comfortable CPU utilization. You're at 5–10× capacity. Your database CPU sits at 95%. Every slow query cascades into a queue, and your p99 latency climbs from 80 ms to 800 ms. Users leave. You spin up more database replicas, which costs money and only buys time before you hit the next ceiling.
Now add one line of config: store the homepage result in Redis with a 60-second TTL.
What Just Happened?
With a 60-second TTL, the homepage data is served from Redis for 99% of requests. The database sees 100 requests per second instead of 10,000 — a 99% reduction. Your database CPU drops from 95% to around 2%. Your p99 latency drops from 800 ms to about 2 ms. Your infrastructure bill for the database tier drops dramatically. And all you changed was adding one cache.set call.
The 1% that still hits the database? Those are the requests that arrive right after the TTL expires and the cache needs a fresh copy. That's the "cache miss" — and handling it well (quickly, without stampeding the database) is where the engineering work lives.
product:42 — is read 100 times more than any other row. It gets updated (price change, stock count) roughly once per minute. Where would you cache it, at what TTL, and what happens if two servers simultaneously try to refresh it after it expires? Hold that question — we'll come back to it in S4 (stampede) and S6 (invalidation).
Mental Model — The Cache Hierarchy
When engineers say "add a cache," they usually mean Redis or Memcached. But in reality, your request already passes through five or six layers of caching before it reaches a database. Understanding the full hierarchy helps you pick the right level to optimize at — because sometimes the highest-leverage cache is the one you already have for free.
Think of the hierarchy like a series of progressively bigger, slower pantries. The fridge in the kitchen (L1 CPU cache) is tiny but right at your fingertips — nanoseconds. The pantry down the hall (app process memory) is bigger but a few steps away. The warehouse across town (database) holds everything but takes real time to retrieve. You stock the pantries from the warehouse; you check the closest one first.
The key insight from the pyramid: each level is orders of magnitude slower than the one above it, but orders of magnitude larger. CPU caches are measured in kilobytes and nanoseconds; a CDN is measured in terabytes and milliseconds. You can't store everything at every level — so you store the hottest data at the fastest level and let less-frequently-accessed data fall down to slower levels.
For most backend engineers, the two most actionable levels are app-level (in-process) caching and distributed caching (Redis). Let's look at all five levels properly.
CPU Caches (L1/L2/L3)
Latency: ~1 nanosecond (L1) to ~10 ns (L3). Size: 32 KB to ~32 MB.
Managed entirely by the CPU hardware — your code doesn't control it directly. When the CPU reads memory, it stores a copy in L1/L2/L3 so future reads of the same address are near-instant. Why does this matter for backend engineers? Cache-friendly data structures (arrays over linked lists, struct-of-arrays over array-of-structs) get massive throughput improvements. A tight loop over a contiguous array hits L1 cache almost every access; chasing pointers across a linked list misses cache constantly.
App-Level (In-Process) Cache
Latency: ~1 microsecond. Size: MB to GB (limited by process heap). Examples: a Dictionary<string, Product> in C#, Map in Java, lru-cache npm package in Node.
Lives inside your application process — no network call needed. Reading from it is just a hash-map lookup: roughly 1 µs. The downside: it's per-process. If you have 20 app server instances, each has its own copy. A write to Product 42 on server 1 doesn't update the cache on servers 2–20. This makes in-process caching great for truly static data (config values, feature flags, reference tables that change hourly) and risky for frequently-changing data.
Distributed Cache (Redis / Memcached)
Latency: 0.1–1 ms. Size: GB to TB. The workhorse of backend caching.
A separate in-memory service that all your app servers share. Why separate? Because all 20 app servers hit the same Redis instance — a write on server 1 is immediately visible to servers 2–20. Why in-memory? Because RAM is ~10,000× faster than disk for random reads. Redis adds rich data structures (sorted sets for leaderboards, lists for queues, streams for event logs) and optional disk persistence. Memcached is simpler but lacks persistence and data structures.
CDN (Content Delivery Network)
Latency: 1–50 ms (from nearest edge node). Size: TB–PB. Examples: Cloudflare, Fastly, AWS CloudFront, Akamai.
A globally distributed network of servers that caches your static assets (images, JS, CSS) and increasingly dynamic API responses close to the user's physical location. Why does geography matter? Light travels at ~200 km/ms in fiber. A user in Singapore connecting to a server in Virginia adds ~170 ms of pure travel time. A CDN edge node in Singapore drops that to ~5 ms. Why does this help for dynamic content? Edge functions (Cloudflare Workers, Lambda@Edge) can run logic at the edge and cache personalized or semi-dynamic responses.
Browser Cache
Latency: ~0 ms (already on device). Size: Configurable, typically a few hundred MB per origin. Controlled by: HTTP headers.
The fastest cache of all — already on the user's machine, no network needed. Controlled by HTTP response headers: Cache-Control: max-age=86400 tells the browser to use its local copy for 24 hours without asking the server. ETag and Last-Modified enable conditional requests — "only send me the file if it changed." The challenge: browser caches are hard to invalidate immediately. The standard technique is cache-busting — change the filename or add a hash to the URL (app.a3f8c2.js) so the browser treats it as a brand-new file.
Core Concepts — The Vocabulary You Need
Before diving into the 5 strategies, there are 6 concepts that appear everywhere in caching literature and interviews. Each one is simple once named — but without the names, conversations about caching quickly become confusing.
Cache Hit / Cache Miss
When your code asks the cache for a key and the cache has it, that's a cache hit. Data is returned instantly — no database query needed. When the cache doesn't have it (because it never was cached, or it expired, or it was evicted), that's a cache miss. On a miss, your code falls back to the source of truth (usually a database), fetches the data, and typically writes it back to the cache for next time.
Why the distinction matters: hits cost ~1 ms; misses cost ~50 ms + the overhead of writing to cache. Your goal is to maximize the hit rate for your hot data.
TTL — Time-To-Live
A TTL is just a countdown timer attached to a cache entry. When you store a value in Redis with SET product:42 {...} EX 300, you're saying: "keep this for 300 seconds, then automatically delete it." After 300 seconds, the next request for product:42 gets a cache miss, your code fetches fresh data from the DB, and stores it again with a fresh TTL.
TTL is the simplest way to handle cache invalidation — you accept that your data might be up to N seconds stale. The right TTL depends on how often the data changes and how much staleness your users can tolerate. A product price might need a 10-second TTL; a country list might be fine with a 24-hour TTL.
Eviction Policy
A cache is finite. When it fills up and a new entry needs to be stored, something old must be thrown out — but which old entry? That decision is the eviction policy.
LRU (Least Recently Used): evict whatever was accessed longest ago. Most common; works well because recently-accessed items are likely to be accessed again. LFU (Least Frequently Used): evict whatever was accessed fewest times total. Better for workloads with very hot keys. FIFO (First In, First Out): evict the oldest entry regardless of access. Simple but often suboptimal. Redis defaults to LRU with configurable variants.
Cache Invalidation
Cache invalidation is the act of marking a cached entry as stale or removing it entirely so the next request fetches fresh data. This is the hard part of caching — Phil Karlton's famous "two hard problems" quote is about exactly this.
Why is it hard? Because data can be updated from multiple places (direct DB writes, background jobs, other services), and your cache might not know about those updates. TTL is the lazy approach: wait for the entry to expire. Explicit purge is more precise: delete the entry immediately when you write to the DB. Event-driven invalidation uses a CDC stream or pub/sub to invalidate across services. We'll dig into each in S6.
Hit Ratio
The hit ratio (or hit rate) is simply: hits ÷ (hits + misses). If you make 1,000 cache requests and 950 of them are hits, your hit ratio is 95%. This is the single most important metric for a cache.
A hit ratio below ~80% means your cache isn't saving much — most requests still hit the database. A hit ratio above 95% means your cache is doing real work. Target 95%+ for hot caches on read-heavy workloads. Low hit ratio diagnoses: TTL too short, cache too small (too much eviction), wrong keys being cached (caching long-tail data instead of hot data), or a cache that just cold-started.
Cache Stampede (Thundering Herd)
A cache stampede — also called a thundering herd — happens when a popular cache entry expires (or is evicted) and many requests all miss at the same time, causing all of them to simultaneously hit the database to recompute and re-cache the same value.
Imagine product:42 has a 60-second TTL and is requested 500 times per second. At T=60, it expires. In the next 50 milliseconds — before any request has had time to re-populate the cache — 25 requests all miss, all query the database, all compute the same result, and all try to write it back. That's 25 redundant DB queries and 25 redundant writes. On a really popular key at high traffic, this number is in the thousands. Mitigation strategies: probabilistic early expiration, mutex/lock on miss, or background refresh — we'll cover each in S6.
The 5 Caching Strategies — Patterns Every Engineer Knows
There isn't just "one way" to use a cache. The 5 canonical patterns differ in who triggers the cache fill (the app or the cache library), when writes are acknowledged (before or after the DB gets updated), and how fresh the cached data stays. Picking the wrong pattern for your workload is a common source of bugs and production incidents.
Most applications use cache-aside as their default. The others fit specific needs: read-through and write-through clean up your code when using a caching framework; write-behind maximizes write throughput at the cost of durability; refresh-ahead eliminates cold-start latency for predictably popular content.
1. Cache-Aside (Lazy Loading)
The most common pattern. Your application code is responsible for all cache interactions — it checks the cache first, handles misses by fetching from the DB, and writes back to the cache after a miss. The cache is only populated on demand (lazily).
Read path: Check cache → hit: return data; miss: query DB → write to cache → return data.
Write path: Write to DB → (optionally) delete or update the cache entry.
Best for: General-purpose read-heavy workloads. Works with any data store. Cache failures are non-fatal (app falls back to DB). Watch out for: Stampede on popular key expiry. Cache and DB can diverge if the write path doesn't invalidate cache. Cold start: first request for every key always misses.
2. Read-Through
Cleaner code, same logic. Like cache-aside, but the caching library itself handles the DB fetch on a miss — your application code only ever talks to the cache. You configure the cache with a "loader" function that knows how to fetch from the DB.
Read path: App asks cache → hit: return; miss: cache calls loader → loader fetches from DB → cache stores + returns.
Write path: Depends on configuration — often paired with write-through (below).
Best for: Applications using a caching framework (e.g., Spring Cache, NCache, AWS ElastiCache with a loader). Keeps business logic clean. Watch out for: First request for each key still misses. The loader adds coupling between cache and DB layers.
3. Write-Through
Cache stays perfectly fresh. Every write goes to the cache and the database synchronously — both are updated before the write is acknowledged to the caller. The cache never holds stale data; any read immediately after a write will get the new value.
Write path: App writes → update cache + update DB (both synchronously) → return success.
Read path: Cache always has the data (no cold start for written keys).
Best for: Read-heavy workloads where you can tolerate the extra write latency. Financial records, inventory counts, any data where reads must always see the latest value. Watch out for: Every write takes longer (cache write + DB write). If you write many keys that are rarely read, you're wasting cache space (write amplification).
4. Write-Behind (Write-Back)
Lowest write latency, highest risk. Writes go to the cache immediately and are acknowledged to the caller at once. The cache then flushes data to the database asynchronously — on a schedule, in batches, or when triggered. The app sees sub-millisecond write acknowledgment.
Write path: App writes → update cache → return success immediately → (background) flush to DB.
Risk: If the cache crashes before the flush completes, data is permanently lost.
Best for: High-frequency writes where occasional data loss is acceptable (analytics event counters, view counts, game state). Gaming backends use this heavily. Watch out for: Data loss on cache failure. Database can be seconds or minutes behind. Requires careful flush scheduling and crash recovery logic.
5. Refresh-Ahead
Proactive, not reactive. Instead of waiting for a TTL to expire and then fetching fresh data on the next miss, the cache proactively refreshes an entry before it expires — typically when remaining TTL drops below a threshold (e.g., refresh when 80% of TTL is consumed). No request ever sees a cold-cache miss for hot data.
Read path: Always a hit (if entry is hot enough to keep refreshing). No miss latency.
Background: Cache detects entries approaching expiry → triggers a background fetch → updates cache entry silently.
Best for: Data that's always hot and expensive to compute (homepage content, featured items, aggregated dashboards). Watch out for: If you refresh too aggressively, you do unnecessary DB work. If you predict wrong (entry stops being hot), you refreshed for nothing. Complexity: requires a background job or thread.
Seeing It in Code
// cache-aside.js — the most common pattern
// App code controls all cache interactions.
const redis = require('redis');
const client = redis.createClient();
async function getProduct(productId) {
const cacheKey = `product:${productId}`;
// Step 1: Check the cache first.
// Why: cache hit costs ~1 ms; DB query costs ~50 ms.
const cached = await client.get(cacheKey);
if (cached) {
return JSON.parse(cached); // cache HIT — fast path
}
// Step 2: Cache MISS — fall back to the database.
// This is the slow path. We accept one slow request
// and make all subsequent requests fast.
const product = await db.query(
'SELECT * FROM products WHERE id = $1', [productId]
);
// Step 3: Populate the cache so the next request is fast.
// TTL = 300 seconds. After 5 minutes, Redis auto-deletes
// this entry and the next request will re-fetch from DB.
// Why 300s? Product data changes rarely; 5 min staleness OK.
await client.setEx(cacheKey, 300, JSON.stringify(product));
return product;
}
// On a product UPDATE, invalidate the cache entry so the
// next read gets fresh data. Don't update cache here —
// let the next read fill it (lazy).
async function updateProduct(productId, data) {
await db.query(
'UPDATE products SET ... WHERE id = $1', [productId]
);
// Delete stale entry. Next GET will re-cache fresh data.
await client.del(`product:${productId}`);
}
The cache-aside pattern in 3 lines of thinking: check cache → miss → fetch from DB and fill cache. Writes invalidate the cache entry; the next read re-populates it. Simple, decoupled, and works with any data store. The downside: if two requests miss simultaneously on the same key, both query the DB — that's the stampede problem the next tab solves.
// cache-aside with stampede protection using a Redis lock.
// Problem: product:42 expires, 500 simultaneous requests
// all miss at once and all hit the DB to re-compute the same value.
// Solution: first miss acquires a lock; others wait for the lock
// to be released, then read the now-populated cache.
const LOCK_TTL = 5; // lock held for max 5 seconds
async function getProductWithLock(productId) {
const cacheKey = `product:${productId}`;
const lockKey = `lock:product:${productId}`;
// 1. Check cache — happy path unchanged.
const cached = await client.get(cacheKey);
if (cached) return JSON.parse(cached);
// 2. Cache miss. Try to acquire an exclusive lock.
// NX = only set if Not eXists (atomic compare-and-set).
// EX = expire after LOCK_TTL seconds (prevents deadlock).
const acquired = await client.set(
lockKey, '1', { NX: true, EX: LOCK_TTL }
);
if (acquired) {
// 3. We won the lock. We are responsible for fetching.
try {
const product = await db.query(
'SELECT * FROM products WHERE id = $1', [productId]
);
await client.setEx(cacheKey, 300, JSON.stringify(product));
return product;
} finally {
await client.del(lockKey); // release lock for waiters
}
} else {
// 4. Another request is already fetching. Wait briefly
// then retry — cache should be populated by then.
await sleep(50); // 50ms — adjust to your DB latency
const retried = await client.get(cacheKey);
if (retried) return JSON.parse(retried);
// Last resort: fetch directly (should be rare).
return db.query('SELECT * FROM products WHERE id = $1', [productId]);
}
}
const sleep = ms => new Promise(r => setTimeout(r, ms));
The mutex-on-miss pattern: only the first request that misses acquires the lock and queries the DB. All other concurrent misses wait 50 ms and then read from the now-populated cache. The lock itself has a TTL (5 seconds) to prevent permanent deadlock if the lock-holder crashes before releasing. This reduces N redundant DB queries on a popular expiry to exactly 1.
// write-through.js — every write updates both cache + DB.
// Trade-off: write latency ≈ 2× (sequential cache + DB),
// but reads are ALWAYS served from fresh cache data.
// No stale reads. No stampedes. Just slower writes.
async function updateProduct(productId, data) {
const cacheKey = `product:${productId}`;
// Step 1: Write to the database (source of truth).
// Why DB first? If DB fails, we haven't put bad data in cache.
await db.query(
'UPDATE products SET name=$2, price=$3 WHERE id=$1',
[productId, data.name, data.price]
);
// Step 2: Update the cache to match.
// The cache entry is now identical to what's in the DB.
// Any read immediately after this write will see new data.
// We refresh the TTL too (reset the 5-minute timer).
await client.setEx(cacheKey, 300, JSON.stringify(data));
// Why not the other way around (cache first, then DB)?
// If the DB write fails after the cache write succeeds,
// the cache holds data the DB never persisted — a lie.
// Always write the source of truth first.
}
async function getProduct(productId) {
// Reads are always fast: cache was populated on the last write.
// No miss path needed (unless cache was evicted by memory pressure).
const cacheKey = `product:${productId}`;
const cached = await client.get(cacheKey);
if (cached) return JSON.parse(cached);
// Fallback for cold start or eviction.
const product = await db.query(
'SELECT * FROM products WHERE id = $1', [productId]
);
await client.setEx(cacheKey, 300, JSON.stringify(product));
return product;
}
Write-through in practice: DB write first (so the source of truth is always correct), then cache write. Reads are almost always cache hits because data was just written there. The cost: each write takes two round trips instead of one — roughly 2× write latency. For financial or inventory data where stale reads are unacceptable, that's a worthwhile trade.
Cache Invalidation — The Hard Part
Reading from a cache is the easy part — it's just a hash map lookup. The genuinely hard part is deciding when a cached copy is no longer correct and needs to be thrown away or updated. This is cache invalidation, and it's the topic of Phil Karlton's famous quote:
Why is it actually hard? Because data is changed by many actors (application writes, background jobs, migrations, admin tools, other services) and your cache doesn't automatically know about any of them. You have to explicitly tell it — and doing so correctly across all code paths is surprisingly difficult.
There are three main strategies for keeping your cache in sync. Each trades simplicity for consistency.
Strategy 1 — TTL-Based Expiration
How it works: Every cache entry has a time-to-live. When the timer hits zero, Redis auto-deletes the entry. The next request gets a miss, fetches from the DB, and re-caches with a fresh TTL. You never explicitly invalidate anything — you just wait for entries to age out.
Why it's attractive: Simplest possible implementation — just add EX 300 to your SET command. No need to track what changed or which keys to invalidate. Works automatically even if the write path forgets to invalidate.
The catch: During the TTL window, your cache serves stale data. If a product price changes at T=0 and the TTL is 5 minutes, users see the wrong price for up to 5 minutes. For some data (country names, navigation menus) that's fine. For prices, stock levels, or balances — it depends on your business requirements. A short TTL reduces staleness but increases DB load. A long TTL reduces DB load but increases staleness. There's no free lunch.
Strategy 2 — Explicit Purge (Invalidation on Write)
How it works: When your application writes to the database, it also immediately deletes (or updates) the corresponding cache entry. The next read gets a miss, fetches fresh data, and re-caches. No stale window — as soon as a write completes, the cache is clean.
Why it's attractive: Near-zero stale window. Reads immediately after a write see fresh data. More cache efficiency — entries stay valid longer between purges, so your hit ratio stays high even with short effective lifetimes.
The catch: You must remember to invalidate on every code path that writes. One missed invalidation in a background job, a migration script, or an admin tool means the cache holds stale data indefinitely until it expires via TTL (which you may have set to hours or infinity). This is the "only two hard things" problem in concrete form. The standard defense: always have a TTL as a backstop even when using explicit purge.
Strategy 3 — Event-Driven Invalidation (CDC / Pub-Sub)
How it works: Instead of the application explicitly deleting cache entries, a change-data-capture (CDC) system reads the database's write-ahead log and publishes change events. A cache-invalidation service subscribes to these events and deletes or updates cache entries automatically.
Why it's attractive: Works even if application code forgets to invalidate. Catches writes from background jobs, migrations, admin tools, and other services that don't go through your main application code. Decoupled — the write path doesn't need to know about the cache.
The catch: Complexity. You need Kafka or a similar event bus, a CDC connector (e.g., Debezium for Postgres), and a cache-invalidation consumer service. Adds operational overhead and a small propagation delay (typically 10–200 ms between DB write and cache invalidation). Overkill for small systems; standard practice at scale in services with microservices writing to shared data.
The 4 Invalidation Pitfalls That Bite Production Teams
Pitfall 1 — Forgetting to Invalidate
The single most common cause of stale data in production. Your main application code deletes the cache entry on update — but your nightly batch job, data migration script, and admin tool all write directly to the database and never touch the cache. The cache holds stale data until its TTL expires (which you may have set to hours). Users see old prices, wrong balances, deleted items that reappear.
Defense: Always use a TTL as a backstop, even when relying on explicit purge. If your primary strategy fails, the TTL limits damage to N seconds. Short TTL = less damage, more DB load. Pick based on data sensitivity.
Pitfall 2 — Read-Modify-Write Race
Two concurrent requests read the same cached value (e.g., stock count = 10), both decrement it (-1), and both write back 9. You sold two items but only decremented the count once. The correct value is 8; the cache (and possibly DB) says 9.
This is a TOCTOU (time-of-check / time-of-use) race. Defense: Use atomic operations where possible (Redis DECR, Lua scripts, or database-level UPDATE stock = stock - 1 WHERE stock > 0). Never do read-modify-write at the application level for values that need to be exact.
Pitfall 3 — TTL Tuning Gone Wrong
TTL too long: users see stale data that damages trust (wrong prices, phantom deleted items). TTL too short: the cache expires so fast that the hit ratio drops below 80% and your DB is nearly as loaded as without a cache. Both extremes cost you.
Defense: Start with a TTL based on data change frequency, then tune with metrics. If your price changes once per minute and your TTL is 1 second, you're flushing 59× more often than necessary. Monitor your hit ratio and P99 DB query latency together. A drop in hit ratio without a change in traffic usually means your TTL is too short or your cache is evicting too aggressively.
Pitfall 4 — Cascading Invalidation (Thundering Herd on Write)
You update a popular entity (e.g., a major brand on an e-commerce site) and your invalidation logic deletes 100,000 cache keys — every product, every category listing, every search result page that references that brand. All 100,000 keys are now cold. In the next few seconds, all 100,000 keys get stampeded simultaneously, generating 100,000 DB queries.
Defense: Stagger invalidation (batch deletions with small delays). Use probabilistic or token-bucket rate limiting on DB re-population. Consider background refresh for highly-referenced entities rather than immediate delete. For extreme cases, use a short lock window — mark entries as "refreshing" and serve slightly-stale data while the refresh happens rather than serving a miss.
Eviction Policies — Who Gets Kicked Out First?
A cache only works because it's smaller than your main store — that's the whole point. It fits in RAM precisely because it doesn't hold everything. But that means it fills up. When a new item arrives and there's no room, the cache has to make a choice: which existing entry gets evicted to create space?
That choice is the eviction policy. Get it wrong and you'll keep evicting the wrong items — the ones that would have been hit again shortly — and your hit ratio stays low. A well-matched policy typically lifts hit ratio by 5–20% over a naive default with zero infrastructure changes.
The Six Policies You'll See in the Wild
LRU — The Sensible Default
LRU keeps a mental note of when each item was last touched. The item untouched the longest gets kicked out first. The logic: if you haven't needed something in a while, you probably won't need it soon. Redis implements this by sampling a handful of random keys and evicting the oldest-touched among them — a good approximation with near-zero overhead. Use LRU as your starting point for any general-purpose cache.
LFU — Popularity Wins
LFU tracks how many times each item has been accessed. The item accessed least often goes first. Great for workloads where some items are dramatically more popular than others (think: the front page vs. an obscure product page). The downside is bookkeeping cost — you need a frequency counter for every entry, and that memory adds up. A clever variant called W-TinyLFU solves this elegantly: instead of exact counters, it uses a compact "approximate counter" structure that estimates frequencies with a tiny memory footprint.
W-TinyLFU — Modern Champion
W-TinyLFU is the modern "best of both worlds" eviction policy — it blends frequency tracking (LFU's strength) with a short grace period for brand-new entries (so a hot new item is not killed before it has a chance to prove itself). It is the default policy in Caffeine (a popular Java cache library) and is available in Redis as allkeys-lfu. The internal design splits the cache into a small "window" area for fresh items and a larger LFU-style main area — that combination handles both bursty popularity and steady-state hot keys at the same time. If you are tuning a high-traffic cache and have the option, W-TinyLFU will typically outperform pure LRU.
TTL-Based — Time Is the Judge
Rather than tracking access patterns, TTL-based eviction simply lets time decide. Every entry has a clock ticking down; when it hits zero the entry expires. This is less about eviction strategy and more about a freshness guarantee — you're saying "I accept data up to N seconds stale." Redis combines TTL with LRU: entries expire on schedule, and if memory still fills up, LRU kicks in. TTL is mandatory for any data that changes in the source of truth — product prices, user sessions, feature flags.
FIFO — Simple but Blunt
FIFO evicts by insertion order: the oldest entry in the cache goes first, regardless of whether it's being used constantly. It's easy to reason about and implement (just a queue), but it ignores actual usage entirely. You can end up evicting an item that's hit a thousand times per minute just because it was inserted first. FIFO is mostly useful in very constrained embedded systems where tracking access time would be too expensive, not in typical web caching scenarios.
Random — Surprisingly Viable
Pick a random victim. No bookkeeping, no clock, no counters. This sounds terrible but at scale it's often within a few percent of LRU's hit ratio — because popular items are statistically less likely to be chosen as the random victim. Redis uses randomized sampling in its LRU approximation partly for this reason. Pure random eviction is a useful baseline and performs better than FIFO in most real access distributions. It's rarely the right first choice, but it's a useful reminder that complexity doesn't always buy you proportionate gain.
Distributed Caching — One Cache for 100 Servers
An in-process cache is the fastest thing you can have — data lives in the same memory space as your application code, so fetching it is literally a pointer dereference. But in-process cache has a fatal flaw at scale: every server has its own separate cache, and they don't talk to each other.
Imagine you have 50 app servers. User Alice updates her profile. Server 12 gets the write and clears its local cache. Servers 1–11 and 13–50 still have the old profile. You now have 49 stale copies floating around — and you have no mechanism to tell them. That inconsistency can persist until each server's TTL ticks down.
A distributed cache solves this by moving the cache out of each server's process and into a shared external store — typically Redis or Memcached. Now when Server 12 invalidates Alice's profile, all 50 servers see the change on their next read. The trade-off: you've added a network hop (~0.3–1 ms) compared to ~100 ns for an in-process read.
How Keys Get Distributed Across Nodes
Consistent Hashing
Picture every possible key arranged around the edge of a clock face — that imaginary ring is the "keyspace circle." Each cache node owns a slice of the circle. To find where a key lives, you hash it to a point on the circle, then walk clockwise to the first node that owns that slice. Why a circle? Because when you add or remove a node, only the keys in one slice need to move — not every key in the cluster. That makes scaling cheap: adding a node reshuffles roughly 1/N of your data, not all of it. This is why consistent hashing became the standard for distributed caches. Redis Cluster uses a variation with 16,384 pre-defined slots — keys hash into one of those slots, and each node owns a fixed batch of slots, which makes administration even simpler than a pure circle.
Key-Range Sharding
Divide the keyspace alphabetically or numerically: keys starting with A–F go to node 1, G–M to node 2, and so on. Simple to understand and easy to implement. The problem: access patterns are rarely uniform. If every popular product ID starts with "1", node 1 gets hammered while nodes 2–5 sit idle. This is called a hot shard. Key-range sharding is fine for range queries but needs careful key design to avoid hotspots.
Replication for HA + Read Scale
Each primary (master) node can have one or more replica nodes that hold a copy of the same data. Replicas serve two purposes: failover (if the primary dies, a replica is promoted) and read scale (you can route read-only queries to replicas, spreading load). The catch: replicas lag behind the primary by a few milliseconds. If you read from a replica immediately after writing to the primary you might get stale data — this is the coherence problem covered in Section 11.
Redis vs. Memcached — The Honest Comparison
Memcached — Lean and Fast
Memcached does one thing: store strings by key. Its simplicity is its strength — it's multi-threaded and can squeeze more throughput per CPU than Redis on pure key-value workloads. If all you need is "cache this blob of bytes for 60 seconds" and you're at extreme throughput (millions of operations per second), Memcached is worth evaluating. Its weaknesses: no persistence, no replication built in, no pub/sub, no rich data types. You can't do atomic counters, sorted sets, or queues. It's a cache, only a cache, and always a cache.
Redis — Cache + Much More
Redis supports strings, hashes, lists, sorted sets, bitmaps, HyperLogLog, streams, and geospatial indexes. It has optional persistence (RDB snapshots + AOF), cluster mode for horizontal scaling, Pub/Sub for messaging, and Lua scripting. Modern teams default to Redis because the feature set removes the need for multiple specialized services. The mild downside is that Redis is single-threaded on the command-processing path — though Redis 7+ adds multi-threaded I/O and the single-thread bottleneck rarely matters below hundreds of thousands of commands per second.
CDN Caching — Cache at the Edge, Near Every User
All the caching we've discussed so far lives inside your data centre. Your database is in Virginia, your Redis is in Virginia, and your app servers are in Virginia. But your users are in Tokyo, London, and São Paulo. Even at the speed of light, a round trip from Tokyo to Virginia takes roughly 150–200 ms. No amount of in-memory caching can change the physics of distance.
A CDN solves this by placing cache nodes in dozens (or hundreds) of cities around the world. When a user in Tokyo requests your homepage image, Cloudflare's Tokyo edge node serves it from local storage in ~5 ms — the image never travels across the Pacific.
CDNs typically have three or four layers: your browser's local cache → the nearest edge node → an optional regional "origin shield" that further reduces origin load → the actual origin server. For a well-cached static asset, the origin might only get hit once per region per TTL window regardless of how many users request it.
Static Asset Caching
Images, JavaScript bundles, CSS files — these are the ideal CDN citizens. They change infrequently (or never, once fingerprinted) and can be served from edge with a one-year TTL. The HTTP header Cache-Control: public, max-age=31536000, immutable tells both the browser and CDN to cache aggressively. The immutable directive says "don't bother revalidating this even on a forced refresh — the URL itself will change when the content changes." This is why modern build tools add content hashes to filenames: app.a1b2c3.js instead of app.js. Change the file → change the URL → old cache entries naturally expire.
API Response Caching
CDNs can cache JSON API responses too, not just assets. A product listing endpoint that doesn't personalise its response can be cached at the edge for 5–60 seconds. This is powerful: instead of your app servers and Redis handling 50,000 req/s, the CDN handles most of it, and only genuine cache misses (or 5-second intervals) reach your origin. The key constraint is that the response must be the same for all users — once personalisation enters (prices, recommendations, inventory held in cart), you must either exclude those parts from CDN caching or use ESI.
Stale-While-Revalidate
The header Cache-Control: max-age=60, stale-while-revalidate=30 tells the CDN: "serve this for 60 seconds. After 60 seconds, keep serving the stale version for up to 30 more seconds while you fetch a fresh copy in the background." The user sees no latency spike — they get the slightly stale response instantly, and the refresh happens asynchronously. This is one of the most effective patterns for busy endpoints where a few seconds of staleness is acceptable. It eliminates the "cold TTL expiry → every user hits origin simultaneously" problem at the CDN layer.
Smart Purging
Rather than waiting for TTLs to expire, CDN providers offer purge APIs. When a product price changes, your backend calls POST /api/purge?url=/products/42 (or tag-based purge: "purge all responses tagged product-42"). This is the push side of cache invalidation applied at the CDN level — you trade the simplicity of TTL-only for real-time freshness when it matters. Cloudflare and Fastly can propagate purges globally in under a second. The operational overhead is writing the purge calls wherever your data changes; it's worth it for high-value or price-sensitive content.
Cache Stampede — When the Cache Expires and the DB Drowns
Picture a popular news article cached in Redis with a 60-second TTL. Every minute, a thousand users are reading it. For most of that minute, Redis serves all thousand in ~1 ms. Then the 60-second clock hits zero and Redis evicts the entry.
In the next few milliseconds, all thousand requests in flight hit the cache simultaneously — and all of them get a cache miss. All thousand go straight to the database to fetch the same article. The database, which was previously doing almost no work for this article, now has a thousand concurrent queries for identical data. This is a cache stampede (also called a thundering herd).
Four Mitigations
Singleflight / Mutex Lock
The core idea is simple: imagine 500 concurrent requests all miss the cache for the same key at the same instant. Instead of all 500 racing to the database, you elect one request to do the fetch — the rest wait, and when the first one finishes they all share the same fresh result. This "only one person flies the mission, the rest ride along" pattern is called singleflight.
In Go, the singleflight package does this exactly (the name literally means "one flight" — one fetch goes out, many results come back). In Redis-based systems, you use a distributed lock: the first thread to miss sets a lock key (SETNX article:42:fetching 1 EX 5). Other threads see the lock and wait (or serve slightly stale data). When the first thread finishes and populates the cache, the lock is released and waiters serve from cache. Net effect: N duplicate DB queries collapse into 1.
Probabilistic Early Expiration
Rather than letting the TTL hit zero and causing a stampede, start refreshing the entry slightly before it expires. The intuition: as the remaining TTL gets smaller, more and more requests "flip a coin" and decide to refresh in the background. When 5 seconds remain, maybe 1% of requests flip heads and refresh. When 1 second remains, maybe 20% do. By the time the entry would have officially expired, somebody has almost certainly already refreshed it — so no request ever sees a cold miss. The formula from the "Optimal Probabilistic Cache Stampede Prevention" paper is: refresh if currentTime - delta * beta * log(rand()) > expiryTime. In English: the closer you are to expiry, the more likely a random coin flip lands on "refresh now." The practical result: the cache refreshes itself gradually, with at most a handful of DB fetches instead of thousands.
Cache Warming
Rather than letting hot keys expire cold, actively pre-populate the cache before expiry. A background job running every 50 seconds (for a 60-second TTL) re-fetches the hot keys and refreshes them, so they never actually expire from the cache perspective. This is simple and reliable for a small, known set of hot keys (homepage, top-10 products, trending articles). It doesn't scale to thousands of keys but is very effective for the most critical ones. Some teams also warm the cache during deploys to prevent stampedes after a restart wipes the in-process cache.
Jittered TTLs
If 1,000 items are all cached at the same moment (say, during a deploy that hydrates the cache from a batch job), they'll all expire at the same moment too. One stampede of 1,000 entries × 1,000 concurrent users = chaos. The fix is trivial: instead of TTL = 3600 for all entries, use TTL = 3600 + random(0, 600). Now expirations spread across a 10-minute window. The total DB load is the same — but instead of 1,000 entries expiring in one second, they expire 1–2 per second over 10 minutes. Your DB barely notices. This is the simplest stampede mitigation and should be default behaviour in any cache wrapper you write.
Cache Coherence — Stale Reads After Writes
In a Redis cluster set up for high availability, you typically have a primary (master) node that accepts writes and one or more replica nodes that copy those writes asynchronously. The word "asynchronously" is the source of all the trouble here.
Here's the timeline that bites teams: User updates their shipping address (write goes to primary). Your app immediately reads the address back (read goes to a replica for load balancing). The replica hasn't received the update yet — it lags by a few milliseconds. User sees their old address and thinks the update failed. They submit again. You now have a confused user and possibly a duplicate action.
Typical Redis replica lag in a healthy, well-networked cluster is under 10 ms. But under network partition, high write throughput, or a slow replica (disk flush), lag can spike to seconds or longer. Your application needs a strategy for when freshness matters.
Read From Primary After Write
The simplest solution: for any read that immediately follows a write — specifically "read-your-own-writes" scenarios — route the read to the primary instead of a replica. Your app can tag requests with a "consistency required" flag, or certain operations always route to primary by convention. The downside: you lose the load-balancing benefit of replicas for those reads, and the primary sees more traffic. Acceptable for low-frequency, high-importance operations (profile updates, payment confirmations) but not for every read in a high-read system.
Redis WAIT Command
Redis provides a WAIT numreplicas timeout command that blocks until a given number of replicas have acknowledged the latest write, or until the timeout fires. For example: WAIT 1 100 means "wait until at least 1 replica has the data, or 100 ms, whichever comes first." This trades latency for durability and consistency. It's useful for critical writes (financial transactions, address changes before showing confirmation), but adds up to ~10–50 ms per write and should not be used for every operation in a high-throughput path.
Pub/Sub Cache Invalidation
In multi-region deployments, each region has its own Redis cluster. When a write happens in US-East, it publishes an invalidation event to a cross-region message bus (Kafka, Redis Streams, or a managed pub/sub). All regional caches subscribe and delete or update the affected keys. This is how Facebook's Memcache paper described their invalidation pipeline (McRouter). It's operationally complex but is the right pattern when you need near-real-time coherence across geographically separated caches. The lag is now bounded by the pub/sub delivery latency, typically 50–200 ms cross-region.
Negative Caching — Cache the Absence, Not Just the Data
Here's a scenario most teams don't anticipate until it bites them in production. You build a user lookup endpoint: the client sends a username, you check Redis, on a miss you check PostgreSQL. Works great. Then someone runs a script that generates 500,000 random usernames and sends them all as lookup requests. Every single one misses Redis (they don't exist). Every single one hits PostgreSQL. Your DB melts under 500,000 "no such user" queries, even though there's nothing to return.
Negative caching is the simple idea: cache the fact that something doesn't exist. Store a sentinel value in Redis with a short TTL — say, username:xyz → "NULL" with a 5-second expiry. Now the next lookup for that username hits Redis and returns "not found" instantly, without touching the database.
Cache NULL with Short TTL
The simplest approach: when a DB query returns no rows, store a sentinel value in Redis — e.g., user:xyz → "NULL" — with a short TTL of 1–5 seconds. The next request for that key hits Redis, gets "NULL", and your app returns a 404 without touching the DB. The TTL must be short because if the item is later created, you don't want to serve a stale "not found" indefinitely. 5 seconds is a common choice: long enough to absorb a burst attack, short enough that legitimately created items show up quickly.
Bloom Filter for Existence
Imagine a tiny, memory-efficient "membership checker" that can quickly tell you "this username definitely does NOT exist in the database" — without doing a database query. That is a Bloom filter: a clever, compact data structure you pre-populate with all existing user IDs at startup. Before any DB query, check the Bloom filter first. If it says "definitely not in the set," skip the DB entirely — guaranteed safe. If it says "might be in the set," fall through and check the DB normally (Bloom filters can occasionally say "maybe" when the answer is actually "no," but they never say "no" when the answer is "yes"). A Bloom filter with 1 million items and a 1% maybe-rate uses about 1 MB of memory. It can't replace full negative caching (it does not handle newly created items without regular updates), but it's an excellent first line of defence against enumeration attacks.
Singleflight for Missing Keys
Negative cache entries can stampede too. If a popular non-existent key expires (TTL hits zero), a burst of requests will all miss and all hit the DB simultaneously — even though the DB will return nothing. Apply the same singleflight pattern from Section 10: use a Redis distributed lock or an in-process singleflight group to ensure only one request fetches from the DB for a given missing key. All others wait a few milliseconds and share the "NULL" result. The combination of short TTL + singleflight means you get fast protection against floods while keeping the cache fresh enough that newly-created items appear promptly.
Multi-Level Caching — Layers That Compound
Here's a counterintuitive idea: one cache is rarely enough. Production systems almost always layer multiple caches in front of the database, each one absorbing the traffic the previous layer couldn't handle. The reason is simple math — if your browser cache absorbs 50% of requests, your CDN only sees the other 50%. If the CDN absorbs 90% of that, Redis only sees 5% of the original traffic. By the time you reach the database, only a tiny fraction of requests actually need to go there.
Think of it like a concert ticket queue. The first barrier is "do you have a pre-printed ticket?" — most people do. The second is "are you in the guest list?" — some people are. The third is "can you pay at the door?" — fewer. The ticket booth (database) only sees people who passed through every filter. Each layer acts as a filter that reduces load on everything below it.
The Four Layers in Detail
Each layer has a different job. They aren't redundant — they're complementary. Here's what each one does and why it exists where it does.
L1: Browser / Client Cache
The browser cache is the fastest possible cache because it never leaves the user's machine — no network round-trip at all. It's controlled by the HTTP Cache-Control header you send from your server.
Use it for static assets that almost never change: JavaScript bundles, CSS, images, fonts. A well-configured Cache-Control: max-age=31536000, immutable means the browser won't ask your server for that file for a full year. For API responses, a short max-age=60 lets browsers avoid duplicate requests on page refreshes.
Why it matters: this is the only layer that requires zero infrastructure on your end. It's free capacity.
L2: CDN Edge
A CDN sits at the network edge, close to your users geographically. It absorbs requests before they even reach your origin infrastructure.
Best for: large static assets (videos, large images), and short-TTL API responses that are the same for all users (e.g., trending product lists). Immutable assets with content-addressed filenames (e.g., bundle.abc123.js) can be cached indefinitely.
Why it matters: CDN PoPs are distributed globally. A user in London gets sub-10ms response from a London PoP vs 120ms round-trip to your Virginia origin. Performance and cost both improve.
L3: App-Level Redis Cache
Redis sits inside your infrastructure, between your application code and your database. This is where you cache hot data: user sessions, frequently read records, expensive query results, aggregated metrics.
Unlike the browser and CDN, Redis is programmable — you control exactly what goes in, when it expires, and when it's explicitly invalidated. You can also use Redis for more advanced patterns like distributed locks and pub/sub messaging.
Why it matters: a DB query takes 5–50 ms; a Redis read takes under 1 ms. For hot data hit thousands of times per second, this gap is enormous.
L4: Database Buffer Pool
Even inside the database itself, there's a cache: the buffer pool. PostgreSQL calls it shared_buffers; MySQL/InnoDB calls it the InnoDB buffer pool. It caches recently accessed database pages (8 KB chunks of table/index data) in RAM.
This means even if a query "misses" all your application-level caches and actually runs against the database, there's a good chance the rows it needs are already in the DB's own memory — no actual disk I/O required.
Why it matters: disk reads are ~100× slower than RAM reads. The buffer pool ensures that only cold data — pages that haven't been accessed recently — requires actual disk I/O.
0.5 × 0.1 × 0.2 × 0.1 = 0.001. Translation: only 0.1% of requests make it all the way to disk. Equivalent overall hit ratio: 99.9%. Well-tuned multi-level caches routinely achieve 99%+ overall, delivering 50–100× reductions in database load. The exact percentages vary by workload — but the compounding principle is universal.
Cache Sizing & Hit Ratio — Finding the Right Size
One of the most common caching mistakes is guessing how big the cache should be. Too small and your cache miss rate stays high — you're constantly hitting the database anyway, so the cache barely helps. Too large and you're paying for RAM that holds data nobody reads. The right size is the one where adding more memory gives diminishing returns in hit ratio improvement.
There's a concept called the working set — the subset of your total data that actually gets accessed in a given time window. If your database has 500 GB of data but 90% of queries touch the same 10 GB of hot rows, your working set is ~10 GB. A cache just large enough to hold the working set will have a very high hit ratio. Caching beyond the working set adds cost but almost no benefit.
Four Sizing Rules That Actually Work
Measure Your Working Set First
Before picking a cache size, find out how much data is actually hot. Guessing leads to either over-provisioning (wasted money) or under-provisioning (cache that barely helps).
In PostgreSQL, pg_stat_statements shows which queries run most often and how much data they touch. In Redis, the MONITOR command streams live commands so you can see key access patterns. Most APM tools (Datadog, New Relic) can show you cache hit rates in real time.
A practical starting point: size your cache to hold the data accessed in the last 24 hours at typical traffic. That usually covers 80–90% of your working set.
Target 90%+ Hit Ratio for Hot Caches
A hit ratio below 80% means your cache is barely helping — more than 1 in 5 requests still hits the database. You're paying for cache infrastructure without getting the full benefit.
90% is a reasonable minimum target for a hot application cache. At 90%, only 1 in 10 requests reaches the database — a 10× load reduction. Getting from 90% to 95% typically requires doubling cache size because you're now catching the long tail of less-frequently-accessed data. From 95% to 99% can require 5× more memory. Chase hit ratio improvements with awareness of those costs.
Plan 30% Headroom for Traffic Spikes
If your working set is 10 GB under normal traffic, provision 13 GB. During a traffic spike, more unique data gets requested, expanding the effective working set. A cache that runs at 95% memory utilization during normal traffic will start evicting hot data during spikes, dropping hit ratios exactly when you need them most.
Redis tracks memory usage in real time (INFO memory). Set an alert at 75–80% utilization to give yourself time to resize before you hit the ceiling.
Watch for Data Shape Changes
Even if your cache is perfectly sized today, a product change can silently explode the working set. Adding a new dimension to your cache key is the classic trap. If you cache products:{id} and there are 100K products, your working set is 100K keys. If you add per-user personalization and change the key to products:{id}:{user_id} with 1M users, your working set is now 100 billion keys — impossible to cache.
Before adding a new dimension to a cache key, calculate the cardinality explosion. If the product of all dimensions exceeds available memory, you need a different strategy (e.g., cache only the base data and apply personalization in-app).
Application-Level Caching Patterns
Beyond the five canonical cache strategies (cache-aside, read-through, etc.), there are several recurring patterns in application code that you'll encounter constantly. These are less about how the cache connects to the database and more about what gets cached and how the key is constructed. Understanding these patterns helps you recognize when a caching opportunity exists — even when it isn't a simple "cache this DB query" scenario.
Memoization
Memoization is the simplest form of caching: if a function is pure (same input always gives same output, no side effects), you can cache the result the first time and skip the computation on future calls with the same input. It's most valuable for expensive computations — machine learning inference, complex mathematical transformations, report generation.
The key insight is that memoization works at the function level rather than the data store level. You're caching the result of a computation, not a row from a table. This can be in-process (a dictionary in memory) or distributed (store results in Redis keyed by a hash of the inputs).
Query Result Caching
Instead of caching individual rows, you cache the full result set of a specific query. The cache key is typically a hash of the query string plus its parameters. This is great for expensive aggregation queries or complex JOINs that are run frequently with the same filters.
The invalidation challenge: if any row in the underlying tables changes, the cached result may be stale. A practical approach is to use a TTL short enough for your use case (e.g., 60 seconds for a dashboard metric that refreshes every minute) rather than trying to invalidate precisely on every write — which would be complex and often slower than just recomputing.
HTTP Response Caching
For API endpoints that return the same response for the same inputs (route + query params + relevant headers), you can cache the entire serialized HTTP response body. On a cache hit, you skip the entire request pipeline — no controller logic, no DB query, no serialization.
This is especially powerful for public, anonymous endpoints like product listings, blog posts, or search results. Be careful with personalized or authenticated endpoints — the cache key must include anything that affects the response (user ID, locale, feature flags) or you'll return the wrong data to the wrong user.
Computed View Caching
Sometimes the most expensive operations are not single queries but the assembly of multiple data sources into a view. A user's activity feed might require joining posts, likes, comments, friend relationships, and personalization scores. Rather than recomputing this assembly on every request, you pre-compute it and store the final result.
Database materialized views do this at the DB level. Application-level computed views store the assembled JSON blob in Redis. Trade-off: write complexity increases (you must invalidate or rebuild when source data changes), but reads become extremely fast.
Per-User Caching
Sessions, auth tokens, preferences, shopping cart contents, and personalized recommendations are all data tied to a specific user and accessed on almost every request. Storing these in Redis (keyed by user ID or session token) avoids a DB read on every authenticated API call.
The key consideration is cache key design: include enough to uniquely identify the user and context (user ID, tenant ID in multi-tenant systems) but no more. Including unnecessary dimensions explodes the key space and wastes memory. Also set a TTL based on session lifetime — there's no point keeping a cache entry for a user who hasn't been active in 30 days.
Code Examples
A Python-style memoization decorator. The function computes an expensive result on first call and stores it. On subsequent calls with the same arguments, it returns the stored result immediately — skipping the computation entirely.
import hashlib, json, redis
r = redis.Redis()
def memoize(ttl=300):
"""Cache a function's return value keyed by its arguments."""
def decorator(fn):
def wrapper(*args, **kwargs):
# Build a cache key from the function name + all arguments
raw = json.dumps({"fn": fn.__name__, "args": args, "kwargs": kwargs},
sort_keys=True)
key = "memo:" + hashlib.sha256(raw.encode()).hexdigest()
# Check cache first — if hit, return immediately
cached = r.get(key)
if cached:
return json.loads(cached) # ← cache HIT, no computation
# Cache miss — run the expensive function
result = fn(*args, **kwargs)
# Store result with TTL so stale results expire automatically
r.setex(key, ttl, json.dumps(result))
return result # ← cache MISS, computed fresh
return wrapper
return decorator
@memoize(ttl=600)
def compute_recommendation_score(user_id: int, product_id: int) -> float:
# Expensive ML inference — skipped on cache hit
return ml_model.predict(user_id, product_id)
Cache the full result set of a database query. The key is a hash of the SQL and parameters — same query, same params, same cached result. A short TTL (30–60 s) handles invalidation simply without complex change-tracking logic.
import hashlib, json, redis
r = redis.Redis()
def cached_query(db, sql: str, params: tuple = (), ttl: int = 60):
"""Execute a SQL query and cache its result set."""
# Hash the query + params to build a stable cache key
raw = json.dumps({"sql": sql, "params": params}, sort_keys=True)
key = "qcache:" + hashlib.sha256(raw.encode()).hexdigest()
# Try cache first
cached = r.get(key)
if cached:
return json.loads(cached) # ← served from Redis, DB never touched
# Miss — run the actual query against PostgreSQL
with db.cursor() as cur:
cur.execute(sql, params)
rows = [dict(zip([d[0] for d in cur.description], row))
for row in cur.fetchall()]
# Store in Redis with TTL — expires automatically
r.setex(key, ttl, json.dumps(rows))
return rows
# Usage: same as a regular query, but most calls never reach the DB
products = cached_query(
db,
"SELECT id, name, price FROM products WHERE category = %s ORDER BY sales DESC LIMIT 20",
params=("electronics",),
ttl=60
)
Cache the entire HTTP response body for an endpoint. On a hit, the response is returned directly from Redis without any controller logic running. The cache key includes route + query parameters to ensure different inputs map to different cached responses.
import hashlib, json, redis
from functools import wraps
from flask import request, make_response
r = redis.Redis()
def cache_response(ttl=60):
"""Decorator: cache the entire JSON response of a Flask route."""
def decorator(view_fn):
@wraps(view_fn)
def wrapper(*args, **kwargs):
# Build key from route path + sorted query string
# NOTE: never include user-specific data in a shared response cache
raw = json.dumps({
"path": request.path,
"args": dict(sorted(request.args.items()))
})
key = "http:" + hashlib.sha256(raw.encode()).hexdigest()
# Cache HIT — return stored response immediately
cached = r.get(key)
if cached:
data = json.loads(cached)
resp = make_response(json.dumps(data["body"]), data["status"])
resp.headers["X-Cache"] = "HIT"
return resp
# Cache MISS — run the actual view function
resp = view_fn(*args, **kwargs)
payload = json.dumps({"body": resp.get_json(), "status": resp.status_code})
r.setex(key, ttl, payload)
resp.headers["X-Cache"] = "MISS"
return resp
return wrapper
return decorator
@app.route("/products")
@cache_response(ttl=30) # full response cached for 30 seconds
def list_products():
# This only runs on cache miss — ~1 in 10K requests at high traffic
return jsonify(db.query("SELECT * FROM products WHERE active = true"))
Real-World Architectures — How Big Companies Actually Cache
Reading about caching strategies in the abstract is useful, but seeing how large companies implement them at scale reveals something important: at a certain size, caching stops being a library call and becomes its own distributed system with dedicated teams, custom hardware, and bespoke protocols. These four examples show the full spectrum from clever library-level caching to globally distributed edge caches.
Facebook TAO — Graph-Aware Caching
Facebook's social graph is a web of objects (users, posts, comments) and associations (friendships, likes, shares). Standard key-value caching doesn't map well to graph traversals — fetching a user's feed requires following dozens of edges. TAO (The Associations and Objects) is a purpose-built, graph-aware caching layer that sits in front of MySQL.
TAO understands the semantics of graph operations, not just key lookups. When a post gets a new like, TAO can invalidate exactly the cache entries that reference that association — rather than doing a broad TTL-based sweep. This graph-aware invalidation is what allows Facebook to serve roughly 1 billion reads per second across its data centers (at ~96% cache hit rate) while keeping MySQL CPU at manageable levels.
The lesson: at extreme scale, a generic cache may not be enough. Sometimes the access pattern of your data (graph, time-series, geospatial) justifies building or adopting a domain-specific caching layer.
Netflix EVCache — Geo-Replicated Memcached
Netflix serves video metadata (titles, descriptions, thumbnails, recommendation lists) to 200M+ subscribers globally. Fetching this from a central database for every play request would be impossible — the round-trip latency alone would be hundreds of milliseconds for users far from the origin region.
EVCache is Netflix's solution: a Memcached-based caching system that replicates cache writes across multiple AWS regions automatically. A write to the US-East cache is asynchronously propagated to US-West and EU caches. Reads are always served locally from the nearest region. The trade-off is eventual consistency — for a few hundred milliseconds after a write, different regions may serve slightly different versions of the data. For video metadata (where a title's description changing by one word isn't user-critical), this is completely acceptable.
Cloudflare Workers KV — Edge-First Storage
Cloudflare's Workers KV is designed for a specific problem: you have configuration data, feature flags, or static content that needs to be read from anywhere in the world with sub-millisecond latency. The solution is to push the data to every one of Cloudflare's 300+ Points of Presence globally.
Reads from Workers KV are served from the local PoP — no origin round-trip. This makes reads extremely fast regardless of geography. Writes go to a central store and then propagate to all PoPs within a few seconds. The trade-off is that KV is eventually consistent and optimized for read-heavy workloads where writes are infrequent. It's not suitable for data that changes frequently or requires strong consistency between concurrent writers.
Reddit — Tiered Hot / Cold Caching
Reddit's architecture illustrates a clean separation between hot and cold data. Subreddit listing pages (the front page, r/gaming, etc.) are hit millions of times per day with the same content. These hot listings are pre-computed and stored in Memcached. User timelines and post histories — accessed less frequently but needed for longer — are stored in Cassandra, which provides durable storage with fast reads. Canonical user and post data lives in PostgreSQL.
The interesting design point is the hot/cold boundary: Reddit doesn't try to cache everything in RAM. Instead, they classify data by access frequency and match it to the right store. Hot = Memcached (RAM). Warm = Cassandra (SSD). Cold = PostgreSQL (disk). This tiering lets Reddit serve massive read traffic without over-provisioning expensive in-memory stores for data that's rarely accessed.
SETEX call in your API handler. At Facebook or Netflix scale, caching becomes its own distributed system — with dedicated infrastructure teams, custom replication protocols, and graph-aware invalidation logic. The principles are the same; the implementation complexity grows by orders of magnitude.
Cache Anti-Patterns — Mistakes That Bite in Production
Caching is one of those areas where doing it almost right can be worse than not doing it at all. A cache without a clear invalidation strategy serves stale data. A TTL that's too short turns your cache into an expensive DNS lookup. A cache stampede during peak traffic can take down your database at the exact moment you need it most. These six anti-patterns show up repeatedly in production — knowing them lets you avoid the potholes before you step in them.
Caching Without an Invalidation Strategy
The most dangerous anti-pattern is treating the cache as a write-once store. You put data in, set no expiry, and never think about removing it. Three months later, users are seeing product prices from last quarter because no one ever purged the cache when prices were updated in the database.
Every cache entry needs a clear answer to: "what event causes this entry to become invalid?" If the answer is a time window ("stale after 60 seconds is fine"), use TTL. If the answer is a specific data write ("this entry must be removed when this user's profile is updated"), use explicit invalidation from your write path. If neither applies, ask whether this data should be cached at all.
Forgetting Negative Caching
Imagine your application receives a request for user ID 99999, which doesn't exist. Your app checks the cache — miss. Fetches from DB — not found, returns 404. Now someone sends 10,000 requests per second for non-existent user IDs (intentional or not). Each one is a cache miss, each one hits the DB. Your cache is completely bypassed for the exact traffic pattern that could take you down.
The fix is negative caching: when a lookup returns "not found," store a sentinel value in the cache (e.g., the string "__NOT_FOUND__") with a short TTL (30–60 seconds). Future requests for the same missing key are served from cache with a 404, without touching the database. This turns a potential DoS amplifier into a safely absorbed response.
TTL Too Short — The Expensive Lookup
Setting a 5-second TTL on a cache that takes 50 ms to refill from the database means you're doing that expensive refill 12 times per minute. If 1,000 users request the same data concurrently, you could be hitting the database up to 12,000 times per minute — almost as often as with no cache at all. You've added infrastructure complexity and cost without meaningful benefit.
Before setting a TTL, ask: what is the acceptable staleness for this data? For a trending posts list, 60 seconds of staleness is invisible to users. For a balance in a financial system, 0 seconds (no caching, or write-through only) may be required. TTL should be derived from business tolerance, not set to a default.
Caching Non-Deterministic Data
If a function returns different results for the same input based on random numbers, the current time, or external state, caching its result is dangerous. You're locking in one random outcome for the duration of the TTL — every user who hits the cache sees that same "random" result, which defeats the purpose of randomness.
Common culprits: A/B test assignment logic that uses a random number per request (should be deterministic based on user ID instead), recommendations with a random shuffle (shuffle after cache retrieval, not before), and time-sensitive checks embedded in cached responses. Before caching a function's output, verify it is deterministic for a given input.
Cache Stampede — The Thundering Herd
A popular cache entry expires. In the milliseconds before any one request can refill it, 500 concurrent requests all see a cache miss. All 500 hit the database simultaneously. The database — which was previously handling 50 queries/sec comfortably — suddenly receives 500 queries at once. Response times spike. The DB may become overwhelmed and start timing out. The cascading failures spread.
Two practical fixes: (1) Mutex/lock: when a miss occurs, the first request acquires a Redis lock and refills the cache; all other concurrent requests wait for the lock to release and then read from the now-filled cache. (2) TTL jitter: instead of all entries of the same type expiring at exactly the same time (e.g., all cached after the same deploy), add random jitter: TTL = base_ttl + random.randint(0, base_ttl * 0.2). This spreads expirations out in time, preventing synchronized stampedes.
TTL Too Long — Stale Data Lingers
On the opposite extreme from TTL-too-short, a TTL set to hours or days creates a long window where your cache serves incorrect data. A user updates their profile picture; the old picture persists in cache for 24 hours. A product goes out of stock; the "in stock" flag stays cached. An admin disables a user account for abuse; the session remains valid in cache.
Long TTLs are fine for genuinely static data (images, archived content, product descriptions that rarely change). For data that can change in response to user actions or operational events, TTL should be short enough that the staleness window is acceptable, and you should also build an explicit invalidation path that fires when the data actually changes — so you don't have to wait for the TTL to expire.
Observability & Monitoring — Seeing What Your Cache Is Doing
Here's a frustrating truth about caching: when it works, it's completely invisible. Your response times are fast, your database CPU is low, everything feels fine. But you can't tell if the cache is working well or if it's quietly broken — serving stale data, evicting hot keys, or heading toward a memory ceiling. You only find out when things go wrong. The fix is instrumenting a handful of key signals so you can see your cache's health before it becomes a crisis.
Think of these six metrics as the vital signs of your cache. Just like a doctor checking heart rate, blood pressure, and temperature, you want these numbers visible at a glance. Any one of them moving in the wrong direction is an early warning.
Hit Ratio — The Primary Health Signal
Hit ratio is hits / (hits + misses). It tells you what fraction of requests are being served from cache versus falling through to the database. A hit ratio of 94% means 94 in every 100 requests are served from cache; 6 hit the database.
Why it matters: a sudden drop in hit ratio is the earliest warning that something is wrong. Common causes: a new feature that generates cache keys with much higher cardinality (more unique keys → more misses), a deploy that cleared the cache (normal, but causes a temporary dip), or a traffic pattern change (e.g., a bot crawling random URLs with unique parameters that all miss). Every cache monitoring setup should have an alert for "hit ratio dropped below 80% for more than 5 minutes."
Eviction Rate — Is Your Cache Too Small?
When Redis runs out of memory, it must remove existing entries to make room for new ones — this is called eviction. A healthy cache has near-zero eviction rate. A cache with a high eviction rate is too small for its working set — it's constantly pushing out data that will be needed again soon, creating a cycle of misses that hammer the database.
Track eviction rate in Redis with INFO stats → evicted_keys. A rising eviction rate while hit ratio falls is a clear signal to increase cache memory. In Redis, set maxmemory-policy allkeys-lru to ensure evictions use LRU (least recently used) rather than random, which gives you better hit ratios for the available memory.
Latency p99 — Cache Should Be Fast
One of the main reasons you use a cache is to reduce latency. If your cache reads are slow, you're paying the infrastructure cost without getting the speed benefit. A healthy Redis read should be under 1 ms on the same network. Under 5 ms is acceptable. Consistent p99 latency above 5 ms indicates a problem.
Common causes of slow cache reads: Redis memory pressure (when Redis is using swap, reads are disk-speed, not RAM-speed), large values (serializing/deserializing a 1 MB JSON blob takes real time — split large objects into smaller keys), network congestion between app servers and Redis, and Redis CPU saturation from heavy Lua scripts or large SCAN operations running concurrently with reads.
Memory Utilization — Give Yourself Room
Redis performs best with 50–75% memory utilization. Below 50%, you may be over-provisioned (wasting money). Above 80%, you're cutting it close — a traffic spike will push you into evictions. Above 95%, Redis may start swapping to disk (on systems without maxmemory set), which destroys latency.
Set a maxmemory limit in Redis so it uses your configured eviction policy rather than growing unbounded. Alert at 75–80% utilization. When you receive the alert, you have time to evaluate: is this a temporary spike (do nothing), a trend (provision more memory), or a code change that added high-cardinality keys (fix the key design)? You want to make that decision proactively, not reactively when the cache falls over.
Stampede Count — How Often Does the Herd Run?
A stampede count tracks how often multiple concurrent requests miss the same cache key at the same time. This metric requires some custom instrumentation — Redis itself doesn't expose it directly. A common pattern is to track it in your application: increment a counter whenever a cache miss triggers a database fetch while another request for the same key is already in-flight.
Any non-zero stampede rate is worth investigating. A stampede count that spikes periodically often correlates with your TTL pattern — if all entries of a type were populated at the same time (e.g., after a cache warm-up), they all expire at the same time. Adding TTL jitter (randomizing expiry times by ±10–20%) is usually enough to eliminate the pattern without code changes.
Network Errors — The Silent Cascading Failure
This metric is often overlooked until it causes an incident. If your application can't reach Redis — due to a network partition, Redis restart, or connection pool exhaustion — every request will either fail or fall through to the database. If the database can't handle the full load without the cache, you're now in a cascading failure: Redis is down, DB is overwhelmed, your entire service degrades.
Protection strategy: implement a circuit breaker on your Redis client. After N consecutive connection failures, stop trying to reach Redis for a cooldown period and fall back gracefully (go direct to DB, or return a degraded response). Alert immediately when network errors to Redis exceed 0 — this should never be non-zero in a healthy system.
Tools & Platforms — The Cache Ecosystem in Plain English
Knowing why to cache and how each strategy works is the hard part. Choosing the right tool is mostly a matter of matching the tool's strengths to your use case. There are really only a handful of players worth knowing — each one occupies a distinct position in the stack.
Redis
Redis is the most widely deployed distributed cache in the world — and for good reason. It stores everything in RAM, so reads and writes take under 1 ms. But what makes Redis special is its rich data structures: strings, lists, sorted sets, hashes, bitmaps, and streams. These let you do more than just "store a blob and get it back." You can, for example, increment a counter atomically, pop the oldest item from a queue, or retrieve the top-10 scores from a leaderboard — all without a round-trip to your database. Redis also supports optional persistence (so a cache restart does not always mean a cold cache) and pub/sub messaging. The downside: it is a network hop away from your app, and cluster setup adds operational overhead.
Best for: distributed caches, session stores, leaderboards, rate limiters, job queues.
Memcached
Memcached is the stripped-down sibling of Redis. It does one thing: store a blob by key and retrieve it by key. There are no lists, no sorted sets, no pub/sub — just pure key-value. Why does it exist if Redis does everything Memcached does plus more? Because stripping everything out makes Memcached slightly faster and easier to scale horizontally — there is no replication state to manage, and the memory overhead per key is lower. At Facebook scale, those margins matter. For most teams, Redis is the right default; Memcached is a viable choice when you have a pure "store and retrieve blob" workload at extremely high throughput and want the simplest possible operational footprint.
Best for: pure string/blob caching at extreme throughput where Redis's extra features are not needed.
Caffeine (Java)
Caffeine is a Java in-process cache library that lives inside the same JVM as your application code. Because it uses the same heap memory as your app, there is zero network overhead — a cache lookup is a hash-map lookup in RAM, which takes roughly 0.1 ms or less. Caffeine implements a sophisticated eviction algorithm called TinyLFU that achieves higher hit ratios than plain LRU by tracking access frequency alongside recency. It handles size-based eviction, time-based expiration, and asynchronous loading. The catch: because it lives in your process, every app server has its own copy of the cache — so cache state is not shared between servers. This is fine for very hot, read-heavy, rarely-changing data (like a config object), but wrong for user-session data that different servers might need to read.
Best for: JVM applications, in-process L1 cache in front of Redis, config and reference data.
Hazelcast
Hazelcast is a distributed in-memory data grid — a fancy way of saying it is a cache that automatically partitions and replicates data across a cluster of nodes, and the nodes can live inside the same JVM as your app (embedded mode) or as a separate cluster (client-server mode). Unlike Redis, Hazelcast is Java-native and integrates directly with the JDK's Map interface, which makes it feel natural to Java developers. It also supports distributed locks, queues, and topics. Hazelcast is a strong choice when you want in-process caching but need shared state across app servers without running a separate Redis cluster — it merges both layers into one. The trade-off: more complex than Redis to operate and tune, and less community tooling.
Best for: Java microservices needing distributed shared state without a separate cache tier.
AWS ElastiCache / GCP Memorystore
These are managed Redis (and Memcached) services from the major cloud providers. The core technology is identical to open-source Redis — you get the same data structures, the same commands, the same performance characteristics. What you are paying for is operational automation: the cloud handles cluster provisioning, automatic failover, patching, backups, and monitoring dashboards. If you are already on AWS or GCP, using ElastiCache or Memorystore typically takes an afternoon to set up versus days for a self-managed Redis cluster. The cost per GB is higher than running Redis on EC2 yourself, but for most teams the operational savings far outweigh the premium. The key thing to know: ElastiCache "Redis OSS" and ElastiCache "Serverless" behave slightly differently under high load — benchmark before committing at scale.
Best for: teams on AWS/GCP who want Redis without the ops burden of self-managing a cluster.
CDN Providers
A CDN (Content Delivery Network) is a cache that sits between your servers and your users, distributed across 200+ edge locations around the planet. When a user in Tokyo requests your homepage, the CDN serves it from a node in Tokyo rather than from your origin server in Virginia — shaving off 100–300 ms of round-trip latency just from geography. CDNs cache static assets (images, CSS, JS, fonts) automatically and can also cache API responses if you configure the right Cache-Control headers. Major players: Cloudflare (most popular, free tier), Fastly (highly programmable, used by GitHub and Stripe), Amazon CloudFront (deep AWS integration), Bunny.net (cost-effective for storage-heavy use cases). CDN caching is the single highest-leverage cache for public-facing web apps because it absorbs traffic before it ever reaches your infrastructure.
Best for: static assets, cacheable API responses, any public-facing content where geography adds latency.
Common Misconceptions — What People Get Wrong
Caching sounds simple — store data closer to the reader, fetch it fast. But in practice, the same half-dozen misconceptions show up on engineering teams again and again. Each one looks plausible on the surface and quietly causes bugs, cost overruns, or production incidents.
This is the most common mental shortcut, and it leads teams to bolt Redis onto their stack without a strategy — and then wonder why they still have cache-related bugs. Caching is not a product you install; it is a multi-layer strategy that spans HTTP headers (Cache-Control), in-process memory (Caffeine, .NET MemoryCache), distributed stores (Redis, Memcached), and CDN edge caches. Redis is one tool that fits one layer. Getting caching right means deciding: which layer should hold what data? What is the TTL at each layer? How does an update propagate through all layers? Adding Redis without answering those questions is like installing a fire suppression system in only one room of a building and calling it "fire protection."
Hit ratio is a key health metric — but chasing it blindly leads to wasteful over-allocation. The relationship between cache size and hit ratio follows a curve of diminishing returns. Going from a 90% to 95% hit ratio might require doubling your cache memory. Going from 95% to 99% might require 5× the memory. At some point you are paying $10,000/month in extra RAM to avoid a handful of database queries per second that the database could handle trivially. The right question is: "What is the marginal cost of an additional cache miss (in DB load and latency) vs. the marginal cost of the RAM to prevent it?" When the DB can comfortably absorb the remaining misses, the money is better spent elsewhere.
TTL is a simple and effective tool, but it has one unavoidable property: within the TTL window, you are serving stale data. If you set a 5-minute TTL on a product price and a flash sale changes that price at 12:00:01, users who hit a cached entry will see the old price until 12:05:01. For a discount coupon, that might be acceptable. For a financial transaction, it might not be. TTL-only consistency is eventually consistent with a known maximum lag equal to the TTL. It is the right tool for data where staleness within that window is acceptable — but it is not a complete consistency solution for data that must be accurate at the moment of a write.
Phil Karlton's famous quote — "there are only two hard things in computer science: cache invalidation and naming things" — is often cited as proof that you should just use long TTLs and not try to invalidate explicitly. That is a misreading. Cache invalidation is hard, not impossible. The tools exist: CDC (Change Data Capture) reads your database's write log and publishes change events — your cache consumers can delete or update the relevant key the moment data changes. Event-driven invalidation via a pub/sub bus (Kafka, Redis pub/sub) achieves the same result. The complexity is real, but it is manageable and well-documented. The alternative — accepting permanent staleness — is often worse than tackling invalidation properly from the start.
Different data has different shapes, different freshness requirements, and different access patterns. A user session entry should expire when the user logs out — it needs a short, sticky TTL. A product catalog entry changes rarely but is read millions of times — it needs a long TTL and explicit invalidation on update. An analytics aggregate is recomputed every hour — it needs scheduled refresh-ahead. Stuffing all three into one cache with one eviction policy is a recipe for either serving stale sessions (security bug) or aggressively evicting catalog data (defeating the point). The right approach is to treat each data class individually: decide the appropriate strategy, TTL, and eviction policy for each, and if needed, use separate Redis databases or namespaces to isolate them.
Cache performance depends almost entirely on working set fit — whether the data your application actually accesses frequently fits in the cache. A 100 GB cache full of keys that are each read once a month will have a terrible hit ratio. A 1 GB cache that holds exactly the 50,000 hot keys that absorb 95% of traffic will have an excellent hit ratio. More memory does not fix a bad key design, a missing expiration strategy, or cache pollution from low-frequency items. Before adding memory, profile your cache's keyspace: which keys are accessed most? Are low-value keys evicting high-value ones? Are you caching data that nobody reads? Optimizing key selection and TTLs almost always yields better returns than simply adding RAM.
Real-World Disasters — When Caching Goes Wrong
Theory is easy. Production is where caching shows its fangs. The five incidents below are all real patterns — adapted from public post-mortems and engineering blog posts. Each one has a clear root cause, a concrete lesson, and a fix you can apply right now.
A popular e-commerce homepage cached its featured products list with a 5-minute TTL. On Black Friday, that TTL fired at peak traffic: 1,000 concurrent requests all got a cache miss simultaneously and launched 1,000 parallel queries against the same database table. The database CPU hit 100% in under 2 seconds. Response times climbed from 80 ms to 12 seconds. The site was effectively down for 4 minutes.
Root cause: All 1,000 threads simultaneously saw a miss and all immediately raced to refill the cache. No coordination.
Lesson: Use singleflight (deduplicate concurrent misses so only one request goes to DB) plus jittered TTL (add ±10–15% randomness to expiry times so all copies do not expire simultaneously).
An online retailer updated a product's price in Postgres. The application code used cache-aside with a 24-hour TTL. Nobody added explicit cache invalidation logic on price update because "the TTL will handle it eventually." Users who had the old price cached saw the wrong price for up to 24 hours. For a limited flash sale that ran for 2 hours, this meant customers saw the discounted price well after it had expired — costing the company thousands in margin.
Root cause: TTL-only strategy with no explicit invalidation on write. Long TTL made the window of staleness unacceptable for price-sensitive data.
Lesson: Classify data by freshness tolerance. For pricing data — shorten TTL (5 minutes max) AND add explicit cache invalidation on every price update event. Consider never caching price data at all without a consistency guarantee.
A team upgraded their Redis client library. A subtle bug in the new version caused it to stop setting TTLs on new keys. The keys accumulated silently. Over 48 hours, Redis memory grew from 4 GB to its 8 GB maxmemory limit. Redis hit OOM, eviction policy kicked in unexpectedly, and it started evicting live session keys — logging out thousands of users simultaneously.
Root cause: No monitoring on Redis memory growth rate and no alert on eviction activity. The bug went undetected until it caused a user-facing incident.
Lesson: Monitor Redis memory utilization continuously. Alert on rapid growth rate (not just absolute size). Alert on non-zero eviction counts — unexpected evictions almost always indicate either a memory leak or a working-set size mismatch. Test client library upgrades with TTL inspection before rolling to production.
During a production deploy, an engineer ran a CDN cache purge to force fresh content. Due to a misconfigured wildcard pattern, the purge cleared all cached objects — including ones that were supposed to remain cached for another hour. Suddenly 100% of CDN traffic became cache misses and hit the origin. The origin autoscaling group was sized for 5% miss rate, not 100%. It took 11 minutes to scale up enough capacity; during that time the site returned 503s.
Root cause: No canary purge process (purge a small % of edge nodes first and monitor origin load). No blast radius limit on purge wildcards.
Lesson: Never use global CDN purge wildcards without a staged rollout. Purge a single PoP, monitor origin RPS and error rate for 60 seconds, then proceed. Size your origin for burst traffic at 2× expected peak miss rate — not steady-state.
A fintech application cached account balance lookups in Redis with a 60-second TTL to reduce database load. A user transferred money out of their account and immediately tried to place an order. The order service read the balance from Redis — which still showed the pre-transfer balance. The order was approved. The account went negative. Because this happened for dozens of users in a 60-second window during a peak period, the company processed orders worth more than available funds before the TTL expired.
Root cause: Eventually-consistent cache used for a strongly-consistent critical data path (financial authorization).
Lesson: Never cache strongly-consistent financial data without coherence guarantees. For balance reads during a transaction, always read from the authoritative source (the database, or a strongly-consistent replica). Cache is appropriate for display purposes (the balance number shown in the UI between transactions) but not for authorization decisions where stale reads have real financial consequences.
Performance & Best Practices — Rules That Hold Up in Production
These eight practices are not opinions — they are patterns that appear over and over in engineering post-mortems, conference talks, and production playbooks. Each one exists because someone learned the hard way what happens when you ignore it.
Cache Hot Data Only — Measure First
The instinct to cache everything is wrong. Caching has real costs: memory, operational complexity, staleness risk. The right approach is to identify your actual "hot working set" — the keys that absorb 80–90% of your traffic — and cache only those. How? Add a read counter to your application layer, or use Redis keyspace notifications to track access frequency. In most systems, 80% of traffic hits less than 20% of the data (the Pareto principle applies to data access patterns). Caching the other 80% of data that absorbs only 20% of traffic wastes memory and pollutes your eviction queue. Profile before caching.
Pick TTL Based on Freshness Need + DB Capacity
TTL is not a magic number — it encodes a trade-off between data freshness and database load. The formula to reason about it: "How many seconds of staleness is acceptable for this specific piece of data?" For a promotional banner, maybe 300 seconds. For a user's real-time balance, maybe 0 (do not cache at all). The second input is your database's capacity: if your DB can handle 5,000 QPS comfortably and you have 500,000 hot keys, a TTL of 100 seconds means each key is fetched from DB roughly every 100 seconds — so about 5,000 DB reads per second just for cache misses. TTL × traffic density = DB load. Use the math to choose a defensible number, not intuition.
Singleflight + Jitter — Prevent Stampedes
Stampedes (see Section 21) are one of the most common cache-related outages. Two tools prevent them. Singleflight means: when multiple concurrent threads/goroutines/requests all miss the same key, only one is allowed to fetch from the DB; the others wait and share the single result. In Go, this is the singleflight package. In Java, it is a ConcurrentHashMap of pending futures. Jitter means adding ±10–20% random offset to your TTLs at write time so that a batch of entries written at the same time do not all expire simultaneously. These two techniques together reduce stampede probability to near zero, and they are straightforward to implement.
Negative Caching — Cache "Not Found" Too
Negative caching means storing a "this key does not exist" marker in the cache rather than just returning a miss. Why? Without it, every request for a non-existent key goes straight to the database. An attacker who queries thousands of non-existent user IDs can trivially DoS your database with misses — this is called a cache penetration attack. The fix: when the database returns "not found," store a small sentinel value (e.g., "__nil__") in Redis with a short TTL (30–60 seconds). Subsequent requests for the same key hit the cache, get the sentinel, and return 404 without touching the database. Use a short TTL on negatives so that when the object is eventually created, the cache heals quickly.
Multi-Level Caching — Stack Your Layers
The highest-performing systems stack multiple cache layers: in-process (Caffeine/MemoryCache) as L1 — zero network overhead, sub-millisecond; distributed Redis as L2 — shared across servers, ~1 ms; CDN as L3 — absorbs traffic before it reaches your data center, ~5 ms from edge but saves origin latency entirely. The layering means that even when L1 misses (rare request, or just started up), L2 absorbs it. Only true misses on L2 hit the origin — and those are rare enough that the origin can handle them comfortably. The complexity trade-off: multi-level means you have three places to invalidate. This is manageable with an event-driven invalidation bus that fans out a delete to all levels simultaneously.
Monitor Hit Ratio + Eviction Rate Continuously
A cache that is not monitored is a silent source of production issues. Two metrics matter most. Hit ratio: what percentage of cache lookups result in a hit? Healthy caches run at 90%+ for hot workloads; a sudden drop signals either a keyspace change, a deploy that changed key names, or a memory constraint causing unexpected evictions. Eviction rate: how many keys per second is Redis discarding to make room? Zero evictions is normal. Sudden spikes in eviction indicate either a memory leak (keys accumulating without TTL) or a working-set size that exceeds your allocated memory. Alert at 5%+ drop in hit ratio, and alert at any non-zero eviction on production caches that should not be evicting. These two alerts catch 80% of cache production issues.
Plan Invalidation Strategy Upfront
The most common cache technical debt: teams add caching as an optimization without designing an invalidation strategy, because "the TTL will handle it." This works until the data changes faster than the TTL — and then it does not. The right time to design invalidation is when you design the cache, not after you discover stale data in production. Ask: "What event causes this data to change? Who owns the write? How quickly does the read path need to see the update?" If the answer is "within the TTL window, staleness is acceptable," TTL-only is fine. If the answer is "immediately," plan explicit invalidation: the write path deletes the cache key, or publishes a "key invalidated" event. Do this upfront — retrofitting invalidation into a busy codebase is painful and error-prone.
Never Cache Critical Data Without Coherence
The financial app incident in Section 21 is the canonical example of this rule. Some data — account balances, inventory counts, access control permissions, payment authorizations — must be accurate at the moment of use. Caching these values with any form of eventual consistency (TTL-only, lazy invalidation) introduces a window where the cached value is wrong and a business decision is made on wrong data. The rule: before caching any piece of data, ask "what is the worst-case cost of acting on a stale copy of this value?" If the answer involves money, security, or user safety, the cache must either be bypassed for authoritative reads or use a coherence mechanism (read-your-writes guarantees, cache invalidation on every write, or strong consistency via a distributed lock) that eliminates the staleness window.
FAQ — Questions Everyone Asks About Caching
These questions come up in every system design interview, every architecture review, and every Slack thread that starts with "should we add a cache?" Each answer is designed to give you a clear, defensible position — not a vague "it depends," and not a textbook definition.
Default to Redis. Use Memcached only if you have a specific reason. Redis does everything Memcached does (pure key-value caching) plus a lot more: rich data structures (sorted sets for leaderboards, lists for queues, streams for pub/sub), optional persistence so a restart does not cold-start your cache, built-in replication and cluster mode, and Lua scripting for atomic operations. Memcached is slightly faster and uses slightly less memory per key for pure string values — a real difference at Facebook scale, meaningless for most applications. The concrete decision: if you need leaderboards, queues, session storage, real-time counters, or pub/sub alongside caching, use Redis. If you have a single use case (cache these blobs by key, nothing else) and are handling millions of QPS where every byte and microsecond counts, benchmark Memcached. For almost everyone, the extra capability of Redis is worth the tiny overhead.
TTL is a business decision disguised as a technical number. Start by asking: "What is the maximum staleness that is acceptable for this data, from the user's perspective?" A navigation menu: 5 minutes. Product price: 1 minute or less. User profile: 60 seconds. Real-time balance: 0 (do not cache). Once you have the freshness requirement, check it against your database's capacity: if 100,000 users are hitting a key and your TTL is 60 seconds, the key gets refreshed roughly 1,667 times per minute — can your database handle those misses? If yes, the TTL is fine. If no, either increase the TTL (accept more staleness) or add explicit invalidation on write (so the TTL can be longer without serving stale data). Always add ±10–15% jitter to the final number to avoid stampedes when many copies expire at once.
Big enough to hold your hot working set with 30% headroom. The working set is the set of keys that absorb 80–90% of your traffic — profile this from your application access logs or use Redis OBJECT FREQ (LFU policy) to find frequently accessed keys. Once you know the working set size, add 30% headroom for burst traffic (new users, trending items) and for growth over the next 3–6 months before the next capacity review. As a rough starting heuristic: if each cached object is on average 1 KB and you have 500,000 hot keys, your working set is ~500 MB. Allocate ~650 MB. If you see eviction rates climbing despite objects having valid TTLs, your working set has grown beyond the allocated memory and you need to either increase allocation or prune low-value keys from the cache.
The right strategy depends on how quickly stale data causes harm. Three approaches: (1) TTL-only — simplest, eventual consistency with known lag. Correct for data where staleness within the TTL window is acceptable (e.g., navigation menus, recommendation lists, reference data). (2) Explicit invalidation on write — the service that writes the data also deletes (or updates) the cache key immediately. Correct for data that must be fresh within seconds (pricing, inventory levels). (3) Event-driven invalidation via CDC — Debezium reads the database WAL and publishes a "key changed" event to Kafka; cache consumers subscribe and delete the key. Most robust for microservice architectures where the writer and the cache owner are different services. Combine (1) and (2) for defense-in-depth: use short TTL as a fallback even when you have explicit invalidation, so a bug in the invalidation path does not cause permanent stale data.
Redis with TTL is the standard session caching pattern, and it works well for a straightforward reason: sessions are naturally key-value (session ID → session data), the TTL maps directly to session timeout (e.g., 30 minutes of inactivity), and Redis's persistence option means sessions survive a Redis restart. The typical setup: on login, write session:{uuid} → serialized session object with EXPIRE 1800 (30 minutes). On each authenticated request, do GET session:{uuid} — if the key exists, extend the TTL (EXPIRE 1800 again); if it does not exist, redirect to login. On logout, explicitly delete the key. For multi-region setups, Redis cluster or Redis Sentinel handles replication so users do not lose sessions on a single node failure. For extremely high session volumes, Memcached is a reasonable alternative since sessions are plain blobs with simple expiry.
A CDN helps whenever two conditions are true: (1) the response is the same for many users (it is "public" data, not personalized), and (2) geography adds meaningful latency (your users are distributed globally or nationally). Static assets — JavaScript bundles, CSS, images, fonts — tick both boxes and should always go through a CDN. API responses also benefit if they satisfy (1): a product catalog page is the same for all users; a user's order history is not. You can cache the former at the CDN with Cache-Control: public, max-age=300. The latter should never be CDN-cached. A CDN also helps even if your users are regional but your origin is in a single cloud region — the CDN absorbs repeated identical requests so your origin only serves the first request per cache lifetime rather than every request. For content-heavy sites, a CDN is often the highest-ROI infrastructure investment — typically absorbing 80–95% of traffic at pennies per GB.
Yes for highest scale, but only if the added complexity is justified. In-process cache (L1) has zero network overhead — a hit costs 0.1 ms vs. the 0.5–2 ms for a Redis round trip. For very hot keys that are read thousands of times per second per server, that difference is real. But in-process caches are per-server: each app server has its own copy, and they can diverge. If you have 20 app servers and a price changes, you need to invalidate 20 in-process caches. The overhead of coordinating L1 invalidation is what makes teams skip it. The practical rule: start with Redis-only (L2). Add in-process caching (L1) only when you can measure that Redis latency is a bottleneck — typically when you have very hot keys (millions of reads/second for the same data) and are already at Redis's throughput limits. For the vast majority of applications, L2 Redis alone is sufficient.
Targets vary by layer and use case. CDN: 85–99% hit ratio for static assets; 50–80% for cacheable API responses (lower because personalized requests mix in). Distributed cache (Redis): 90%+ is healthy; 95%+ is excellent; below 80% means your cache is either too small for the working set or caching data that is accessed only once (which actively hurts eviction performance). In-process cache (Caffeine): 70–90% is typical because L1 is smaller; the point of L1 is to intercept the hottest requests so L2 and DB see less load. Monitor these ratios continuously and alert on drops of 5%+ from your established baseline — a sudden drop almost always indicates a deploy that changed key names, a working-set size change, or a memory constraint causing unexpected eviction of hot keys.