Caching Levels — System Guide

Section 1

TL;DR — Caching Levels in Plain English

Why every modern system has not one cache but six — stacked from the CPU die all the way out to the CDN edge — and why each layer exists for a different physical reason
The latency hierarchy: L1 cache at ~1 ns, DRAM at ~80 ns, Redis at ~0.5 ms, database cold read at 5–15 ms, CDN at 5–30 ms — and how to reason about where each layer fits
The compounding hit-ratio math: why four cache layers at 90% hit ratio each reduce downstream traffic to 0.01% of the original, not 60%
Which layer is right for which data — working sets, hot data, cold data, personalized vs shared — and the costs of putting data in the wrong layer
How to trace a real request through all six layers and calculate end-to-end latency under cold vs warm conditions

Every real system has six cache layers sitting between a user request and a disk. CPU L1/L2/L3 caches run at nanosecond speed and are invisible to your code. The OS page cache holds your hot files in RAM automatically. Your app-process cache (Caffeine, a Python dict) keeps computed results in memory for milliseconds. Redis or Memcached stores shared state across every server. The database buffer pool hides most disk reads. CDN edge nodes serve the whole page before it ever reaches your data center. A request that hits every layer warm costs 20 ms. A request that misses every layer costs 600 ms. That 30× gap is decided entirely by how well you understand — and tune — these six layers.

Between a user's browser and your disk there are exactly six cache layers, each one progressively larger, slower, and cheaper per byte. At the top sit CPU L1/L2/L3 caches — kilobytes to megabytes of memory baked onto the processor die, running at 1–30 ns, completely invisible to application code. Below that, the OS page cache stores recently-read file data in ordinary RAM and is managed by the kernel — your database reads usually land here first. Next comes the app-process cache: a Java Caffeine cache, a Python dict, or any in-memory map that lives inside your application server's heap. Then the distributed cache — Redis or Memcached — a separate tier shared across all your servers, reached over the network in 0.5–2 ms. The database buffer pool (Postgres's shared_buffers, MySQL's InnoDB buffer pool) caches raw database pages in RAM on the DB server before they would need a disk read. And finally the CDN edge, with servers distributed worldwide near users, caching full HTTP responses 5–30 ms from the browser. Every layer absorbs as many requests as it can; only the misses fall to the next layer.

The layers form a pyramid because physics dictates a strict trade-off: speed, size, and cost per byte are locked in a triangle — you can only have two. CPU L1 cache is the fastest thing you can build (~1 ns, single-cycle) but is tiny (32–64 KB) because SRAM cells are expensive. The OS page cache is slower (100 ns) but uses cheap DRAM and can be gigabytes. Redis is accessible to every server but pays a network round-trip (0.5–2 ms) while holding terabytes across a cluster. A CDN distributes globally across petabytes of storage but each response pays a geographic round-trip (5–50 ms). The pyramid isn't a design choice — it's an expression of physics. Understanding it tells you exactly which data should live where: hot, tiny, shared-nothing data → L1/L2; hot, large, cross-server data → distributed cache; hot, global, static data → CDN.

The most underappreciated insight in caching is that hit ratios multiply across layers, they don't add. If your app cache hits 90% of requests, the remaining 10% reach Redis. If Redis hits 90% of those, only 1% reach the database. If the database buffer pool hits 90% of those, a mere 0.1% ever touch disk. After four layers at 90% each, the bottom layer — disk — sees 0.01% of original traffic. This compounding is why every layer matters and why even a mediocre 50% hit ratio at a layer is worth adding: it cuts downstream load by half for free. End-to-end average latency follows the same compounding: most requests are served fast from high layers, dragging the average down even if slow cold-miss requests still exist.

Every modern system has six cache layers from CPU to CDN. Each layer is 5–100× slower and 5–100× larger than the layer above it. Hit ratios compound multiplicatively — four layers at 90% each mean the disk sees 0.01% of traffic. Understanding this hierarchy lets you place data in the right layer, tune hit ratios that matter most, and dramatically reduce latency and database load.

Section 2

Why You Need This — The Layered-Speed Story

Caching is one of those topics where you can learn each technique in isolation — Redis for session storage, CDN for static files, Caffeine for in-process lookups — without ever seeing the full picture. The full picture is this: all six layers are active simultaneously for every production request, and the difference between a well-tuned stack and a poorly-tuned one is measured in hundreds of milliseconds and in database servers you don't have to buy.

The Cold-Cache Horror Story: A Checkout Flow in Trouble

Picture an e-commerce checkout page. A user clicks "Buy Now" and their browser fires a request. Here is what happens on a system where nobody has thought carefully about caching layers — every cache is cold or misconfigured:

Browser cache miss — the CDN URL has no Cache-Control header, so the browser always re-fetches. +0 ms saved.
CDN miss — the checkout HTML is marked Cache-Control: no-store out of excess caution. Every request hits the origin. +30 ms network to origin.
Nginx reverse proxy — no proxy-side cache configured. Request passes through. +2 ms routing.
Redis miss — the user's cart, session, and pricing data are not in Redis (TTL expired). +1 ms network to Redis, then fall through to DB.
Database query — five SQL joins to assemble cart, shipping options, and product details. The query itself takes 60 ms on the primary. +60 ms.
Response assembly + serialization — another 15 ms server-side. +15 ms.
Response travels back — another 30 ms network to browser. +30 ms.

Total: roughly 138 ms per checkout page load — before accounting for the three round trips a browser needs for TLS and TCP. At p99 (with DB query variability), this regularly spikes to 800 ms. Their load tests show they need eight database replicas to handle Black Friday traffic.

The Warm-Cache Story: Same Team, Three Months Later

The team audits each cache layer and makes targeted fixes. Here is the same request after the optimization:

Browser cache hit — static assets (JS bundles, CSS, images) have content-hash URLs and a one-year Cache-Control: immutable header. Browser serves them without a network request. ~0 ms for assets.
CDN edge hit — the shared parts of the checkout HTML (header, footer, product thumbnails) are now cached at the CDN with a 60-second TTL and surrogate-key invalidation on price changes. Served from the CDN edge in 8 ms. +8 ms for the shell.
Redis hit — user cart, pricing rules, and shipping options are cached in Redis with a 5-minute TTL. Redis answers in 0.6 ms. +0.6 ms.
Database buffer pool hit — the small remaining DB read (user's saved address) lands in Postgres shared_buffers and returns in 0.3 ms — no disk touch. +0.3 ms.
Response assembly + network back — 5 ms + 8 ms. +13 ms.

Total: roughly 22 ms per checkout page load. p99 is under 80 ms. And because the DB now sees only the cache misses (roughly 2% of requests after layered caching), the load test reveals they can handle Black Friday on two replicas instead of eight — a difference worth tens of thousands of dollars per month in infrastructure cost.

The key lesson: No single cache layer produced the 6× speedup and 4× DB reduction. It was each layer, in sequence, absorbing its slice of traffic. The browser handles assets. The CDN handles shared HTML. Redis handles computed state. The buffer pool handles DB reads. Together, they transform the system. Individually, none of them alone comes close.

The waterfall makes the contrast visceral. In the cold path, the majority of wall-clock time is spent in two places: the CDN round-trip (because caching was disabled) and the database disk I/O (because five joins bypassed every cache). In the warm path, those two expensive phases don't happen. The request returns from shallow layers — CDN and Redis — and the system is 6× faster with 4× fewer database servers needed. The six-layer hierarchy isn't academic. It is the architectural blueprint for every latency improvement your system will ever achieve.

A real production checkout flow demonstrates the layered-cache effect clearly: every cache miss cascades to the next layer, and the most expensive layers (DB disk, network round-trips) dominate wall-clock time. Warming each cache layer systematically transformed a 138 ms p50 / 800 ms p99 system into a 22 ms p50 / 80 ms p99 system — and cut database replica needs from eight to two. No single layer achieved this; it required the full stack.

Section 3

Mental Model — The Six-Layer Cache Pyramid

Before diving into each layer individually, you need one picture that holds the whole hierarchy in your head. Here it is: imagine a pyramid. At the very top — the smallest, fastest, most expensive tier — sit your CPU caches. At the very bottom — the largest, slowest, cheapest tier — sits the CDN serving billions of users globally. A request starts at the top of the pyramid for the layer closest to the CPU. If it misses, it falls to the next layer. And the next. Each fall costs progressively more time.

The pyramid shape is not arbitrary — it reflects a hard physical reality. As you move from top to bottom: capacity grows (KB → GB → TB → PB), latency grows (1 ns → 50 ms), and cost per byte drops (SRAM at $10/MB → DRAM at $0.005/MB → SSD at $0.0001/MB → CDN storage at fractions of a cent). You cannot get all three wins at once. Speed costs space. Space costs money. This triangle defines where each layer sits.

Reading the pyramid: each layer you descend pays a latency penalty and gains capacity. CPU L1 is baked onto the processor die itself — it is the single fastest memory humans have ever built, but it holds only 32–64 KB per core. By the time you reach the CDN edge, you are talking petabytes across thousands of servers worldwide, but each cache hit takes 5–50 ms depending on how far the user is from the nearest PoP. The pyramid is not a hierarchy of importance — all six layers run simultaneously in every production request. It is a hierarchy of speed and cost.

The most counterintuitive thing about this pyramid is who controls each layer. Layers 1–2 (CPU caches, OS page cache) are controlled entirely by the hardware and kernel — your application code cannot directly address them. Layer 3 (app-process cache) you control explicitly in your application code. Layer 4 (distributed cache) is a separate service you operate and configure. Layer 5 (DB buffer pool) is controlled by your database configuration — a single parameter like Postgres's shared_buffers or MySQL's innodb_buffer_pool_size can double or halve the number of disk reads your database generates. Layer 6 (CDN) is controlled by HTTP headers your application sends. Each layer has a different control knob, and knowing which knob to turn for which problem is the core skill this page builds.

A note on layer 5 (DB buffer pool): Many engineers treat their database as a black box. But the database has its own internal cache — the buffer pool — and it is often the single highest-leverage tuning knob in the entire system. By default, Postgres allocates only 128 MB for shared_buffers. On a server with 32 GB of RAM, setting it to 8 GB can eliminate 80–90% of disk I/O entirely. That is a single config line that can outperform a week of application-level caching work.

The six cache layers form a pyramid where capacity grows, latency grows, and cost per byte drops as you move from CPU caches (top) to CDN edges (bottom). Each layer is controlled differently — hardware/kernel for layers 1–2, application code for layer 3, separate services for layer 4, database config for layer 5, HTTP headers for layer 6. The pyramid is the mental model for deciding where each piece of data belongs and which knob to turn to fix a performance problem.

Section 4

Core Concepts — The Vocabulary

Caching has its own jargon, and a lot of the terms look similar but mean subtly different things. We'll warm up with the plain-English idea first, then introduce the name — so when you bump into these in docs, code reviews, or interviews, you already have a mental hook for what each one means and why anyone cares.

The Fundamentals

When your program asks for a piece of data and it is already sitting in the cache, ready to go — that is called a cache hit. When the cache doesn't have it and has to go fetch it from the layer below — that is a cache miss.

The fraction of requests that find data in the cache is the hit ratio. This is the single most important number in cache performance. A 90% hit ratio means only 10% of traffic reaches the next layer; a 50% hit ratio means half of all traffic falls through — very different outcomes. The flip side is the miss penalty — the extra time you pay every time the cache doesn't deliver.

The data your system accesses most often right now is called the working set. If your cache is large enough to hold the entire working set, hit ratios will be high and stable. If the working set is larger than your cache, items get evicted before they can be reused — a condition called cache thrashing.

Write Policies

When data changes, you have three strategies for keeping the cache in sync with the source. If you write to the cache AND the backing store simultaneously — guaranteeing they match at all times — that is write-through. It is safe and consistent, but every write still pays the cost of hitting the backing store.

If you write only to the backing store and skip the cache entirely (forcing the next read to re-fetch) — that is write-around. Good for data that is written once and not immediately re-read — it prevents polluting the cache with cold data.

If you write only to the cache and lazily flush to the backing store later — that is write-back (or write-behind). It is fastest because the caller doesn't wait for the DB, but it introduces risk: if the cache crashes before flushing, data is lost.

Cache Locality

The reason CPU caches are so effective is that programs naturally exhibit locality — they tend to access the same data repeatedly, and data near recently-accessed data. When a program uses the same variable or memory address over and over — think a loop counter — that is temporal locality. When a program accesses sequentially adjacent memory locations — think iterating an array — that is spatial locality. Both principles explain why caches at every layer are so effective — they work because programs are not random in what they access.

Cache Relationships (Inclusive vs Exclusive)

When you have multiple cache layers, you have to decide: can the same data item live in more than one layer at a time? If yes, the caches are inclusive — each layer is a superset of the layer above it. If an item lives in exactly one layer at any time, the caches are exclusive — evicting from one layer moves the item to the next rather than discarding it. AMD CPUs use exclusive L2/L3 caches; Intel traditionally used inclusive L3 (this has evolved). For distributed systems, the inclusive vs exclusive choice maps to whether you populate multiple cache tiers simultaneously (inclusive) or only cache at the tier that served the miss (exclusive).

Hardware Cache Coherence

On a multi-core CPU, each core has its own L1 and L2 cache — meaning the same memory address can exist in multiple cores' caches simultaneously. If one core changes a value, all other cores with a copy need to know. The hardware protocol that manages this is called MESI (Modified, Exclusive, Shared, Invalid). Each cache line is tagged with one of these four states, and the hardware automatically coordinates between cores when a write happens. This is invisible to your code but explains why false sharing — two threads writing to adjacent variables on the same 64-byte cache line — can cause surprising performance cliffs: every write invalidates the other core's copy and forces a round-trip through L3.

The concept map shows how all the vocabulary connects. Hit ratio and miss penalty together determine end-to-end latency — you want maximum hit ratio and minimum miss penalty at every layer. Working set size tells you whether a given cache is large enough to hold hot data without thrashing. Write policies decide consistency trade-offs. Locality explains why caches work at all. Inclusive vs exclusive decides total capacity. MESI is the hardware implementation of consistency at the CPU level.

The core vocabulary — hit ratio, miss penalty, working set, write-through/around/back, temporal locality, spatial locality, inclusive/exclusive caches, MESI — forms the shared language for reasoning about any cache layer. Internalizing these terms makes documentation, config tuning, and performance debugging dramatically more efficient. Every subsequent section uses this vocabulary.

Section 5

The Six Cache Layers — A Layer-by-Layer Tour

With the vocabulary and pyramid mental model in place, let's walk each layer one by one — what it is, who controls it, the latency numbers, the typical capacity, and (critically) what makes it the right layer for certain types of data. Later blocks dive deep into each; here the goal is a clear, side-by-side picture.

Layer 1 — CPU L1 / L2 / L3 Caches (1–30 ns)

The fastest memory in any computer is not RAM — it is the cache baked directly onto the processor die. There are actually three tiers of this on-die cache, nested inside the CPU itself, named by how close they sit to each core: "level 1" is closest and smallest, "level 3" is furthest and largest. The innermost one — closest to a single core — is called L1 cache, holds 32–64 KB per core and responds in about 1 nanosecond — that is 4 CPU clock cycles at 4 GHz. The middle tier, L2 cache, holds 256 KB to ~4 MB and responds in 4–12 ns. The outermost, L3 cache, is shared across all cores and holds 4–192+ MB depending on the CPU, responding in 15–50 ns.

Your application code cannot directly address L1/L2/L3 caches. You cannot say "put this variable in L1." The CPU hardware manages these caches automatically — the programmer's job is to write code with good cache-friendly access patterns. Accessing array elements in order (sequential → excellent spatial locality) is dramatically faster than jumping around a linked list (random pointers → terrible spatial locality). This matters for database internals, serialization, and hot-path server code. A cache-unfriendly inner loop can run 10–100× slower than an equivalent cache-friendly one — not because of algorithm complexity but because of L1/L2 thrashing.

Layer 2 — OS Page Cache (~100 ns, GBs of DRAM)

When your application reads a file — say, a database reading a data page from disk — the operating system does something clever: it keeps a copy of that file's data in ordinary RAM (DRAM) even after the read is done. The next time anyone reads the same data, the OS serves it directly from RAM without touching the disk. This memory region managed by the kernel is called the OS page cache.

Why does this matter? Because your database is really two caches at the OS level: its own buffer pool (Layer 5) AND the OS page cache. When Postgres reads a data page and it's not in shared_buffers, the kernel checks its page cache first. If the page is in the page cache (a kernel-level cache miss from Postgres's perspective, but a page-cache hit from the OS perspective), the read takes ~100 ns. Only if the page cache also misses does the OS issue an actual disk I/O — 50 µs for an SSD, 5–10 ms for an HDD. This two-level structure at the OS layer is often overlooked but explains why databases see much better performance than naïve latency calculations suggest.

Layer 3 — App-Process Cache (100 ns–1 µs, MBs–GBs)

When you store results in a HashMap inside your Java application, or a dict in Python, or use a library like Caffeine — that is Layer 3: the app-process cache. It lives inside your application's own memory (heap), so access is just a hash lookup — no network, no IPC, no syscall. Latency is effectively the same as accessing any other in-memory data structure: 100 ns to a few microseconds.

The critical limitation of Layer 3 is that it is local to one process. If you have ten application servers, each has its own independent in-process cache. If server A caches a user's profile, server B knows nothing about it. This makes Layer 3 best for shared-nothing, read-heavy, globally valid data — reference data that is the same for all servers (configuration values, product catalog snippets, static rate limits), not per-user state. For per-user or frequently-updated shared state, you need Layer 4.

Layer 4 — Distributed Cache: Redis / Memcached (0.5–2 ms, GBs–TBs)

When you need a cache that is visible to all your application servers simultaneously, you reach for a distributed cache. Redis and Memcached are separate server processes — your application connects to them over a TCP socket, pays a network round-trip, and gets back a cached value in 0.5–2 ms.

That network round-trip is the key trade-off: Layer 4 is ~1,000–2,000× slower than Layer 3 for any single lookup, but it is shared. Every server in your fleet sees the same data. This makes it ideal for: user sessions, rate limit counters, cart contents, real-time leaderboards, feature flags, and any state that is per-user but needs to survive a process restart or be visible to multiple servers. The canonical rule: if the data lives in your database and you want to serve it without a DB query, Redis is your Layer 4.

Layer 5 — Database Buffer Pool (5 µs–15 ms, GBs on DB Server)

Your database has its own cache, and it is arguably the most powerful cache in the entire stack for database-heavy applications. The database buffer pool caches raw data pages in RAM on the database server itself. Postgres calls it shared_buffers. MySQL/InnoDB calls it innodb_buffer_pool_size. The concept is the same: a chunk of the DB server's RAM is set aside for caching recently-read database pages.

When a query reads a row, the database loads the entire 8 KB page containing that row into the buffer pool. Future queries for any row on that same page are served from RAM in ~5 µs instead of requiring a disk read (50 µs for NVMe SSD, 5–10 ms for HDD). For a production database with a well-tuned buffer pool and a working set that fits in RAM, 90–99% of reads never touch storage at all. The critical tuning fact: Postgres's default shared_buffers is just 128 MB regardless of how much RAM the server has. On a 32 GB server, a single config change — shared_buffers = 8GB — can eliminate most disk I/O entirely.

Layer 6 — CDN Edge (5–50 ms, GBs–PBs, Globally Distributed)

The outermost cache layer sits between the Internet and your origin servers. A CDN edge node is a server in a data center physically close to the user — often within a city or metropolitan area. When the edge node has a cached copy of the requested resource (an HTML page, a JSON API response, an image), it serves the response directly: ~5–30 ms to the user regardless of where your origin server lives.

CDN caching is controlled by HTTP headers your application sends — Cache-Control, Surrogate-Control, Vary, ETag. The cache key is typically the URL. CDNs are ideal for: static files (JS/CSS/images with content-hash URLs, cached for a year with Cache-Control: immutable), shared page shells (header, footer, product grids), and public API responses (search results, product listings). They are not ideal for personalized responses (shopping carts, authenticated pages) without careful key partitioning or ESI (Edge Side Includes).

The comparison table highlights the most important pattern: as you move down the layers, latency multiplies by 10–1,000× at each step, while capacity also multiplies by 10–1,000×. There is no layer that is "better" — each layer has a different role. The goal is to ensure that the hottest data (most frequently accessed) lives in the fastest layer that can hold it, and that each layer is sized large enough to achieve a high hit ratio for its working set.

The six layers — CPU L1/L2/L3, OS page cache, app-process cache, distributed cache, DB buffer pool, CDN edge — each have distinct latency, capacity, control mechanisms, and ideal data types. L1 runs at 1 ns but holds 64 KB; CDN runs at 5–50 ms but holds petabytes globally. Each layer is controlled differently: hardware auto-manages L1–L2, the kernel manages the page cache, application code manages L3, config files manage L5, and HTTP headers manage L6. The unified picture: every layer runs simultaneously, and the one closest to the requester that has valid cached data wins.

Section 6

The Math of Multi-Level Caching — Why Compounding Hit Ratios Win

Here is the insight that separates engineers who design caches from engineers who just use them: cache hit ratios across multiple layers do not add — they multiply. And multiplication is far more powerful than addition.

The Cascading Multiplication

Imagine a four-layer cache stack where each layer has a 90% hit ratio. You might intuitively think "90% + 90% + 90% + 90% = 360%" which doesn't even make sense, or maybe "90% average across layers." But the real math works like this: the 10% of requests that miss Layer 1 reach Layer 2. Layer 2 hits 90% of those — so 10% × 10% = 1% of original requests reach Layer 3. Layer 3 hits 90% of those — so 1% × 10% = 0.1% of original requests reach Layer 4. Layer 4 hits 90% — so only 0.1% × 10% = 0.01% of original requests reach the backing store (disk).

The formula is straightforward: the probability of a request reaching layer N is the product of all miss rates for layers 1 through N-1.

    P(reach layer N) = ∏ (1 − hit_ratioi)  for i = 1 to N-1

    Example: 4 layers at 90% each

    P(reach L2) = 1 − 0.90 = 0.10  (10%)

    P(reach L3) = 0.10 × 0.10 = 0.01  (1%)

    P(reach L4) = 0.01 × 0.10 = 0.001  (0.1%)

    P(reach disk) = 0.001 × 0.10 = 0.0001  (0.01% of original traffic)

Put differently: you need 10,000 requests at Layer 1 before a single request reaches disk. If your Layer 1 is processing 10,000 requests per second, your disk sees 1 request per second. That is the power of compounding.

The cascade diagram makes the compounding visceral: each layer's bar is 10× shorter than the one before it, because each layer absorbs 90% of what reaches it. By the time you reach the disk, the traffic is a tiny sliver — 10 requests per second from an original 10,000. The database doesn't need to scale to match your application traffic; it only needs to handle the residual after every upstream cache layer has done its job.

The End-to-End Latency Formula

The same compounding logic applies to average latency. The end-to-end average latency of a multi-level cache stack is not the slowest layer's latency — it is the probability-weighted sum across all layers.

    avg_latency = Σ P(hit at layer i) × latency(layer i)  +  P(miss all) × disk_latency

    Where: P(hit at layer i) = hit_ratio(i) × P(reach layer i)

           P(reach layer i) = ∏ miss_ratio(j) for j = 1 to i-1

Let's work through a realistic example with five layers and real latency numbers. The numbers below are representative for a typical production web application with a well-tuned Redis cluster, reasonable DB buffer pool, and CDN in front:

The worked example reveals something surprising: even though disk I/O is 10 ms — roughly 33,000× slower than an app-cache hit — the disk contributes only about 2% of the average latency. Why? Because only 0.075% of requests ever reach it. The slowest layer barely matters for the average, because the upstream layers filter out virtually all traffic. This has a counterintuitive implication for optimization: spending engineering effort to improve L1 hit ratio (e.g., from 85% to 92%) has a larger impact on average latency than tuning the disk read path by 10×, because the disk path is already so rare.

When a Single 50% Layer Is Worth Adding

The math also reveals why adding a mediocre cache layer with only 50% hit ratio can still be worthwhile. If your stack currently sends 1,000 requests per second to the database, adding a layer with 50% hit ratio reduces database traffic to 500 requests per second — for free, without any database tuning. Whether 500 req/s of saved DB load justifies the operational cost of the new cache layer depends on your specific system, but the math is always in your favor as long as the cache layer is cheaper per request than the layer below it. Redis at 0.5 ms is dramatically cheaper to scale than a Postgres query at 10 ms — so even a 50% Redis hit ratio dramatically reduces DB scaling costs.

The multi-level caching optimization priority: Always optimize the highest cache layer first — improving L1 hit ratio by 5% saves more than improving L4 hit ratio by 50%, because every L1 hit prevents the work at all four downstream layers. Work from top to bottom: measure hit ratios at each layer, find the worst one relative to its potential, and fix it before moving down.

Cache hit ratios multiply across layers, not add. Four layers at 90% hit ratio each reduce bottom-layer traffic to 0.01% of original — meaning 10,000 requests per second at the top produces only 1 request per second at disk. End-to-end average latency is a probability-weighted sum: slow layers barely affect the average because they're so rarely reached. The optimization rule: always fix the highest layer with the worst hit ratio first — improvements there compound through every layer below.

Section 7

Layer 1 Deep Dive — CPU Cache (L1, L2, L3) and the Memory Hierarchy

Before your code ever touches Redis or a database, it runs on a CPU that is doing its own caching — silently, automatically, and at speeds so fast that most engineers never think about them. Understanding this layer won't let you configure it (the hardware controls it entirely), but it will explain why some algorithms are 10× faster than others on identical hardware, and why "cache-friendly" code is a real engineering concern worth talking about in a systems interview.

The Speed Gap Problem: Why CPU Caches Had to Exist

Here is the fundamental problem that forced CPU caches into existence: modern CPUs execute instructions in roughly 0.25–0.5 nanoseconds (a 3 GHz clock ticks every 0.33 ns), but reading from DRAM — the ordinary RAM in your computer — takes roughly 80 nanoseconds. That is a 200× speed gap. Without caching, the CPU would spend most of its time waiting for data, running at roughly 0.5% of its theoretical peak speed.

The solution: build a small chunk of extra-fast memory and bolt it onto the CPU itself. The memory used inside CPU caches is a different type than the RAM in your DIMM slots — it uses six transistors per bit instead of one, so each bit is wired up to respond instantly without needing the periodic "refresh" pulse ordinary RAM needs. This faster type is called SRAM (Static RAM), and the slower bulk type used in main memory is DRAM (Dynamic RAM). SRAM is faster but six times more expensive per bit, so you can only afford a tiny amount. The hardware splits that tiny amount into a three-level hierarchy:

The memory hierarchy pyramid. Each level down is roughly 5–10× slower, 5–30× larger, and much cheaper per byte. The CPU checks L1 first; on a miss it checks L2, then L3, then DRAM. Every miss to a lower level costs more cycles.

What the CPU Actually Caches: Cache Lines, Not Individual Bytes

Here is a detail that trips up many people: the CPU doesn't cache individual bytes or even individual integers. Instead, every time it pulls something from RAM it grabs a fixed-size neighborhood — typically a 64-byte chunk on Intel/AMD x86 and most ARM64 server and mobile cores — and stuffs the whole chunk into the cache. Apple's M-series chips are a notable exception and use 128-byte lines. Those chunks are called cache lines. Why 64 bytes specifically? Because programs almost never read just one byte and stop. If your code touches address X, it will very likely touch X+1, X+2, ... X+63 within a few instructions — a pattern called spatial locality. Loading the whole neighborhood at once means one slow trip to DRAM pays for dozens of future fast reads.

This has a practical consequence for data structure design: an array of 64-bit integers stores 8 integers per cache line. Traversing the array sequentially loads a new cache line every 8 elements, but each load serves the next 7 reads for free — very efficient. A linked list, by contrast, stores each node at a random heap address; every pointer dereference may require loading an entirely new cache line, potentially causing a full DRAM round-trip for every node. Same algorithmic complexity (O(n)), but the array version can be 5–10× faster in practice purely because of cache efficiency.

Temporal and Spatial Locality: Why Caches Work at All

CPU caches work because real programs aren't random — they have patterns. Two key patterns make caching effective:

Temporal locality — if you accessed a piece of data once, you'll probably access it again soon. A loop counter used a million times fits this pattern perfectly. The CPU keeps it in L1 and never pays the DRAM cost after the first load.
Spatial locality — if you accessed address X, you'll likely access nearby addresses soon. Array traversal, struct field access, and sequential file reads all exhibit this. The cache-line mechanism exploits it automatically.

When your code violates these patterns — accessing memory in a random stride, jumping around a huge linked list — you "thrash the cache" and effectively reduce the CPU to running at DRAM speeds. The difference between cache-friendly and cache-hostile code on the same hardware is routinely 5–10×.

MESI Coherence: Keeping Multi-Core Caches Consistent

A modern CPU has 8–128 cores, each with its own L1 and L2 cache. If two cores both cache the same memory address and one of them writes to it, the other's cached copy is now stale and would return the wrong answer if used. The hardware fixes this by giving every cached 64-byte chunk a little status tag — "is this copy still trustworthy, or has somebody else changed it?" — and broadcasting updates between cores whenever a write happens. The protocol that defines the four possible tag values and the rules for switching between them is called MESI. The name is just the four states stitched together:

The hardware circuitry managing MESI transitions runs entirely without software intervention. From your code's perspective, multi-core caches appear coherent. From a performance perspective, however, MESI coherence creates one dangerous trap.

False Sharing: The Silent Performance Killer

Imagine two threads on two different cores, each incrementing their own independent counter. Thread A increments counter_a; Thread B increments counter_b. Completely independent operations — in theory, they should run at full speed with zero contention. But if counter_a and counter_b happen to live in the same 64-byte cache line (e.g., they are adjacent fields in the same struct), every write by Thread A invalidates Thread B's cached copy of that line via MESI, and vice versa. Neither thread is actually touching the other's data, but the CPU's cache coherence mechanism can't tell — it only sees cache lines, not individual variables. This is false sharing: two threads contending over a cache line they don't logically share.

False sharing in action. Both cores read the same 64-byte cache line into their L1 caches. When Core 0 writes counter_a, MESI sends an invalidation to Core 1 — even though Core 1 only cares about counter_b. Both threads slow to DRAM speeds despite touching logically independent data. The fix is to pad each counter to 64 bytes so they live in separate cache lines.

Why this matters in Java and Go: Java's @Contended annotation (JVM flag -XX:-RestrictContended) instructs the JVM to add 128 bytes of padding around a field to prevent false sharing. Go's sync.Mutex includes padding for the same reason. Disruptor (the high-performance Java ring buffer) eliminates false sharing by padding every slot to exactly one cache line — it's a major reason the Disruptor consistently outperforms traditional queues by 10× in throughput benchmarks.

The CPU's three-level cache hierarchy (L1 ~1 ns / 32 KB → L2 ~4 ns / 512 KB → L3 ~20 ns / 32 MB) exists entirely because DRAM is 200× slower than the CPU. The hardware caches in 64-byte cache lines to exploit spatial locality, and the MESI protocol keeps multi-core caches coherent. False sharing — two threads unknowingly sharing a cache line for unrelated variables — can silently kill multi-core performance by 5–10×. You can't configure these caches, but you can write cache-friendly code that works with them.

Section 8

Layer 2 Deep Dive — OS Page Cache (The Free Cache You Already Have)

Every Linux server you've ever run an application on has been quietly caching disk data in RAM behind your back — automatically, since the 1990s. This is the OS page cache, and it is genuinely the cheapest performance win in systems engineering because you didn't have to configure, buy, or install anything. It just works. Understanding it lets you reason about when it's working for you, when workloads fight against it, and how to tune it for your specific situation.

How the Page Cache Works: The Kernel's RAM Recycler

When your application calls read() on a file, the kernel doesn't fetch bytes straight from disk and hand them to your process. Instead, it reads an entire page (4 KB on x86 Linux) into a kernel-managed RAM buffer, then copies the data to your process's address space. The crucial part: after satisfying your read() call, the kernel keeps that page in RAM. The next read() — by you, by another process, or by the OS itself — is served entirely from RAM with no disk involved.

Where does this RAM come from? The kernel uses all RAM not currently needed by processes as page cache. If a process needs more heap memory, the kernel evicts cold pages from the page cache (using a variant of LRU) to reclaim space. The page cache is not a fixed allocation — it expands and contracts dynamically based on system pressure. This is why the output of free -h can be confusing at first glance:

# On a 32 GB server running Postgres — typical output $ free -h total used free shared buff/cache available Mem: 31Gi 4.2Gi 0.8Gi 312Mi 26.2Gi 26.1Gi Swap: 0Bi 0Bi 0Bi # "used" (4.2 GB) = RAM held by running processes # "buff/cache" (26.2 GB) = OS page cache + kernel buffers — NOT wasted RAM # "available" (26.1 GB) = how much RAM a new process can actually get # (kernel will evict page cache if needed — so available ≈ free + most of buff/cache) # # KEY INSIGHT: a healthy server should show low "free" and high "buff/cache" # "free" memory is literally wasted — the kernel fills it with page cache automatically

A common mistake: seeing free = 0.8 GB and panicking that the server is out of memory. The 26 GB in buff/cache is available to processes on demand. The kernel will evict page cache pages to make room. available is the honest metric for "how much RAM can a new process get right now."

The read() path through the OS page cache. On a hit, the kernel copies data from a RAM page (~100 ns). On a miss, it fetches from disk, stores the page in the cache, then copies to your process. Critically, the page stays in cache after the miss — subsequent reads are free.

Why Databases Love (and Sometimes Fight) the Page Cache

Every database you've ever used benefits from the page cache by default. Postgres reads its data files through the page cache, and the page cache is often the reason Postgres feels fast even when shared_buffers is set to a modest 2 GB on a 64 GB server — the OS is quietly caching another 40 GB of hot pages underneath. This creates a double-buffering situation: both Postgres's own buffer pool and the OS page cache hold copies of the same data. It's technically wasteful (RAM used twice), but in practice the two levels serve different purposes — shared_buffers knows it's holding database pages and evicts intelligently; the page cache just sees bytes and does generic LRU.

MySQL InnoDB takes a different approach: it sets the O_DIRECT flag on its data files, which bypasses the page cache entirely and uses its own buffer pool exclusively. The motivation is to avoid double-buffering: if you've allocated 60 GB to the InnoDB buffer pool, there's no point caching the same data again in the page cache. InnoDB's buffer pool has smarter eviction policies (it knows about hot vs cold pages in its B-tree), so excluding the OS cache is a net win when RAM is the constraint.

Scan Eviction: The Page Cache's Achilles Heel

The page cache's biggest weakness is scan eviction. Imagine running a full-table scan on a 200 GB table while your production database also serves live OLTP queries. The scan reads 200 GB of pages sequentially, filling the page cache as it goes. Because the OS can't tell the difference between "scan I'll never see again" and "hot page accessed a thousand times today," it happily evicts hot pages to make room for scan data. After the scan finishes, all your hot OLTP pages are gone — every subsequent query causes a cache miss and a disk read. The database spends the next several minutes re-warming the cache. This is called cache pollution.

The fix is to tell the kernel about your access pattern so it stops treating the scan like hot data. Linux gives you two system calls for this — one for normal file reads, one for memory-mapped files. They are called fadvise() (file advise) and madvise() (memory advise), and they let you whisper hints like "I'm scanning, don't bother keeping these pages" or "I'll be jumping around randomly, skip the pre-fetch." Used together, they make a scan walk through the page cache without pushing out the hot OLTP data:

// For a sequential scan you will NOT re-read: // Tell the kernel: "I'm going to read this range linearly once. // Pre-fetch aggressively, then evict immediately." posix_fadvise(fd, offset, length, POSIX_FADV_SEQUENTIAL); posix_fadvise(fd, offset, length, POSIX_FADV_DONTNEED); // POSIX_FADV_SEQUENTIAL → kernel doubles read-ahead window // POSIX_FADV_DONTNEED → kernel marks pages as immediately evictable // after they are read; won't pollute hot-page LRU // For random access on a hot working set: posix_fadvise(fd, offset, length, POSIX_FADV_RANDOM); // Tells kernel: don't do sequential pre-fetch; I'll jump around. // Avoids reading ahead pages you'll never use. // madvise() does the same thing for mmap'd regions: madvise(addr, length, MADV_SEQUENTIAL); // pre-fetch ahead madvise(addr, length, MADV_DONTNEED); // drop these pages now

Postgres historically used posix_fadvise(POSIX_FADV_WILLNEED) driven by effective_io_concurrency to prefetch pages during bitmap heap scans and similar operations (modern Postgres 17+ has moved sequential scans onto the streaming-I/O / async-I/O framework instead). On the application side, POSIX_FADV_DONTNEED is the surgical lever you reach for when you specifically want to prevent a one-shot batch scan from polluting the page cache. It's a rare but powerful tuning lever for mixed OLTP + analytical workloads.

tmpfs and shmfs: Explicit In-RAM Filesystems

Sometimes you want a guarantee that certain data always lives in RAM and never gets written to disk at all — temporary files, in-memory IPC buffers, shared-memory regions between processes. Linux offers a special filesystem that looks and acts like any other directory but is secretly backed by the page cache and swap rather than a real disk. It is called tmpfs (temporary filesystem). Files you write to a tmpfs mount stay entirely in RAM. Postgres's shared memory segment lives on a tmpfs mount (the familiar /dev/shm path), which is why Postgres restart is so fast — shared_buffers is in RAM-backed shared memory the kernel never has to read from disk.

The OS page cache is an automatic, always-on RAM cache for file data managed by the kernel. It expands to fill unused RAM and contracts under process memory pressure. The buff/cache line in free -h shows its size — high values are healthy, not wasteful. Databases like Postgres ride on top of the page cache; InnoDB bypasses it via O_DIRECT to avoid double-buffering. Full-table scans can evict hot pages (cache pollution); POSIX fadvise(POSIX_FADV_DONTNEED) is the surgical fix.

Section 9

Layer 3 Deep Dive — In-Process Application Cache (Caffeine, Guava, dicts)

The in-process application cache is the cache you build yourself, inside your application's own memory. Unlike the CPU cache (hardware-controlled) or the OS page cache (kernel-controlled), this one is yours to design: you choose what to store, how much, how long, and what eviction policy to use. It is the fastest cache you can write application code against — because the data is already in the same process, there is no network hop, no serialization, no protocol overhead. A hit costs roughly 100–500 ns on a modern JVM, compared to 0.5–2 ms for Redis. That is a 1,000–5,000× speed difference.

Why Not Just Use a HashMap? The Problem With Unbounded Caches

The simplest in-process cache is a HashMap: put data in on first access, read from it on subsequent accesses. This works until your JVM runs out of heap memory and starts either throwing OutOfMemoryError or spending 80% of its time in GC. An in-process cache without a size bound is just a memory leak with aspirations. You need at minimum: a maximum size (so memory is bounded), an eviction policy (so the least-useful entries are dropped first), and optionally a TTL (so stale data expires).

Caffeine: The Modern Best-in-Class (Java)

For Java, Caffeine is the de-facto standard. It replaced Guava's cache as the go-to library (Spring's @Cacheable uses Caffeine by default when it's on the classpath). The reason it pulled ahead is its eviction algorithm. The old default — "drop whatever was used longest ago" (plain LRU) — has a known weakness: a one-time scan can shove out keys that have been hammered a million times. Caffeine instead remembers how often each key has been used recently, and protects frequently-used keys from one-shot scans. The technique has a mouthful of a name — W-TinyLFU (Windowed Tiny Least Frequently Used) — but the idea is just "favour popular keys over recently-touched-once keys." Here is a minimal production-ready Caffeine setup:

import com.github.benmanes.caffeine.cache.Cache; import com.github.benmanes.caffeine.cache.Caffeine; import java.util.concurrent.TimeUnit; // Build a cache: max 10,000 entries, expire 5 min after write Cache<String, UserProfile> profileCache = Caffeine.newBuilder() .maximumSize(10_000) // evict LFU entries past this count .expireAfterWrite(5, TimeUnit.MINUTES) // TTL per entry .recordStats() // enable hit/miss counters (dev/staging only) .build(); // get-if-present (returns null on miss) UserProfile cached = profileCache.getIfPresent(userId); // get-or-load (calls the loader on miss, caches result automatically) UserProfile profile = profileCache.get(userId, id -> userRepository.findById(id)); // ↑ This is the preferred pattern: atomic load-and-cache, no stampede risk // Caffeine serializes concurrent loads for the same key (only one DB call per key) // Inspect hit rate in tests/staging (never leave recordStats() on in prod — small overhead) CacheStats stats = profileCache.stats(); System.out.printf("Hit rate: %.1f%%, Eviction count: %d%n", stats.hitRate() * 100, stats.evictionCount());

Key points from this snippet: maximumSize bounds memory; expireAfterWrite bounds staleness; get(key, loader) is the preferred read pattern because Caffeine serializes concurrent loads for the same key — if 100 threads request the same missing key simultaneously, only one DB call fires and the other 99 wait for that result. This prevents the cache stampede (also called "thundering herd") where a sudden miss causes a flood of identical DB queries.

The Multi-Instance Problem: Why In-Process Cache Hit Ratio Degrades at Scale

Here is the catch with in-process caching: when you run multiple application instances (horizontal scaling), each instance has its own private cache. Instance A might cache user profile 42; instance B might not. A request routed to instance B misses and hits the database, even though the exact same data is sitting warm in instance A's heap. Your effective hit ratio is per instance, not across the whole fleet.

With 3 instances and random load balancing, the same hot key (user 42) is only cached in 1 of 3 instances. Roughly two-thirds of requests miss locally and fall through to the database. The fix is either sticky sessions (fragile, counter to horizontal scaling) or a combined strategy: a small Caffeine cache for the very hottest keys, backed by Redis for cross-instance sharing.

The L-Shape Cache Strategy: In-Process + Distributed Together

The best pattern for production is a two-level lookup: check the in-process Caffeine cache first (1 µs); if missed, check Redis (1 ms); if missed, query the database and populate both caches. This is called the "L-shape" or "layered" cache strategy. The Caffeine tier holds the hottest ~1,000–10,000 keys; Redis holds the full warm data set across all instances. The Caffeine layer absorbs the top 5–10% of traffic at microsecond speed. The Redis layer handles the rest at millisecond speed. Together, fewer than 1% of requests reach the database.

Equivalents in Other Languages

Python: functools.lru_cache (built-in, size-bounded, LRU, but not TTL-aware) or the cachetools library (TTL, LFU, LRU). For async/FastAPI code, aiocache provides async-safe in-process caches.

.NET / C#: System.Runtime.Caching.MemoryCache (framework) or Microsoft.Extensions.Caching.Memory.IMemoryCache (recommended in .NET Core). Both support size limits, TTL, and eviction callbacks.

Node.js: The lru-cache npm package is the standard. Supports max (entry count), maxSize (byte-based), ttl, and a fetchMethod for atomic load-and-cache (equivalent to Caffeine's get(key, loader)).

The in-process cache lives in your application's heap and offers ~100–500 ns access time — 1,000–5,000× faster than Redis. Caffeine (Java) with W-TinyLFU is the best-in-class implementation; equivalents exist for Python, .NET, and Node.js. The critical trade-off: in-process caches don't share data across instances, so fleet-wide hit ratios degrade as instance count grows. The production solution is the L-shape strategy: tiny Caffeine tier for ultra-hot keys, Redis for cross-instance sharing, database as the final fallback.

Section 10

Layer 4 Deep Dive — Distributed Cache (Redis, Memcached)

A distributed cache solves the one problem that in-process caching cannot: shared state across multiple application servers. Instead of each server maintaining its own private copy of cached data, all servers talk to a central (or clustered) cache tier. Every server sees the same data, invalidations propagate instantly, and the cache can be sized independently of your application server heap. The tradeoff — and it is a real one — is a network round-trip for every cache operation: typically 0.5–2 ms on a well-connected internal network, compared to ~100–500 ns for in-process.

Redis vs Memcached: Which One and When

Dimension	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, streams, JSON (RedisJSON), bitmaps	Strings only (key → arbitrary byte blob)
Persistence	RDB snapshots, AOF (append-only log), or both	None — pure RAM; restart = empty cache
Replication	Built-in primary-replica; Sentinel for automatic failover; Cluster for sharding	No built-in replication; client-side sharding only
Scripting / atomics	Lua scripts, MULTI/EXEC transactions, atomic commands (INCR, SETNX)	CAS (check-and-set) only; no scripting
Threading model	Single-threaded command execution (I/O can be multi-threaded in Redis 6+)	True multi-threaded, scales across cores for pure key-value workloads
When to pick	Rich data structures needed; TTL per key; Lua; pub/sub; persistence required	Pure cache, strings only, highest possible throughput per core, no persistence needed

In practice, Redis wins the majority of greenfield decisions because it is more flexible. The overhead difference between Redis and Memcached is rarely the bottleneck — network latency and serialization cost typically dwarf the few-microsecond difference in the servers themselves. Pick Memcached only when you've benchmarked a specific bottleneck that Memcached solves and Redis doesn't.

Redis Cluster: Horizontal Sharding via Consistent Hashing

A single Redis node caps out at roughly 25–50 GB of useful cache data (beyond that, GC-equivalent pauses on the fork for RDB snapshots start to hurt). Redis Cluster scales horizontally by splitting the keyspace into 16,384 hash slots. Every key is hashed to a slot (using CRC16 mod 16384), and every cluster node owns a contiguous range of slots. The client library calculates which slot a key belongs to before sending the request, connecting directly to the right node. This is a form of consistent hashing: adding a new node means redistributing only some of the slots, not rehashing the entire keyspace.

Redis Cluster splits the 16,384 hash slots across primary nodes. The client library computes the slot locally and connects directly to the right node — no central router hop. Each primary has at least one replica for failover. When you add a new node, only a proportional subset of slots moves; everything else stays in place.

Reducing Round-Trips: Pipelining and Lua Scripts

The biggest cost of a distributed cache is the network: 0.5–2 ms per round-trip. If your request handler makes 20 independent Redis calls, that's 10–40 ms of pure network overhead before the first byte of your business logic runs. Two Redis features cut this dramatically:

Pipelining via MULTI/EXEC — batch multiple commands into one TCP write. The server buffers them and sends all responses back in one reply. Zero-copy batching:

-- Without pipelining: 3 round-trips (3 × 1ms = ~3ms) GET user:42:profile GET user:42:cart GET user:42:prefs -- With MULTI/EXEC (one round-trip): MULTI GET user:42:profile GET user:42:cart GET user:42:prefs EXEC -- Server replies with an array of 3 values in a single TCP response -- Cost: ~1ms instead of ~3ms

Lua scripts — run arbitrary logic on the Redis server, eliminating the network round-trip for the "read → decide → write" pattern. For example: "if the rate-limit counter for this IP is under 100, increment and return 'allowed'; otherwise return 'blocked'" — in Lua, this is one atomic server-side operation. In application code, it would be a GET, then a conditional SET — two round-trips with a potential race between them.

Failure Modes: Hot-Key Meltdown and Network Partitions

Hot-key meltdown: In Redis Cluster, a single key like product:homepage-banner might receive 100,000 requests per second — but it always hashes to the same slot, meaning the same single primary node handles every one. That node becomes a bottleneck while other nodes are idle. Fixes: local in-process caching for hyper-hot keys (L-shape strategy), key sharding by appending a random suffix and reading one at random, or using Redis's read replicas with READONLY commands to distribute load.

Network partition & split-brain: If a Redis primary becomes unreachable, Redis Sentinel or Cluster will promote a replica. But if the replica was slightly behind (async replication lag), recently-written data is lost. This is the CAP theorem in action: Redis opts for availability (continue serving reads/writes) over consistency (guarantee every read sees the latest write). For cache data this is usually acceptable — a stale value is just a temporary miss penalty. For rate limiters or distributed locks, it's a real correctness concern that requires stronger patterns (Redlock, or switching to a CP system like ZooKeeper).

Distributed caches (Redis, Memcached) solve the cross-instance sharing problem at the cost of a 0.5–2 ms network round-trip per operation. Redis wins most decisions due to rich data structures, Lua scripting, and built-in clustering. Redis Cluster shards via 16,384 hash slots; clients calculate the target node locally for zero-hop routing. Pipelining (MULTI/EXEC) and Lua scripts cut the round-trip count for multi-operation patterns. Key failure modes are hot-key bottlenecks (fix: L-shape caching, key sharding) and replication lag during failover (fix: accept stale cache, use stronger patterns only for non-cache state).

Section 11

Layer 5 Deep Dive — Database Buffer Pool (Postgres shared_buffers, InnoDB)

Every database that reads data from disk would be unbearably slow if it had to do an actual disk read for every row you touch. The database's solution is exactly the same as the OS page cache — keep recently-accessed data in RAM — but smarter: instead of caching generic file bytes, the database caches structured pages (8 KB blocks for Postgres, 16 KB for InnoDB) that it understands as B-tree nodes, heap tuples, or index entries. Because the database knows what these pages contain, it can make better eviction decisions than a generic kernel cache.

Postgres: shared_buffers and the Double-Buffering Dance

Postgres manages its own in-memory cache of database pages, called the shared buffer pool, configured by shared_buffers. The conventional starting point is 25% of total RAM — so on a 64 GB DB server, set shared_buffers = 16GB. Why only 25%? Because Postgres reads its data files through the OS page cache (it does not use O_DIRECT by default), and the OS page cache can beneficially cache the same pages again in the remaining 75% of RAM. This double-caching is technically redundant but provides a practical buffer: if shared_buffers is cold after a restart, the OS page cache (which the kernel persistently maintains across Postgres restarts) may already have hot pages. Postgres will re-populate shared_buffers from the OS cache at ~100 ns per page, not from disk at 0.1–8 ms per page.

-- postgresql.conf — shared_buffers sizing shared_buffers = 16GB -- 25% of total RAM (e.g., 64GB server) -- Why 25%? OS page cache uses the rest; avoid OOM; empirically validated -- For analytics / data warehouse workloads: can go to 40%, less page cache needed -- effective_cache_size tells the query planner how much RAM (shared_buffers + OS cache) -- it can assume is available for caching. Does NOT allocate memory; just a hint. effective_cache_size = 48GB -- typically 75% of RAM on dedicated DB servers -- Inspect the buffer pool live (requires pg_buffercache extension): SELECT c.relname, count(*) AS buffers, round(100.0 * count(*) / (SELECT count(*) FROM pg_buffercache), 2) AS pct_of_bufpool FROM pg_buffercache b JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) WHERE b.isdirty = false GROUP BY c.relname ORDER BY buffers DESC LIMIT 20; -- Shows you which tables/indexes are consuming the most shared_buffers space

Postgres Eviction: Clock-Sweep

When shared_buffers is full and Postgres needs to load a new page, it has to throw something out to make room. How does it decide? Imagine the buffer pool as a circular table of slots, and a single hand pointing at one slot — like a clock. Each slot has a small counter that goes up every time something touches that page. The hand walks slowly around the ring, and at each stop it decrements the counter by one. When a counter reaches zero, that slot is the loser and gets evicted. This sweeping clock-hand approach is called the clock-sweep algorithm, and the reason Postgres uses it is mostly about avoiding lock contention: a strict "least recently used" list would require every cache hit to update a global linked list, which becomes a hotspot under high concurrency. Clock-sweep approximates LRU using only local counter increments, no shared list, no global lock.

InnoDB: Buffer Pool + O_DIRECT

MySQL's InnoDB storage engine takes a more aggressive approach to memory management: allocate a large fraction of RAM to the InnoDB buffer pool and use O_DIRECT to bypass the OS page cache entirely. The motivation is clear: if you're giving InnoDB 50 GB of buffer pool on a 64 GB server, there's no point having the OS also cache those same pages — you'd have no RAM left for anything else.

# my.cnf / my.ini — InnoDB buffer pool sizing [mysqld] innodb_buffer_pool_size = 40G # Typically 60–80% of total RAM on dedicated DB servers # O_DIRECT means no OS page cache for data files # OS needs ~4-8GB for its own use; leave some headroom innodb_buffer_pool_instances = 8 # Split pool into 8 sub-pools, each with its own mutex # Reduces lock contention at high concurrency # Rule of thumb: 1 instance per GB of buffer pool size innodb_buffer_pool_dump_at_shutdown = ON # Save list of hot pages to disk on shutdown innodb_buffer_pool_load_at_startup = ON # Reload them on startup — warm cache instantly # Without this, first 30min after restart is painfully slow # Check buffer pool hit ratio live: SHOW ENGINE INNODB STATUS\G -- Look for: "Buffer pool hit rate NNN / 1000" → should be ≥ 997/1000 (99.7%+)

InnoDB Midpoint Insertion: Protecting Hot Pages From Scans

InnoDB has a clever twist that directly addresses the scan-pollution problem we saw in the OS page cache. The buffer pool is still arranged as a list with "hot" at the head and "evict next" at the tail, but instead of dropping new pages straight at the head — where they'd push the truly hot pages closer to eviction — InnoDB drops them partway down the list. That partway-down spot is called the midpoint (by default, about 3/8 of the way up from the tail). New pages enter here. A page only graduates to the front of the list (the "young" or hot sublist) if it gets touched a second time. Pages that are only touched once after insertion drift through the back half (the "old" sublist) and get evicted from the tail without ever disturbing the truly hot data.

The key insight: a full-table scan floods new pages in at the midpoint. If they are only read once (typical for a scan), they never get promoted to the young list and are quickly evicted without displacing the truly hot pages that live at the head. The hot OLTP pages survive the scan untouched.

InnoDB's scan-resistant LRU. New pages enter at the midpoint (boundary between young and old sublists). If a page is accessed again soon, it gets promoted to the head of the young list and survives for a long time. Scan pages that are never re-accessed slowly drift toward the tail of the old sublist and are evicted first — never displacing truly hot pages.

Hit Ratio: The #1 Database Performance Metric

The buffer pool hit ratio tells you what percentage of page reads are served from RAM vs disk. This single number summarizes your database's cache efficiency better than any other metric. A rough guide:

≥ 99.5% (InnoDB ≥ 997/1000) — excellent. Almost all reads from RAM. Very few disk reads.
98–99.5% — good for a working set that's 90%+ hot. Occasional cold reads.
90–98% — 2–10% of reads are disk reads. Investigate: working set larger than buffer pool? Scan pollution? Need to increase buffer pool or add caching above the DB.
< 90% — serious I/O pressure. Page queries will be slow. The buffer pool is significantly undersized for the working set, or a batch scan is polluting it.

MongoDB WiredTiger cache: MongoDB uses the WiredTiger storage engine, which has its own internal cache configured by storage.wiredTiger.engineConfig.cacheSizeGB in mongod.conf (or the --wiredTigerCacheSizeGB flag). Default is max(50% of RAM − 1 GB, 256 MB). Like InnoDB with O_DIRECT, WiredTiger bypasses the OS page cache for its data files, so the cache size should reflect the full available RAM minus OS overhead. Check hit ratio via: db.serverStatus().wiredTiger.cache["pages read into cache"] vs "pages requested from the cache".

The database buffer pool is the DB's internal RAM cache for structured data pages. Postgres allocates ~25% of RAM to shared_buffers and layers the OS page cache underneath (double-buffering). InnoDB allocates 60–80% to innodb_buffer_pool_size and bypasses the OS page cache via O_DIRECT. Both use scan-resistant eviction to protect hot OLTP data from analytical scan pollution. Buffer pool hit ratio ≥ 99.5% is the target; below 90% means I/O is the bottleneck. Tools: pg_buffercache for Postgres, SHOW ENGINE INNODB STATUS for MySQL.

Section 12

Layer 6 Deep Dive — CDN Edge Caches (Geographic, Shared, Tiered)

Every layer we've covered so far lives inside your data center — on the CPU die, in the kernel, in the JVM, in a Redis cluster, or in the database server. The CDN edge cache is different: it lives outside your data center entirely, in hundreds of small server rooms scattered across the world's cities, each one close to the users in that region. CDN providers call each of these little server rooms a Point of Presence, or PoP for short — just a name for "a CDN data center near users." Its job is to serve your content before a user's request ever reaches your origin servers. For a user in Singapore, that means the response travels from a CDN node in Singapore (~5 ms round-trip) instead of from your US-east origin (~200 ms). The CDN is the one cache that reduces latency not by making your server faster, but by moving the response physically closer to the person asking for it.

What a CDN Can (and Cannot) Cache

The conventional wisdom is "CDNs cache static files." The more accurate picture in 2024 is "CDNs cache anything with a cache key and a reasonable TTL." The canonical categories, from easiest to hardest:

Static assets — JavaScript bundles, CSS, images, fonts, video segments. Cache-Control: immutable, max-age=31536000 (one year) using content-hash URLs. Hit ratios of 99%+ are achievable. These are so static that the only invalidation trigger is deploying a new build with a new content hash.
Shared HTML fragments — product pages, blog posts, marketing pages. Cache with a moderate TTL (60 s–5 min) and use surrogate keys (Cloudflare Cache-Tags, Fastly Surrogate-Key header) to invalidate all related pages instantly when content changes. Hit ratios of 80–95%.
API responses — increasingly common. Product listings, search results, pricing queries — all can be cached at the CDN edge when they are not user-personalized. Use Cache-Control: public, s-maxage=30 (cache at CDN for 30 seconds, revalidate). Hit ratios of 50–80% for popular queries.
Personalized content — responses that contain user-specific data (a user's cart, their name in the header) must not be cached at shared CDN edge nodes. The fix is to split pages into a cacheable shell + dynamic user-specific fragment loaded client-side (Edge Side Includes, or a separate API call after initial render).

Cache Key: What Makes a CDN Response Unique

A CDN identifies a cacheable response by its cache key — by default, the URL including query string. Two requests for the same URL with the same cache key are served the same cached response. This is why cache keys need careful design:

The Vary header trap: If your origin sets Vary: Accept-Encoding, the CDN correctly caches separate versions for gzip and brotli — fine. But Vary: Cookie or Vary: Authorization effectively disables CDN caching for every logged-in user (every unique cookie = unique cache key = no sharing). Audit your Vary headers. The correct pattern for authenticated content is: never mark the main response cacheable; instead, cache the public shell and load personalization via a separate, non-CDN-cached API request.

Tiered Caching: Shield Nodes Cut Origin Load

In a basic CDN setup, every edge PoP has its own cache. A cache miss at the Singapore PoP goes directly to your origin. A cache miss at the Tokyo PoP also goes directly to your origin. A cache miss at the Sydney PoP does the same. If you have 300 PoPs and a CDN miss rate of 20%, each of those 300 PoPs independently hammers your origin. The traffic that reaches your origin server is 300× amplified relative to what a single-PoP scenario would generate.

The solution is CDN shielding (also called "origin shield" or "tiered caching"): add a mid-tier cache layer between the edge PoPs and your origin. The Singapore, Tokyo, and Sydney edge nodes all forward misses to a single regional shield node (say, Tokyo shield), which itself caches the response. Your origin only sees one request even if three edge PoPs missed simultaneously for the same URL. The shield absorbs the long-tail miss traffic; your origin sees only the cache misses the shield itself couldn't serve.

The full six-layer request path. A CDN edge hit costs 5–30 ms total and is the default happy path for static and shared content. A full cold miss through all layers to disk costs 100–600 ms. Crucially, the CDN and the origin-side caches (Redis, buffer pool) are complementary: the CDN serves shared public content; Redis serves user-specific computed state; the buffer pool absorbs DB page reads. No single layer replaces the others.

CDN in the Six-Layer Context: What It Does That No Other Layer Can

The key insight that ties this section to the full hierarchy: the CDN is the only layer that operates on the user's side of the network. Every other cache layer (in-process, Redis, buffer pool) is co-located with your application — they reduce work on your servers, but they do nothing about the physics of data traveling across the Atlantic or Pacific. The CDN collapses the geographic distance problem. It doesn't make your app smarter; it makes the content physically closer.

This also defines exactly what the CDN is bad at: anything that requires real-time user-specific computation. A user's shopping cart, their personalized recommendations, their account balance — these need data from your database that only your application can assemble. The CDN can serve the shell of the page in 10 ms; Redis and the buffer pool assemble the personalized data in another 2–3 ms; together they give you a total page experience under 20 ms without the CDN ever touching the private data.

Invalidation: The Hard Problem at the CDN Layer

The most dangerous CDN failure mode isn't a miss — it's serving a stale hit. If a product price changes and the CDN is still serving the old price from a 5-minute cached response, customers see incorrect prices. CDNs solve this with two mechanisms: TTL expiry (wait for the cache to age out naturally) and instant purge. Purge needs a clever trick, because a single product change might affect dozens of cached pages: the product page itself, the category page, the homepage, the search results. Listing every URL to purge by hand is tedious and error-prone. Instead, CDNs let you attach labels to each cached response — a product page can be labelled "product-42, category-electronics, homepage-featured" — and then you purge by label, not by URL. These labels are called surrogate keys (called Cache-Tags on Cloudflare, Surrogate-Key on Fastly). One API call ("purge everything tagged product-42") clears every cached response touching that product, across every PoP, in a few hundred milliseconds.

The stale-while-revalidate pattern: Cache-Control: max-age=60, stale-while-revalidate=30 instructs the CDN to serve the cached response (even if it's 60–90 seconds old) while asynchronously refetching the latest version in the background. The user always gets an instant response; the cache is refreshed in the background. This is the pattern that lets CDNs achieve both low latency and reasonable freshness simultaneously, at the cost of accepting a brief window of stale data. Most modern CDNs (Cloudflare, Fastly, CloudFront) support it.

CDN edge caches are the outermost cache layer — physically distributed near users, serving responses 5–30 ms from the browser without touching your origin. They cache static assets (near-perfect hit ratios), shared HTML (80–95%), and increasingly API responses. Tiered caching (edge → shield → origin) prevents origin amplification. Cache keys (URL + Vary headers) must be designed carefully — Vary: Cookie disables sharing. Surrogate keys enable instant bulk invalidation. In the six-layer stack, the CDN is the only layer that solves geographic latency; all other layers reduce server-side work. CDN + Redis + buffer pool are complementary: CDN serves shared static content; Redis serves cross-instance dynamic state; the buffer pool absorbs DB reads.

Section 13

Choosing the Right Layer for Each Piece of Data

The six cache layers exist simultaneously in every production system — but that doesn't mean every piece of data belongs in every layer. Putting data in the wrong layer wastes memory, introduces stale-data bugs, or misses the latency benefit you were hoping for. The decision isn't hard once you understand the three questions that drive it: Who shares this data? How often does it change? How large is it?

The Three Axes That Drive Layer Selection

Before picking a layer, ask these three questions about the data you want to cache:

Scope — who needs it? Data that every server in every region needs (public homepage HTML) is a CDN candidate. Data that only the current server process needs (a parsed config file) belongs in an in-process cache. Data that all servers in one region share (a logged-in user's session) belongs in a distributed cache like Redis.
Mutability — how often does it change? Static assets that change only on deploy → CDN with long TTLs. Price data updated every few minutes → distributed cache with short TTL. A user's shopping cart, mutated on every click → distributed cache with write-through on every change. A CPU-computed result like a cryptographic hash → in-process cache with a bounded capacity.
Size — how much memory does it occupy? A parsed JWT token is a few hundred bytes and belongs in-process. Gigabytes of product catalog belong in Redis, not in every app-server heap. Video files belong in the CDN, not in Redis.

The decision tree above walks you through the dominant question at each branch. A public homepage with the same HTML for all users? CDN, because geographic proximity gives you the biggest latency win. A user's shopping cart? Redis, because multiple app servers need to read it — you can't keep it in one server's heap. A single server's parsed YAML config? In-process map, because every network hop is wasted cost for data only that process needs.

The matrix above is worth internalizing. Notice the pattern: anything user-specific and mutable goes to Redis, because you need it shared across servers but it must not be cached at CDN (which would serve your cart to someone else). Anything DB-internal goes to the buffer pool — that layer is managed automatically by Postgres or MySQL, and you only tune it by configuring memory allocation. Financial records don't live in any cache as their primary store — durability requirements mean the disk is the record of truth, and any cache layer holding a financial value is secondary at best.

Common Placement Mistakes and Why They Happen

The most frequent mistake teams make is reaching for Redis reflexively — for everything. Config values that never change and are only read by one service end up in Redis even though an in-process map would be 200× faster and involve no network hop. The motivation is understandable: Redis is already in the stack, so it feels "easy." But every Redis GET is a network round-trip (~0.5 ms), and if you're calling it 50 times per request for config lookups, you've added 25 ms of latency for data that could live in a local map refreshed every 60 seconds.

The second common mistake is caching per-user data at the CDN level. CDN caching is fundamentally designed for shared content — the same response, delivered to many users. The moment you cache a response that varies by user identity, you risk the CDN serving one user's data to another. Some CDN providers support Vary headers or user-aware cache segments, but these are complex to configure correctly. The safer default: anything with user-specific data should bypass CDN caching entirely and go to Redis or the app layer.

Quick mental model: Ask "If two users request this at the same time, should they get the same response?" If yes → CDN is a candidate. If no → CDN is dangerous. The same test applies going deeper: "If two app servers handle this user's requests, do they need the same data?" If yes → Redis. If no → in-process cache is fine.

Choose a cache layer by asking three questions: who shares the data (scope), how often it changes (mutability), and how large it is (size). Public, rarely-changing content belongs at CDN or browser. Per-user shared state belongs in Redis. Single-server computed results belong in-process. Raw database pages belong in the DB buffer pool. Mixing these up — especially caching user-specific data at CDN or over-using Redis for local config — introduces stale data bugs and wasted latency.

Section 14

Cross-Layer Invalidation — The Hardest Problem in Multi-Level Caching

Here is the uncomfortable truth about multi-level caching: adding a cache layer doesn't just make reads faster, it creates a new copy of your data. And the moment you have multiple copies, you have a consistency problem — when the real data changes, who tells the copies? This is why invalidation is often called the hardest problem in caching. Phil Karlton's famous quip — "There are only two hard things in computer science: cache invalidation and naming things" — is funny because it's deeply true. Let's understand why, and what to do about it.

The Layers That Take Care of Themselves

Good news first: the lowest two layers — CPU cache and OS page cache — handle their own consistency automatically, and you never need to think about them.

CPU caches use a hardware protocol called MESI (Modified, Exclusive, Shared, Invalid). When one CPU core writes to a memory address, the MESI protocol automatically broadcasts an invalidation signal to other cores that hold a cached copy of that address. This happens at the hardware level, in nanoseconds, without any software involvement. You never write application code to manage CPU cache coherence.

The OS page cache uses write-back semantics managed by the kernel. When your database writes to a file, the kernel updates the page cache immediately — so any subsequent read on the same machine sees fresh data. Across machines (NFS), page cache coherence is more complex, but for a typical single-node database setup, you don't manage this yourself.

The Layers Where You're Responsible

The layers that your application code manages — in-process cache, distributed cache (Redis/Memcached), and CDN — do not have any automatic invalidation mechanism. When the source data changes, nothing tells them automatically. You have to build that mechanism yourself. This is where bugs live.

Three strategies exist for keeping application-managed cache layers consistent with the source of truth:

Write-through — when the app writes to the database, it also writes the new value to the cache at the same time, synchronously. The cache is always current. Cost: every write hits both DB and cache; write latency doubles.
Write-behind (write-back) — the app writes only to the cache immediately, and a background job flushes to the database later. Lowest write latency; risk of data loss if the cache crashes before the flush. More on this in Section 15.
Explicit invalidation (delete on write) — the app writes to the database, then deletes (or marks stale) the cache entry. The next read is a cache miss and repopulates from the DB. Simple and safe, but causes a brief miss storm after writes. This is the most common pattern in practice.

The diagram above shows what a correctly implemented cascading invalidation looks like: a single write event fans out to every layer that might hold a copy. The red "forgotten layer" on the right is the classic production bug. One layer is missed — usually the in-process cache on individual app servers — and it serves the old price for as long as its TTL lasts. Users on servers that got the invalidation see $25; users on servers that didn't see $29. Both prices are "real" at different layers of the stack, which is worse than just being consistently wrong.

Three Invalidation Patterns in Practice

1. Pub/Sub Fan-Out (Redis PUBLISH)

The most common way to keep every server's in-process cache fresh in a multi-server fleet is to use Redis as a tiny one-way notification service: one server shouts "the price for product 42 changed!" and every other server hears it and clears the affected key. This shout-and-listen pattern is called publish/subscribe, or pub/sub for short, and "fanning out" just means the one shout reaches many listeners. When a server writes a new value, it publishes an invalidation message to a shared channel. All servers — including itself — subscribe to that channel and evict the affected key from their local in-process cache when they receive the message.

# Server A: handles a price update request db.execute("UPDATE products SET price = 25 WHERE id = 42") redis.DEL("product:42") # evict from distributed cache redis.PUBLISH("cache:invalidate", "product:42") # broadcast to all servers # Every server (including A): background subscriber thread redis.SUBSCRIBE("cache:invalidate") on_message(key): local_cache.invalidate(key) # evict from own in-process cache

The advantage: all servers learn about the invalidation in milliseconds. The limitation: Redis pub/sub is fire-and-forget — if a server is temporarily disconnected, it misses the message and its in-process cache stays stale until the TTL expires. For high-stakes data (prices, inventory), you pair pub/sub with a short TTL as a fallback safety net.

2. CDC-Driven Cascading Invalidation

The pub/sub approach above only catches writes that go through your application. A migration script, an admin running a manual SQL update, or a different microservice writing to the same DB will all bypass the invalidation. A safer trick: have something watch the database itself and emit a notification for every row that changes, no matter who changed it. Databases already keep an internal log of every change for replication purposes (Postgres calls it the WAL, MySQL calls it the binlog); a tool can tap into that log and turn each row change into a message. This watch-the-log pattern is called Change Data Capture, or CDC. Tools like Debezium implement it for most databases. A separate invalidation service consumes the CDC stream and clears the appropriate keys in Redis and CDN. The benefit is that all writes are caught. The cost is operational complexity — you're adding Kafka or a similar message broker to the architecture.

3. Versioned Key Fingerprinting

Instead of invalidating, you change the key. When a product's data changes, you append a version hash to the cache key: product:42:v7 becomes product:42:v8. Old caches silently expire because no one asks for :v7 anymore. New reads always hit :v8, which is a miss on first access but is then populated correctly. This sidesteps the invalidation problem entirely — you never delete, you rotate. The trade-off: old versioned keys accumulate in the cache until TTL expires, wasting memory.

This stale-instance diagram describes a real class of production incident. A team deploys a new price. Redis is correctly invalidated. They even configured pub/sub to broadcast. But Servers 2, 3, and 4 were temporarily under load during the pub/sub broadcast and their subscriber threads were backlogged. For the next 45 minutes — until the in-process Caffeine cache TTL expired — 75% of checkout requests showed the old price. The fix is not complex: combine pub/sub with short in-process TTLs (60 seconds as a ceiling), and during high-stakes deploys, restart app servers in rolling order to flush all in-process caches cleanly.

The cardinal rule of multi-level invalidation: An invalidation is only as reliable as your weakest signal path. If you invalidate Redis but not in-process caches, you get mixed responses. If you invalidate CDN but not Redis, you get a CDN miss that still returns stale data from Redis. Every copy of the data needs an invalidation signal — and you need a fallback TTL for the cases where signals are dropped.

CPU cache and OS page cache handle their own consistency via MESI and kernel write-back — you don't touch them. Application-managed layers (in-process, Redis, CDN) require you to build invalidation. The three strategies are write-through (synchronous update), explicit delete-on-write, and versioned keys. Pub/sub fan-out via Redis is the most common approach for in-process cache invalidation across a fleet, but it's fire-and-forget — always pair with a bounded TTL as a safety net. The stale-instance problem — where one server's in-process cache is missed — is one of the most common post-deploy incidents in multi-tier caching systems.

Section 15

Write Strategies Across Layers — Write-Through, Write-Around, Write-Back

We've talked a lot about what happens when a read hits or misses a cache. But caching strategies are equally important for writes. When a user updates their cart, changes a password, or completes a purchase, where does that data go first — and in what order? The answer determines your write latency, your consistency guarantees, and how much data you can lose if a server crashes. There are three main strategies, and knowing when to reach for each one is a key engineering judgment.

Let's walk through each strategy in plain terms, then look at how they compose in a real multi-level stack.

Write-Through — Consistency First

Write-through means: when you write data, you write it to the cache and the database in the same synchronous operation, before returning success to the client. The cache is always up to date because it gets every write immediately. The cost is that every write pays the latency of two storage operations: the cache write plus the database write. For a typical setup, that's ~0.5 ms to Redis plus ~5 ms to Postgres — roughly doubling write latency compared to bypassing the cache.

This is the right strategy for financial transactions, inventory counts, and anything where stale reads are unacceptable. The slightly higher write latency is a reasonable price to pay for the guarantee that every read from the cache reflects the latest confirmed write.

Write-Around — Protecting the Cache from Junk

Write-around means: writes go directly to the database and skip the cache entirely. The cache is only populated on cache misses during reads. This pattern makes sense when you're writing data that will rarely or never be read from the cache — for example, a bulk import of historical records, log ingestion, or append-only analytics events. Caching these writes would evict useful data from the cache for no benefit, since the written data won't be read back through the cache path.

The downside: the first read after a write is always a cache miss, which adds the database read latency. For frequently-read data, write-around is the wrong choice — you get the worst of both worlds (write goes to DB, read still misses the cache).

Write-Back — Performance at the Cost of Durability

Write-back is the most aggressive strategy: writes land in the cache immediately and return success to the client, then a background flush thread pushes dirty cache entries to the database asynchronously. This gives you the absolute lowest write latency — cache writes are microseconds to nanoseconds, vs 5–15 ms for database writes. The risk: if the cache server crashes before the flush, you lose the writes that hadn't been persisted. For counters (page views, "likes"), losing a few counts is acceptable. For financial records, it is not.

How These Strategies Compose in a Multi-Level Stack

In practice, different layers use different write strategies simultaneously. Here's how a typical e-commerce system stacks them:

The key insight from this table: different data types use different write strategies simultaneously within the same application. User profiles use write-through to Redis because consistency matters — you must not serve a user's stale name to their friends seconds after an update. View counters use write-back because losing a few counts is fine, and the throughput requirement is enormous. Static assets use write-around entirely — you never put a JS bundle in Redis or the app heap — and at CDN you effectively sidestep invalidation by using a new URL with a content hash on every deploy.

Write-back for DB-of-record writes is almost always wrong. The pattern makes sense for high-throughput counters (Redis INCR with periodic flush) but not for rows that represent the authoritative state of an entity. If your cache crashes between the write and the flush, that data is permanently gone. For transactional writes, use write-through — accept the latency and gain the durability guarantee.

Write-through updates cache and DB synchronously — strong consistency, doubled write latency, right for user state and financial data. Write-around skips the cache on write and lets reads populate it on miss — right for bulk imports and one-time-write data. Write-back writes to cache only and flushes asynchronously — lowest latency, data-loss risk, right for high-throughput counters where losing a few increments is acceptable. In a real multi-layer stack, all three strategies may be in use simultaneously across different data types.

Section 16

Production Patterns — Cache-of-Caches, Multi-Tier, Look-Aside, Read-Through

Once you understand the six layers individually, the next step is seeing how real engineering teams wire them together. At scale, you don't pick one layer — you compose several into named architectural patterns. The four patterns below represent the most-used approaches in production today, from the simple default that most teams reach for, to the sophisticated multi-region hierarchies that companies like Facebook and Netflix built to serve billions of requests.

(a) Cache-Aside (Look-Aside) — The Default Pattern

The most common pattern by a large margin. The application manages the cache explicitly: on a read, check the cache first; if it misses, fetch from the database and populate the cache; on a write, write to the database and then invalidate (or update) the cache entry. The application code is in full control — there is no caching middleware silently doing things for you.

Why it's the default: it's simple to understand, easy to debug (you can trace exactly which line populated a cache entry), and compatible with any database or cache store. Most Redis + Postgres architectures use this pattern. The downside is that the responsibility for cache correctness sits entirely in application code — if a developer forgets to invalidate after a write, the bug is silent and difficult to find in production.

# Cache-aside (look-aside) read path def get_product(product_id): key = f"product:{product_id}" cached = redis.GET(key) if cached: return deserialize(cached) # cache hit — fast path product = db.query("SELECT * FROM products WHERE id = ?", product_id) redis.SET(key, serialize(product), EX=300) # populate cache, TTL=5min return product # Cache-aside write path def update_product_price(product_id, new_price): db.execute("UPDATE products SET price = ? WHERE id = ?", new_price, product_id) redis.DEL(f"product:{product_id}") # invalidate stale cache entry

(b) Read-Through — Transparent Cache Population

In read-through, the cache itself is responsible for fetching from the database on a miss — the application only ever talks to the cache. A cache library like Caffeine (Java) or a Redis module with a loader function handles the miss transparently. The application says "get me product:42" and the cache returns it, fetching from the DB if necessary, without the application seeing the DB at all.

This centralizes the cache-loading logic in one place and ensures the "populate on miss" step is never forgotten. The trade-off: it can be harder to implement correctly for complex queries, and the application loses fine-grained control over when and what gets cached.

(c) Cache-of-Caches / Hierarchical — Two-Tier Architecture

When a single shared cache cluster becomes the bottleneck, teams add a local in-process cache in front of the distributed cache. Reads check local memory first; only misses reach the shared Redis or Memcached cluster. This pattern cuts Redis round-trips dramatically for hot keys that every app server needs repeatedly.

Twitter, Facebook, and many large-scale systems use this pattern. The naming varies — "L1/L2 cache," "local cache + regional cache," "near cache" — but the structure is the same: a fast, small, in-process layer absorbs the hottest traffic, and a larger, shared, network-accessed layer handles the rest.

// Caffeine (L1 in-process) → Redis (L2 distributed) → DB LoadingCache localCache = Caffeine.newBuilder() .maximumSize(10_000) // bounded to prevent OOM .expireAfterWrite(30, SECONDS) // short TTL — safety net for invalidation gaps .build(key -> { // L1 miss: check L2 (Redis) String raw = redis.get(key); if (raw != null) return deserialize(raw); // L2 miss: go to DB Product p = db.findProduct(key); redis.setex(key, 300, serialize(p)); // populate L2 return p; }); // Read path: single call, hierarchy is transparent to caller Product product = localCache.get("product:42");

The code above shows the hierarchy in action. Caffeine acts as the L1 gatekeeper — if the key is in local memory, the call returns in microseconds without a network hop. If L1 misses, Caffeine's CacheLoader function fires, checks Redis (L2), and only if that misses too does it reach the database. From the caller's perspective, it's a single get call. The layered fallback is invisible.

(d) Multi-Tier Heterogeneous — Netflix EVCache Pattern

Netflix built its own globally-distributed cache layer on top of Memcached. They named it EVCache — short for "Ephemeral Volatile cache," signalling up front that the data can be lost on a node crash and that's fine, because every entry can be rebuilt from the source. EVCache is a well-documented example of a multi-tier heterogeneous cache. Netflix uses Memcached (not Redis) for raw speed on hot ephemeral data — Memcached has lower overhead per operation and better CPU efficiency for simple key-value lookups. Redis is used separately for data that needs data structures (sets, sorted sets, pub/sub). The two stores coexist in the same architecture, each used for what it does best.

EVCache adds zone-aware routing: within an AWS region, Netflix runs one Memcached cluster per availability zone. Reads prefer the local zone's cluster to minimize cross-zone latency. Writes replicate across all zones asynchronously. This means a zone failure doesn't lose cached data — other zones still hold copies — and normal reads stay within the same data center rack, keeping latency under 1 ms.

Facebook TAO — A Real-World Two-Tier Example

The social-graph data Facebook serves is mostly a giant collection of two things: objects (a user, a post, a photo) and associations between them (Alice is friends with Bob; Bob liked this photo). Facebook's caching system is literally named after those two shapes — TAO stands for "The Associations and Objects." Published in a USENIX ATC 2013 paper by Bronson et al., it is one of the most studied real-world caching architectures. TAO sits in front of MySQL and serves Facebook's social graph — friends, likes, comments — as a two-tier cache.

TAO's architecture encodes a key insight: for a globally-distributed social graph, writes need strong consistency (all writes go to the leader MySQL) but reads can tolerate brief lag (follower regions cache data that may be a few hundred milliseconds behind the leader). The two-tier design — TAO cache servers in front of MySQL, per region — means most social graph reads are served from in-region memory without any cross-Atlantic network hop. The consistency cost is bounded: when a user updates their profile, followers in EU will see the change within seconds once the async invalidation propagates. That's an acceptable trade-off for a social feed. It would not be acceptable for a bank balance.

The cache-of-caches diagram shows the steady-state topology used by many large-scale Java services: each app server has a bounded Caffeine cache (typically 5,000–50,000 entries) that absorbs the hottest fraction of keys with microsecond latency. The shared Redis cluster handles everything the L1 doesn't hold — much larger working set, shared across all instances. Only genuine misses at both layers reach the database. In a well-tuned deployment, the database might see 1–3% of total read traffic. The rest is absorbed by the two-layer hierarchy.

Cache-aside (look-aside) is the default pattern — application manages the cache explicitly on reads and writes. Read-through delegates cache population to the cache library, centralizing miss handling. Cache-of-caches stacks an in-process L1 (Caffeine) in front of a distributed L2 (Redis) to absorb the hottest traffic with sub-microsecond latency. Facebook TAO demonstrates a two-tier globally-distributed design: writes go through a leader region with write-through, while follower regions serve reads locally with async invalidation. Netflix EVCache shows how a multi-tier Memcached architecture with zone-aware routing achieves high availability and sub-millisecond cache reads at planetary scale.

Section 17

Pitfalls & Anti-Patterns

Knowing what layers exist and what write strategies to use is only half the battle. The other half is knowing the mistakes teams make in production — the easy-to-reach-for decisions that feel right but quietly destroy performance, introduce stale data, or cause catastrophic failures under load. Here are the seven most damaging pitfalls, each with a clear explanation of why the mistake is tempting, what goes wrong, and how to fix it.

The mistake: A team adds Redis caching, then adds an in-process cache, then adds CDN caching, then enables the DB buffer pool — all for the same data — without measuring which layers actually help. The thinking is: "more caching = faster." So every layer gets turned on, everywhere.

Why it's bad: Each cache layer has real costs. An in-process cache consumes heap memory that your application needs for GC and object allocation. A CDN cache requires correct Cache-Control headers, invalidation logic, and monitoring. A Redis cache adds operational complexity (replication, eviction policy, sentinel/cluster setup). If a layer has a low hit ratio — say, 30% — it costs more in complexity and memory than it saves in latency. The 70% of requests that miss still pay the full downstream latency plus the overhead of the miss check.

The fix: Instrument before adding. Before deploying a new cache layer, measure the hit ratio hypothesis: "I believe 80%+ of reads for this data pattern will be cache hits." After deploying, confirm with metrics. If a layer consistently hits below 60%, reconsider its existence. Cache layers should be added one at a time, with a clear latency/throughput win to justify each one. The right question is not "should we cache this?" but "at which layer does caching this produce a measurable benefit?"

// Team adds caching at every possible layer — no measurement public Product getProduct(String id) { // L1: in-process Product p = localCache.getIfPresent(id); // ~5% hit ratio (cold products) if (p != null) return p; // L2: Redis String raw = redis.get("product:" + id); // ~40% hit ratio (long tail) if (raw != null) { p = deserialize(raw); localCache.put(id, p); // populate L1 with rarely-read data return p; } // L3: DB p = db.find(id); redis.setex("product:" + id, 3600, serialize(p)); // cache with 1hr TTL regardless localCache.put(id, p); return p; }

// Only cache where hit ratio is verified ≥ 80% // Measurement showed: hot products (top 1000 SKUs) hit 95%+ in Redis // Long-tail products (millions of SKUs) hit 8% in Redis → skip Redis for those public Product getProduct(String id) { if (hotProductIds.contains(id)) { // L1 in-process for hot products only Product p = localCache.getIfPresent(id); if (p != null) return p; } // Redis only for known-hot products (measured hit ratio: 93%) if (hotProductIds.contains(id)) { String raw = redis.get("product:" + id); if (raw != null) return deserialize(raw); } // DB for all misses (including long-tail direct) Product p = db.find(id); if (hotProductIds.contains(id)) { redis.setex("product:" + id, 300, serialize(p)); localCache.put(id, p); } return p; }

The mistake: A developer sets all cache layers to the same TTL — say, 5 minutes — reasoning that "data should be stale for at most 5 minutes everywhere." So in-process cache, Redis, and any CDN fragment cache all get TTL=300s.

Why it's bad: If all layers expire simultaneously, every server loses its cached version of popular data at the same moment. Thousands of requests hit Redis at the same time (all missing), then thousands of Redis threads hit the database simultaneously — a cache stampede. The database, which normally sees a trickle of cache misses, suddenly receives 100% of traffic. Under heavy load, this can cascade into a complete outage.

The fix: Stagger TTLs across layers. In-process cache should have the shortest TTL (30–60 seconds) — it's the fastest to re-warm but also the most hidden from invalidation. Redis can have a longer TTL (5–30 minutes). CDN can be longer still (minutes to hours). Additionally, add jitter to TTL values so even entries in the same layer don't all expire at exactly the same second.

# All three layers expire at the exact same time localCache.set(key, value, TTL=300) # in-process: 5 min redis.setex(key, 300, value) # Redis: 5 min cdn_header = "Cache-Control: max-age=300" # CDN: 5 min # At T=300s: ALL layers expire simultaneously # → stampede to database

import random BASE_TTL = 300 # 5 minutes for Redis # Stagger: in-process is much shorter (safety net for invalidation gaps) local_ttl = 45 + random.randint(0, 15) # 45-60s redis_ttl = BASE_TTL + random.randint(0, 30) # 300-330s cdn_ttl = 600 # CDN can hold longer (explicit purge on changes) localCache.set(key, value, TTL=local_ttl) redis.setex(key, redis_ttl, value) response.header("Cache-Control", f"max-age={cdn_ttl}")

The mistake: A team caches the entire checkout page at the CDN — including the user's name, cart contents, and personalized recommendations — to save origin bandwidth. The CDN happily caches the response. The next user to request the same URL gets served the first user's checkout page.

Why it's bad: CDN caching is fundamentally designed for shared responses — the same content for all users. The moment a response contains user-specific data (session tokens, cart contents, account details, personalized recommendations), CDN caching becomes a privacy and correctness violation. Depending on the data, this can be a serious data leak. Even when Vary: Cookie headers are used to segment the CDN cache by session, the complexity and risk surface grows significantly.

The fix: Split your pages into a cacheable shell and a dynamic fragment. The page shell (header, footer, navigation, product thumbnails) can be CDN-cached. The personalized fragment (cart count, user name, recommendations) is fetched client-side via a separate authenticated API call that bypasses CDN. This is the "edge-side include" or "fragment caching" pattern. The rule: if the content changes per user, it must never touch CDN.

The mistake: A developer adds an in-process cache as a plain Java HashMap or Python dict with no eviction policy and no size limit. Over time, as more keys are added, the map grows without bound. After a few hours under production traffic, the JVM heap fills up and the process is killed by the OS, or GC pauses become so severe that the service becomes unresponsive.

Why it's bad: In-process caches live in the application heap. Unlike Redis — which runs in a separate process with its own memory and eviction policy — the in-process cache competes directly with your application objects for heap space. A 10 GB in-process cache on a JVM with a 16 GB heap leaves only 6 GB for everything else. GC pressure from the cache alone can cause 10-second stop-the-world pauses.

The fix: Always use a bounded cache with an LRU or TinyLFU eviction policy. In Java, use Caffeine's .maximumSize() or .maximumWeight(). In Python, use cachetools.LRUCache(maxsize=10_000). Set the bound based on the maximum heap budget you're willing to allocate to the cache — typically 10–20% of total heap, never more than 30%.

// Unbounded cache — will grow until OOM private final Map<String, Product> productCache = new HashMap<>(); public Product getProduct(String id) { return productCache.computeIfAbsent(id, db::find); // never evicts! }

// Bounded, evicting, TTL-based in-process cache private final LoadingCache<String, Product> productCache = Caffeine.newBuilder() .maximumSize(50_000) // hard cap: evicts LRU entries beyond this .expireAfterWrite(60, SECONDS) // safety TTL even if invalidation missed .recordStats() // expose hit ratio for monitoring .build(id -> db.find(id)); // auto-loads on miss public Product getProduct(String id) { return productCache.get(id); // safe: evicts, expires, stats }

The mistake: A product's price is updated. The team correctly invalidates Redis (DEL product:42). But they don't realize that four app servers also hold the same product in their Caffeine in-process caches with a 10-minute TTL. For the next 10 minutes, those servers serve the old price regardless of what Redis says.

Why it's bad: In-process caches are the most invisible layer in the stack. They are fast, cheap, and easy to add — but they are also local and silent. Unlike Redis (a shared, observable store), each server's in-process cache is a private island. When data changes, you can see the Redis update in redis-cli. You cannot easily inspect every running server's heap. A stale in-process cache can survive a Redis invalidation undetected.

The fix: Two complementary approaches. First, use a short in-process TTL (30–60 seconds) as the baseline safety net — even without any explicit invalidation, stale data expires quickly. Second, implement Redis pub/sub broadcast invalidation (see Section 14): when any server writes a new value, it publishes to an invalidate channel, and all servers (including itself) clear the affected in-process key on receipt. The combination means most invalidations propagate in milliseconds via pub/sub, with the TTL as a fallback for servers that miss the broadcast.

The mistake: A team adds Redis, tunes Caffeine, adds CDN caching — and never looks at the database buffer pool hit ratio. They assume the DB is fine because query latency looks acceptable. Six months later, a traffic spike causes query latency to jump 10×. The root cause: the working set grew beyond the buffer pool size, so most queries are now doing disk reads instead of memory reads. The buffer pool hit ratio had been quietly degrading from 99% to 65% over the previous two months.

Why it's bad: The database buffer pool is the most impactful single parameter in database performance. A Postgres shared_buffers at 25% of RAM with a 99% buffer pool hit ratio means almost no disk I/O. Drop to 70% hit ratio and you're doing disk I/O on 30% of page accesses — which adds 2–10 ms per affected query. Yet buffer pool hit ratio is rarely monitored. Teams monitor query latency (a lagging indicator) but not the leading indicator that predicts when it will degrade.

The fix: Add buffer pool hit ratio to your monitoring dashboard as a first-class metric. In Postgres: SELECT sum(blks_hit) / (sum(blks_hit) + sum(blks_read)) FROM pg_stat_database. Target ≥ 99%. If it falls below 95%, the working set is growing — consider increasing shared_buffers, adding RAM, or archiving cold data. Don't wait for query latency to spike before looking at this metric.

The mistake: A viral product goes on sale and suddenly 50,000 concurrent requests per second are all asking for product:viral_item_id. Even though the key is in Redis, it is hit so heavily that the single Redis shard holding it becomes a CPU bottleneck. Worse: when the key's TTL expires, all 50,000 requests simultaneously find a miss and simultaneously attempt to fetch from the database — a textbook cache stampede.

Why it's bad: Cache stampede is a self-inflicted DDoS on your own database. A single hot key expiring can generate a spike of database load proportional to your request concurrency, not your normal cache-miss rate. At high QPS, this can knock the database offline within seconds of a TTL expiry.

The fix: Two techniques. (1) Probabilistic early reexpiration: before a key expires, a small fraction of requests (e.g., 1%) proactively refresh it while the rest still serve the cached value. Libraries like Caffeine support this natively. (2) Per-key mutex / request coalescing: when a miss occurs, only one request fetches from the database; all others wait for that result. This converts an N-simultaneous-database-hit into a single database hit with N-1 waiters. Additionally, for extreme hot keys, replicate the value across multiple Redis shards and route reads by a hash of request ID — distributing the load.

# All 50k concurrent requests miss simultaneously when TTL expires def get_viral_product(id): val = redis.GET(f"product:{id}") if val is None: # ALL requests hit this simultaneously — stampede! val = db.query(id) redis.SETEX(f"product:{id}", 300, val) return val

import threading _locks = {} def get_viral_product(id): val = redis.GET(f"product:{id}") if val is not None: return val # fast path — no lock needed # Per-key mutex: only one thread fetches from DB key = f"product:{id}" if key not in _locks: _locks[key] = threading.Lock() with _locks[key]: # Double-check after acquiring lock (another thread may have populated) val = redis.GET(key) if val is not None: return val # Single DB fetch — all other requests waited for this val = db.query(id) # Add jitter to TTL to avoid synchronized expiry ttl = 300 + random.randint(0, 30) redis.SETEX(key, ttl, val) return val

The seven pitfalls cover the most common ways multi-level caching breaks in production: adding layers without measuring hit ratios, synchronizing TTLs causing stampedes, leaking user data through CDN, unbounded in-process caches causing OOM kills, missing in-process invalidation after writes, ignoring DB buffer pool hit ratio drift, and unprotected hot-key stampedes. Each mistake is fixable with a targeted countermeasure — bounded caches, staggered TTLs, pub/sub invalidation, probabilistic early reexpiration, and per-key mutex coalescing.

Section 18

Practice Exercises — Multi-Level Cache Reasoning

The goal of these exercises is not to test whether you memorized the layer names. It's to build the habit of reasoning through multi-level cache decisions — the kind of thinking an interviewer is looking for when they describe a system and ask "where would you add caching?"

You have a five-layer cache stack with the following hit ratios and per-layer latencies:

Layer	Hit Ratio	Latency on Hit	Latency on Miss (falls through)
L1: In-process (Caffeine)	70%	0.2 ms	falls to L2
L2: Redis	80%	0.8 ms	falls to L3
L3: DB buffer pool (warm)	80%	3 ms	falls to L4
L4: DB query (no page cache)	90%	10 ms	falls to L5
L5: Disk (cold read)	100%	50 ms	—

Calculate the end-to-end average latency per request across the full stack. Show your work.

Think about it as: what fraction of requests reach each layer? L1 serves 70%. The remaining 30% fall to L2. Of those 30%, L2 serves 80% (= 24% of total). And so on. Average latency = sum of (fraction_served_at_layer × latency_at_layer) for every layer.

Solution:

L1 hit: 70% × 0.2 ms = 0.14 ms
L2 hit: 30% × 80% = 24% of requests × 0.8 ms = 0.192 ms
L3 hit: 30% × 20% × 80% = 4.8% × 3 ms = 0.144 ms
L4 hit: 30% × 20% × 20% × 90% = 1.08% × 10 ms = 0.108 ms
L5 hit: 30% × 20% × 20% × 10% = 0.12% × 50 ms = 0.060 ms
Total average latency ≈ 0.14 + 0.192 + 0.144 + 0.108 + 0.060 = 0.644 ms

Notice: even though a cold disk read takes 50 ms, it only affects 0.12% of requests — its contribution to average latency is just 0.06 ms. The dominant term is L1, which handles 70% of all traffic at 0.2 ms. This is the compounding math in action.

You're designing caching for a product detail page on an e-commerce site. The page includes: (a) the product's main image and static description (rarely changes), (b) the current price (changes 0–5 times per day), (c) the inventory count (changes on every purchase, potentially hundreds per minute), (d) the user's personalized "You might also like" recommendations (generated per-user per-session), and (e) the user's recently-viewed history badge. For each piece of data, identify the appropriate cache layer(s) and explain why.

Apply the three-axis test: who shares it? how often does it change? how large is it? For the inventory count, ask whether stale reads are tolerable — what's the business consequence of showing "5 left" when there are actually 0?

Solution:

(a) Product image + description → CDN + Browser cache. Public, shared across all users, rarely changes. Long TTLs safe. Content-hash URL enables permanent caching.
(b) Price → Redis with write-through invalidation. Shared across all users (same price for everyone) but changes daily — CDN is okay with short TTL + surrogate-key purge on change. Redis for the rendered price component. In-process cache acceptable with TTL ≤ 60s.
(c) Inventory count → Redis only, no in-process cache, no CDN. Changes very frequently. Stale count risks overselling (serious business consequence). Use Redis with short TTL and write-through on purchase. CDN must not cache this.
(d) Personalized recommendations → Redis keyed by user ID. User-specific — never CDN. Computed per session. Acceptable to cache for the session duration (15–30 min TTL).
(e) Recently-viewed history badge → Redis keyed by user ID. User-specific. In-process cache technically possible but risky if the user updates their browsing on another tab/device. Redis with short TTL is safest.

A price change is made in the admin console at 2:00 PM. At 2:45 PM, a support ticket reports that some users still see the old price. Your investigation shows: the database has the correct new price, Redis has the correct new price (confirmed with redis-cli GET product:42), and the CDN is correctly returning fresh content (confirmed by checking response headers — no Age header). Users who are affected are on app servers 2, 3, and 4. App server 1 shows the correct price. Identify the cache layer responsible for the bug and describe the fix.

If the DB is correct, Redis is correct, and CDN is correct — what's left? Think about what cache layer exists only on individual app servers, not in shared infrastructure.

Solution: The in-process cache (app-server-local Caffeine or similar) on servers 2, 3, and 4 is serving the stale price. Server 1 presumably had its process restarted (flushing the heap cache) or received the pub/sub invalidation. Servers 2–4 did not receive the invalidation message — either because Redis pub/sub was not configured, the subscriber thread was lagging, or the subscriber wasn't implemented at all. The fix: implement Redis pub/sub invalidation as described in Section 14, and set a short in-process TTL (≤ 60 seconds) as a safety net so stale data expires quickly even when pub/sub messages are missed.

You need to deploy a change to your main JavaScript bundle (app.js). The current bundle is being served from CDN and browsers. After the deploy, the new app.js must be used. Design the invalidation strategy: what headers do you set, what CDN operation (if any) do you trigger, and how do you ensure that browsers that have the old bundle cached are forced to fetch the new one? The constraint: you cannot issue a hard browser-level invalidation (you cannot clear users' browser caches).

Think about the difference between invalidating a URL vs changing the URL. If you change the filename (or add a content hash), the browser has never seen it before — no invalidation needed. This sidesteps the "you can't clear browser caches" constraint entirely.

Solution: Use content-addressed (fingerprinted) filenames. During build, compute the hash of app.js and rename the output to app.8f3a2c.js. The HTML file references this hashed filename. Set Cache-Control: public, max-age=31536000, immutable — a one-year immutable CDN and browser cache. On the next deploy, the new bundle gets a new hash: app.9d1b4f.js. The HTML is updated to reference the new filename. No CDN purge needed. No browser cache invalidation needed. Browsers see a URL they've never fetched and request the new file fresh. The old file stays in CDN/browser caches indefinitely but is never requested again (no HTML points to it). Old CDN entries expire naturally after their TTL. This is the canonical way to handle static asset deployment — the key insight is that you never invalidate, you rotate.

You're reviewing a system with two distributed cache tiers: L4 is a Redis cluster (hit ratio: 95%) and L5 is the DB buffer pool (hit ratio: 60%). The overall average latency is acceptable when L4 hits (0.5 ms) but a 5% L4 miss rate is generating noticeable load — each L4 miss hits the DB, and of those DB hits, 40% are disk reads because the buffer pool is only 60% warm. What are the possible interventions, and which one should you prioritize first? Explain the trade-offs.

Work backwards from the bottleneck. The disk reads are the expensive path — 50 ms vs 3 ms. How many disk reads are happening per 1000 requests? What causes a 60% buffer pool hit ratio? Is the working set too large for the current buffer pool size, or is traffic patchy (lots of cold/long-tail reads)?

Solution:

Per 1000 requests: 950 hit L4 (0.5 ms each = 475 ms total). 50 miss L4 → hit DB. Of those 50: 30 hit buffer pool (3 ms each = 90 ms), 20 hit disk (50 ms each = 1000 ms). So disk reads represent 2% of total requests but contribute 1000 ms / (475 + 90 + 1000) = ~63% of total latency. The bottleneck is clear: disk reads.

Interventions in priority order:

Increase DB buffer pool (highest ROI): if the 60% hit ratio is caused by an undersized buffer pool relative to the hot working set, increasing shared_buffers (Postgres) or innodb_buffer_pool_size (MySQL) will directly raise the hit ratio. First, verify: check whether the DB working set fits in RAM. If the working set is 20 GB but the buffer pool is only 4 GB, adding RAM and increasing the buffer pool to 12–16 GB will likely push hit ratio above 95%, eliminating most disk reads.
Add L3 in-process cache on app servers: add a Caffeine L3 between L4 (Redis) and the DB. This absorbs hot L4 misses before they reach the DB at all. Effective if L4 misses are concentrated on a small set of hot keys.
Improve L4 hit ratio (lower priority): 95% is already quite good. Going from 95% to 98% only halves the miss rate from 50 to 25 per 1000 — marginal gain compared to fixing the disk read cost.
Data tiering (if working set is inherently large): if the buffer pool can never fit the working set, archive cold data, partition hot vs cold tables, or route cold-read queries to a separate read replica with a larger buffer pool.

These five exercises build the core multi-level cache reasoning skills: computing average latency across compounding hit ratios (Exercise 1), mapping data types to appropriate cache layers (Exercise 2), diagnosing stale-data bugs by process of elimination across layers (Exercise 3), designing static asset deployments that sidestep browser cache invalidation entirely (Exercise 4), and identifying the bottleneck layer in a two-tier system and prioritizing the highest-ROI intervention (Exercise 5).

Section 19

Bug Studies — When Multi-Level Caches Go Wrong

Theory says caches make things faster. Production says: it depends on how you configure them. The four incidents below are composite sketches drawn from real patterns — exact companies are softened, but every failure mode is documented in public postmortems or engineering blogs. Read each one to understand not just what broke, but why that specific layer interaction caused the failure.

Bug #1 — Postgres shared_buffers Set to 90% of RAM Caused OOM Kills

Incident: A team right-sized a new Postgres instance with 64 GB RAM. Following advice to "give Postgres more memory = better," they set shared_buffers = 57600MB (≈90%). Within hours the database began OOM-killing background workers. Query latency — paradoxically — got worse, not better. Restoring shared_buffers = 16384MB (25%) immediately resolved both OOM kills and improved throughput by roughly 40%.

What Went Wrong

Postgres does not operate in isolation — it shares the operating system with the OS page cache. When Postgres reads a database page, the read first passes through the OS page cache, then lands in shared_buffers. That means if shared_buffers is enormous, the same page can be double-buffered: once in kernel space (page cache) and once in Postgres's own buffer pool. You burn RAM twice for the same data.

Worse: by stealing 90% of RAM for shared_buffers, Postgres left almost nothing for the OS page cache. Every WAL write, every VACUUM I/O, every sequential scan that bypasses shared_buffers hit the disk cold. The OS's buffer manager — which is extremely good at scan-resistant eviction — was effectively disabled. The correct split is roughly 25% of RAM for shared_buffers, leaving 75% for the OS page cache to complement it.

The diagram shows how over-allocating shared_buffers starves the OS page cache, forcing disk I/O for every operation that bypasses the Postgres buffer pool. The correct split lets both layers complement each other.

# postgresql.conf — DO NOT DO THIS on a 64 GB host shared_buffers = 57600MB # 90% of RAM — too much # Consequence: leaves ~6 GB for OS page cache # WAL, VACUUM, sequential scans hit disk cold # Background workers OOM-killed under write load

# postgresql.conf — correct values for a 64 GB host shared_buffers = 16384MB # 25% of RAM — sweet spot effective_cache_size = 49152MB # tell planner OS page cache is ~48 GB work_mem = 64MB # per-sort/hash operation (tune separately) # Now: Postgres buffer pool + OS page cache work as complements # Sequential scans warm the page cache; random reads warm shared_buffers

Lesson: The Postgres buffer pool and the OS page cache are partners, not competitors. Postgres's own documentation recommends 25% precisely because leaving room for the OS layer compounds the total effective cache size. Giving Postgres 90% doesn't give you 90% of the benefit — it destroys the partner layer.

Bug #2 — In-Process Cache Without a Size Cap Caused a Slow OOM Over Weeks

Incident: A JVM service used a plain HashMap as an in-process cache for user-preference objects. It grew unbounded. Heap usage climbed by roughly 400 MB per week. After six weeks, the service began triggering full GC pauses (2–5 seconds) every few hours. After ten weeks, it OOM-crashed. The fix was a 10-line change: replace the raw HashMap with a size-bounded Caffeine cache.

What Went Wrong

In-process caches store data inside your application's heap. That heap is managed by the garbage collector, and the GC has a fixed budget. An unbounded cache means you keep accumulating objects. JVM's GC can handle young-generation objects well — they're short-lived and collected cheaply. But long-lived cache entries promote to old generation, and old generation collection (full GC) is expensive: it pauses the entire process.

The root cause is conceptual: a cache without a size cap is not a cache — it is a memory leak with lookup syntax. Every cache must have either a maximum entry count, a maximum byte size, or a TTL, or all three. Without at least one bound, the cache grows until it eats the heap.

// ANTI-PATTERN: unbounded HashMap used as a cache private final Map<String, UserPreferences> prefCache = new HashMap<>(); public UserPreferences getPreferences(String userId) { return prefCache.computeIfAbsent(userId, id -> db.loadPreferences(id)); // Every unique userId that ever visits adds a permanent entry. // After months: millions of entries, gigabytes of heap, GC death spiral. }

// CORRECT: Caffeine cache with explicit bounds private final Cache<String, UserPreferences> prefCache = Caffeine.newBuilder() .maximumSize(50_000) // hard cap — evicts LRU entries above this .expireAfterWrite(10, MINUTES) // stale data evicted regardless of access .recordStats() // expose hit ratio to monitoring .build(); public UserPreferences getPreferences(String userId) { return prefCache.get(userId, id -> db.loadPreferences(id)); // Maximum ~50 000 entries regardless of how many users visit. // Old entries expire automatically. Heap stays predictable. }

Lesson: Always configure three limits on an in-process cache: maximum size (prevents heap exhaustion), TTL (prevents stale data), and hit-ratio monitoring (detects misconfiguration early). Caffeine, Guava Cache, and Ehcache all support these. A plain HashMap or Python dict as a long-lived cache is almost always wrong.

Bug #3 — Cache Stampede: A Single Hot Key Expiry Melted the Database

Incident: A social platform cached a "trending topics" list in Redis with a 60-second TTL. The key was identical across all 1 200 application servers. When the key expired, all 1 200 servers detected a cache miss simultaneously and each fired the same database query — a 200 ms aggregation over 50 million rows. The database went from 2% CPU to 100%, query latency spiked to 45 seconds, and the site went partially down for 8 minutes. The next TTL expiry caused the same problem 60 seconds later.

What Went Wrong

This failure is called a cache stampede (also called thundering herd). The problem arises because TTL expiry is a synchronized event: every process that has the key cached sees it expire at exactly the same wall-clock moment. If all those processes immediately try to rebuild the cache, you've turned one database query into one-thousand simultaneous identical queries.

The fix has two parts: probabilistic early expiry (rebuild the cache slightly before it expires, stochastically, so only one server does it) and locking (use a distributed lock or a Redis SETNX so only the first miss triggers the rebuild; all others wait for the result).

The stampede (left) fires 1 200 identical queries simultaneously. The mutex fix (right) serializes the rebuild to one query while all others wait for the fresh value.

def get_trending(): value = redis.get("trending:topics") if value is None: # ALL 1200 servers reach here simultaneously after TTL expiry value = db.compute_trending() # heavy 200 ms aggregation redis.setex("trending:topics", 60, value) # same TTL, same expiry next time return value

import random, time LOCK_KEY = "trending:lock" DATA_KEY = "trending:topics" TTL_SECONDS = 60 def get_trending(): value = redis.get(DATA_KEY) if value is not None: return value # Only one server wins the rebuild lock acquired = redis.set(LOCK_KEY, "1", nx=True, ex=5) # 5-second lock TTL if acquired: try: value = db.compute_trending() # Add TTL jitter so not all keys expire at the same second jitter = random.randint(0, 10) redis.setex(DATA_KEY, TTL_SECONDS + jitter, value) finally: redis.delete(LOCK_KEY) return value else: # Lost the lock race — wait briefly and re-read (lock holder will populate) time.sleep(0.05) return redis.get(DATA_KEY) # now populated by lock winner

Lesson: Any hot key with a TTL is a potential stampede. Two defences together are stronger than either alone: (1) mutex locking so only one process rebuilds at a time, and (2) TTL jitter so keys don't all expire in the same second. For very hot keys, probabilistic early expiry (start rebuilding when p(expiry) crosses a threshold before actual TTL) prevents stampedes entirely at the cost of slight cache staleness.

Bug #4 — Cross-DC Cache Coherence: EU Users Saw 4 Hours of Stale Pricing Data

Incident: A global e-commerce company invalidated a product pricing cache after a price update. The invalidation message was published to the US Redis cluster — but the EU Redis cluster was a separate deployment connected via async replication. The replication lag was normally under 100 ms, but a network partition during a deployment window meant EU cluster invalidations were queued and not delivered. EU users saw the old price for 4 hours. Legal and customer-support fallout followed.

What Went Wrong

Multi-region caches are not automatically coherent. Redis's cross-datacenter replication (whether via built-in replication or an intermediary like a Pub/Sub fan-out) is asynchronous. Under normal conditions this is fine — the lag is negligible. But under a network partition or a deployment gap, invalidation messages queue up and are delivered late — or in the worst case, dropped.

The team's mental model was "we invalidate the cache on write" — true, but only for the local region. Multi-region invalidation needs explicit propagation acknowledgement or a fallback strategy. Common approaches: (a) use a global TTL short enough that stale windows are acceptable, (b) use versioned keys (the new price gets a new key; old keys expire naturally), (c) route write-path invalidation to all regional clusters with acknowledgement.

# Invalidation only touches the local Redis cluster def update_price(product_id, new_price): db.save_price(product_id, new_price) local_redis.delete(f"price:{product_id}") # EU cluster never told # EU users see stale price until TTL expires (could be hours)

# Option A: Fan-out invalidation to all regional clusters def update_price(product_id, new_price): db.save_price(product_id, new_price) for region_redis in [us_redis, eu_redis, apac_redis]: region_redis.delete(f"price:{product_id}") # Or publish to a Pub/Sub topic that each region subscribes to # Option B: Versioned keys — no explicit invalidation needed def cache_price(product_id, price, version): key = f"price:{product_id}:v{version}" redis.setex(key, 300, price) # 5-min TTL; old version keys expire naturally def get_price(product_id, current_version): key = f"price:{product_id}:v{current_version}" return redis.get(key) or db.load_price(product_id)

Lesson: Multi-region cache correctness requires explicit design. Async replication is a best-effort delivery mechanism, not a guarantee. For data where stale reads have business consequences (pricing, inventory, permissions), either use short TTLs, versioned keys, or synchronous fan-out invalidation with acknowledgement. Document the acceptable stale window and make it a monitored SLA.

Four production failure modes dominate multi-level cache incidents: double-buffering from oversized DB buffer pools, unbounded in-process caches that OOM over weeks, thundering-herd stampedes when hot keys expire, and cross-region stale data from async-only invalidation. Each failure has a structural fix — right-size buffer pools at 25%, bound every in-process cache, use mutex + jitter for hot keys, and fan-out invalidations explicitly to all regions.

Section 20

Real-World Architectures — How Big Systems Stack Their Caches

Textbook cache descriptions make it sound like a choice between Redis and Memcached. Real production systems layer four or five different caching mechanisms simultaneously, each chosen for a specific reason. The architectures below are drawn from public engineering posts, USENIX papers, and conference talks — they show how the six-layer hierarchy plays out at scale.

Facebook TAO (USENIX ATC 2013, Bronson et al.): writes go to the leader region and propagate to followers via async invalidation. Each follower region reads from its local TAO cache backed by a MySQL replica — reads never cross regions. This means EU users always get low latency but may briefly see stale data after a write.

Facebook TAO — Graph-Optimized Layered Cache

Facebook's social graph is a massive graph of objects (users, posts, photos) and associations (friendships, likes). TAO (The Associations and Objects system, described in the USENIX ATC 2013 paper by Bronson et al.) is purpose-built for this shape of data. The key architectural insight is that graph reads massively outnumber writes — at reported ratios of hundreds-to-one — so TAO is optimized for read-local, write-global.

Cache layers in TAO:

In-region TAO cache — a distributed Memcached-like tier within each data center. Object and association data is cached here; reads served from this layer at sub-millisecond latency.
MySQL with InnoDB buffer pool — each region has a MySQL primary or replica. The InnoDB buffer pool caches the hot pages of the graph DB, so cache misses that fall through TAO usually still avoid disk.
Write-through to leader — writes always go to the leader region's DB, then TAO sends async invalidations to all follower regions. Followers re-populate their caches on the next miss.

The tradeoff: follower regions trade strong consistency for low read latency. A user in Frankfurt who updates their profile will see the update immediately (they wrote to leader, and the local TAO is invalidated synchronously for their own read), but their friend in London might see the old value for up to a few hundred milliseconds while async invalidation propagates.

Netflix EVCache — Memcached With Multi-Region Replication

Netflix's EVCache (described in multiple Netflix Tech Blog posts) is a Memcached-based distributed cache with a custom replication layer added on top. The core Memcached layer is standard — Netflix chose it for its simplicity, predictable latency, and well-understood memory model. What Netflix added was multi-region replication baked into the client library.

Cache layers in EVCache:

EVCache cluster per AWS region — each region runs its own Memcached clusters, sharded across multiple nodes. The cluster is entirely local — no cross-region traffic on the hot read path.
Client-side replication — writes are replicated to multiple regions by the client library itself, not by Memcached. The client writes to all regions; reads always go to the local region.
No dedicated in-process cache tier — Netflix uses EVCache as the primary cache for everything from session tokens to personalized recommendations. The assumption is that EVCache is fast enough (sub-millisecond over the same AWS VPC) that an additional in-process layer is usually not worth the complexity.

EVCache is used for data with high reuse across many users — content metadata, genre lists, row data for the Netflix home screen. Personalised data (your specific recommendation ranking) is computed fresh or cached with short TTLs because the personalization model changes frequently.

Twitter (X) — Manhattan KV With In-Process + Distributed Cache Tiers

Twitter (now X) built its own distributed key-value store called Manhattan, first publicly described on the Twitter engineering blog in 2014. The social graph (followers, timelines) was backed by Manhattan, with multiple cache layers in front of it to handle the extreme read-to-write ratio of social timelines.

Cache layers in Twitter's stack:

In-process timeline cache — each timeline service instance kept a Scala/Java in-process cache of recently fetched timelines (a fixed-size LRU). A heavy Twitter user's home timeline was often served entirely from this tier.
Memcached cluster — the shared distributed tier between all timeline service instances. Cache keys included the user ID and a cursor position for paginated timelines.
Manhattan KV — the persistent store. Manhattan itself uses a log-structured merge-tree (LSM) internally, which means it has its own multi-layer cache: memtable (in-process), block cache (in-process), and OS page cache. So even the "database layer" has three sub-layers of caching.

Discord — Removing a Cache Tier by Upgrading the Database

Discord's 2023 engineering post on migrating from Cassandra to ScyllaDB is a fascinating counter-example: they removed a cache tier rather than adding one. Cassandra's read performance degraded as their message table grew to trillions of rows, so Discord added a Redis cache in front of Cassandra to protect it from read amplification. This worked but added operational complexity.

When they migrated to ScyllaDB (a Cassandra-compatible database written in C++ with a much more efficient cache implementation), ScyllaDB's internal row cache — which is part of ScyllaDB's architecture — was good enough to replace the Redis tier entirely. The "right number of cache layers" dropped from four to three. The lesson is that sometimes improving the backing store is a better investment than adding another cache layer in front of a struggling one.

Shopify — Rails Fragment Caching + Fastly CDN

Shopify runs one of the largest Rails deployments in existence. Their caching story is documented across several engineering blog posts. Unlike the previous examples which are read-heavy social platforms, Shopify's challenge is handling flash sales: millions of users hitting a single storefront in seconds.

Cache layers in Shopify's stack:

Fastly CDN — the outermost layer. Shopify uses Fastly's Varnish-based CDN to cache entire page responses. For a flash sale, Fastly absorbs a substantial share of requests before they reach Shopify's origin servers. Fastly's surrogate keys allow instant cache purge on inventory or price changes.
Rails fragment caching — inside the Rails app, expensive HTML fragments (product grids, collection pages) are cached in Memcached using Rails's built-in fragment cache API. Generation-based cache busting (incrementing a global generation counter rather than enumerating every key to invalidate) allows broad invalidation with a single atomic operation.
Redis for session and cart — user session and cart state live in Redis with short TTLs, ensuring personalized data is always fresh without hitting the database.
Postgres with shared_buffers + PgBouncer — the database tier, with connection pooling via PgBouncer to handle the connection spike from flash sales, and Postgres's own buffer pool caching the hot inventory rows.

No two production stacks are identical. Facebook TAO skips CDN and in-process caches because social graph data is too personalized to cache at the edge. Netflix skips in-process caches because EVCache latency is already under 0.5 ms intra-VPC. Shopify leads with CDN because static content cacheable at the edge handles the bulk of flash-sale traffic.

Real production cache stacks are always multi-layer and always shaped by the specific data access pattern: Facebook TAO optimizes for graph locality, Netflix EVCache for multi-region read scale, Twitter for timeline fan-out, Discord for simplicity, Shopify for flash-sale throughput. The common thread is that every system uses at least three layers simultaneously, and the design choices follow directly from understanding what each layer does and what its tradeoffs are.

Section 21

Common Misconceptions — Mental-Model Errors

These aren't beginner mistakes — they're the mental-model errors that experienced engineers make when reasoning about cache hierarchies. Each one is distinct from the pitfalls in S17 (which focused on implementation gotchas). These are wrong beliefs about how caches work, and correcting them changes how you design systems.

The belief: If the system is slow, add Redis. Adding a cache layer can only help.

Why it's wrong: A cache speeds things up only when the data has high temporal or spatial locality. If you're caching data that is accessed only once — a report query keyed on a unique date range, a user profile fetch that's never requested twice because users churn — you pay the overhead of cache writes, serialization, and network hops without ever getting a cache hit. The cache hit ratio stays near 0%, and you've added 0.5–2 ms of latency to every request (the cost of checking the cache and missing) instead of saving time.

The corrected mental model: Before adding a cache layer, measure the access pattern. What is the expected hit ratio for this data? If most requests are for unique keys, the data has no locality and a cache will hurt, not help. Caches are only beneficial when data is reused — either the same key is read many times (temporal locality) or nearby keys are read together (spatial locality).

The belief: Redis, Memcached, Caffeine, and the CDN all do the same thing. Pick one.

Why it's wrong: Each layer has fundamental physical constraints that make it non-substitutable. You cannot replace an in-process Caffeine cache with Redis if your bottleneck is latency below 0.1 ms — Redis adds a network round trip. You cannot replace a CDN with Redis if your bottleneck is geographic distance — Redis is in one data center. You cannot replace the OS page cache with Redis — the page cache operates at the kernel level and intercepts file I/O automatically, something Redis cannot do.

The corrected mental model: The six layers exist because each occupies a unique point in the speed/size/location trade-off space. Layers complement each other; they don't substitute for each other. The right question is not "which cache?" but "which cache belongs at which layer for this specific data?"

The belief: If my Redis hit ratio is 85%, I should add more RAM to push it to 95%.

Why it's wrong: Hit ratio improves with cache size only up to the working set boundary. Once your cache is large enough to hold the entire working set, additional memory adds nothing — every key that would ever be a hit is already cached. Beyond that point, you're paying for RAM that holds data nobody accesses. The marginal value of each additional gigabyte drops toward zero once you've covered the working set.

The corrected mental model: Profile the working set size first. Use Redis's INFO keyspace and monitoring to understand how many unique keys are accessed in a 24-hour window. Size the cache to cover that working set, then stop. Spending money on cache RAM beyond the working set has zero return.

The belief: If I have Redis, I don't need an in-process cache. If I have an in-process cache, Redis is redundant.

Why it's wrong: They serve different purposes and work best together. An in-process cache (Caffeine, Python dict) costs 0 ms — no serialization, no network hop, no thread switch. A distributed cache (Redis) costs 0.3–2 ms but is shared across all your application servers, so every server benefits from every other server's cache population. The optimal design is usually both: an in-process cache for the hottest data (top 1% of requests) that eliminates even the Redis round trip, backed by a distributed cache for the next tier down.

The corrected mental model: Think of them as two levels of a hierarchy, not competing options. The in-process cache is a "local Redis" with zero latency and limited scope. The distributed cache is a "shared store" that aggregates all servers' cache populations. Use both. Cap the in-process cache at a small fraction of heap (tens of thousands of entries); let Redis handle the long tail.

The belief: The database handles its own caching. I don't need to think about shared_buffers or the InnoDB buffer pool — that's DBA territory.

Why it's wrong: The DB buffer pool is the most consequential cache layer for most applications, because it determines whether a DB query touches disk or RAM. If your buffer pool hit ratio is 60% — meaning 40% of DB reads go to disk — adding a Redis cache in front of the database will only help with the queries Redis can cache. For queries that are too complex or too personalized for Redis, every miss still causes a disk read. Understanding the buffer pool hit ratio tells you whether application-level caching is even the right lever to pull.

The corrected mental model: Monitor pg_stat_bgwriter and the ratio of blks_hit / (blks_hit + blks_read) in Postgres (or Innodb_buffer_pool_reads vs Innodb_buffer_pool_read_requests in MySQL). If the buffer pool hit ratio is above 99%, adding more app-level cache is low value. If it's 80%, right-sizing the buffer pool first will often give a larger gain than adding Redis.

The belief: My Redis hit ratio is 97%. The cache is working great.

Why it's wrong: Hit ratio is only one dimension of cache health. A cache can have a 97% hit ratio while simultaneously: (a) evicting aggressively because it's undersized — you're serving hits but constantly churning the working set; (b) returning stale data — hits are "successful" from the cache's perspective but wrong from the application's perspective; (c) adding 5 ms of latency due to memory pressure causing slower lookups or swap usage.

The corrected mental model: Monitor at least four metrics per cache layer: hit ratio, eviction rate, memory utilization, and p99 latency. A healthy cache has high hit ratio and low eviction rate (it's not churning to achieve the hits) and sub-millisecond p99 latency and stable memory utilization below the configured maximum.

The belief: I have a small app with a few thousand users. All this CPU cache / OS page cache / buffer pool talk is irrelevant to me.

Why it's wrong: Small systems already use three or four cache layers whether you configured them or not. The OS page cache caches your application's static files and your database's data files automatically. The database buffer pool caches your hot rows. The browser caches your CSS and JS. These layers are always present; the question is only whether you've configured them intelligently. Even a hobby project benefits from setting appropriate HTTP cache headers (CDN/browser layer), ensuring the database buffer pool is large enough for the working set (DB layer), and avoiding unbounded in-process data structures (in-process layer).

The corrected mental model: You're not choosing whether to have cache layers — they exist regardless. You're choosing whether to understand and configure them, or to ignore them and wonder why the app feels sluggish.

The seven misconceptions share a common root: treating caches as a simple on/off optimization rather than as a physical hierarchy with specific tradeoffs at each layer. Correcting these mental models shifts thinking from "add a cache" to "which layer, for which data, at what cost, and how do I measure that it's working?"

Section 22

Operational Playbook — Tuning the Whole Stack

Knowing the theory is necessary but not sufficient. This section is a practical five-stage playbook for tuning a multi-level cache stack in a running production system. Follow it in order — skipping stages leads to optimizing the wrong layer.

The five stages must be followed in order. Tuning eviction before measuring hit ratios is guessing. Right-sizing before locating the bottleneck layer is optimizing the wrong thing.

Stage 1 — Measure: Instrument Every Cache Layer

You cannot tune what you cannot see. Before changing anything, add instrumentation to expose hit ratio, miss ratio, latency, and eviction rate for each layer. Here are the specific commands and APIs:

OS Page Cache — use free -h to see total vs used vs buff/cache. Use vmstat 1 to see page-in/page-out rates. On Linux, cachestat (part of bcc tools) shows page cache hit/miss per file.

Postgres shared_buffers — the pg_buffercache extension exposes what is currently in the buffer pool:

-- Install the extension once: CREATE EXTENSION pg_buffercache; -- Hit ratio for the current buffer pool: SELECT round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) AS buffer_hit_ratio_pct FROM pg_stat_database WHERE datname = current_database(); -- Which relations are consuming the most buffer pool pages: SELECT relname, count(*) AS buffer_pages, round(100.0 * count(*) / (SELECT setting::int FROM pg_settings WHERE name='shared_buffers'), 2) AS pct_of_pool FROM pg_buffercache bc JOIN pg_class c ON bc.relfilenode = c.relfilenode GROUP BY relname ORDER BY buffer_pages DESC LIMIT 20;

Redis — run INFO memory and INFO stats:

# Connect to Redis and run: redis-cli INFO memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio" redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses" # Compute hit ratio manually: # hit_ratio = keyspace_hits / (keyspace_hits + keyspace_misses) # For eviction rate: redis-cli INFO stats | grep evicted_keys # If evicted_keys is growing rapidly, the cache is too small for the working set

In-process cache (Caffeine) — enable stats recording and expose via JMX or Micrometer:

Cache<String, Object> cache = Caffeine.newBuilder() .maximumSize(100_000) .recordStats() // enables CacheStats collection .build(); // Periodically log stats: CacheStats stats = cache.stats(); log.info("Hit ratio: {}, Evictions: {}, Load count: {}", stats.hitRate(), stats.evictionCount(), stats.loadCount());

Stage 2 — Locate: Find the Layer With the Highest Miss Penalty

With metrics in hand, identify two things: which layer has the lowest hit ratio relative to its allocated capacity, and which layer's miss penalty is highest. These are not always the same layer.

Hit ratio per layer: if your in-process cache is 40% and your Redis is 95%, the in-process cache is the weak link — but it may be weak because it's too small or because its TTL is too short. If your Redis hit ratio is 60%, that's alarming — it suggests the working set doesn't fit, or the TTL is too aggressive, or the eviction policy is wrong.

Miss penalty by layer: an in-process cache miss falls to Redis (≈1 ms penalty). A Redis miss falls to the database (≈5–50 ms penalty). A database miss falls to disk (≈50 µs–10 ms penalty). The layer where misses are most expensive is the most important to fix first. Usually this is the distributed cache or the database buffer pool.

Decision rule: Fix the layer where (miss rate × miss penalty) is highest. A 10% miss rate with a 50 ms miss penalty costs 5 ms of average added latency per request. A 1% miss rate with a 1 ms penalty costs only 0.01 ms. Always weight miss rate by its cost, not in isolation.

Stage 3 — Right-Size: Allocate Correct RAM to Each Layer

Right-sizing means allocating enough memory to cover the working set — no more, no less. Here are the validated rules of thumb for each layer:

Postgres shared_buffers: 25% of total system RAM. Set effective_cache_size to 75% to tell the query planner about the OS page cache. Never exceed 40% even on a dedicated DB host — you need room for the page cache.
MySQL InnoDB buffer pool (innodb_buffer_pool_size): 60–80% of RAM on a dedicated DB host. MySQL's buffer pool is more aggressive about using memory because MySQL relies on the buffer pool more than Postgres (Postgres delegates more to the OS page cache).
In-process cache (Caffeine, Guava): set maximumSize to the number of unique "hot" keys your service handles in one TTL window. A typical API service might cache 10 000–100 000 hot user records. Always cap it — if you're uncertain, start at 50 000 and measure eviction rate; if eviction rate is near zero, you have headroom.
Redis working set: use redis-cli --bigkeys and DEBUG OBJECT sampling to estimate total working set size. Size the Redis maxmemory at 110–120% of the working set estimate to leave headroom for hot-key clustering without eviction pressure.

Stage 4 — Tune Eviction: Match Policy to Access Pattern

Eviction policy determines which entries are removed when the cache is full. The wrong policy can cut your effective hit ratio even with correct sizing.

LRU (Least Recently Used): evicts the entry that hasn't been accessed in the longest time. Good for workloads with strong temporal locality. Weakness: a single large sequential scan evicts all hot data ("scan pollution").
LFU (Least Frequently Used): evicts the entry accessed the fewest times total. Good for frequency-skewed workloads (a small set of keys is always hot). Weakness: new entries start with count 0 and are immediately vulnerable to eviction even if they'll be popular.
TinyLFU (used in Caffeine): a scan-resistant, frequency-based policy that uses a compact frequency sketch to approximate LFU without the cold-start problem. This is the best general-purpose policy for in-process caches and is Caffeine's default.
TTL jitter: add a small random offset (e.g., ±10% of TTL) to each entry's expiry time. This spreads expiry events across time and prevents cache stampedes when many keys were populated simultaneously.

Redis note: Redis supports allkeys-lru, allkeys-lfu, volatile-lru, and volatile-lfu eviction policies (set via maxmemory-policy in redis.conf). For a general-purpose shared cache where all keys have TTLs, use allkeys-lfu. For a cache where some keys are permanent (no TTL) and some are ephemeral, use volatile-lru to evict only TTL-bearing keys.

Stage 5 — Validate: Load Test and Compare Metrics Before / After

Tuning in production without load testing is gambling. Always validate changes under realistic load before declaring success. The validation checklist:

Run a realistic load test: use a traffic replay tool (e.g., wrk, k6, Gatling) with a request distribution that mirrors production — not uniform random, but skewed toward hot keys. If 1% of your users account for 80% of requests, your load test should reflect that.
Compare hit ratios before and after: for each layer, record hit ratio at steady state before the change and after. The change should increase hit ratio at the targeted layer and should not decrease it at adjacent layers.
Measure p99 latency, not just average: average latency hides tail-latency spikes caused by GC pauses, eviction storms, or cold-start effects after a cache flush. p99 is what your users actually experience on their worst requests.
Monitor eviction rate over 24 hours: a successful right-sizing will show a drop in eviction rate (fewer entries being thrown out to make room). If eviction rate remains high after right-sizing, the working set is larger than estimated and you need to re-profile.
Verify end-to-end DB query rate dropped: the final proof that cache tuning worked is that the database sees fewer queries, not just that the cache metrics improved. Instrument DB query throughput (queries per second per table) and confirm it dropped after cache improvements.

The operational playbook is: measure hit ratio and miss penalty at every layer, locate the layer where (miss rate × miss penalty) is highest, right-size that layer to the working set, tune eviction policy to match the access pattern (TinyLFU for in-process, LFU for Redis), then validate with load testing against p99 latency and DB query rate — not just cache hit ratios alone.

Section 23

Cheat Sheet & Glossary — The 30-Second Recap

Everything from this page distilled into quick-reference cards and a glossary. Bookmark this section for interview prep.

The Latency Hierarchy (commit to memory)

~1 ns — single-cycle SRAM read, 32–64 KB per core

~3–5 ns — slightly larger (256 KB–4 MB), still on the die

~15–40 ns — shared across cores, 8–64 MB, still on-chip

~80–100 ns — DRAM read; ~10–50 µs for a full page fault from disk

~0.3–2 ms over LAN — network-bound, not compute-bound

~50–150 µs — orders of magnitude slower than DRAM

~5–15 ms — mechanical arm movement is the bottleneck

~5–30 ms — geographic latency to nearest PoP

~50–600 ms — full DB query + serialization + round trip

Key Rules of Thumb

Set to 25% of total RAM. Leave the rest for the OS page cache.

Set to 60–80% of RAM on a dedicated DB host (MySQL relies on it more).

4 layers at 90% each → disk sees only 0.01% of traffic (miss-through-all-layers = 0.1⁴ = 0.0001).

These are complements, not alternatives. Use both. Cap in-process at a fixed entry count.

Mutex + TTL jitter. Never set the same TTL for all keys of the same type.

TinyLFU for in-process; allkeys-lfu for Redis general cache; volatile-lru when mixing TTL and permanent keys.

Size cache to cover the working set. Past that, additional RAM yields near-zero improvement.

Target >99% for Postgres/MySQL buffer pool. Below 95% means the working set doesn't fit — add RAM or scale down data.

Async replication is not a guarantee. For consistency-sensitive data, fan-out explicitly or use versioned keys + short TTLs.

Glossary

cache hierarchy: The ordered set of cache layers from fastest/smallest (CPU L1) to slowest/largest (CDN), each layer catching misses from the layer above.
MESI protocol: The cache coherence protocol used between CPU cores: each cache line is in one of four states — Modified, Exclusive, Shared, Invalid. Prevents two cores seeing different values for the same memory address.
false sharing: When two threads modify different variables that happen to live on the same CPU cache line, causing constant MESI invalidations even though the threads aren't sharing data logically. Solved by padding structs to align variables to separate cache lines.
page cache: The Linux kernel's in-RAM cache of recently-read or recently-written file data. File I/O goes through the page cache automatically; Postgres, for example, benefits from it without any configuration.
buffer pool: The database's own in-process cache of recently-accessed database pages. Postgres calls it shared_buffers; MySQL calls it the InnoDB buffer pool. It sits in front of the OS page cache for database-specific I/O.
working set: The subset of data actively accessed in a given time window. Once a cache is large enough to hold the working set, additional capacity yields near-zero improvement in hit ratio.
temporal locality: The tendency to re-access the same data soon after accessing it. High temporal locality makes caching effective — the same key is likely in cache when requested again.
spatial locality: The tendency to access data near recently-accessed data. Exploited by CPU cache lines (fetch 64 bytes at once) and OS page cache (fetch 4 KB at once, even if only 8 bytes needed).
miss penalty: The latency cost paid when a cache lookup fails and must fall through to the next (slower) layer. Miss penalty × miss rate = average added latency from misses.
write-through / write-around / write-back: Three strategies for handling writes in a cache. Write-through writes to both cache and backing store synchronously. Write-around skips the cache on write. Write-back (write-behind) writes to the cache and flushes to the backing store asynchronously.
cache stampede: When a popular cached item expires and many clients simultaneously attempt to rebuild it, overwhelming the backing store with identical requests. Prevented by mutex locking and TTL jitter.
TAO: Facebook's "The Associations and Objects" distributed cache system, described in the USENIX ATC 2013 paper by Bronson et al. Designed for graph data with a write-to-leader, read-from-local-cache pattern across multiple regions.

The key numbers to internalize: L1 ≈ 1 ns, DRAM ≈ 80 ns, Redis ≈ 1 ms, SSD ≈ 100 µs, HDD ≈ 10 ms, CDN ≈ 20 ms. The key rules: Postgres 25%, MySQL 60–80%, stampede = mutex + jitter, in-process + distributed = complements. The key insight: hit ratios compound multiplicatively across layers.

Caching Levels — The Six-Layer Hierarchy from CPU to CDN

TL;DR — Caching Levels in Plain English

Why You Need This — The Layered-Speed Story

The Cold-Cache Horror Story: A Checkout Flow in Trouble

The Warm-Cache Story: Same Team, Three Months Later

Mental Model — The Six-Layer Cache Pyramid

Core Concepts — The Vocabulary

The Fundamentals

Write Policies

Cache Locality

Cache Relationships (Inclusive vs Exclusive)

Hardware Cache Coherence

The Six Cache Layers — A Layer-by-Layer Tour

Layer 1 — CPU L1 / L2 / L3 Caches (1–30 ns)

Layer 2 — OS Page Cache (~100 ns, GBs of DRAM)

Layer 3 — App-Process Cache (100 ns–1 µs, MBs–GBs)

Layer 4 — Distributed Cache: Redis / Memcached (0.5–2 ms, GBs–TBs)

Layer 5 — Database Buffer Pool (5 µs–15 ms, GBs on DB Server)

Layer 6 — CDN Edge (5–50 ms, GBs–PBs, Globally Distributed)

The Math of Multi-Level Caching — Why Compounding Hit Ratios Win

The Cascading Multiplication

The End-to-End Latency Formula

When a Single 50% Layer Is Worth Adding

Layer 1 Deep Dive — CPU Cache (L1, L2, L3) and the Memory Hierarchy

The Speed Gap Problem: Why CPU Caches Had to Exist

What the CPU Actually Caches: Cache Lines, Not Individual Bytes

Temporal and Spatial Locality: Why Caches Work at All

MESI Coherence: Keeping Multi-Core Caches Consistent

False Sharing: The Silent Performance Killer

Layer 2 Deep Dive — OS Page Cache (The Free Cache You Already Have)

How the Page Cache Works: The Kernel's RAM Recycler

Why Databases Love (and Sometimes Fight) the Page Cache

Scan Eviction: The Page Cache's Achilles Heel

tmpfs and shmfs: Explicit In-RAM Filesystems

Layer 3 Deep Dive — In-Process Application Cache (Caffeine, Guava, dicts)

Why Not Just Use a HashMap? The Problem With Unbounded Caches

Caffeine: The Modern Best-in-Class (Java)

The Multi-Instance Problem: Why In-Process Cache Hit Ratio Degrades at Scale

The L-Shape Cache Strategy: In-Process + Distributed Together

Equivalents in Other Languages

Layer 4 Deep Dive — Distributed Cache (Redis, Memcached)

Redis vs Memcached: Which One and When

Redis Cluster: Horizontal Sharding via Consistent Hashing

Reducing Round-Trips: Pipelining and Lua Scripts

Failure Modes: Hot-Key Meltdown and Network Partitions

Layer 5 Deep Dive — Database Buffer Pool (Postgres shared_buffers, InnoDB)

Postgres: shared_buffers and the Double-Buffering Dance

Postgres Eviction: Clock-Sweep

InnoDB: Buffer Pool + O_DIRECT

InnoDB Midpoint Insertion: Protecting Hot Pages From Scans

Hit Ratio: The #1 Database Performance Metric

Layer 6 Deep Dive — CDN Edge Caches (Geographic, Shared, Tiered)

What a CDN Can (and Cannot) Cache

Cache Key: What Makes a CDN Response Unique

Tiered Caching: Shield Nodes Cut Origin Load

CDN in the Six-Layer Context: What It Does That No Other Layer Can

Invalidation: The Hard Problem at the CDN Layer

Choosing the Right Layer for Each Piece of Data

The Three Axes That Drive Layer Selection

Common Placement Mistakes and Why They Happen

Cross-Layer Invalidation — The Hardest Problem in Multi-Level Caching

The Layers That Take Care of Themselves

The Layers Where You're Responsible

Three Invalidation Patterns in Practice

1. Pub/Sub Fan-Out (Redis PUBLISH)

2. CDC-Driven Cascading Invalidation

3. Versioned Key Fingerprinting

Write Strategies Across Layers — Write-Through, Write-Around, Write-Back

Write-Through — Consistency First

Write-Around — Protecting the Cache from Junk

Write-Back — Performance at the Cost of Durability

How These Strategies Compose in a Multi-Level Stack

Production Patterns — Cache-of-Caches, Multi-Tier, Look-Aside, Read-Through

(a) Cache-Aside (Look-Aside) — The Default Pattern

(b) Read-Through — Transparent Cache Population

(c) Cache-of-Caches / Hierarchical — Two-Tier Architecture

(d) Multi-Tier Heterogeneous — Netflix EVCache Pattern

Facebook TAO — A Real-World Two-Tier Example

Pitfalls & Anti-Patterns

Practice Exercises — Multi-Level Cache Reasoning