TL;DR β Caching Levels in Plain English
- Why every modern system has not one cache but six β stacked from the CPU die all the way out to the CDN edge β and why each layer exists for a different physical reason
- The latency hierarchy: L1 cache at ~1 ns, DRAM at ~80 ns, Redis at ~0.5 ms, database cold read at 5β15 ms, CDN at 5β30 ms β and how to reason about where each layer fits
- The compounding hit-ratio math: why four cache layers at 90% hit ratio each reduce downstream traffic to 0.01% of the original, not 60%
- Which layer is right for which data β working sets, hot data, cold data, personalized vs shared β and the costs of putting data in the wrong layer
- How to trace a real request through all six layers and calculate end-to-end latency under cold vs warm conditions
Every real system has six cache layers sitting between a user request and a disk. CPU L1/L2/L3 caches run at nanosecond speed and are invisible to your code. The OS page cache holds your hot files in RAM automatically. Your app-process cache (Caffeine, a Python dict) keeps computed results in memory for milliseconds. Redis or Memcached stores shared state across every server. The database buffer pool hides most disk reads. CDN edge nodes serve the whole page before it ever reaches your data center. A request that hits every layer warm costs 20 ms. A request that misses every layer costs 600 ms. That 30Γ gap is decided entirely by how well you understand β and tune β these six layers.
Between a user's browser and your disk there are exactly six cache layers, each one progressively larger, slower, and cheaper per byte. At the top sit CPU L1/L2/L3 caches β kilobytes to megabytes of memory baked onto the processor die, running at 1β30 ns, completely invisible to application code. Below that, the OS page cache stores recently-read file data in ordinary RAM and is managed by the kernel β your database reads usually land here first. Next comes the app-process cache: a Java Caffeine cache, a Python dict, or any in-memory map that lives inside your application server's heap. Then the distributed cache β Redis or Memcached β a separate tier shared across all your servers, reached over the network in 0.5β2 ms. The database buffer pool (Postgres's shared_buffers, MySQL's InnoDB buffer pool) caches raw database pages in RAM on the DB server before they would need a disk read. And finally the CDN edge, with servers distributed worldwide near users, caching full HTTP responses 5β30 ms from the browser. Every layer absorbs as many requests as it can; only the misses fall to the next layer.
The layers form a pyramid because physics dictates a strict trade-off: speed, size, and cost per byte are locked in a triangle β you can only have two. CPU L1 cache is the fastest thing you can build (~1 ns, single-cycle) but is tiny (32β64 KB) because SRAM cells are expensive. The OS page cache is slower (100 ns) but uses cheap DRAM and can be gigabytes. Redis is accessible to every server but pays a network round-trip (0.5β2 ms) while holding terabytes across a cluster. A CDN distributes globally across petabytes of storage but each response pays a geographic round-trip (5β50 ms). The pyramid isn't a design choice β it's an expression of physics. Understanding it tells you exactly which data should live where: hot, tiny, shared-nothing data β L1/L2; hot, large, cross-server data β distributed cache; hot, global, static data β CDN.
The most underappreciated insight in caching is that hit ratios multiply across layers, they don't add. If your app cache hits 90% of requests, the remaining 10% reach Redis. If Redis hits 90% of those, only 1% reach the database. If the database buffer pool hits 90% of those, a mere 0.1% ever touch disk. After four layers at 90% each, the bottom layer β disk β sees 0.01% of original traffic. This compounding is why every layer matters and why even a mediocre 50% hit ratio at a layer is worth adding: it cuts downstream load by half for free. End-to-end average latency follows the same compounding: most requests are served fast from high layers, dragging the average down even if slow cold-miss requests still exist.
Why You Need This β The Layered-Speed Story
Caching is one of those topics where you can learn each technique in isolation β Redis for session storage, CDN for static files, Caffeine for in-process lookups β without ever seeing the full picture. The full picture is this: all six layers are active simultaneously for every production request, and the difference between a well-tuned stack and a poorly-tuned one is measured in hundreds of milliseconds and in database servers you don't have to buy.
The Cold-Cache Horror Story: A Checkout Flow in Trouble
Picture an e-commerce checkout page. A user clicks "Buy Now" and their browser fires a request. Here is what happens on a system where nobody has thought carefully about caching layers β every cache is cold or misconfigured:
- Browser cache miss β the CDN URL has no
Cache-Controlheader, so the browser always re-fetches. +0 ms saved. - CDN miss β the checkout HTML is marked
Cache-Control: no-storeout of excess caution. Every request hits the origin. +30 ms network to origin. - Nginx reverse proxy β no proxy-side cache configured. Request passes through. +2 ms routing.
- Redis miss β the user's cart, session, and pricing data are not in Redis (TTL expired). +1 ms network to Redis, then fall through to DB.
- Database query β five SQL joins to assemble cart, shipping options, and product details. The query itself takes 60 ms on the primary. +60 ms.
- Response assembly + serialization β another 15 ms server-side. +15 ms.
- Response travels back β another 30 ms network to browser. +30 ms.
Total: roughly 138 ms per checkout page load β before accounting for the three round trips a browser needs for TLS and TCP. At p99 (with DB query variability), this regularly spikes to 800 ms. Their load tests show they need eight database replicas to handle Black Friday traffic.
The Warm-Cache Story: Same Team, Three Months Later
The team audits each cache layer and makes targeted fixes. Here is the same request after the optimization:
- Browser cache hit β static assets (JS bundles, CSS, images) have content-hash URLs and a one-year
Cache-Control: immutableheader. Browser serves them without a network request. ~0 ms for assets. - CDN edge hit β the shared parts of the checkout HTML (header, footer, product thumbnails) are now cached at the CDN with a 60-second TTL and surrogate-key invalidation on price changes. Served from the CDN edge in 8 ms. +8 ms for the shell.
- Redis hit β user cart, pricing rules, and shipping options are cached in Redis with a 5-minute TTL. Redis answers in 0.6 ms. +0.6 ms.
- Database buffer pool hit β the small remaining DB read (user's saved address) lands in Postgres
shared_buffersand returns in 0.3 ms β no disk touch. +0.3 ms. - Response assembly + network back β 5 ms + 8 ms. +13 ms.
Total: roughly 22 ms per checkout page load. p99 is under 80 ms. And because the DB now sees only the cache misses (roughly 2% of requests after layered caching), the load test reveals they can handle Black Friday on two replicas instead of eight β a difference worth tens of thousands of dollars per month in infrastructure cost.
The waterfall makes the contrast visceral. In the cold path, the majority of wall-clock time is spent in two places: the CDN round-trip (because caching was disabled) and the database disk I/O (because five joins bypassed every cache). In the warm path, those two expensive phases don't happen. The request returns from shallow layers β CDN and Redis β and the system is 6Γ faster with 4Γ fewer database servers needed. The six-layer hierarchy isn't academic. It is the architectural blueprint for every latency improvement your system will ever achieve.
Mental Model β The Six-Layer Cache Pyramid
Before diving into each layer individually, you need one picture that holds the whole hierarchy in your head. Here it is: imagine a pyramid. At the very top β the smallest, fastest, most expensive tier β sit your CPU caches. At the very bottom β the largest, slowest, cheapest tier β sits the CDN serving billions of users globally. A request starts at the top of the pyramid for the layer closest to the CPU. If it misses, it falls to the next layer. And the next. Each fall costs progressively more time.
The pyramid shape is not arbitrary β it reflects a hard physical reality. As you move from top to bottom: capacity grows (KB β GB β TB β PB), latency grows (1 ns β 50 ms), and cost per byte drops (SRAM at $10/MB β DRAM at $0.005/MB β SSD at $0.0001/MB β CDN storage at fractions of a cent). You cannot get all three wins at once. Speed costs space. Space costs money. This triangle defines where each layer sits.
Reading the pyramid: each layer you descend pays a latency penalty and gains capacity. CPU L1 is baked onto the processor die itself β it is the single fastest memory humans have ever built, but it holds only 32β64 KB per core. By the time you reach the CDN edge, you are talking petabytes across thousands of servers worldwide, but each cache hit takes 5β50 ms depending on how far the user is from the nearest PoP. The pyramid is not a hierarchy of importance β all six layers run simultaneously in every production request. It is a hierarchy of speed and cost.
The most counterintuitive thing about this pyramid is who controls each layer. Layers 1β2 (CPU caches, OS page cache) are controlled entirely by the hardware and kernel β your application code cannot directly address them. Layer 3 (app-process cache) you control explicitly in your application code. Layer 4 (distributed cache) is a separate service you operate and configure. Layer 5 (DB buffer pool) is controlled by your database configuration β a single parameter like Postgres's shared_buffers or MySQL's innodb_buffer_pool_size can double or halve the number of disk reads your database generates. Layer 6 (CDN) is controlled by HTTP headers your application sends. Each layer has a different control knob, and knowing which knob to turn for which problem is the core skill this page builds.
shared_buffers. On a server with 32 GB of RAM, setting it to 8 GB can eliminate 80β90% of disk I/O entirely. That is a single config line that can outperform a week of application-level caching work.
Core Concepts β The Vocabulary
Caching has its own jargon, and a lot of the terms look similar but mean subtly different things. We'll warm up with the plain-English idea first, then introduce the name β so when you bump into these in docs, code reviews, or interviews, you already have a mental hook for what each one means and why anyone cares.
The Fundamentals
When your program asks for a piece of data and it is already sitting in the cache, ready to go β that is called a cache hit. When the cache doesn't have it and has to go fetch it from the layer below β that is a cache miss.
The fraction of requests that find data in the cache is the hit ratio. This is the single most important number in cache performance. A 90% hit ratio means only 10% of traffic reaches the next layer; a 50% hit ratio means half of all traffic falls through β very different outcomes. The flip side is the miss penalty β the extra time you pay every time the cache doesn't deliver.
The data your system accesses most often right now is called the working set. If your cache is large enough to hold the entire working set, hit ratios will be high and stable. If the working set is larger than your cache, items get evicted before they can be reused β a condition called cache thrashing.
Write Policies
When data changes, you have three strategies for keeping the cache in sync with the source. If you write to the cache AND the backing store simultaneously β guaranteeing they match at all times β that is write-through. It is safe and consistent, but every write still pays the cost of hitting the backing store.
If you write only to the backing store and skip the cache entirely (forcing the next read to re-fetch) β that is write-around. Good for data that is written once and not immediately re-read β it prevents polluting the cache with cold data.
If you write only to the cache and lazily flush to the backing store later β that is write-back (or write-behind). It is fastest because the caller doesn't wait for the DB, but it introduces risk: if the cache crashes before flushing, data is lost.
Cache Locality
The reason CPU caches are so effective is that programs naturally exhibit locality β they tend to access the same data repeatedly, and data near recently-accessed data. When a program uses the same variable or memory address over and over β think a loop counter β that is temporal locality. When a program accesses sequentially adjacent memory locations β think iterating an array β that is spatial locality. Both principles explain why caches at every layer are so effective β they work because programs are not random in what they access.
Cache Relationships (Inclusive vs Exclusive)
When you have multiple cache layers, you have to decide: can the same data item live in more than one layer at a time? If yes, the caches are inclusive β each layer is a superset of the layer above it. If an item lives in exactly one layer at any time, the caches are exclusive β evicting from one layer moves the item to the next rather than discarding it. AMD CPUs use exclusive L2/L3 caches; Intel traditionally used inclusive L3 (this has evolved). For distributed systems, the inclusive vs exclusive choice maps to whether you populate multiple cache tiers simultaneously (inclusive) or only cache at the tier that served the miss (exclusive).
Hardware Cache Coherence
On a multi-core CPU, each core has its own L1 and L2 cache β meaning the same memory address can exist in multiple cores' caches simultaneously. If one core changes a value, all other cores with a copy need to know. The hardware protocol that manages this is called MESI (Modified, Exclusive, Shared, Invalid). Each cache line is tagged with one of these four states, and the hardware automatically coordinates between cores when a write happens. This is invisible to your code but explains why false sharing β two threads writing to adjacent variables on the same 64-byte cache line β can cause surprising performance cliffs: every write invalidates the other core's copy and forces a round-trip through L3.
The concept map shows how all the vocabulary connects. Hit ratio and miss penalty together determine end-to-end latency β you want maximum hit ratio and minimum miss penalty at every layer. Working set size tells you whether a given cache is large enough to hold hot data without thrashing. Write policies decide consistency trade-offs. Locality explains why caches work at all. Inclusive vs exclusive decides total capacity. MESI is the hardware implementation of consistency at the CPU level.
The Six Cache Layers β A Layer-by-Layer Tour
With the vocabulary and pyramid mental model in place, let's walk each layer one by one β what it is, who controls it, the latency numbers, the typical capacity, and (critically) what makes it the right layer for certain types of data. Later blocks dive deep into each; here the goal is a clear, side-by-side picture.
Layer 1 β CPU L1 / L2 / L3 Caches (1β30 ns)
The fastest memory in any computer is not RAM β it is the cache baked directly onto the processor die. There are actually three tiers of this on-die cache, nested inside the CPU itself, named by how close they sit to each core: "level 1" is closest and smallest, "level 3" is furthest and largest. The innermost one β closest to a single core β is called L1 cache, holds 32β64 KB per core and responds in about 1 nanosecond β that is 4 CPU clock cycles at 4 GHz. The middle tier, L2 cache, holds 256 KB to ~4 MB and responds in 4β12 ns. The outermost, L3 cache, is shared across all cores and holds 4β192+ MB depending on the CPU, responding in 15β50 ns.
Your application code cannot directly address L1/L2/L3 caches. You cannot say "put this variable in L1." The CPU hardware manages these caches automatically β the programmer's job is to write code with good cache-friendly access patterns. Accessing array elements in order (sequential β excellent spatial locality) is dramatically faster than jumping around a linked list (random pointers β terrible spatial locality). This matters for database internals, serialization, and hot-path server code. A cache-unfriendly inner loop can run 10β100Γ slower than an equivalent cache-friendly one β not because of algorithm complexity but because of L1/L2 thrashing.
Layer 2 β OS Page Cache (~100 ns, GBs of DRAM)
When your application reads a file β say, a database reading a data page from disk β the operating system does something clever: it keeps a copy of that file's data in ordinary RAM (DRAM) even after the read is done. The next time anyone reads the same data, the OS serves it directly from RAM without touching the disk. This memory region managed by the kernel is called the OS page cache.
Why does this matter? Because your database is really two caches at the OS level: its own buffer pool (Layer 5) AND the OS page cache. When Postgres reads a data page and it's not in shared_buffers, the kernel checks its page cache first. If the page is in the page cache (a kernel-level cache miss from Postgres's perspective, but a page-cache hit from the OS perspective), the read takes ~100 ns. Only if the page cache also misses does the OS issue an actual disk I/O β 50 Β΅s for an SSD, 5β10 ms for an HDD. This two-level structure at the OS layer is often overlooked but explains why databases see much better performance than naΓ―ve latency calculations suggest.
Layer 3 β App-Process Cache (100 nsβ1 Β΅s, MBsβGBs)
When you store results in a HashMap inside your Java application, or a dict in Python, or use a library like Caffeine β that is Layer 3: the app-process cache. It lives inside your application's own memory (heap), so access is just a hash lookup β no network, no IPC, no syscall. Latency is effectively the same as accessing any other in-memory data structure: 100 ns to a few microseconds.
The critical limitation of Layer 3 is that it is local to one process. If you have ten application servers, each has its own independent in-process cache. If server A caches a user's profile, server B knows nothing about it. This makes Layer 3 best for shared-nothing, read-heavy, globally valid data β reference data that is the same for all servers (configuration values, product catalog snippets, static rate limits), not per-user state. For per-user or frequently-updated shared state, you need Layer 4.
Layer 4 β Distributed Cache: Redis / Memcached (0.5β2 ms, GBsβTBs)
When you need a cache that is visible to all your application servers simultaneously, you reach for a distributed cache. Redis and Memcached are separate server processes β your application connects to them over a TCP socket, pays a network round-trip, and gets back a cached value in 0.5β2 ms.
That network round-trip is the key trade-off: Layer 4 is ~1,000β2,000Γ slower than Layer 3 for any single lookup, but it is shared. Every server in your fleet sees the same data. This makes it ideal for: user sessions, rate limit counters, cart contents, real-time leaderboards, feature flags, and any state that is per-user but needs to survive a process restart or be visible to multiple servers. The canonical rule: if the data lives in your database and you want to serve it without a DB query, Redis is your Layer 4.
Layer 5 β Database Buffer Pool (5 Β΅sβ15 ms, GBs on DB Server)
Your database has its own cache, and it is arguably the most powerful cache in the entire stack for database-heavy applications. The database buffer pool caches raw data pages in RAM on the database server itself. Postgres calls it shared_buffers. MySQL/InnoDB calls it innodb_buffer_pool_size. The concept is the same: a chunk of the DB server's RAM is set aside for caching recently-read database pages.
When a query reads a row, the database loads the entire 8 KB page containing that row into the buffer pool. Future queries for any row on that same page are served from RAM in ~5 Β΅s instead of requiring a disk read (50 Β΅s for NVMe SSD, 5β10 ms for HDD). For a production database with a well-tuned buffer pool and a working set that fits in RAM, 90β99% of reads never touch storage at all. The critical tuning fact: Postgres's default shared_buffers is just 128 MB regardless of how much RAM the server has. On a 32 GB server, a single config change β shared_buffers = 8GB β can eliminate most disk I/O entirely.
Layer 6 β CDN Edge (5β50 ms, GBsβPBs, Globally Distributed)
The outermost cache layer sits between the Internet and your origin servers. A CDN edge node is a server in a data center physically close to the user β often within a city or metropolitan area. When the edge node has a cached copy of the requested resource (an HTML page, a JSON API response, an image), it serves the response directly: ~5β30 ms to the user regardless of where your origin server lives.
CDN caching is controlled by HTTP headers your application sends β Cache-Control, Surrogate-Control, Vary, ETag. The cache key is typically the URL. CDNs are ideal for: static files (JS/CSS/images with content-hash URLs, cached for a year with Cache-Control: immutable), shared page shells (header, footer, product grids), and public API responses (search results, product listings). They are not ideal for personalized responses (shopping carts, authenticated pages) without careful key partitioning or ESI (Edge Side Includes).
The comparison table highlights the most important pattern: as you move down the layers, latency multiplies by 10β1,000Γ at each step, while capacity also multiplies by 10β1,000Γ. There is no layer that is "better" β each layer has a different role. The goal is to ensure that the hottest data (most frequently accessed) lives in the fastest layer that can hold it, and that each layer is sized large enough to achieve a high hit ratio for its working set.
The Math of Multi-Level Caching β Why Compounding Hit Ratios Win
Here is the insight that separates engineers who design caches from engineers who just use them: cache hit ratios across multiple layers do not add β they multiply. And multiplication is far more powerful than addition.
The Cascading Multiplication
Imagine a four-layer cache stack where each layer has a 90% hit ratio. You might intuitively think "90% + 90% + 90% + 90% = 360%" which doesn't even make sense, or maybe "90% average across layers." But the real math works like this: the 10% of requests that miss Layer 1 reach Layer 2. Layer 2 hits 90% of those β so 10% Γ 10% = 1% of original requests reach Layer 3. Layer 3 hits 90% of those β so 1% Γ 10% = 0.1% of original requests reach Layer 4. Layer 4 hits 90% β so only 0.1% Γ 10% = 0.01% of original requests reach the backing store (disk).
The formula is straightforward: the probability of a request reaching layer N is the product of all miss rates for layers 1 through N-1.
Example: 4 layers at 90% each
P(reach L2) = 1 β 0.90 = 0.10 (10%)
P(reach L3) = 0.10 Γ 0.10 = 0.01 (1%)
P(reach L4) = 0.01 Γ 0.10 = 0.001 (0.1%)
P(reach disk) = 0.001 Γ 0.10 = 0.0001 (0.01% of original traffic)
Put differently: you need 10,000 requests at Layer 1 before a single request reaches disk. If your Layer 1 is processing 10,000 requests per second, your disk sees 1 request per second. That is the power of compounding.
The cascade diagram makes the compounding visceral: each layer's bar is 10Γ shorter than the one before it, because each layer absorbs 90% of what reaches it. By the time you reach the disk, the traffic is a tiny sliver β 10 requests per second from an original 10,000. The database doesn't need to scale to match your application traffic; it only needs to handle the residual after every upstream cache layer has done its job.
The End-to-End Latency Formula
The same compounding logic applies to average latency. The end-to-end average latency of a multi-level cache stack is not the slowest layer's latency β it is the probability-weighted sum across all layers.
Where: P(hit at layer i) = hit_ratio(i) Γ P(reach layer i)
P(reach layer i) = β miss_ratio(j) for j = 1 to i-1
Let's work through a realistic example with five layers and real latency numbers. The numbers below are representative for a typical production web application with a well-tuned Redis cluster, reasonable DB buffer pool, and CDN in front:
The worked example reveals something surprising: even though disk I/O is 10 ms β roughly 33,000Γ slower than an app-cache hit β the disk contributes only about 2% of the average latency. Why? Because only 0.075% of requests ever reach it. The slowest layer barely matters for the average, because the upstream layers filter out virtually all traffic. This has a counterintuitive implication for optimization: spending engineering effort to improve L1 hit ratio (e.g., from 85% to 92%) has a larger impact on average latency than tuning the disk read path by 10Γ, because the disk path is already so rare.
When a Single 50% Layer Is Worth Adding
The math also reveals why adding a mediocre cache layer with only 50% hit ratio can still be worthwhile. If your stack currently sends 1,000 requests per second to the database, adding a layer with 50% hit ratio reduces database traffic to 500 requests per second β for free, without any database tuning. Whether 500 req/s of saved DB load justifies the operational cost of the new cache layer depends on your specific system, but the math is always in your favor as long as the cache layer is cheaper per request than the layer below it. Redis at 0.5 ms is dramatically cheaper to scale than a Postgres query at 10 ms β so even a 50% Redis hit ratio dramatically reduces DB scaling costs.
Layer 1 Deep Dive β CPU Cache (L1, L2, L3) and the Memory Hierarchy
Before your code ever touches Redis or a database, it runs on a CPU that is doing its own caching β silently, automatically, and at speeds so fast that most engineers never think about them. Understanding this layer won't let you configure it (the hardware controls it entirely), but it will explain why some algorithms are 10Γ faster than others on identical hardware, and why "cache-friendly" code is a real engineering concern worth talking about in a systems interview.
The Speed Gap Problem: Why CPU Caches Had to Exist
Here is the fundamental problem that forced CPU caches into existence: modern CPUs execute instructions in roughly 0.25β0.5 nanoseconds (a 3 GHz clock ticks every 0.33 ns), but reading from DRAM β the ordinary RAM in your computer β takes roughly 80 nanoseconds. That is a 200Γ speed gap. Without caching, the CPU would spend most of its time waiting for data, running at roughly 0.5% of its theoretical peak speed.
The solution: build a small chunk of extra-fast memory and bolt it onto the CPU itself. The memory used inside CPU caches is a different type than the RAM in your DIMM slots β it uses six transistors per bit instead of one, so each bit is wired up to respond instantly without needing the periodic "refresh" pulse ordinary RAM needs. This faster type is called SRAM (Static RAM), and the slower bulk type used in main memory is DRAM (Dynamic RAM). SRAM is faster but six times more expensive per bit, so you can only afford a tiny amount. The hardware splits that tiny amount into a three-level hierarchy:
The memory hierarchy pyramid. Each level down is roughly 5β10Γ slower, 5β30Γ larger, and much cheaper per byte. The CPU checks L1 first; on a miss it checks L2, then L3, then DRAM. Every miss to a lower level costs more cycles.
What the CPU Actually Caches: Cache Lines, Not Individual Bytes
Here is a detail that trips up many people: the CPU doesn't cache individual bytes or even individual integers. Instead, every time it pulls something from RAM it grabs a fixed-size neighborhood β typically a 64-byte chunk on Intel/AMD x86 and most ARM64 server and mobile cores β and stuffs the whole chunk into the cache. Apple's M-series chips are a notable exception and use 128-byte lines. Those chunks are called cache lines. Why 64 bytes specifically? Because programs almost never read just one byte and stop. If your code touches address X, it will very likely touch X+1, X+2, ... X+63 within a few instructions β a pattern called spatial locality. Loading the whole neighborhood at once means one slow trip to DRAM pays for dozens of future fast reads.
This has a practical consequence for data structure design: an array of 64-bit integers stores 8 integers per cache line. Traversing the array sequentially loads a new cache line every 8 elements, but each load serves the next 7 reads for free β very efficient. A linked list, by contrast, stores each node at a random heap address; every pointer dereference may require loading an entirely new cache line, potentially causing a full DRAM round-trip for every node. Same algorithmic complexity (O(n)), but the array version can be 5β10Γ faster in practice purely because of cache efficiency.
Temporal and Spatial Locality: Why Caches Work at All
CPU caches work because real programs aren't random β they have patterns. Two key patterns make caching effective:
- Temporal locality β if you accessed a piece of data once, you'll probably access it again soon. A loop counter used a million times fits this pattern perfectly. The CPU keeps it in L1 and never pays the DRAM cost after the first load.
- Spatial locality β if you accessed address X, you'll likely access nearby addresses soon. Array traversal, struct field access, and sequential file reads all exhibit this. The cache-line mechanism exploits it automatically.
When your code violates these patterns β accessing memory in a random stride, jumping around a huge linked list β you "thrash the cache" and effectively reduce the CPU to running at DRAM speeds. The difference between cache-friendly and cache-hostile code on the same hardware is routinely 5β10Γ.
MESI Coherence: Keeping Multi-Core Caches Consistent
A modern CPU has 8β128 cores, each with its own L1 and L2 cache. If two cores both cache the same memory address and one of them writes to it, the other's cached copy is now stale and would return the wrong answer if used. The hardware fixes this by giving every cached 64-byte chunk a little status tag β "is this copy still trustworthy, or has somebody else changed it?" β and broadcasting updates between cores whenever a write happens. The protocol that defines the four possible tag values and the rules for switching between them is called MESI. The name is just the four states stitched together:
The hardware circuitry managing MESI transitions runs entirely without software intervention. From your code's perspective, multi-core caches appear coherent. From a performance perspective, however, MESI coherence creates one dangerous trap.
False Sharing: The Silent Performance Killer
Imagine two threads on two different cores, each incrementing their own independent counter. Thread A increments counter_a; Thread B increments counter_b. Completely independent operations β in theory, they should run at full speed with zero contention. But if counter_a and counter_b happen to live in the same 64-byte cache line (e.g., they are adjacent fields in the same struct), every write by Thread A invalidates Thread B's cached copy of that line via MESI, and vice versa. Neither thread is actually touching the other's data, but the CPU's cache coherence mechanism can't tell β it only sees cache lines, not individual variables. This is false sharing: two threads contending over a cache line they don't logically share.
False sharing in action. Both cores read the same 64-byte cache line into their L1 caches. When Core 0 writes counter_a, MESI sends an invalidation to Core 1 β even though Core 1 only cares about counter_b. Both threads slow to DRAM speeds despite touching logically independent data. The fix is to pad each counter to 64 bytes so they live in separate cache lines.
@Contended annotation (JVM flag -XX:-RestrictContended) instructs the JVM to add 128 bytes of padding around a field to prevent false sharing. Go's sync.Mutex includes padding for the same reason. Disruptor (the high-performance Java ring buffer) eliminates false sharing by padding every slot to exactly one cache line β it's a major reason the Disruptor consistently outperforms traditional queues by 10Γ in throughput benchmarks.
Layer 2 Deep Dive β OS Page Cache (The Free Cache You Already Have)
Every Linux server you've ever run an application on has been quietly caching disk data in RAM behind your back β automatically, since the 1990s. This is the OS page cache, and it is genuinely the cheapest performance win in systems engineering because you didn't have to configure, buy, or install anything. It just works. Understanding it lets you reason about when it's working for you, when workloads fight against it, and how to tune it for your specific situation.
How the Page Cache Works: The Kernel's RAM Recycler
When your application calls read() on a file, the kernel doesn't fetch bytes straight from disk and hand them to your process. Instead, it reads an entire page (4 KB on x86 Linux) into a kernel-managed RAM buffer, then copies the data to your process's address space. The crucial part: after satisfying your read() call, the kernel keeps that page in RAM. The next read() β by you, by another process, or by the OS itself β is served entirely from RAM with no disk involved.
Where does this RAM come from? The kernel uses all RAM not currently needed by processes as page cache. If a process needs more heap memory, the kernel evicts cold pages from the page cache (using a variant of LRU) to reclaim space. The page cache is not a fixed allocation β it expands and contracts dynamically based on system pressure. This is why the output of free -h can be confusing at first glance:
A common mistake: seeing free = 0.8 GB and panicking that the server is out of memory. The 26 GB in buff/cache is available to processes on demand. The kernel will evict page cache pages to make room. available is the honest metric for "how much RAM can a new process get right now."
The read() path through the OS page cache. On a hit, the kernel copies data from a RAM page (~100 ns). On a miss, it fetches from disk, stores the page in the cache, then copies to your process. Critically, the page stays in cache after the miss β subsequent reads are free.
Why Databases Love (and Sometimes Fight) the Page Cache
Every database you've ever used benefits from the page cache by default. Postgres reads its data files through the page cache, and the page cache is often the reason Postgres feels fast even when shared_buffers is set to a modest 2 GB on a 64 GB server β the OS is quietly caching another 40 GB of hot pages underneath. This creates a double-buffering situation: both Postgres's own buffer pool and the OS page cache hold copies of the same data. It's technically wasteful (RAM used twice), but in practice the two levels serve different purposes β shared_buffers knows it's holding database pages and evicts intelligently; the page cache just sees bytes and does generic LRU.
MySQL InnoDB takes a different approach: it sets the O_DIRECT flag on its data files, which bypasses the page cache entirely and uses its own buffer pool exclusively. The motivation is to avoid double-buffering: if you've allocated 60 GB to the InnoDB buffer pool, there's no point caching the same data again in the page cache. InnoDB's buffer pool has smarter eviction policies (it knows about hot vs cold pages in its B-tree), so excluding the OS cache is a net win when RAM is the constraint.
Scan Eviction: The Page Cache's Achilles Heel
The page cache's biggest weakness is scan eviction. Imagine running a full-table scan on a 200 GB table while your production database also serves live OLTP queries. The scan reads 200 GB of pages sequentially, filling the page cache as it goes. Because the OS can't tell the difference between "scan I'll never see again" and "hot page accessed a thousand times today," it happily evicts hot pages to make room for scan data. After the scan finishes, all your hot OLTP pages are gone β every subsequent query causes a cache miss and a disk read. The database spends the next several minutes re-warming the cache. This is called cache pollution.
The fix is to tell the kernel about your access pattern so it stops treating the scan like hot data. Linux gives you two system calls for this β one for normal file reads, one for memory-mapped files. They are called fadvise() (file advise) and madvise() (memory advise), and they let you whisper hints like "I'm scanning, don't bother keeping these pages" or "I'll be jumping around randomly, skip the pre-fetch." Used together, they make a scan walk through the page cache without pushing out the hot OLTP data:
Postgres historically used posix_fadvise(POSIX_FADV_WILLNEED) driven by effective_io_concurrency to prefetch pages during bitmap heap scans and similar operations (modern Postgres 17+ has moved sequential scans onto the streaming-I/O / async-I/O framework instead). On the application side, POSIX_FADV_DONTNEED is the surgical lever you reach for when you specifically want to prevent a one-shot batch scan from polluting the page cache. It's a rare but powerful tuning lever for mixed OLTP + analytical workloads.
tmpfs and shmfs: Explicit In-RAM Filesystems
Sometimes you want a guarantee that certain data always lives in RAM and never gets written to disk at all β temporary files, in-memory IPC buffers, shared-memory regions between processes. Linux offers a special filesystem that looks and acts like any other directory but is secretly backed by the page cache and swap rather than a real disk. It is called tmpfs (temporary filesystem). Files you write to a tmpfs mount stay entirely in RAM. Postgres's shared memory segment lives on a tmpfs mount (the familiar /dev/shm path), which is why Postgres restart is so fast β shared_buffers is in RAM-backed shared memory the kernel never has to read from disk.
buff/cache line in free -h shows its size β high values are healthy, not wasteful. Databases like Postgres ride on top of the page cache; InnoDB bypasses it via O_DIRECT to avoid double-buffering. Full-table scans can evict hot pages (cache pollution); POSIX fadvise(POSIX_FADV_DONTNEED) is the surgical fix.Layer 3 Deep Dive β In-Process Application Cache (Caffeine, Guava, dicts)
The in-process application cache is the cache you build yourself, inside your application's own memory. Unlike the CPU cache (hardware-controlled) or the OS page cache (kernel-controlled), this one is yours to design: you choose what to store, how much, how long, and what eviction policy to use. It is the fastest cache you can write application code against β because the data is already in the same process, there is no network hop, no serialization, no protocol overhead. A hit costs roughly 100β500 ns on a modern JVM, compared to 0.5β2 ms for Redis. That is a 1,000β5,000Γ speed difference.
Why Not Just Use a HashMap? The Problem With Unbounded Caches
The simplest in-process cache is a HashMap: put data in on first access, read from it on subsequent accesses. This works until your JVM runs out of heap memory and starts either throwing OutOfMemoryError or spending 80% of its time in GC. An in-process cache without a size bound is just a memory leak with aspirations. You need at minimum: a maximum size (so memory is bounded), an eviction policy (so the least-useful entries are dropped first), and optionally a TTL (so stale data expires).
Caffeine: The Modern Best-in-Class (Java)
For Java, Caffeine is the de-facto standard. It replaced Guava's cache as the go-to library (Spring's @Cacheable uses Caffeine by default when it's on the classpath). The reason it pulled ahead is its eviction algorithm. The old default β "drop whatever was used longest ago" (plain LRU) β has a known weakness: a one-time scan can shove out keys that have been hammered a million times. Caffeine instead remembers how often each key has been used recently, and protects frequently-used keys from one-shot scans. The technique has a mouthful of a name β W-TinyLFU (Windowed Tiny Least Frequently Used) β but the idea is just "favour popular keys over recently-touched-once keys." Here is a minimal production-ready Caffeine setup:
Key points from this snippet: maximumSize bounds memory; expireAfterWrite bounds staleness; get(key, loader) is the preferred read pattern because Caffeine serializes concurrent loads for the same key β if 100 threads request the same missing key simultaneously, only one DB call fires and the other 99 wait for that result. This prevents the cache stampede (also called "thundering herd") where a sudden miss causes a flood of identical DB queries.
The Multi-Instance Problem: Why In-Process Cache Hit Ratio Degrades at Scale
Here is the catch with in-process caching: when you run multiple application instances (horizontal scaling), each instance has its own private cache. Instance A might cache user profile 42; instance B might not. A request routed to instance B misses and hits the database, even though the exact same data is sitting warm in instance A's heap. Your effective hit ratio is per instance, not across the whole fleet.
With 3 instances and random load balancing, the same hot key (user 42) is only cached in 1 of 3 instances. Roughly two-thirds of requests miss locally and fall through to the database. The fix is either sticky sessions (fragile, counter to horizontal scaling) or a combined strategy: a small Caffeine cache for the very hottest keys, backed by Redis for cross-instance sharing.
The L-Shape Cache Strategy: In-Process + Distributed Together
The best pattern for production is a two-level lookup: check the in-process Caffeine cache first (1 Β΅s); if missed, check Redis (1 ms); if missed, query the database and populate both caches. This is called the "L-shape" or "layered" cache strategy. The Caffeine tier holds the hottest ~1,000β10,000 keys; Redis holds the full warm data set across all instances. The Caffeine layer absorbs the top 5β10% of traffic at microsecond speed. The Redis layer handles the rest at millisecond speed. Together, fewer than 1% of requests reach the database.
Equivalents in Other Languages
functools.lru_cache (built-in, size-bounded, LRU, but not TTL-aware) or the cachetools library (TTL, LFU, LRU). For async/FastAPI code, aiocache provides async-safe in-process caches..NET / C#:
System.Runtime.Caching.MemoryCache (framework) or Microsoft.Extensions.Caching.Memory.IMemoryCache (recommended in .NET Core). Both support size limits, TTL, and eviction callbacks.Node.js: The
lru-cache npm package is the standard. Supports max (entry count), maxSize (byte-based), ttl, and a fetchMethod for atomic load-and-cache (equivalent to Caffeine's get(key, loader)).
Layer 4 Deep Dive β Distributed Cache (Redis, Memcached)
A distributed cache solves the one problem that in-process caching cannot: shared state across multiple application servers. Instead of each server maintaining its own private copy of cached data, all servers talk to a central (or clustered) cache tier. Every server sees the same data, invalidations propagate instantly, and the cache can be sized independently of your application server heap. The tradeoff β and it is a real one β is a network round-trip for every cache operation: typically 0.5β2 ms on a well-connected internal network, compared to ~100β500 ns for in-process.
Redis vs Memcached: Which One and When
| Dimension | Redis | Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, streams, JSON (RedisJSON), bitmaps | Strings only (key β arbitrary byte blob) |
| Persistence | RDB snapshots, AOF (append-only log), or both | None β pure RAM; restart = empty cache |
| Replication | Built-in primary-replica; Sentinel for automatic failover; Cluster for sharding | No built-in replication; client-side sharding only |
| Scripting / atomics | Lua scripts, MULTI/EXEC transactions, atomic commands (INCR, SETNX) | CAS (check-and-set) only; no scripting |
| Threading model | Single-threaded command execution (I/O can be multi-threaded in Redis 6+) | True multi-threaded, scales across cores for pure key-value workloads |
| When to pick | Rich data structures needed; TTL per key; Lua; pub/sub; persistence required | Pure cache, strings only, highest possible throughput per core, no persistence needed |
In practice, Redis wins the majority of greenfield decisions because it is more flexible. The overhead difference between Redis and Memcached is rarely the bottleneck β network latency and serialization cost typically dwarf the few-microsecond difference in the servers themselves. Pick Memcached only when you've benchmarked a specific bottleneck that Memcached solves and Redis doesn't.
Redis Cluster: Horizontal Sharding via Consistent Hashing
A single Redis node caps out at roughly 25β50 GB of useful cache data (beyond that, GC-equivalent pauses on the fork for RDB snapshots start to hurt). Redis Cluster scales horizontally by splitting the keyspace into 16,384 hash slots. Every key is hashed to a slot (using CRC16 mod 16384), and every cluster node owns a contiguous range of slots. The client library calculates which slot a key belongs to before sending the request, connecting directly to the right node. This is a form of consistent hashing: adding a new node means redistributing only some of the slots, not rehashing the entire keyspace.
Redis Cluster splits the 16,384 hash slots across primary nodes. The client library computes the slot locally and connects directly to the right node β no central router hop. Each primary has at least one replica for failover. When you add a new node, only a proportional subset of slots moves; everything else stays in place.
Reducing Round-Trips: Pipelining and Lua Scripts
The biggest cost of a distributed cache is the network: 0.5β2 ms per round-trip. If your request handler makes 20 independent Redis calls, that's 10β40 ms of pure network overhead before the first byte of your business logic runs. Two Redis features cut this dramatically:
Pipelining via MULTI/EXEC β batch multiple commands into one TCP write. The server buffers them and sends all responses back in one reply. Zero-copy batching:
Lua scripts β run arbitrary logic on the Redis server, eliminating the network round-trip for the "read β decide β write" pattern. For example: "if the rate-limit counter for this IP is under 100, increment and return 'allowed'; otherwise return 'blocked'" β in Lua, this is one atomic server-side operation. In application code, it would be a GET, then a conditional SET β two round-trips with a potential race between them.
Failure Modes: Hot-Key Meltdown and Network Partitions
product:homepage-banner might receive 100,000 requests per second β but it always hashes to the same slot, meaning the same single primary node handles every one. That node becomes a bottleneck while other nodes are idle. Fixes: local in-process caching for hyper-hot keys (L-shape strategy), key sharding by appending a random suffix and reading one at random, or using Redis's read replicas with READONLY commands to distribute load.
Redlock, or switching to a CP system like ZooKeeper).
Layer 5 Deep Dive β Database Buffer Pool (Postgres shared_buffers, InnoDB)
Every database that reads data from disk would be unbearably slow if it had to do an actual disk read for every row you touch. The database's solution is exactly the same as the OS page cache β keep recently-accessed data in RAM β but smarter: instead of caching generic file bytes, the database caches structured pages (8 KB blocks for Postgres, 16 KB for InnoDB) that it understands as B-tree nodes, heap tuples, or index entries. Because the database knows what these pages contain, it can make better eviction decisions than a generic kernel cache.
Postgres: shared_buffers and the Double-Buffering Dance
Postgres manages its own in-memory cache of database pages, called the shared buffer pool, configured by shared_buffers. The conventional starting point is 25% of total RAM β so on a 64 GB DB server, set shared_buffers = 16GB. Why only 25%? Because Postgres reads its data files through the OS page cache (it does not use O_DIRECT by default), and the OS page cache can beneficially cache the same pages again in the remaining 75% of RAM. This double-caching is technically redundant but provides a practical buffer: if shared_buffers is cold after a restart, the OS page cache (which the kernel persistently maintains across Postgres restarts) may already have hot pages. Postgres will re-populate shared_buffers from the OS cache at ~100 ns per page, not from disk at 0.1β8 ms per page.
Postgres Eviction: Clock-Sweep
When shared_buffers is full and Postgres needs to load a new page, it has to throw something out to make room. How does it decide? Imagine the buffer pool as a circular table of slots, and a single hand pointing at one slot β like a clock. Each slot has a small counter that goes up every time something touches that page. The hand walks slowly around the ring, and at each stop it decrements the counter by one. When a counter reaches zero, that slot is the loser and gets evicted. This sweeping clock-hand approach is called the clock-sweep algorithm, and the reason Postgres uses it is mostly about avoiding lock contention: a strict "least recently used" list would require every cache hit to update a global linked list, which becomes a hotspot under high concurrency. Clock-sweep approximates LRU using only local counter increments, no shared list, no global lock.
InnoDB: Buffer Pool + O_DIRECT
MySQL's InnoDB storage engine takes a more aggressive approach to memory management: allocate a large fraction of RAM to the InnoDB buffer pool and use O_DIRECT to bypass the OS page cache entirely. The motivation is clear: if you're giving InnoDB 50 GB of buffer pool on a 64 GB server, there's no point having the OS also cache those same pages β you'd have no RAM left for anything else.
InnoDB Midpoint Insertion: Protecting Hot Pages From Scans
InnoDB has a clever twist that directly addresses the scan-pollution problem we saw in the OS page cache. The buffer pool is still arranged as a list with "hot" at the head and "evict next" at the tail, but instead of dropping new pages straight at the head β where they'd push the truly hot pages closer to eviction β InnoDB drops them partway down the list. That partway-down spot is called the midpoint (by default, about 3/8 of the way up from the tail). New pages enter here. A page only graduates to the front of the list (the "young" or hot sublist) if it gets touched a second time. Pages that are only touched once after insertion drift through the back half (the "old" sublist) and get evicted from the tail without ever disturbing the truly hot data.
The key insight: a full-table scan floods new pages in at the midpoint. If they are only read once (typical for a scan), they never get promoted to the young list and are quickly evicted without displacing the truly hot pages that live at the head. The hot OLTP pages survive the scan untouched.
InnoDB's scan-resistant LRU. New pages enter at the midpoint (boundary between young and old sublists). If a page is accessed again soon, it gets promoted to the head of the young list and survives for a long time. Scan pages that are never re-accessed slowly drift toward the tail of the old sublist and are evicted first β never displacing truly hot pages.
Hit Ratio: The #1 Database Performance Metric
The buffer pool hit ratio tells you what percentage of page reads are served from RAM vs disk. This single number summarizes your database's cache efficiency better than any other metric. A rough guide:
- β₯ 99.5% (InnoDB β₯ 997/1000) β excellent. Almost all reads from RAM. Very few disk reads.
- 98β99.5% β good for a working set that's 90%+ hot. Occasional cold reads.
- 90β98% β 2β10% of reads are disk reads. Investigate: working set larger than buffer pool? Scan pollution? Need to increase buffer pool or add caching above the DB.
- < 90% β serious I/O pressure. Page queries will be slow. The buffer pool is significantly undersized for the working set, or a batch scan is polluting it.
storage.wiredTiger.engineConfig.cacheSizeGB in mongod.conf (or the --wiredTigerCacheSizeGB flag). Default is max(50% of RAM β 1 GB, 256 MB). Like InnoDB with O_DIRECT, WiredTiger bypasses the OS page cache for its data files, so the cache size should reflect the full available RAM minus OS overhead. Check hit ratio via: db.serverStatus().wiredTiger.cache["pages read into cache"] vs "pages requested from the cache".
shared_buffers and layers the OS page cache underneath (double-buffering). InnoDB allocates 60β80% to innodb_buffer_pool_size and bypasses the OS page cache via O_DIRECT. Both use scan-resistant eviction to protect hot OLTP data from analytical scan pollution. Buffer pool hit ratio β₯ 99.5% is the target; below 90% means I/O is the bottleneck. Tools: pg_buffercache for Postgres, SHOW ENGINE INNODB STATUS for MySQL.Layer 6 Deep Dive β CDN Edge Caches (Geographic, Shared, Tiered)
Every layer we've covered so far lives inside your data center β on the CPU die, in the kernel, in the JVM, in a Redis cluster, or in the database server. The CDN edge cache is different: it lives outside your data center entirely, in hundreds of small server rooms scattered across the world's cities, each one close to the users in that region. CDN providers call each of these little server rooms a Point of Presence, or PoP for short β just a name for "a CDN data center near users." Its job is to serve your content before a user's request ever reaches your origin servers. For a user in Singapore, that means the response travels from a CDN node in Singapore (~5 ms round-trip) instead of from your US-east origin (~200 ms). The CDN is the one cache that reduces latency not by making your server faster, but by moving the response physically closer to the person asking for it.
What a CDN Can (and Cannot) Cache
The conventional wisdom is "CDNs cache static files." The more accurate picture in 2024 is "CDNs cache anything with a cache key and a reasonable TTL." The canonical categories, from easiest to hardest:
- Static assets β JavaScript bundles, CSS, images, fonts, video segments. Cache-Control:
immutable, max-age=31536000(one year) using content-hash URLs. Hit ratios of 99%+ are achievable. These are so static that the only invalidation trigger is deploying a new build with a new content hash. - Shared HTML fragments β product pages, blog posts, marketing pages. Cache with a moderate TTL (60 sβ5 min) and use surrogate keys (Cloudflare Cache-Tags, Fastly Surrogate-Key header) to invalidate all related pages instantly when content changes. Hit ratios of 80β95%.
- API responses β increasingly common. Product listings, search results, pricing queries β all can be cached at the CDN edge when they are not user-personalized. Use
Cache-Control: public, s-maxage=30(cache at CDN for 30 seconds, revalidate). Hit ratios of 50β80% for popular queries. - Personalized content β responses that contain user-specific data (a user's cart, their name in the header) must not be cached at shared CDN edge nodes. The fix is to split pages into a cacheable shell + dynamic user-specific fragment loaded client-side (Edge Side Includes, or a separate API call after initial render).
Cache Key: What Makes a CDN Response Unique
A CDN identifies a cacheable response by its cache key β by default, the URL including query string. Two requests for the same URL with the same cache key are served the same cached response. This is why cache keys need careful design:
Vary: Accept-Encoding, the CDN correctly caches separate versions for gzip and brotli β fine. But Vary: Cookie or Vary: Authorization effectively disables CDN caching for every logged-in user (every unique cookie = unique cache key = no sharing). Audit your Vary headers. The correct pattern for authenticated content is: never mark the main response cacheable; instead, cache the public shell and load personalization via a separate, non-CDN-cached API request.
Tiered Caching: Shield Nodes Cut Origin Load
In a basic CDN setup, every edge PoP has its own cache. A cache miss at the Singapore PoP goes directly to your origin. A cache miss at the Tokyo PoP also goes directly to your origin. A cache miss at the Sydney PoP does the same. If you have 300 PoPs and a CDN miss rate of 20%, each of those 300 PoPs independently hammers your origin. The traffic that reaches your origin server is 300Γ amplified relative to what a single-PoP scenario would generate.
The solution is CDN shielding (also called "origin shield" or "tiered caching"): add a mid-tier cache layer between the edge PoPs and your origin. The Singapore, Tokyo, and Sydney edge nodes all forward misses to a single regional shield node (say, Tokyo shield), which itself caches the response. Your origin only sees one request even if three edge PoPs missed simultaneously for the same URL. The shield absorbs the long-tail miss traffic; your origin sees only the cache misses the shield itself couldn't serve.
The full six-layer request path. A CDN edge hit costs 5β30 ms total and is the default happy path for static and shared content. A full cold miss through all layers to disk costs 100β600 ms. Crucially, the CDN and the origin-side caches (Redis, buffer pool) are complementary: the CDN serves shared public content; Redis serves user-specific computed state; the buffer pool absorbs DB page reads. No single layer replaces the others.
CDN in the Six-Layer Context: What It Does That No Other Layer Can
The key insight that ties this section to the full hierarchy: the CDN is the only layer that operates on the user's side of the network. Every other cache layer (in-process, Redis, buffer pool) is co-located with your application β they reduce work on your servers, but they do nothing about the physics of data traveling across the Atlantic or Pacific. The CDN collapses the geographic distance problem. It doesn't make your app smarter; it makes the content physically closer.
This also defines exactly what the CDN is bad at: anything that requires real-time user-specific computation. A user's shopping cart, their personalized recommendations, their account balance β these need data from your database that only your application can assemble. The CDN can serve the shell of the page in 10 ms; Redis and the buffer pool assemble the personalized data in another 2β3 ms; together they give you a total page experience under 20 ms without the CDN ever touching the private data.
Invalidation: The Hard Problem at the CDN Layer
The most dangerous CDN failure mode isn't a miss β it's serving a stale hit. If a product price changes and the CDN is still serving the old price from a 5-minute cached response, customers see incorrect prices. CDNs solve this with two mechanisms: TTL expiry (wait for the cache to age out naturally) and instant purge. Purge needs a clever trick, because a single product change might affect dozens of cached pages: the product page itself, the category page, the homepage, the search results. Listing every URL to purge by hand is tedious and error-prone. Instead, CDNs let you attach labels to each cached response β a product page can be labelled "product-42, category-electronics, homepage-featured" β and then you purge by label, not by URL. These labels are called surrogate keys (called Cache-Tags on Cloudflare, Surrogate-Key on Fastly). One API call ("purge everything tagged product-42") clears every cached response touching that product, across every PoP, in a few hundred milliseconds.
Cache-Control: max-age=60, stale-while-revalidate=30 instructs the CDN to serve the cached response (even if it's 60β90 seconds old) while asynchronously refetching the latest version in the background. The user always gets an instant response; the cache is refreshed in the background. This is the pattern that lets CDNs achieve both low latency and reasonable freshness simultaneously, at the cost of accepting a brief window of stale data. Most modern CDNs (Cloudflare, Fastly, CloudFront) support it.
Choosing the Right Layer for Each Piece of Data
The six cache layers exist simultaneously in every production system β but that doesn't mean every piece of data belongs in every layer. Putting data in the wrong layer wastes memory, introduces stale-data bugs, or misses the latency benefit you were hoping for. The decision isn't hard once you understand the three questions that drive it: Who shares this data? How often does it change? How large is it?
The Three Axes That Drive Layer Selection
Before picking a layer, ask these three questions about the data you want to cache:
- Scope β who needs it? Data that every server in every region needs (public homepage HTML) is a CDN candidate. Data that only the current server process needs (a parsed config file) belongs in an in-process cache. Data that all servers in one region share (a logged-in user's session) belongs in a distributed cache like Redis.
- Mutability β how often does it change? Static assets that change only on deploy β CDN with long TTLs. Price data updated every few minutes β distributed cache with short TTL. A user's shopping cart, mutated on every click β distributed cache with write-through on every change. A CPU-computed result like a cryptographic hash β in-process cache with a bounded capacity.
- Size β how much memory does it occupy? A parsed JWT token is a few hundred bytes and belongs in-process. Gigabytes of product catalog belong in Redis, not in every app-server heap. Video files belong in the CDN, not in Redis.
The decision tree above walks you through the dominant question at each branch. A public homepage with the same HTML for all users? CDN, because geographic proximity gives you the biggest latency win. A user's shopping cart? Redis, because multiple app servers need to read it β you can't keep it in one server's heap. A single server's parsed YAML config? In-process map, because every network hop is wasted cost for data only that process needs.
The matrix above is worth internalizing. Notice the pattern: anything user-specific and mutable goes to Redis, because you need it shared across servers but it must not be cached at CDN (which would serve your cart to someone else). Anything DB-internal goes to the buffer pool β that layer is managed automatically by Postgres or MySQL, and you only tune it by configuring memory allocation. Financial records don't live in any cache as their primary store β durability requirements mean the disk is the record of truth, and any cache layer holding a financial value is secondary at best.
Common Placement Mistakes and Why They Happen
The most frequent mistake teams make is reaching for Redis reflexively β for everything. Config values that never change and are only read by one service end up in Redis even though an in-process map would be 200Γ faster and involve no network hop. The motivation is understandable: Redis is already in the stack, so it feels "easy." But every Redis GET is a network round-trip (~0.5 ms), and if you're calling it 50 times per request for config lookups, you've added 25 ms of latency for data that could live in a local map refreshed every 60 seconds.
The second common mistake is caching per-user data at the CDN level. CDN caching is fundamentally designed for shared content β the same response, delivered to many users. The moment you cache a response that varies by user identity, you risk the CDN serving one user's data to another. Some CDN providers support Vary headers or user-aware cache segments, but these are complex to configure correctly. The safer default: anything with user-specific data should bypass CDN caching entirely and go to Redis or the app layer.
Cross-Layer Invalidation β The Hardest Problem in Multi-Level Caching
Here is the uncomfortable truth about multi-level caching: adding a cache layer doesn't just make reads faster, it creates a new copy of your data. And the moment you have multiple copies, you have a consistency problem β when the real data changes, who tells the copies? This is why invalidation is often called the hardest problem in caching. Phil Karlton's famous quip β "There are only two hard things in computer science: cache invalidation and naming things" β is funny because it's deeply true. Let's understand why, and what to do about it.
The Layers That Take Care of Themselves
Good news first: the lowest two layers β CPU cache and OS page cache β handle their own consistency automatically, and you never need to think about them.
CPU caches use a hardware protocol called MESI (Modified, Exclusive, Shared, Invalid). When one CPU core writes to a memory address, the MESI protocol automatically broadcasts an invalidation signal to other cores that hold a cached copy of that address. This happens at the hardware level, in nanoseconds, without any software involvement. You never write application code to manage CPU cache coherence.
The OS page cache uses write-back semantics managed by the kernel. When your database writes to a file, the kernel updates the page cache immediately β so any subsequent read on the same machine sees fresh data. Across machines (NFS), page cache coherence is more complex, but for a typical single-node database setup, you don't manage this yourself.
The Layers Where You're Responsible
The layers that your application code manages β in-process cache, distributed cache (Redis/Memcached), and CDN β do not have any automatic invalidation mechanism. When the source data changes, nothing tells them automatically. You have to build that mechanism yourself. This is where bugs live.
Three strategies exist for keeping application-managed cache layers consistent with the source of truth:
- Write-through β when the app writes to the database, it also writes the new value to the cache at the same time, synchronously. The cache is always current. Cost: every write hits both DB and cache; write latency doubles.
- Write-behind (write-back) β the app writes only to the cache immediately, and a background job flushes to the database later. Lowest write latency; risk of data loss if the cache crashes before the flush. More on this in Section 15.
- Explicit invalidation (delete on write) β the app writes to the database, then deletes (or marks stale) the cache entry. The next read is a cache miss and repopulates from the DB. Simple and safe, but causes a brief miss storm after writes. This is the most common pattern in practice.
The diagram above shows what a correctly implemented cascading invalidation looks like: a single write event fans out to every layer that might hold a copy. The red "forgotten layer" on the right is the classic production bug. One layer is missed β usually the in-process cache on individual app servers β and it serves the old price for as long as its TTL lasts. Users on servers that got the invalidation see $25; users on servers that didn't see $29. Both prices are "real" at different layers of the stack, which is worse than just being consistently wrong.
Three Invalidation Patterns in Practice
1. Pub/Sub Fan-Out (Redis PUBLISH)
The most common way to keep every server's in-process cache fresh in a multi-server fleet is to use Redis as a tiny one-way notification service: one server shouts "the price for product 42 changed!" and every other server hears it and clears the affected key. This shout-and-listen pattern is called publish/subscribe, or pub/sub for short, and "fanning out" just means the one shout reaches many listeners. When a server writes a new value, it publishes an invalidation message to a shared channel. All servers β including itself β subscribe to that channel and evict the affected key from their local in-process cache when they receive the message.
The advantage: all servers learn about the invalidation in milliseconds. The limitation: Redis pub/sub is fire-and-forget β if a server is temporarily disconnected, it misses the message and its in-process cache stays stale until the TTL expires. For high-stakes data (prices, inventory), you pair pub/sub with a short TTL as a fallback safety net.
2. CDC-Driven Cascading Invalidation
The pub/sub approach above only catches writes that go through your application. A migration script, an admin running a manual SQL update, or a different microservice writing to the same DB will all bypass the invalidation. A safer trick: have something watch the database itself and emit a notification for every row that changes, no matter who changed it. Databases already keep an internal log of every change for replication purposes (Postgres calls it the WAL, MySQL calls it the binlog); a tool can tap into that log and turn each row change into a message. This watch-the-log pattern is called Change Data Capture, or CDC. Tools like Debezium implement it for most databases. A separate invalidation service consumes the CDC stream and clears the appropriate keys in Redis and CDN. The benefit is that all writes are caught. The cost is operational complexity β you're adding Kafka or a similar message broker to the architecture.
3. Versioned Key Fingerprinting
Instead of invalidating, you change the key. When a product's data changes, you append a version hash to the cache key: product:42:v7 becomes product:42:v8. Old caches silently expire because no one asks for :v7 anymore. New reads always hit :v8, which is a miss on first access but is then populated correctly. This sidesteps the invalidation problem entirely β you never delete, you rotate. The trade-off: old versioned keys accumulate in the cache until TTL expires, wasting memory.
This stale-instance diagram describes a real class of production incident. A team deploys a new price. Redis is correctly invalidated. They even configured pub/sub to broadcast. But Servers 2, 3, and 4 were temporarily under load during the pub/sub broadcast and their subscriber threads were backlogged. For the next 45 minutes β until the in-process Caffeine cache TTL expired β 75% of checkout requests showed the old price. The fix is not complex: combine pub/sub with short in-process TTLs (60 seconds as a ceiling), and during high-stakes deploys, restart app servers in rolling order to flush all in-process caches cleanly.
Write Strategies Across Layers β Write-Through, Write-Around, Write-Back
We've talked a lot about what happens when a read hits or misses a cache. But caching strategies are equally important for writes. When a user updates their cart, changes a password, or completes a purchase, where does that data go first β and in what order? The answer determines your write latency, your consistency guarantees, and how much data you can lose if a server crashes. There are three main strategies, and knowing when to reach for each one is a key engineering judgment.
Let's walk through each strategy in plain terms, then look at how they compose in a real multi-level stack.
Write-Through β Consistency First
Write-through means: when you write data, you write it to the cache and the database in the same synchronous operation, before returning success to the client. The cache is always up to date because it gets every write immediately. The cost is that every write pays the latency of two storage operations: the cache write plus the database write. For a typical setup, that's ~0.5 ms to Redis plus ~5 ms to Postgres β roughly doubling write latency compared to bypassing the cache.
This is the right strategy for financial transactions, inventory counts, and anything where stale reads are unacceptable. The slightly higher write latency is a reasonable price to pay for the guarantee that every read from the cache reflects the latest confirmed write.
Write-Around β Protecting the Cache from Junk
Write-around means: writes go directly to the database and skip the cache entirely. The cache is only populated on cache misses during reads. This pattern makes sense when you're writing data that will rarely or never be read from the cache β for example, a bulk import of historical records, log ingestion, or append-only analytics events. Caching these writes would evict useful data from the cache for no benefit, since the written data won't be read back through the cache path.
The downside: the first read after a write is always a cache miss, which adds the database read latency. For frequently-read data, write-around is the wrong choice β you get the worst of both worlds (write goes to DB, read still misses the cache).
Write-Back β Performance at the Cost of Durability
Write-back is the most aggressive strategy: writes land in the cache immediately and return success to the client, then a background flush thread pushes dirty cache entries to the database asynchronously. This gives you the absolute lowest write latency β cache writes are microseconds to nanoseconds, vs 5β15 ms for database writes. The risk: if the cache server crashes before the flush, you lose the writes that hadn't been persisted. For counters (page views, "likes"), losing a few counts is acceptable. For financial records, it is not.
How These Strategies Compose in a Multi-Level Stack
In practice, different layers use different write strategies simultaneously. Here's how a typical e-commerce system stacks them:
The key insight from this table: different data types use different write strategies simultaneously within the same application. User profiles use write-through to Redis because consistency matters β you must not serve a user's stale name to their friends seconds after an update. View counters use write-back because losing a few counts is fine, and the throughput requirement is enormous. Static assets use write-around entirely β you never put a JS bundle in Redis or the app heap β and at CDN you effectively sidestep invalidation by using a new URL with a content hash on every deploy.
Production Patterns β Cache-of-Caches, Multi-Tier, Look-Aside, Read-Through
Once you understand the six layers individually, the next step is seeing how real engineering teams wire them together. At scale, you don't pick one layer β you compose several into named architectural patterns. The four patterns below represent the most-used approaches in production today, from the simple default that most teams reach for, to the sophisticated multi-region hierarchies that companies like Facebook and Netflix built to serve billions of requests.
(a) Cache-Aside (Look-Aside) β The Default Pattern
The most common pattern by a large margin. The application manages the cache explicitly: on a read, check the cache first; if it misses, fetch from the database and populate the cache; on a write, write to the database and then invalidate (or update) the cache entry. The application code is in full control β there is no caching middleware silently doing things for you.
Why it's the default: it's simple to understand, easy to debug (you can trace exactly which line populated a cache entry), and compatible with any database or cache store. Most Redis + Postgres architectures use this pattern. The downside is that the responsibility for cache correctness sits entirely in application code β if a developer forgets to invalidate after a write, the bug is silent and difficult to find in production.
(b) Read-Through β Transparent Cache Population
In read-through, the cache itself is responsible for fetching from the database on a miss β the application only ever talks to the cache. A cache library like Caffeine (Java) or a Redis module with a loader function handles the miss transparently. The application says "get me product:42" and the cache returns it, fetching from the DB if necessary, without the application seeing the DB at all.
This centralizes the cache-loading logic in one place and ensures the "populate on miss" step is never forgotten. The trade-off: it can be harder to implement correctly for complex queries, and the application loses fine-grained control over when and what gets cached.
(c) Cache-of-Caches / Hierarchical β Two-Tier Architecture
When a single shared cache cluster becomes the bottleneck, teams add a local in-process cache in front of the distributed cache. Reads check local memory first; only misses reach the shared Redis or Memcached cluster. This pattern cuts Redis round-trips dramatically for hot keys that every app server needs repeatedly.
Twitter, Facebook, and many large-scale systems use this pattern. The naming varies β "L1/L2 cache," "local cache + regional cache," "near cache" β but the structure is the same: a fast, small, in-process layer absorbs the hottest traffic, and a larger, shared, network-accessed layer handles the rest.
The code above shows the hierarchy in action. Caffeine acts as the L1 gatekeeper β if the key is in local memory, the call returns in microseconds without a network hop. If L1 misses, Caffeine's CacheLoader function fires, checks Redis (L2), and only if that misses too does it reach the database. From the caller's perspective, it's a single get call. The layered fallback is invisible.
(d) Multi-Tier Heterogeneous β Netflix EVCache Pattern
Netflix built its own globally-distributed cache layer on top of Memcached. They named it EVCache β short for "Ephemeral Volatile cache," signalling up front that the data can be lost on a node crash and that's fine, because every entry can be rebuilt from the source. EVCache is a well-documented example of a multi-tier heterogeneous cache. Netflix uses Memcached (not Redis) for raw speed on hot ephemeral data β Memcached has lower overhead per operation and better CPU efficiency for simple key-value lookups. Redis is used separately for data that needs data structures (sets, sorted sets, pub/sub). The two stores coexist in the same architecture, each used for what it does best.
EVCache adds zone-aware routing: within an AWS region, Netflix runs one Memcached cluster per availability zone. Reads prefer the local zone's cluster to minimize cross-zone latency. Writes replicate across all zones asynchronously. This means a zone failure doesn't lose cached data β other zones still hold copies β and normal reads stay within the same data center rack, keeping latency under 1 ms.
Facebook TAO β A Real-World Two-Tier Example
The social-graph data Facebook serves is mostly a giant collection of two things: objects (a user, a post, a photo) and associations between them (Alice is friends with Bob; Bob liked this photo). Facebook's caching system is literally named after those two shapes β TAO stands for "The Associations and Objects." Published in a USENIX ATC 2013 paper by Bronson et al., it is one of the most studied real-world caching architectures. TAO sits in front of MySQL and serves Facebook's social graph β friends, likes, comments β as a two-tier cache.
TAO's architecture encodes a key insight: for a globally-distributed social graph, writes need strong consistency (all writes go to the leader MySQL) but reads can tolerate brief lag (follower regions cache data that may be a few hundred milliseconds behind the leader). The two-tier design β TAO cache servers in front of MySQL, per region β means most social graph reads are served from in-region memory without any cross-Atlantic network hop. The consistency cost is bounded: when a user updates their profile, followers in EU will see the change within seconds once the async invalidation propagates. That's an acceptable trade-off for a social feed. It would not be acceptable for a bank balance.
The cache-of-caches diagram shows the steady-state topology used by many large-scale Java services: each app server has a bounded Caffeine cache (typically 5,000β50,000 entries) that absorbs the hottest fraction of keys with microsecond latency. The shared Redis cluster handles everything the L1 doesn't hold β much larger working set, shared across all instances. Only genuine misses at both layers reach the database. In a well-tuned deployment, the database might see 1β3% of total read traffic. The rest is absorbed by the two-layer hierarchy.
Pitfalls & Anti-Patterns
Knowing what layers exist and what write strategies to use is only half the battle. The other half is knowing the mistakes teams make in production β the easy-to-reach-for decisions that feel right but quietly destroy performance, introduce stale data, or cause catastrophic failures under load. Here are the seven most damaging pitfalls, each with a clear explanation of why the mistake is tempting, what goes wrong, and how to fix it.
The mistake: A team adds Redis caching, then adds an in-process cache, then adds CDN caching, then enables the DB buffer pool β all for the same data β without measuring which layers actually help. The thinking is: "more caching = faster." So every layer gets turned on, everywhere.
Why it's bad: Each cache layer has real costs. An in-process cache consumes heap memory that your application needs for GC and object allocation. A CDN cache requires correct Cache-Control headers, invalidation logic, and monitoring. A Redis cache adds operational complexity (replication, eviction policy, sentinel/cluster setup). If a layer has a low hit ratio β say, 30% β it costs more in complexity and memory than it saves in latency. The 70% of requests that miss still pay the full downstream latency plus the overhead of the miss check.
The fix: Instrument before adding. Before deploying a new cache layer, measure the hit ratio hypothesis: "I believe 80%+ of reads for this data pattern will be cache hits." After deploying, confirm with metrics. If a layer consistently hits below 60%, reconsider its existence. Cache layers should be added one at a time, with a clear latency/throughput win to justify each one. The right question is not "should we cache this?" but "at which layer does caching this produce a measurable benefit?"
The mistake: A developer sets all cache layers to the same TTL β say, 5 minutes β reasoning that "data should be stale for at most 5 minutes everywhere." So in-process cache, Redis, and any CDN fragment cache all get TTL=300s.
Why it's bad: If all layers expire simultaneously, every server loses its cached version of popular data at the same moment. Thousands of requests hit Redis at the same time (all missing), then thousands of Redis threads hit the database simultaneously β a cache stampede. The database, which normally sees a trickle of cache misses, suddenly receives 100% of traffic. Under heavy load, this can cascade into a complete outage.
The fix: Stagger TTLs across layers. In-process cache should have the shortest TTL (30β60 seconds) β it's the fastest to re-warm but also the most hidden from invalidation. Redis can have a longer TTL (5β30 minutes). CDN can be longer still (minutes to hours). Additionally, add jitter to TTL values so even entries in the same layer don't all expire at exactly the same second.
The mistake: A team caches the entire checkout page at the CDN β including the user's name, cart contents, and personalized recommendations β to save origin bandwidth. The CDN happily caches the response. The next user to request the same URL gets served the first user's checkout page.
Why it's bad: CDN caching is fundamentally designed for shared responses β the same content for all users. The moment a response contains user-specific data (session tokens, cart contents, account details, personalized recommendations), CDN caching becomes a privacy and correctness violation. Depending on the data, this can be a serious data leak. Even when Vary: Cookie headers are used to segment the CDN cache by session, the complexity and risk surface grows significantly.
The fix: Split your pages into a cacheable shell and a dynamic fragment. The page shell (header, footer, navigation, product thumbnails) can be CDN-cached. The personalized fragment (cart count, user name, recommendations) is fetched client-side via a separate authenticated API call that bypasses CDN. This is the "edge-side include" or "fragment caching" pattern. The rule: if the content changes per user, it must never touch CDN.
The mistake: A developer adds an in-process cache as a plain Java HashMap or Python dict with no eviction policy and no size limit. Over time, as more keys are added, the map grows without bound. After a few hours under production traffic, the JVM heap fills up and the process is killed by the OS, or GC pauses become so severe that the service becomes unresponsive.
Why it's bad: In-process caches live in the application heap. Unlike Redis β which runs in a separate process with its own memory and eviction policy β the in-process cache competes directly with your application objects for heap space. A 10 GB in-process cache on a JVM with a 16 GB heap leaves only 6 GB for everything else. GC pressure from the cache alone can cause 10-second stop-the-world pauses.
The fix: Always use a bounded cache with an LRU or TinyLFU eviction policy. In Java, use Caffeine's .maximumSize() or .maximumWeight(). In Python, use cachetools.LRUCache(maxsize=10_000). Set the bound based on the maximum heap budget you're willing to allocate to the cache β typically 10β20% of total heap, never more than 30%.
The mistake: A product's price is updated. The team correctly invalidates Redis (DEL product:42). But they don't realize that four app servers also hold the same product in their Caffeine in-process caches with a 10-minute TTL. For the next 10 minutes, those servers serve the old price regardless of what Redis says.
Why it's bad: In-process caches are the most invisible layer in the stack. They are fast, cheap, and easy to add β but they are also local and silent. Unlike Redis (a shared, observable store), each server's in-process cache is a private island. When data changes, you can see the Redis update in redis-cli. You cannot easily inspect every running server's heap. A stale in-process cache can survive a Redis invalidation undetected.
The fix: Two complementary approaches. First, use a short in-process TTL (30β60 seconds) as the baseline safety net β even without any explicit invalidation, stale data expires quickly. Second, implement Redis pub/sub broadcast invalidation (see Section 14): when any server writes a new value, it publishes to an invalidate channel, and all servers (including itself) clear the affected in-process key on receipt. The combination means most invalidations propagate in milliseconds via pub/sub, with the TTL as a fallback for servers that miss the broadcast.
The mistake: A team adds Redis, tunes Caffeine, adds CDN caching β and never looks at the database buffer pool hit ratio. They assume the DB is fine because query latency looks acceptable. Six months later, a traffic spike causes query latency to jump 10Γ. The root cause: the working set grew beyond the buffer pool size, so most queries are now doing disk reads instead of memory reads. The buffer pool hit ratio had been quietly degrading from 99% to 65% over the previous two months.
Why it's bad: The database buffer pool is the most impactful single parameter in database performance. A Postgres shared_buffers at 25% of RAM with a 99% buffer pool hit ratio means almost no disk I/O. Drop to 70% hit ratio and you're doing disk I/O on 30% of page accesses β which adds 2β10 ms per affected query. Yet buffer pool hit ratio is rarely monitored. Teams monitor query latency (a lagging indicator) but not the leading indicator that predicts when it will degrade.
The fix: Add buffer pool hit ratio to your monitoring dashboard as a first-class metric. In Postgres: SELECT sum(blks_hit) / (sum(blks_hit) + sum(blks_read)) FROM pg_stat_database. Target β₯ 99%. If it falls below 95%, the working set is growing β consider increasing shared_buffers, adding RAM, or archiving cold data. Don't wait for query latency to spike before looking at this metric.
The mistake: A viral product goes on sale and suddenly 50,000 concurrent requests per second are all asking for product:viral_item_id. Even though the key is in Redis, it is hit so heavily that the single Redis shard holding it becomes a CPU bottleneck. Worse: when the key's TTL expires, all 50,000 requests simultaneously find a miss and simultaneously attempt to fetch from the database β a textbook cache stampede.
Why it's bad: Cache stampede is a self-inflicted DDoS on your own database. A single hot key expiring can generate a spike of database load proportional to your request concurrency, not your normal cache-miss rate. At high QPS, this can knock the database offline within seconds of a TTL expiry.
The fix: Two techniques. (1) Probabilistic early reexpiration: before a key expires, a small fraction of requests (e.g., 1%) proactively refresh it while the rest still serve the cached value. Libraries like Caffeine support this natively. (2) Per-key mutex / request coalescing: when a miss occurs, only one request fetches from the database; all others wait for that result. This converts an N-simultaneous-database-hit into a single database hit with N-1 waiters. Additionally, for extreme hot keys, replicate the value across multiple Redis shards and route reads by a hash of request ID β distributing the load.
Practice Exercises β Multi-Level Cache Reasoning
The goal of these exercises is not to test whether you memorized the layer names. It's to build the habit of reasoning through multi-level cache decisions β the kind of thinking an interviewer is looking for when they describe a system and ask "where would you add caching?"
You have a five-layer cache stack with the following hit ratios and per-layer latencies:
| Layer | Hit Ratio | Latency on Hit | Latency on Miss (falls through) |
|---|---|---|---|
| L1: In-process (Caffeine) | 70% | 0.2 ms | falls to L2 |
| L2: Redis | 80% | 0.8 ms | falls to L3 |
| L3: DB buffer pool (warm) | 80% | 3 ms | falls to L4 |
| L4: DB query (no page cache) | 90% | 10 ms | falls to L5 |
| L5: Disk (cold read) | 100% | 50 ms | β |
Calculate the end-to-end average latency per request across the full stack. Show your work.
- L1 hit: 70% Γ 0.2 ms = 0.14 ms
- L2 hit: 30% Γ 80% = 24% of requests Γ 0.8 ms = 0.192 ms
- L3 hit: 30% Γ 20% Γ 80% = 4.8% Γ 3 ms = 0.144 ms
- L4 hit: 30% Γ 20% Γ 20% Γ 90% = 1.08% Γ 10 ms = 0.108 ms
- L5 hit: 30% Γ 20% Γ 20% Γ 10% = 0.12% Γ 50 ms = 0.060 ms
- Total average latency β 0.14 + 0.192 + 0.144 + 0.108 + 0.060 = 0.644 ms
Notice: even though a cold disk read takes 50 ms, it only affects 0.12% of requests β its contribution to average latency is just 0.06 ms. The dominant term is L1, which handles 70% of all traffic at 0.2 ms. This is the compounding math in action.
You're designing caching for a product detail page on an e-commerce site. The page includes: (a) the product's main image and static description (rarely changes), (b) the current price (changes 0β5 times per day), (c) the inventory count (changes on every purchase, potentially hundreds per minute), (d) the user's personalized "You might also like" recommendations (generated per-user per-session), and (e) the user's recently-viewed history badge. For each piece of data, identify the appropriate cache layer(s) and explain why.
- (a) Product image + description β CDN + Browser cache. Public, shared across all users, rarely changes. Long TTLs safe. Content-hash URL enables permanent caching.
- (b) Price β Redis with write-through invalidation. Shared across all users (same price for everyone) but changes daily β CDN is okay with short TTL + surrogate-key purge on change. Redis for the rendered price component. In-process cache acceptable with TTL β€ 60s.
- (c) Inventory count β Redis only, no in-process cache, no CDN. Changes very frequently. Stale count risks overselling (serious business consequence). Use Redis with short TTL and write-through on purchase. CDN must not cache this.
- (d) Personalized recommendations β Redis keyed by user ID. User-specific β never CDN. Computed per session. Acceptable to cache for the session duration (15β30 min TTL).
- (e) Recently-viewed history badge β Redis keyed by user ID. User-specific. In-process cache technically possible but risky if the user updates their browsing on another tab/device. Redis with short TTL is safest.
A price change is made in the admin console at 2:00 PM. At 2:45 PM, a support ticket reports that some users still see the old price. Your investigation shows: the database has the correct new price, Redis has the correct new price (confirmed with redis-cli GET product:42), and the CDN is correctly returning fresh content (confirmed by checking response headers β no Age header). Users who are affected are on app servers 2, 3, and 4. App server 1 shows the correct price. Identify the cache layer responsible for the bug and describe the fix.
You need to deploy a change to your main JavaScript bundle (app.js). The current bundle is being served from CDN and browsers. After the deploy, the new app.js must be used. Design the invalidation strategy: what headers do you set, what CDN operation (if any) do you trigger, and how do you ensure that browsers that have the old bundle cached are forced to fetch the new one? The constraint: you cannot issue a hard browser-level invalidation (you cannot clear users' browser caches).
app.js and rename the output to app.8f3a2c.js. The HTML file references this hashed filename. Set Cache-Control: public, max-age=31536000, immutable β a one-year immutable CDN and browser cache. On the next deploy, the new bundle gets a new hash: app.9d1b4f.js. The HTML is updated to reference the new filename. No CDN purge needed. No browser cache invalidation needed. Browsers see a URL they've never fetched and request the new file fresh. The old file stays in CDN/browser caches indefinitely but is never requested again (no HTML points to it). Old CDN entries expire naturally after their TTL. This is the canonical way to handle static asset deployment β the key insight is that you never invalidate, you rotate.
You're reviewing a system with two distributed cache tiers: L4 is a Redis cluster (hit ratio: 95%) and L5 is the DB buffer pool (hit ratio: 60%). The overall average latency is acceptable when L4 hits (0.5 ms) but a 5% L4 miss rate is generating noticeable load β each L4 miss hits the DB, and of those DB hits, 40% are disk reads because the buffer pool is only 60% warm. What are the possible interventions, and which one should you prioritize first? Explain the trade-offs.
Per 1000 requests: 950 hit L4 (0.5 ms each = 475 ms total). 50 miss L4 β hit DB. Of those 50: 30 hit buffer pool (3 ms each = 90 ms), 20 hit disk (50 ms each = 1000 ms). So disk reads represent 2% of total requests but contribute 1000 ms / (475 + 90 + 1000) = ~63% of total latency. The bottleneck is clear: disk reads.
Interventions in priority order:
- Increase DB buffer pool (highest ROI): if the 60% hit ratio is caused by an undersized buffer pool relative to the hot working set, increasing
shared_buffers(Postgres) orinnodb_buffer_pool_size(MySQL) will directly raise the hit ratio. First, verify: check whether the DB working set fits in RAM. If the working set is 20 GB but the buffer pool is only 4 GB, adding RAM and increasing the buffer pool to 12β16 GB will likely push hit ratio above 95%, eliminating most disk reads. - Add L3 in-process cache on app servers: add a Caffeine L3 between L4 (Redis) and the DB. This absorbs hot L4 misses before they reach the DB at all. Effective if L4 misses are concentrated on a small set of hot keys.
- Improve L4 hit ratio (lower priority): 95% is already quite good. Going from 95% to 98% only halves the miss rate from 50 to 25 per 1000 β marginal gain compared to fixing the disk read cost.
- Data tiering (if working set is inherently large): if the buffer pool can never fit the working set, archive cold data, partition hot vs cold tables, or route cold-read queries to a separate read replica with a larger buffer pool.
Bug Studies β When Multi-Level Caches Go Wrong
Theory says caches make things faster. Production says: it depends on how you configure them. The four incidents below are composite sketches drawn from real patterns β exact companies are softened, but every failure mode is documented in public postmortems or engineering blogs. Read each one to understand not just what broke, but why that specific layer interaction caused the failure.
shared_buffers Set to 90% of RAM Caused OOM Kills
shared_buffers = 57600MB (β90%). Within hours the database began OOM-killing background workers. Query latency β paradoxically β got worse, not better. Restoring shared_buffers = 16384MB (25%) immediately resolved both OOM kills and improved throughput by roughly 40%.
What Went Wrong
Postgres does not operate in isolation β it shares the operating system with the OS page cache. When Postgres reads a database page, the read first passes through the OS page cache, then lands in shared_buffers. That means if shared_buffers is enormous, the same page can be double-buffered: once in kernel space (page cache) and once in Postgres's own buffer pool. You burn RAM twice for the same data.
Worse: by stealing 90% of RAM for shared_buffers, Postgres left almost nothing for the OS page cache. Every WAL write, every VACUUM I/O, every sequential scan that bypasses shared_buffers hit the disk cold. The OS's buffer manager β which is extremely good at scan-resistant eviction β was effectively disabled. The correct split is roughly 25% of RAM for shared_buffers, leaving 75% for the OS page cache to complement it.
The diagram shows how over-allocating shared_buffers starves the OS page cache, forcing disk I/O for every operation that bypasses the Postgres buffer pool. The correct split lets both layers complement each other.
HashMap as an in-process cache for user-preference objects. It grew unbounded. Heap usage climbed by roughly 400 MB per week. After six weeks, the service began triggering full GC pauses (2β5 seconds) every few hours. After ten weeks, it OOM-crashed. The fix was a 10-line change: replace the raw HashMap with a size-bounded Caffeine cache.
What Went Wrong
In-process caches store data inside your application's heap. That heap is managed by the garbage collector, and the GC has a fixed budget. An unbounded cache means you keep accumulating objects. JVM's GC can handle young-generation objects well β they're short-lived and collected cheaply. But long-lived cache entries promote to old generation, and old generation collection (full GC) is expensive: it pauses the entire process.
The root cause is conceptual: a cache without a size cap is not a cache β it is a memory leak with lookup syntax. Every cache must have either a maximum entry count, a maximum byte size, or a TTL, or all three. Without at least one bound, the cache grows until it eats the heap.
HashMap or Python dict as a long-lived cache is almost always wrong.
What Went Wrong
This failure is called a cache stampede (also called thundering herd). The problem arises because TTL expiry is a synchronized event: every process that has the key cached sees it expire at exactly the same wall-clock moment. If all those processes immediately try to rebuild the cache, you've turned one database query into one-thousand simultaneous identical queries.
The fix has two parts: probabilistic early expiry (rebuild the cache slightly before it expires, stochastically, so only one server does it) and locking (use a distributed lock or a Redis SETNX so only the first miss triggers the rebuild; all others wait for the result).
The stampede (left) fires 1 200 identical queries simultaneously. The mutex fix (right) serializes the rebuild to one query while all others wait for the fresh value.
What Went Wrong
Multi-region caches are not automatically coherent. Redis's cross-datacenter replication (whether via built-in replication or an intermediary like a Pub/Sub fan-out) is asynchronous. Under normal conditions this is fine β the lag is negligible. But under a network partition or a deployment gap, invalidation messages queue up and are delivered late β or in the worst case, dropped.
The team's mental model was "we invalidate the cache on write" β true, but only for the local region. Multi-region invalidation needs explicit propagation acknowledgement or a fallback strategy. Common approaches: (a) use a global TTL short enough that stale windows are acceptable, (b) use versioned keys (the new price gets a new key; old keys expire naturally), (c) route write-path invalidation to all regional clusters with acknowledgement.
Real-World Architectures β How Big Systems Stack Their Caches
Textbook cache descriptions make it sound like a choice between Redis and Memcached. Real production systems layer four or five different caching mechanisms simultaneously, each chosen for a specific reason. The architectures below are drawn from public engineering posts, USENIX papers, and conference talks β they show how the six-layer hierarchy plays out at scale.
Facebook TAO (USENIX ATC 2013, Bronson et al.): writes go to the leader region and propagate to followers via async invalidation. Each follower region reads from its local TAO cache backed by a MySQL replica β reads never cross regions. This means EU users always get low latency but may briefly see stale data after a write.
Facebook's social graph is a massive graph of objects (users, posts, photos) and associations (friendships, likes). TAO (The Associations and Objects system, described in the USENIX ATC 2013 paper by Bronson et al.) is purpose-built for this shape of data. The key architectural insight is that graph reads massively outnumber writes β at reported ratios of hundreds-to-one β so TAO is optimized for read-local, write-global.
Cache layers in TAO:
- In-region TAO cache β a distributed Memcached-like tier within each data center. Object and association data is cached here; reads served from this layer at sub-millisecond latency.
- MySQL with InnoDB buffer pool β each region has a MySQL primary or replica. The InnoDB buffer pool caches the hot pages of the graph DB, so cache misses that fall through TAO usually still avoid disk.
- Write-through to leader β writes always go to the leader region's DB, then TAO sends async invalidations to all follower regions. Followers re-populate their caches on the next miss.
The tradeoff: follower regions trade strong consistency for low read latency. A user in Frankfurt who updates their profile will see the update immediately (they wrote to leader, and the local TAO is invalidated synchronously for their own read), but their friend in London might see the old value for up to a few hundred milliseconds while async invalidation propagates.
Netflix's EVCache (described in multiple Netflix Tech Blog posts) is a Memcached-based distributed cache with a custom replication layer added on top. The core Memcached layer is standard β Netflix chose it for its simplicity, predictable latency, and well-understood memory model. What Netflix added was multi-region replication baked into the client library.
Cache layers in EVCache:
- EVCache cluster per AWS region β each region runs its own Memcached clusters, sharded across multiple nodes. The cluster is entirely local β no cross-region traffic on the hot read path.
- Client-side replication β writes are replicated to multiple regions by the client library itself, not by Memcached. The client writes to all regions; reads always go to the local region.
- No dedicated in-process cache tier β Netflix uses EVCache as the primary cache for everything from session tokens to personalized recommendations. The assumption is that EVCache is fast enough (sub-millisecond over the same AWS VPC) that an additional in-process layer is usually not worth the complexity.
EVCache is used for data with high reuse across many users β content metadata, genre lists, row data for the Netflix home screen. Personalised data (your specific recommendation ranking) is computed fresh or cached with short TTLs because the personalization model changes frequently.
Twitter (now X) built its own distributed key-value store called Manhattan, first publicly described on the Twitter engineering blog in 2014. The social graph (followers, timelines) was backed by Manhattan, with multiple cache layers in front of it to handle the extreme read-to-write ratio of social timelines.
Cache layers in Twitter's stack:
- In-process timeline cache β each timeline service instance kept a Scala/Java in-process cache of recently fetched timelines (a fixed-size LRU). A heavy Twitter user's home timeline was often served entirely from this tier.
- Memcached cluster β the shared distributed tier between all timeline service instances. Cache keys included the user ID and a cursor position for paginated timelines.
- Manhattan KV β the persistent store. Manhattan itself uses a log-structured merge-tree (LSM) internally, which means it has its own multi-layer cache: memtable (in-process), block cache (in-process), and OS page cache. So even the "database layer" has three sub-layers of caching.
Discord's 2023 engineering post on migrating from Cassandra to ScyllaDB is a fascinating counter-example: they removed a cache tier rather than adding one. Cassandra's read performance degraded as their message table grew to trillions of rows, so Discord added a Redis cache in front of Cassandra to protect it from read amplification. This worked but added operational complexity.
When they migrated to ScyllaDB (a Cassandra-compatible database written in C++ with a much more efficient cache implementation), ScyllaDB's internal row cache β which is part of ScyllaDB's architecture β was good enough to replace the Redis tier entirely. The "right number of cache layers" dropped from four to three. The lesson is that sometimes improving the backing store is a better investment than adding another cache layer in front of a struggling one.
Shopify runs one of the largest Rails deployments in existence. Their caching story is documented across several engineering blog posts. Unlike the previous examples which are read-heavy social platforms, Shopify's challenge is handling flash sales: millions of users hitting a single storefront in seconds.
Cache layers in Shopify's stack:
- Fastly CDN β the outermost layer. Shopify uses Fastly's Varnish-based CDN to cache entire page responses. For a flash sale, Fastly absorbs a substantial share of requests before they reach Shopify's origin servers. Fastly's surrogate keys allow instant cache purge on inventory or price changes.
- Rails fragment caching β inside the Rails app, expensive HTML fragments (product grids, collection pages) are cached in Memcached using Rails's built-in fragment cache API. Generation-based cache busting (incrementing a global generation counter rather than enumerating every key to invalidate) allows broad invalidation with a single atomic operation.
- Redis for session and cart β user session and cart state live in Redis with short TTLs, ensuring personalized data is always fresh without hitting the database.
- Postgres with shared_buffers + PgBouncer β the database tier, with connection pooling via PgBouncer to handle the connection spike from flash sales, and Postgres's own buffer pool caching the hot inventory rows.
No two production stacks are identical. Facebook TAO skips CDN and in-process caches because social graph data is too personalized to cache at the edge. Netflix skips in-process caches because EVCache latency is already under 0.5 ms intra-VPC. Shopify leads with CDN because static content cacheable at the edge handles the bulk of flash-sale traffic.
Common Misconceptions β Mental-Model Errors
These aren't beginner mistakes β they're the mental-model errors that experienced engineers make when reasoning about cache hierarchies. Each one is distinct from the pitfalls in S17 (which focused on implementation gotchas). These are wrong beliefs about how caches work, and correcting them changes how you design systems.
The belief: If the system is slow, add Redis. Adding a cache layer can only help.
Why it's wrong: A cache speeds things up only when the data has high temporal or spatial locality. If you're caching data that is accessed only once β a report query keyed on a unique date range, a user profile fetch that's never requested twice because users churn β you pay the overhead of cache writes, serialization, and network hops without ever getting a cache hit. The cache hit ratio stays near 0%, and you've added 0.5β2 ms of latency to every request (the cost of checking the cache and missing) instead of saving time.
The corrected mental model: Before adding a cache layer, measure the access pattern. What is the expected hit ratio for this data? If most requests are for unique keys, the data has no locality and a cache will hurt, not help. Caches are only beneficial when data is reused β either the same key is read many times (temporal locality) or nearby keys are read together (spatial locality).
The belief: Redis, Memcached, Caffeine, and the CDN all do the same thing. Pick one.
Why it's wrong: Each layer has fundamental physical constraints that make it non-substitutable. You cannot replace an in-process Caffeine cache with Redis if your bottleneck is latency below 0.1 ms β Redis adds a network round trip. You cannot replace a CDN with Redis if your bottleneck is geographic distance β Redis is in one data center. You cannot replace the OS page cache with Redis β the page cache operates at the kernel level and intercepts file I/O automatically, something Redis cannot do.
The corrected mental model: The six layers exist because each occupies a unique point in the speed/size/location trade-off space. Layers complement each other; they don't substitute for each other. The right question is not "which cache?" but "which cache belongs at which layer for this specific data?"
The belief: If my Redis hit ratio is 85%, I should add more RAM to push it to 95%.
Why it's wrong: Hit ratio improves with cache size only up to the working set boundary. Once your cache is large enough to hold the entire working set, additional memory adds nothing β every key that would ever be a hit is already cached. Beyond that point, you're paying for RAM that holds data nobody accesses. The marginal value of each additional gigabyte drops toward zero once you've covered the working set.
The corrected mental model: Profile the working set size first. Use Redis's INFO keyspace and monitoring to understand how many unique keys are accessed in a 24-hour window. Size the cache to cover that working set, then stop. Spending money on cache RAM beyond the working set has zero return.
The belief: If I have Redis, I don't need an in-process cache. If I have an in-process cache, Redis is redundant.
Why it's wrong: They serve different purposes and work best together. An in-process cache (Caffeine, Python dict) costs 0 ms β no serialization, no network hop, no thread switch. A distributed cache (Redis) costs 0.3β2 ms but is shared across all your application servers, so every server benefits from every other server's cache population. The optimal design is usually both: an in-process cache for the hottest data (top 1% of requests) that eliminates even the Redis round trip, backed by a distributed cache for the next tier down.
The corrected mental model: Think of them as two levels of a hierarchy, not competing options. The in-process cache is a "local Redis" with zero latency and limited scope. The distributed cache is a "shared store" that aggregates all servers' cache populations. Use both. Cap the in-process cache at a small fraction of heap (tens of thousands of entries); let Redis handle the long tail.
The belief: The database handles its own caching. I don't need to think about shared_buffers or the InnoDB buffer pool β that's DBA territory.
Why it's wrong: The DB buffer pool is the most consequential cache layer for most applications, because it determines whether a DB query touches disk or RAM. If your buffer pool hit ratio is 60% β meaning 40% of DB reads go to disk β adding a Redis cache in front of the database will only help with the queries Redis can cache. For queries that are too complex or too personalized for Redis, every miss still causes a disk read. Understanding the buffer pool hit ratio tells you whether application-level caching is even the right lever to pull.
The corrected mental model: Monitor pg_stat_bgwriter and the ratio of blks_hit / (blks_hit + blks_read) in Postgres (or Innodb_buffer_pool_reads vs Innodb_buffer_pool_read_requests in MySQL). If the buffer pool hit ratio is above 99%, adding more app-level cache is low value. If it's 80%, right-sizing the buffer pool first will often give a larger gain than adding Redis.
The belief: My Redis hit ratio is 97%. The cache is working great.
Why it's wrong: Hit ratio is only one dimension of cache health. A cache can have a 97% hit ratio while simultaneously: (a) evicting aggressively because it's undersized β you're serving hits but constantly churning the working set; (b) returning stale data β hits are "successful" from the cache's perspective but wrong from the application's perspective; (c) adding 5 ms of latency due to memory pressure causing slower lookups or swap usage.
The corrected mental model: Monitor at least four metrics per cache layer: hit ratio, eviction rate, memory utilization, and p99 latency. A healthy cache has high hit ratio and low eviction rate (it's not churning to achieve the hits) and sub-millisecond p99 latency and stable memory utilization below the configured maximum.
The belief: I have a small app with a few thousand users. All this CPU cache / OS page cache / buffer pool talk is irrelevant to me.
Why it's wrong: Small systems already use three or four cache layers whether you configured them or not. The OS page cache caches your application's static files and your database's data files automatically. The database buffer pool caches your hot rows. The browser caches your CSS and JS. These layers are always present; the question is only whether you've configured them intelligently. Even a hobby project benefits from setting appropriate HTTP cache headers (CDN/browser layer), ensuring the database buffer pool is large enough for the working set (DB layer), and avoiding unbounded in-process data structures (in-process layer).
The corrected mental model: You're not choosing whether to have cache layers β they exist regardless. You're choosing whether to understand and configure them, or to ignore them and wonder why the app feels sluggish.
Operational Playbook β Tuning the Whole Stack
Knowing the theory is necessary but not sufficient. This section is a practical five-stage playbook for tuning a multi-level cache stack in a running production system. Follow it in order β skipping stages leads to optimizing the wrong layer.
The five stages must be followed in order. Tuning eviction before measuring hit ratios is guessing. Right-sizing before locating the bottleneck layer is optimizing the wrong thing.
You cannot tune what you cannot see. Before changing anything, add instrumentation to expose hit ratio, miss ratio, latency, and eviction rate for each layer. Here are the specific commands and APIs:
OS Page Cache β use free -h to see total vs used vs buff/cache. Use vmstat 1 to see page-in/page-out rates. On Linux, cachestat (part of bcc tools) shows page cache hit/miss per file.
Postgres shared_buffers β the pg_buffercache extension exposes what is currently in the buffer pool:
Redis β run INFO memory and INFO stats:
In-process cache (Caffeine) β enable stats recording and expose via JMX or Micrometer:
With metrics in hand, identify two things: which layer has the lowest hit ratio relative to its allocated capacity, and which layer's miss penalty is highest. These are not always the same layer.
Hit ratio per layer: if your in-process cache is 40% and your Redis is 95%, the in-process cache is the weak link β but it may be weak because it's too small or because its TTL is too short. If your Redis hit ratio is 60%, that's alarming β it suggests the working set doesn't fit, or the TTL is too aggressive, or the eviction policy is wrong.
Miss penalty by layer: an in-process cache miss falls to Redis (β1 ms penalty). A Redis miss falls to the database (β5β50 ms penalty). A database miss falls to disk (β50 Β΅sβ10 ms penalty). The layer where misses are most expensive is the most important to fix first. Usually this is the distributed cache or the database buffer pool.
Right-sizing means allocating enough memory to cover the working set β no more, no less. Here are the validated rules of thumb for each layer:
- Postgres
shared_buffers: 25% of total system RAM. Seteffective_cache_sizeto 75% to tell the query planner about the OS page cache. Never exceed 40% even on a dedicated DB host β you need room for the page cache. - MySQL InnoDB buffer pool (
innodb_buffer_pool_size): 60β80% of RAM on a dedicated DB host. MySQL's buffer pool is more aggressive about using memory because MySQL relies on the buffer pool more than Postgres (Postgres delegates more to the OS page cache). - In-process cache (Caffeine, Guava): set
maximumSizeto the number of unique "hot" keys your service handles in one TTL window. A typical API service might cache 10 000β100 000 hot user records. Always cap it β if you're uncertain, start at 50 000 and measure eviction rate; if eviction rate is near zero, you have headroom. - Redis working set: use
redis-cli --bigkeysandDEBUG OBJECTsampling to estimate total working set size. Size the Redis maxmemory at 110β120% of the working set estimate to leave headroom for hot-key clustering without eviction pressure.
Eviction policy determines which entries are removed when the cache is full. The wrong policy can cut your effective hit ratio even with correct sizing.
- LRU (Least Recently Used): evicts the entry that hasn't been accessed in the longest time. Good for workloads with strong temporal locality. Weakness: a single large sequential scan evicts all hot data ("scan pollution").
- LFU (Least Frequently Used): evicts the entry accessed the fewest times total. Good for frequency-skewed workloads (a small set of keys is always hot). Weakness: new entries start with count 0 and are immediately vulnerable to eviction even if they'll be popular.
- TinyLFU (used in Caffeine): a scan-resistant, frequency-based policy that uses a compact frequency sketch to approximate LFU without the cold-start problem. This is the best general-purpose policy for in-process caches and is Caffeine's default.
- TTL jitter: add a small random offset (e.g., Β±10% of TTL) to each entry's expiry time. This spreads expiry events across time and prevents cache stampedes when many keys were populated simultaneously.
allkeys-lru, allkeys-lfu, volatile-lru, and volatile-lfu eviction policies (set via maxmemory-policy in redis.conf). For a general-purpose shared cache where all keys have TTLs, use allkeys-lfu. For a cache where some keys are permanent (no TTL) and some are ephemeral, use volatile-lru to evict only TTL-bearing keys.
Tuning in production without load testing is gambling. Always validate changes under realistic load before declaring success. The validation checklist:
- Run a realistic load test: use a traffic replay tool (e.g.,
wrk,k6,Gatling) with a request distribution that mirrors production β not uniform random, but skewed toward hot keys. If 1% of your users account for 80% of requests, your load test should reflect that. - Compare hit ratios before and after: for each layer, record hit ratio at steady state before the change and after. The change should increase hit ratio at the targeted layer and should not decrease it at adjacent layers.
- Measure p99 latency, not just average: average latency hides tail-latency spikes caused by GC pauses, eviction storms, or cold-start effects after a cache flush. p99 is what your users actually experience on their worst requests.
- Monitor eviction rate over 24 hours: a successful right-sizing will show a drop in eviction rate (fewer entries being thrown out to make room). If eviction rate remains high after right-sizing, the working set is larger than estimated and you need to re-profile.
- Verify end-to-end DB query rate dropped: the final proof that cache tuning worked is that the database sees fewer queries, not just that the cache metrics improved. Instrument DB query throughput (queries per second per table) and confirm it dropped after cache improvements.
Cheat Sheet & Glossary β The 30-Second Recap
Everything from this page distilled into quick-reference cards and a glossary. Bookmark this section for interview prep.
The Latency Hierarchy (commit to memory)
Key Rules of Thumb
Glossary
- cache hierarchy
- The ordered set of cache layers from fastest/smallest (CPU L1) to slowest/largest (CDN), each layer catching misses from the layer above.
- MESI protocol
- The cache coherence protocol used between CPU cores: each cache line is in one of four states β Modified, Exclusive, Shared, Invalid. Prevents two cores seeing different values for the same memory address.
- false sharing
- When two threads modify different variables that happen to live on the same CPU cache line, causing constant MESI invalidations even though the threads aren't sharing data logically. Solved by padding structs to align variables to separate cache lines.
- page cache
- The Linux kernel's in-RAM cache of recently-read or recently-written file data. File I/O goes through the page cache automatically; Postgres, for example, benefits from it without any configuration.
- buffer pool
- The database's own in-process cache of recently-accessed database pages. Postgres calls it
shared_buffers; MySQL calls it the InnoDB buffer pool. It sits in front of the OS page cache for database-specific I/O. - working set
- The subset of data actively accessed in a given time window. Once a cache is large enough to hold the working set, additional capacity yields near-zero improvement in hit ratio.
- temporal locality
- The tendency to re-access the same data soon after accessing it. High temporal locality makes caching effective β the same key is likely in cache when requested again.
- spatial locality
- The tendency to access data near recently-accessed data. Exploited by CPU cache lines (fetch 64 bytes at once) and OS page cache (fetch 4 KB at once, even if only 8 bytes needed).
- miss penalty
- The latency cost paid when a cache lookup fails and must fall through to the next (slower) layer. Miss penalty Γ miss rate = average added latency from misses.
- write-through / write-around / write-back
- Three strategies for handling writes in a cache. Write-through writes to both cache and backing store synchronously. Write-around skips the cache on write. Write-back (write-behind) writes to the cache and flushes to the backing store asynchronously.
- cache stampede
- When a popular cached item expires and many clients simultaneously attempt to rebuild it, overwhelming the backing store with identical requests. Prevented by mutex locking and TTL jitter.
- TAO
- Facebook's "The Associations and Objects" distributed cache system, described in the USENIX ATC 2013 paper by Bronson et al. Designed for graph data with a write-to-leader, read-from-local-cache pattern across multiple regions.