Eviction Internals — System Guide

Section 1

TL;DR — Eviction in Plain English

Why cache eviction matters: the direct link between picking the wrong eviction victim and a hit-ratio collapse that melts your database
The 7 canonical eviction policies (LRU, LFU, FIFO, Random, MRU, ARC, TinyLFU) — the idea behind each, when each shines, and when each fails spectacularly
Why LRU is O(1): the HashMap + doubly-linked-list trick that makes list reordering constant-time
The failure modes — scan pollution, one-hit wonders, sequential flooding — that bring even battle-tested policies to their knees
How to pick the right policy for any workload: temporal locality vs. frequency locality, and why most caches in the wild use TinyLFU under the hood today

When a cache fills up, every new entry needs to kick something out. The question is: which entry? Pick the wrong one — the one your traffic actually needs — and your hit ratio quietly collapses. Pick the right one — the entry least likely to ever be requested again — and a 1 GB cache can serve 99% of traffic for a 100 GB dataset. Eviction policy is the algorithm that makes that decision, and understanding why each policy works (or breaks) is what separates engineers who configure Redis from engineers who design caching systems.

A cache is finite. Your data isn't. So the moment the cache fills up, you have to make a choice: when a new key arrives, which old key gets thrown out to make room? This is the eviction policy. Different policies use different signals — "when was this last touched?" (recency), "how often has this been requested?" (frequency), "when did this arrive?" (insertion order) — to guess which entry is least valuable to keep. Get the guess right and the working set stays warm. Get it wrong and you spend CPU and DB time constantly refetching entries you just evicted.

LRU (Least Recently Used) throws out the entry you haven't touched the longest — great for bursty, recency-driven traffic, terrible when a full-table scan blasts through the cache. LFU (Least Frequently Used) throws out the entry with the fewest total accesses — great for stable hot keys, terrible when a new viral item can't survive infancy. FIFO throws out the oldest-inserted entry regardless of access — simple and wrong for most real workloads. Random throws out a random entry — surprisingly competitive with LRU in practice because cache misses are cheap compared to tracking recency. MRU (Most Recently Used) throws out the most recently used entry — bizarre but optimal for repeated sequential scans. ARC (Adaptive Replacement Cache) runs both an LRU queue and an LFU queue, dynamically shifting weight between them — adapts automatically to workload character. TinyLFU approximates frequency using a sketch data structure and filters entries through a recency-biased admission gate — the modern industry default because it gets nearly optimal hit ratios at very low memory overhead.

The tricky part is that the best eviction policy depends entirely on your workload's access pattern — and access patterns change. An e-commerce site in the morning is mostly browsing (broad reads across many products). At checkout time it hammers a handful of product and inventory keys. A CDN node sees mostly one-time reads for unique URLs. A social feed sees a power-law distribution where 1% of posts get 99% of reads. LRU handles the e-commerce case well. TinyLFU handles the CDN case. ARC adapts between them. The page that follows goes deep on the mechanics of each policy so you can reason about workloads you've never seen before — not just memorize a decision table.

Eviction policy determines which cache entry gets sacrificed when the cache fills up. The choice of policy determines whether your hit ratio stays high (your working set stays warm) or collapses (you keep evicting the wrong things). Seven canonical policies exist, each exploiting a different signal about future access likelihood. LRU is the most widely deployed; TinyLFU is the modern default for general-purpose caches.

Section 2

Why You Need This — When Your Hit Ratio Quietly Collapses

Most engineers treat eviction as a checkbox. Redis has maxmemory-policy set to allkeys-lru by default. Job done. But there's a specific failure mode that hits production teams every week and takes hours to diagnose, because it doesn't announce itself loudly — it just makes everything slightly slower until suddenly everything is very slow.

The Production Story: One Nightly Job, One Database Fire

Picture a social platform. The cache holds the 2 million most recently accessed user profile objects. At peak hours, 95% of requests hit the cache — the app is snappy, the database idles at 8% CPU, life is good. Then the data team runs a nightly analytics job. The job reads every user row in the database — all 50 million — sequentially. Each read comes back to the application layer, which (because the cache-aside pattern is in place) writes the result to Redis.

Within minutes, those 50 million rows have completely replaced the 2 million warm entries. The analytics job accessed each row exactly once, so none of those 50 million rows will ever be requested again. But LRU doesn't know that — they were all "recently used" as of the analytics scan. Morning traffic arrives. Every user profile request is a cache miss. The database CPU spikes to 85%. The on-call engineer pages at 3 AM wondering why a healthy system is suddenly melting.

This failure has a name: scan pollution (sometimes called "sequential flooding"). The analytics scan polluted the cache with entries that have zero future value, evicting the warm working set that actually served real traffic. It is the canonical LRU failure mode, and it happens any time a large sequential access pattern shares a cache with a hot-key workload.

The Math That Makes Eviction Decisions Life or Death

Let's make this concrete with numbers. Suppose your cache can hold 10,000 entries and your hot working set is 8,000 entries. At steady state with LRU, those 8,000 hot entries live in the cache and hit ratio sits at 95%. Now suppose the analytics job writes 50,000 new entries. LRU evicts the 40,000 least-recently-seen entries — which, because the hot entries haven't been touched during the scan, includes most of your working set. After the scan, your hot entries are mostly gone. Hit ratio drops toward 60%. Every miss costs a database roundtrip: call it 15 ms. At 5,000 requests per second, a drop from 95% to 60% hit ratio means 1,750 more database reads per second instead of 250. That's a 7× increase in database load from a single eviction policy mismatch.

The diagram above shows what scan pollution looks like in the cache's memory. Before the analytics scan, 8,000 of 10,000 slots hold hot working-set entries that real traffic re-reads constantly. After the scan, almost all of those slots are occupied by one-time-read analytics rows that will never be requested again. The cache is still full — but it's full of garbage from the perspective of your production traffic.

The Question Eviction Policy Answers

Every eviction policy is essentially an answer to one question: "Of all the entries currently in the cache, which one is least likely to be requested in the near future?" You can't know the future, so every policy uses a proxy signal — past recency, past frequency, insertion order, or a combination. The gap between the proxy signal and the true future access likelihood is what determines whether you evict the right entry or the wrong one. Scan pollution is what happens when recency (the LRU signal) is a terrible proxy because the most recent accesses came from a one-time scan, not from real traffic.

The insight: A 10% drop in hit ratio can translate to a 5–10× increase in database load, because misses are expensive and compound. If your cache handles 5,000 req/s at 95% hit ratio, only 250 reads/s hit the database. Drop to 85% hit ratio and 750 reads/s hit the database — a 3× increase from what looks like a modest 10-point drop. This is why eviction policy tuning has outsized leverage: even small hit-ratio improvements can prevent database scaling events entirely.

Scan pollution is the canonical LRU failure: a sequential read of N entries blasts the working set out of the cache, collapsing hit ratio and multiplying database load. Eviction policy is the algorithm that decides which entry to sacrifice — and the gap between that algorithm's proxy signal and actual future access likelihood determines how well your cache survives unusual access patterns.

Section 3

Mental Model — The Memory-Pressure Loop

Before we get into specific algorithms, let's build the mental model of what happens inside a cache every single time a new entry needs to go in and the cache is already full. This five-step loop runs billions of times per day in any serious production system, and understanding it in sequence is the foundation for understanding why each policy makes the choices it does.

The Five Steps Every Eviction Follows

Think of the cache as a coat check room at a restaurant — limited hooks, first-come policy, but with a bouncer at the door who decides who has to leave when the room fills up. Here's what happens every time a new coat arrives and the room is full:

New entry arrives — a write request comes in: "cache this key-value pair."
Overflow detected — the cache checks its current size against its configured maximum. It's at capacity.
Policy fires — the eviction algorithm runs and selects a victim: the entry that the policy considers least valuable to keep.
Victim deleted — the victim entry is removed from the cache, freeing one slot.
New entry inserted — the new entry takes the freed slot. The cache is full again, at steady state.

Steps 3 and 4 are where policies differ. FIFO's "policy fires" step is trivial: look at the queue head, that's the victim. LRU's step is more expensive conceptually but still O(1) in a good implementation: look at the tail of the recency-ordered doubly-linked list, that's the victim. TinyLFU's step is the most sophisticated: consult a frequency sketch, compare the incoming entry's estimated frequency to the victim candidate's, and admit the new entry only if it's likely to be accessed more. We'll unpack each of these in later sections.

The diagram above shows the loop as a cycle — it repeats forever as long as traffic writes to a full cache. Notice that step ③ (policy fires) is the only step that varies between eviction algorithms. Every other step is identical regardless of which policy you use. This means that when we compare LRU to LFU to TinyLFU, we're really only comparing one thing: which victim gets selected in step ③ and why.

What "Good" Looks Like

A good eviction policy selects victims that will not be requested again (or will be requested the least) in the near future. The ideal victim is an entry that the system will have to fetch from the database at some point anyway — but that point is so far in the future that paying a miss penalty then is much cheaper than paying it for entries that will be needed imminently. The classic way to measure this is the Miss Ratio Curve (MRC) — but for now, the intuition is: a perfect policy minimizes the number of times you evict something you end up fetching again shortly after.

One more concept worth planting here: ghost entries. Some advanced policies (ARC, TinyLFU) keep a lightweight record of recently evicted keys — just the key, no value. If a ghost entry gets re-requested, it tells the algorithm "you made a bad eviction decision — consider adjusting your strategy." Ghost entries are essentially the cache's mechanism for learning from its mistakes.

Every cache eviction follows the same five-step loop: new entry arrives, overflow detected, policy selects victim, victim deleted, new entry inserted. The only step that differs between policies is step 3 — victim selection. A good policy minimizes re-fetches by evicting entries least likely to be needed soon. Ghost entries let advanced policies detect and learn from bad eviction decisions.

Section 4

Core Concepts — The Vocabulary of Eviction

Before we can talk about why LRU beats FIFO for certain workloads, or why TinyLFU uses a "window" area, we need a shared vocabulary. Eight terms appear constantly in any discussion of eviction. For each one, here's the plain-English meaning first — then the precise technical term you'll see in papers and documentation.

The Eight Terms You Must Know

Working set — The group of cache entries that real traffic actually reads repeatedly. Think of it as "the stuff that matters." If your cache is large enough to hold the entire working set, hit ratio will be high. If it's smaller, some working-set entries keep getting evicted and re-fetched — that's cache churn, and it's where hit-ratio collapses come from.

Recency — How recently an entry was last accessed. An entry that was read 1 millisecond ago has high recency. An entry that was last read 2 hours ago has low recency. Recency is the signal that LRU (Least Recently Used) uses to decide victims: the entry with the lowest recency (longest time since last access) gets evicted first. Recency is a good proxy for future use when traffic has temporal locality — but a terrible proxy when scans blast through the cache.

Frequency — How many times an entry has been accessed in total (or in a recent time window). An entry that's been read 10,000 times is a hot key — even if it hasn't been touched in the last 5 minutes. Frequency is the signal that LFU (Least Frequently Used) uses: the entry with the lowest total access count gets evicted first. Frequency is great for stable hot-key workloads, but it has an "aging" problem: an entry that was hot last week but is irrelevant now keeps a high count and resists eviction even though it's no longer valuable.

Age — How long an entry has been in the cache since it was first inserted. FIFO (First In, First Out) uses age as its signal: the oldest-inserted entry gets evicted regardless of how often it's been accessed since. Age is almost never a good proxy for future access likelihood, which is why FIFO performs poorly on most real workloads.

Churn — The rate at which entries are evicted and then re-fetched. Low churn means the policy is making good eviction decisions (kicking out things that won't be needed). High churn means the policy keeps kicking out things that are immediately re-fetched — every eviction triggers a database read shortly after, which defeats the purpose of caching. The hit ratio is the aggregate measure of churn: low hit ratio = high churn.

Miss penalty — The cost of a single cache miss. When an entry isn't in the cache, the application has to fetch it from the database (or origin server, or S3, etc.) and typically write it back to the cache. The miss penalty is the extra latency and CPU cost of that roundtrip. Miss penalties of 5–50 ms are typical for database reads. In a CDN context, a miss penalty might be a full origin fetch costing hundreds of milliseconds. Eviction policy matters more when the miss penalty is high — because every wrong eviction decision directly costs that many milliseconds of latency for real user requests.

Ghost entries — A lightweight record (just the key, no value) of recently evicted entries. Advanced algorithms like ARC and TinyLFU keep ghost queues. If a ghost entry is re-requested — meaning the cache evicted it but traffic immediately wanted it again — that's evidence of a bad eviction decision. Ghost queues let algorithms self-correct without requiring external tuning.

Miss Ratio Curve (MRC) — A graph where the x-axis is cache capacity and the y-axis is miss ratio. For most realistic workloads, the MRC has a steep drop-off at the working-set size: caches smaller than the working set miss a lot; caches larger than the working set miss very little. The MRC tells you how much each additional megabyte of cache capacity is worth in terms of hit ratio — useful for sizing decisions and for comparing how well two different policies exploit the same capacity.

The diagram above shows how these terms relate. The top group (recency, frequency, age) are input signals — what the eviction policy reads from the cache's metadata to decide which entry to evict. The left side (working set, ghost entries) provide context that helps the policy make better decisions. The bottom group (churn, hit ratio, MRC) are output metrics — how you measure whether the policy's decisions were any good. If your churn is high and hit ratio is low, your policy is selecting bad victims and you should try a different signal.

Eight terms underpin all eviction discussion: working set (entries that matter), recency and frequency (input signals for LRU/LFU), age (FIFO's signal), churn (evict-then-re-fetch rate), miss penalty (cost per cache miss), ghost entries (evicted-key records for self-correction), and MRC (miss rate vs. cache size — the key sizing and policy-comparison tool).

Section 5

The 7 Canonical Eviction Policies — Overview

Seven eviction policies cover the vast majority of what you'll encounter in production systems, academic papers, and system design interviews. Each policy is a different answer to "which entry is least valuable to keep?" — and each one makes that judgment using a different signal and a different data structure. This section gives you the overview; later blocks go deep on each one.

A useful framing before we start: think of each policy as a different kind of "bouncer" at the coat-check door. The bouncer's job is to pick who has to give up their coat hook. Different bouncers use different criteria. One (FIFO) asks "who's been here longest?" Another (LRU) asks "who came in most recently — because that person will probably leave soon and won't need their coat back." Another (TinyLFU) asks "who uses their coat the least across their entire history here?" Different workloads reward different bouncers.

The seven-panel diagram above gives you the quick-reference view. Notice a few patterns. First, the simpler policies (FIFO, Random) have obvious weaknesses because they ignore relevant signals. Second, the "complex" policies (ARC, TinyLFU) exist specifically to fix the failure modes of the simpler ones — they're more sophisticated precisely because real workloads have edge cases that simpler policies can't handle. Third, complexity has a cost: ARC requires tracking twice as many entries (real + ghost), and TinyLFU requires a frequency sketch that needs periodic decay to stay accurate.

The Reference Table

Here's the same information in table form — useful as a quick-reference when you're deciding which policy to reach for:

Policy	Eviction Signal	Best Workload	Killer Weakness	Data Structure
LRU	Recency (last access time)	Bursty, hot-key, temporal locality	Scan pollution — one sequential sweep destroys working set	HashMap + Doubly-Linked List
LFU	Frequency (total accesses)	Stable hot keys, power-law distribution	New-entry starvation — viral content can't survive infancy	Min-heap or frequency-bucket list
FIFO	Age (insertion time)	Streaming data, event queues	Belady's anomaly — evicts items regardless of access count	Circular queue / ring buffer
Random	None (random slot)	Uniform access, hardware TLBs	Non-deterministic; can evict the hottest entry by chance	Array index selection
MRU	Most-recent access (inverted)	Repeated sequential scans (DBs)	Catastrophic for any non-sequential workload	Stack / reversed LRU list
ARC	Recency + frequency (adaptive)	Mixed or shifting workloads	Patent issues (IBM); doubles entry tracking overhead	2 LRU queues + 2 ghost queues
TinyLFU	Frequency sketch (probabilistic)	General purpose (industry default)	Count-Min Sketch needs decay; higher implementation complexity	Count-Min Sketch + SLRU queues

Interview tip: When asked "which eviction policy would you use?", the right answer is never a single policy without context. Lead with: "It depends on the workload's access pattern." Then anchor the tradeoff: recency-driven traffic favors LRU; stable hot-key traffic favors LFU or TinyLFU; workloads that mix both or shift over time favor ARC or TinyLFU. For production general-purpose caches with no specific workload constraints, TinyLFU (used by default in Caffeine, Ristretto, and other modern in-process caches) is widely regarded as the modern default precisely because it adapts well across workload types.

Seven canonical eviction policies cover the design space: LRU (recency), LFU (frequency), FIFO (age), Random (none), MRU (inverted recency), ARC (adaptive recency + frequency), and TinyLFU (probabilistic frequency filter). Simpler policies have sharper failure modes. Complex policies exist to fix those failure modes at the cost of implementation complexity. TinyLFU is the modern general-purpose industry default.

Section 6

LRU Deep Dive — Why HashMap + Doubly-Linked List is O(1)

LRU is the most widely deployed eviction policy in production systems. Redis's default is allkeys-lru. Memcached uses LRU. Countless in-process caches default to it. It's not because LRU is the best policy for all workloads — it isn't, as we saw in S2. It's because LRU is a very good policy for most workloads AND it can be implemented with O(1) operations for every cache operation: get, set, and evict. That combination — good behavior, constant-time guarantees — is hard to beat for a general-purpose default.

The Analogy: A Stack of Parking Permits

Imagine a parking lot with 5 spaces and a single attendant. When a new car arrives and the lot is full, the attendant picks the car whose permit was stamped the longest ago — that's the car least recently "used." But here's the constraint: the lot is busy, and the attendant needs to do this in constant time, not by scanning the whole lot. So instead of checking timestamps on every car, the attendant keeps a physical stack of permits ordered by recency: the most recently arrived car's permit is on top, the least recently arrived car's permit is on the bottom. When a car needs to be evicted, just pull the bottom permit. When a car is accessed (re-entered or moved), pull its permit from wherever it is and put it on top. Every operation is O(1) if you can pull a permit from any position instantly.

That "pull from any position instantly" is the key insight. A regular stack or array can't do that — removing from the middle is O(n). You need a data structure where removing an element from any position is O(1) regardless of where it is. That data structure is the doubly-linked list.

The Data Structure: HashMap + Doubly-Linked List

The classic LRU cache uses exactly two data structures working together:

A doubly-linked list ordered by recency — most recently used at the head, least recently used at the tail. The tail node is always the eviction victim.
A HashMap from cache key to the node's pointer (or reference) in the linked list.

Why do you need the HashMap? Because when a get request arrives for key K, you need to find the node for K in the linked list in O(1) time — without scanning the list. The HashMap gives you direct pointer access to any node, bypassing the list traversal entirely. Once you have the pointer, you can remove the node from its current position (O(1) doubly-linked-list splice) and move it to the head (O(1) insert). Total cost: O(1).

Here's the contract every operation follows:

GET(key): look up node in HashMap → O(1). If found (cache hit), move the node to the head of the list → O(1). Return the value. Total: O(1).
SET(key, value) — no eviction needed: create a new node, insert at head of list → O(1). Insert key→node into HashMap → O(1). Total: O(1).
SET(key, value) — eviction needed: read the tail node (the LRU victim) → O(1). Remove it from the list → O(1). Delete its key from HashMap → O(1). Then do the normal insert for the new entry → O(1). Total: O(1).

The diagram shows a cache with capacity 4 holding keys A, D, B, C — ordered from most recently used (A, at the head) to least recently used (C, at the tail). C is the current eviction victim. The HashMap at the bottom holds a direct pointer to each node, so any get request can jump to the right node in O(1) without walking the list. When key B is accessed, the HashMap immediately locates nodeB, the list splices it out of its current position and re-inserts it at the head — two pointer updates — and the list order becomes A→B→D→C. No list traversal was needed.

A 6-Step Worked Example

Let's walk through a cache with capacity 3 and trace six operations. This is the standard interview demonstration of LRU correctness.

Step ⑤ is the crucial one to understand. When GET A comes in, A is at the tail — it would have been the next victim. But because GET A is a cache hit, LRU promotes A to the head of the list. This single operation changes who gets evicted in step ⑥: B (which hasn't been touched since step ③) becomes the tail and gets evicted instead of A. LRU is essentially a "save yourself by being useful" policy: every time you're accessed, you move to safety. Only entries that nobody reads for the longest time end up at the tail and get evicted.

The O(1) Proof

Let's be precise about why every operation is O(1). The claim rests on three facts about the data structures:

HashMap lookup: average O(1) per standard hash table analysis. In the worst case (all keys hash to the same bucket), it degrades to O(n), but with a good hash function and reasonable load factor this is negligible in practice.
Doubly-linked list removal from an arbitrary position: O(1) given a pointer to the node. You update exactly 4 pointers: node.prev.next = node.next and node.next.prev = node.prev. Done. No traversal needed.
Doubly-linked list insertion at head: O(1). Update 2 pointers: node.next = head; head.prev = node. Point the list's head reference at node. Done.

The HashMap gives you the node pointer in O(1), then the doubly-linked list lets you remove and reinsert in O(1). Put them together and every cache operation — get (hit), set (no eviction), set (with eviction) — is O(1) end-to-end. This is why LRU is so widely deployed: it gives you recency-ordered eviction at the cost of two data structures and zero algorithmic overhead per operation.

The Pseudocode

Here's the core LRU implementation logic — clean pseudocode first so the concept is clear before you see it in a real language:


class LRUCache:
    def __init__(self, capacity):
        self.cap = capacity
        self.map = {}          # key → Node
        # sentinel head/tail nodes — simplify edge cases (no null checks)
        self.head = Node()     # dummy MRU sentinel
        self.tail = Node()     # dummy LRU sentinel / victim pointer
        self.head.next = self.tail
        self.tail.prev = self.head

    def get(self, key):
        if key not in self.map:
            return -1          # cache miss
        node = self.map[key]
        self._remove(node)     # unlink from current position  — O(1)
        self._insert_at_head(node)  # move to MRU position    — O(1)
        return node.value

    def set(self, key, value):
        if key in self.map:
            self._remove(self.map[key])   # update existing entry
        node = Node(key, value)
        self.map[key] = node
        self._insert_at_head(node)        # O(1)
        if len(self.map) > self.cap:
            # evict: the node just before the tail sentinel is the LRU victim
            victim = self.tail.prev
            self._remove(victim)          # O(1) — 4 pointer updates
            del self.map[victim.key]      # O(1) — HashMap delete

    def _remove(self, node):
        # splice node out of the doubly-linked list — 4 pointer updates
        node.prev.next = node.next
        node.next.prev = node.prev

    def _insert_at_head(self, node):
        # insert node immediately after the head sentinel
        node.next = self.head.next
        node.prev = self.head
        self.head.next.prev = node
        self.head.next = node

The sentinel head and tail nodes (dummy nodes with no data) are a common implementation trick. They eliminate the special-casing you'd need when the list is empty or when you're removing the first or last real node. With sentinels, every _remove call looks identical: four pointer updates, no if-statements. The LRU victim is always self.tail.prev — the node just before the tail sentinel. No scanning, no comparison, no computation: just one pointer dereference.

The Killer Weakness: Scan Pollution (Preview)

We previewed this in S2, but now you can see it at the data-structure level. When an analytics scan reads 50 million rows and writes them to the cache, each write calls set(), which calls _insert_at_head(). Every scan entry becomes the new MRU. The previously warm working-set entries get pushed toward the tail with every scan insert. After 10,000 scan entries fill a 10,000-slot cache, every working-set entry has been evicted — not because it was cold, but because the scan's volume overwhelmed the capacity. LRU has no way to distinguish "scan entry I wrote once" from "hot entry re-read 10,000 times" — they both look like "recently written" from the list's perspective.

The fix? Policies that use frequency (TinyLFU) can see that the scan entries have been read exactly once, while hot entries have been read thousands of times, and refuse to admit low-frequency scan entries into the main cache. That's the architectural reason why TinyLFU outperforms plain LRU on mixed workloads. We'll cover that in full in a later block.

The key insight: LRU's O(1) is possible because the doubly-linked list and HashMap are complementary: the list maintains recency order without scanning (O(1) moves via pointer splicing), and the HashMap provides O(1) access to any list node by key. Neither data structure alone is sufficient — the list can't find a node by key without traversal, and the HashMap can't maintain recency order. Together they give you both capabilities at O(1).

LRU evicts the entry that hasn't been accessed the longest. Its O(1) guarantee comes from pairing a HashMap (O(1) key lookup → node pointer) with a doubly-linked list (O(1) remove-from-any-position and insert-at-head via pointer splicing). Every cache operation — hit, miss, eviction — is O(1) end-to-end. The killer weakness is scan pollution: sequential reads appear "recent" to LRU and evict the hot working set. TinyLFU fixes this by filtering on frequency rather than recency alone.

Section 7

LFU Deep Dive — Why Frequency Wins for Long-Tail Data

LRU asks: "when did you last show up?" LFU asks a different question entirely: "how often do you show up?" Those two questions produce dramatically different eviction victims, and understanding when each question is the right one to ask is the core skill this section builds.

The Plain-English Intuition: Most-Played vs. Most-Recently-Played

Imagine you're managing a music player's offline cache — you can only keep 100 songs on the device. LRU would keep the 100 songs you most recently played. If you went on a nostalgia trip last night and played 80 obscure 90s songs once each, LRU would evict your top 20 all-time favorites to make room — because those favorites were last played two days ago. That's wrong. LFU would keep the 100 songs you've played the most total times. Your top 20 favorites played 500 times each? They stay. Last night's one-time plays? Gone first. For a stable, predictable "hot songs" workload, LFU's frequency signal is a far better proxy for future access than LRU's recency signal.

This is exactly what happens in databases with long-tail access distributions. A product catalog might have 10 million SKUs, but 200 of them — the bestsellers — account for 80% of all reads. LFU naturally keeps those 200 in cache permanently. LRU might evict them if a broad promotional email causes a burst of one-time lookups across thousands of different product pages.

The Algorithm: Frequency Buckets in O(1)

The naive LFU implementation scans all entries to find the one with the lowest frequency counter — that's O(n) per eviction, which is too slow. The production-grade version, described by Shah, Mitra, and Matani in 2010, achieves O(1) insert, O(1) access update, and O(1) eviction using a clever two-level data structure:

A HashMap from key → node (for O(1) lookups, same as LRU)
A doubly-linked list of frequency buckets — bucket 1 holds all entries accessed exactly once, bucket 2 holds entries accessed twice, etc.
Within each bucket, a doubly-linked list of entries — these are ordered by insertion time within the bucket, so when two entries have the same frequency, the older one (inserted earlier into that bucket) is evicted first
A pointer to the minimum-frequency bucket — always points to the bucket with the fewest accesses, so finding the eviction victim is O(1): just look at the tail of the min-bucket's list

When a key is accessed, its node moves from bucket f to bucket f+1. If bucket f+1 doesn't exist yet, create it. If bucket f is now empty, delete it. If f was the minimum-frequency bucket, the minimum pointer advances to f+1. All of this is O(1) — pointer manipulations on doubly-linked lists, no scanning.

The diagram above shows the bucket chain in action. The min-freq pointer always points at the leftmost (lowest-frequency) bucket — that's where evictions happen. High-frequency hot keys accumulate in high-numbered buckets and are immune from eviction until something less-accessed takes the min-freq slot. Notice that within the min-freq bucket, entries are ordered by insertion time: the oldest entry in that bucket (leftmost) is evicted first, breaking frequency ties fairly.

The trace above shows the key insight: key A was accessed 3 times early on, so even when newer keys B, C, D, and E arrive, A is protected. B gets evicted first (same frequency as C, but inserted into that bucket earlier). C gets evicted next, despite being in the cache longer than D, because D accumulated frequency faster. This is LFU working correctly — it protects long-term hot keys and sacrifices cold ones.

When LFU Outperforms LRU — and When It Fails

LFU wins when your working set is stable over time. Product bestsellers, API endpoints, configuration values, user session tokens for active users — these get accessed repeatedly over hours or days. LFU accumulates enough history to recognize and protect them even under scan-like write bursts.

LFU fails in two situations. First: cache pollution from stale popularity. If a news article went viral yesterday and got accessed 10,000 times, it has a very high frequency count today. But nobody reads yesterday's viral articles. LFU will keep it in cache for a long time because of historical frequency, even though its future access probability has collapsed to near zero. This is sometimes called the "stale popularity" problem — frequency counts are a lagging indicator.

Second: new hot keys can't survive infancy. A brand-new viral article starts with frequency=1. Every existing key in cache has higher frequency. The new hot key will be evicted before it ever gets a chance to accumulate frequency. This is the cold-start problem for LFU, and it's why pure LFU is rarely used in production without modification. TinyLFU (next section) solves both problems.

The rule of thumb: LFU is the right choice when your access distribution is stable (hot keys stay hot for days) and your working set is small relative to cache size (so new keys have time to build frequency before they're threatened with eviction). For dynamic workloads where popularity shifts rapidly, TinyLFU or ARC handle the transitions far better.

LFU keeps the most-frequently-accessed entries by organizing keys into frequency buckets, achieving O(1) eviction via a min-frequency pointer at the lowest-count bucket. It outperforms LRU for stable long-tail workloads but suffers from stale popularity (old viral content hogs cache) and cold-start failure (new hot keys get evicted before accumulating frequency).

Section 8

TinyLFU & W-TinyLFU — How Caffeine Crushes LRU

Here's an uncomfortable truth about real-world cache traffic: most items are one-hit wonders. Studies on web caches, CDN logs, and database access traces consistently show that 60–90% of all unique keys are accessed exactly once, then never again. If your cache admits every key on first access (as LRU and LFU both do), you're constantly evicting genuinely hot keys to make room for items that will never be requested again.

TinyLFU is built on one core insight: before you admit a new key into the cache, check whether it's likely to be accessed again. If it's probably a one-hit wonder, don't evict a warm key for it. This admission filter is the key innovation, and the data structure that makes it memory-efficient is called a Count-Min Sketch.

The Count-Min Sketch: Approximate Frequency in Tiny Space

Think of the Count-Min Sketch as a cheat sheet for frequency counting. Instead of storing exact hit counts for every possible key (which would require memory proportional to your total keyspace — potentially billions of entries), you use a fixed-size 2D array with multiple hash functions. When a key is accessed, you increment one cell per row using a different hash function for each row. To estimate a key's frequency, you read all the cells it maps to and return the minimum — hence "Count-Min." The sketch can over-estimate (due to hash collisions between different keys sharing cells) but never under-estimates, which is the safe direction for an admission filter.

The critical property: a sketch with 8 rows and 4096 columns uses just 32,768 counters — typically 4 bits each — which is about 16 KB of memory. That 16 KB of sketch lets you track approximate frequency for an effectively unlimited keyspace, and TinyLFU's admission decision costs O(1) per access (8 hash lookups and 8 increments).

The diagram shows how one key's frequency is estimated: each of 4 hash functions maps the key to a different column in its row. The estimate is the minimum across those four cells — the min takes care of hash collisions, since a colliding key can only inflate a cell, not deflate it. The true count is always at most the minimum (other keys may have inflated some cells, but not all of them simultaneously, probabilistically speaking).

W-TinyLFU: The Window That Fixes Cold-Start

Plain TinyLFU still has the cold-start problem: a brand-new viral item has frequency=0 in the sketch (it hasn't been seen yet), so the admission filter would reject it from cache. W-TinyLFU — the variant used by Caffeine — solves this with a two-region design:

Window cache (small LRU, ~1% of capacity) — every new item is admitted here unconditionally. This is the incubation zone. New hot items survive here long enough to accumulate frequency in the sketch.
Main cache (TinyLFU-protected, ~99% of capacity) — split into a small LRU segment ("probation") and a large LRU segment ("protected"). Items graduate from probation to protected on their second access within the main cache.

The admission gate sits between the window and the main cache. When an item is evicted from the window, TinyLFU compares its frequency to the frequency of the candidate eviction victim in the main cache. If the window item has higher estimated frequency, it replaces the main-cache victim. If not, it's discarded. This way, one-hit wonders die in the window (harmlessly), while genuinely popular new items build enough frequency during their window stay to win the admission comparison.

The flow above shows why W-TinyLFU handles both failure modes of plain LFU. New viral items enter the window unconditionally — no cold-start problem. If they're accessed again before the window evicts them, their frequency in the sketch climbs above zero and they can win the admission filter comparison. One-hit wonders die quietly in the window and are discarded by the filter rather than evicting genuinely hot main-cache entries. The 1%/20%/79% split is configurable but is the Caffeine default, tuned from benchmarks across many real workloads.

Why Caffeine beats hand-tuned LRU: In benchmark studies on diverse workloads (database traces, CDN logs, web server logs), W-TinyLFU consistently achieves higher hit ratios than pure LRU — often by a substantial margin on workloads with many one-hit wonders. The memory overhead of the Count-Min Sketch is negligible (a few KB for a sketch that covers any cache size), and all operations remain O(1). The only cost is slightly higher implementation complexity, which is why you use Caffeine rather than writing it yourself.

TinyLFU uses a Count-Min Sketch to estimate key frequency in tiny memory, then gates cache admission: new entries only replace existing ones if their estimated frequency is higher. W-TinyLFU adds a small LRU window for unconditional admission of new items, solving the cold-start problem. This two-region design is why Caffeine achieves better hit ratios than classic LRU across most real workloads — it rejects one-hit wonders without starving genuinely new hot keys.

Section 9

ARC — The Adaptive Replacement Cache

LRU is a pure recency policy. LFU is a pure frequency policy. Both are rigid: LRU performs badly when a scan pollutes the cache, and LFU performs badly when access patterns shift. What if the cache could observe its own performance and dynamically shift between recency-biased and frequency-biased eviction? That's exactly what ARC does.

The Four-List Design

ARC maintains four lists, which sounds complex but follows a clean logic:

T1 (recent) — entries that have been accessed exactly once recently. This is the recency arm — behaves like a standard LRU for recent single-access items.
T2 (frequent) — entries that have been accessed at least twice. This is the frequency arm — a key moves here after its second access and stays here as long as it keeps getting hits.
B1 (ghost for T1) — a list of ghost entries from T1. When an item is evicted from T1, its key (but not its value) is kept in B1. No memory is consumed for the value.
B2 (ghost for T2) — ghost entries from T2. Same concept: evicted T2 keys are remembered in B2 without their values.

The total capacity of T1 + T2 = cache size (the real entries). B1 + B2 can grow larger — they're just key lists. A single parameter p controls the balance: p is the target size for T1. When p is large, ARC is biased toward recency (more T1 slots). When p is small, it's biased toward frequency (more T2 slots).

How ARC Adapts — Ghost Hits

Here's the genius: when a cache miss occurs and the missing key is found in B1 (a recent ghost hit), ARC infers that T1 is too small — it's evicting recently-used keys too quickly. So ARC increases p (grows T1, shrinks T2). When a miss occurs and the key is found in B2, ARC infers that T2 is too large — it's evicting frequently-used keys while keeping recency noise. So ARC decreases p (shrinks T1, grows T2). Ghost hits are free feedback signals — they cost no memory for the value, but they tell ARC exactly where it made a mistake.

The diagram shows all four ARC lists at once. The real memory cost is T1 + T2 (the actual cached values). B1 and B2 are metadata-only lists of keys — they track which items were recently evicted, enabling the adaptation signal. When traffic shifts from scan-like to hot-key, B2 ghost hits grow (T2 is too small, hot keys are getting evicted), so p decreases and T2 expands to protect frequency. This self-tuning is why ARC is used in ZFS and IBM DS8000 storage, where workloads vary enormously across different users and time of day.

The adaptation trace shows ARC silently shifting from a 50/50 T1/T2 split to a 20/80 split as the workload transitions from mixed to hot-key. No configuration change, no operator intervention — the ghost-hit signal does all the work. This makes ARC particularly valuable in systems where you can't predict or control the access pattern in advance: storage arrays, database buffer pools, OS page caches.

Where ARC lives in the wild: ZFS uses ARC as its primary page cache algorithm. IBM's DS8000 enterprise storage uses a variant. PostgreSQL briefly adopted ARC in 8.0.0 but dropped it almost immediately due to IBM patent concerns, and has used a CLOCK-sweep variant ever since. ARC is not commonly available as a user-configurable option in Redis or Memcached — those systems use simpler policies for operational simplicity. But understanding ARC's four-list design helps you reason about any cache that claims to "adapt" to workload patterns.

ARC maintains four lists — T1 (recent), T2 (frequent), B1 (T1 ghosts), B2 (T2 ghosts) — and adapts the T1/T2 split based on ghost hits: B1 hits grow T1 (recency bias), B2 hits grow T2 (frequency bias). This self-tuning behavior means ARC consistently outperforms fixed-policy LRU or LFU on workloads that shift between scan-like and hot-key patterns, which is why it appears in ZFS and enterprise storage systems.

Section 10

Random & FIFO — Simple Policies, Surprisingly Resilient

Every engineer's first instinct is to reach for LRU. But there are two policies that look naive on paper and turn out to be surprisingly competitive — or outright optimal — in specific scenarios: Random eviction and FIFO (First In, First Out). Understanding when simplicity beats sophistication is a genuine system design skill.

Random Eviction: The Underrated Contender

Random eviction does exactly what it sounds like: when the cache fills up, pick a key at random and evict it. No tracking of access times, no frequency counters, no data structures beyond the cache itself. The overhead is essentially zero — no bookkeeping on every access, no linked-list pointer updates.

Intuition says random should be terrible. But intuition is wrong for workloads where access locality is weak. Here's why: LRU's advantage over random comes entirely from its ability to exploit temporal locality. When that assumption holds (most workloads), LRU wins comfortably. When it's weak — broadly distributed access with no hot spots — random eviction converges to nearly the same hit ratio as LRU, because LRU's recency signal is no better than a coin flip.

Random eviction genuinely shines in three situations:

Metadata overhead matters — embedded systems and hardware TLBs use random or pseudo-random eviction because the hardware to track recency order would cost more than the hit-ratio gain.
Scan resistance by design — random eviction cannot be scan-polluted. A sequential scan writes many new entries, each evicting a random existing entry. The hot working set survives with probability proportional to its size, not to when it was last accessed.
Fairness across keys — in multi-tenant caches where no key should be privileged over others, random eviction treats all keys equally regardless of access history.

FIFO: Ordered Simplicity

FIFO is insertion-order eviction: the entry that has been in the cache the longest gets evicted first, regardless of how recently or frequently it was accessed. Think of a queue — new entries join at the back, evictions happen at the front. A ring buffer or linked queue implements this with O(1) insert and O(1) evict, no per-access bookkeeping.

FIFO is the right choice when your data has a natural freshness expiry that correlates with insertion time. If your data becomes stale after a fixed window (weather data, stock quotes, rate-limit windows), FIFO naturally evicts the oldest — and therefore most likely stale — entries. It's also extremely predictable: the eviction order is deterministic and auditable, which matters in regulated environments.

FIFO fails when popular items keep getting evicted just because they've been in cache for a long time. A configuration value inserted at startup will be evicted after enough time, even if it's accessed millions of times per minute. For this reason, FIFO is almost always combined with either TTL-based expiry or a second-chance mechanism — the CLOCK algorithm used in Linux's page cache.

Under high temporal locality (a small hot set dominates traffic), LRU's recency tracking gives it a substantial hit-ratio edge over both FIFO and Random. As locality weakens — more unique keys, broader access distribution — all three policies converge toward the same hit ratio. In the convergence zone, Random wins on simplicity: zero per-access bookkeeping, no linked-list updates, no memory overhead beyond the cache entries themselves.

Hardware or embedded context (no LRU metadata cost)
Scan resistance matters more than optimizing hit ratio
Multi-tenant fairness required across keys
Access pattern is known to be broadly uniform

Data freshness correlates with insertion time
TTL-based expiry already handles most eviction
Deterministic, auditable eviction order is required
Combined with second-chance (CLOCK algorithm)

Random eviction has zero bookkeeping overhead and converges to near-LRU hit ratios when temporal locality is weak — making it ideal for hardware TLBs, scan-resistant caches, and broadly uniform workloads. FIFO naturally evicts the oldest (most likely stale) entries and suits time-sensitive data, but fails for long-lived hot keys unless combined with a second-chance mechanism like the CLOCK algorithm used in Linux's page cache.

Section 11

Redis maxmemory & The 8 Eviction Policies

Redis is the cache most engineers reach for first, and it exposes eviction policy as a first-class configuration option. The challenge is that Redis ships with eight eviction policies, and the names alone don't tell you when to use which. This section walks through all eight, explains what "approximated LRU" means (and why it matters), and gives you a decision framework.

Setting maxmemory and maxmemory-policy

First, nothing happens until you set a memory limit. By default, Redis has no maxmemory limit — it will consume all available RAM and let the OS invoke the OOM (Out of Memory) killer, which is a production incident waiting to happen. Always set a limit explicitly:

# In redis.conf
maxmemory 2gb
maxmemory-policy allkeys-lru

# Or at runtime via redis-cli
redis-cli CONFIG SET maxmemory 2gb
redis-cli CONFIG SET maxmemory-policy allkeys-lfu

Once memory hits the ceiling, every write operation — SET, LPUSH, HSET, and so on — triggers the eviction policy to free space before the write proceeds. The maxmemory line sets the hard ceiling; the maxmemory-policy line chooses the algorithm.

The 8 Policies Explained

The eight policies break down along two axes: scope (allkeys = evict any key; volatile = evict only keys that have a TTL set) and algorithm (lru, lfu, random, ttl). The naming pattern makes them easy to remember once you see the grid:

Policy	What it evicts	Best for
noeviction	Nothing — returns an error on writes when full	When data loss is unacceptable; Redis used as a primary store, not a cache
allkeys-lru	Least recently used key across all keys	General-purpose cache; hot/warm access distribution; most common default
allkeys-lfu	Least frequently used key across all keys	Stable access patterns; long-tail distribution; power-law skew in reads
allkeys-random	A random key across all keys	Uniform access patterns; scan-resistance; when simplicity is valued
volatile-lru	Least recently used key among keys with TTL set	Mixed store: some persistent keys (no TTL), some cache keys (with TTL)
volatile-lfu	Least frequently used key among keys with TTL set	Same mixed-store scenario; frequency-biased eviction among expirable keys
volatile-random	Random key among keys with TTL set	Mixed store; uniform cache-key access; simplest volatile option
volatile-ttl	Key with the shortest remaining TTL among TTL keys	When keys expiring soonest are least valuable; TTL encodes business priority

Why Redis's LRU Is "Approximated" — Not Exact

Here's a common interview gotcha: Redis does not maintain a true LRU doubly-linked list across all keys. Instead, when eviction is needed, Redis samples a small set of random keys (default: 5, configurable with maxmemory-samples), compares their last-access timestamps, and evicts the one with the oldest timestamp. This is approximate LRU, and it matters for three reasons:

O(1) per eviction — True LRU requires a linked-list pointer update on every cache read, which is costly in Redis's single-threaded model. Approximated LRU only runs during eviction events, sampling a fixed number of keys — constant time regardless of database size.
Close enough in practice — Redis's own benchmarks show that a sample of 10 produces hit ratios nearly identical to true LRU across diverse workloads. The default of 5 is slightly less accurate but still excellent for most use cases.
LFU is also approximated — Redis implements LFU using a per-key 8-bit counter with a logarithmic increment (Morris counter) that decays over time via the lfu-decay-time setting. This approximates true frequency using just 1 extra byte per key in the key's metadata.

The sampling diagram makes the approximation concrete. Redis picks 5 random keys, finds the oldest last-access timestamp (key:B at 120 seconds), and evicts it. There may be keys elsewhere with timestamps older than 120s — approximate LRU accepts this imprecision in exchange for O(1) cost per eviction event. The decision guide on the right is the fastest path to the right policy for a new service.

Practical default for new services: Start with allkeys-lru. It handles most general caching workloads well. Switch to allkeys-lfu if you observe hot keys getting evicted during bursts of diverse reads — that's a signal your workload has stable frequency patterns LFU can exploit. Monitor keyspace_hits and keyspace_misses in INFO stats output to track hit ratio changes after any policy switch.

Redis exposes 8 eviction policies combining scope (allkeys vs. volatile) with algorithm (LRU, LFU, random, TTL). Its LRU and LFU are both approximated: LRU samples a small random set of keys and evicts the oldest; LFU uses a probabilistic 8-bit counter with time-based decay. The approximation trades a tiny hit-ratio cost for O(1) eviction regardless of database size. Always set maxmemory; use allkeys-lru for general caches, allkeys-lfu for stable hot-key workloads, volatile-* when you need to protect permanent keys from eviction.

Section 12

Memcached Slab Allocator & Per-Slab LRU

Memcached takes a fundamentally different approach to memory management than Redis, and that difference produces a specific, real-world failure mode that can quietly collapse your cache's effective capacity without any obvious cause. To understand the failure — called slab calcification — you first need to understand why Memcached uses slabs at all.

Why Slab Allocation Exists: Defeating Fragmentation

The problem slab allocation solves is memory fragmentation. A general-purpose allocator (like malloc) storing cache entries of wildly varying sizes — some 64 bytes, some 64 KB — ends up with a Swiss-cheese pattern over time: lots of small gaps that are too small to hold new entries, even though total free space is large.

Memcached sidesteps fragmentation entirely by pre-allocating memory in fixed-size chunks called slabs. At startup, Memcached creates a series of slab classes — typically 40–50 classes growing by a factor of roughly 1.25 from ~96 bytes to ~1 MB. When a new cache entry arrives, Memcached rounds it up to the nearest slab class and allocates one chunk from that class. O(1), zero fragmentation within a class.

Per-Slab LRU: The Isolation That Creates the Problem

Here's the critical design consequence: each slab class has its own independent LRU list. Eviction in Memcached does not choose from the globally least-recently-used entry across all entries. It chooses from the least-recently-used entry within the specific slab class that needs a new chunk. Eviction is local to the class that's full.

When this works well: you have a mix of small user-session objects (fitting in slab class 4) and large HTML fragments (fitting in slab class 18). Each class evicts its own LRU entries independently. Small objects don't compete with large objects for eviction slots. Clean separation.

When this creates a catastrophe: enter slab calcification.

The isolation diagram shows the core problem clearly. Slab Class 3 has free chunks that will never be used by Class 6, because slab pages are not transferable between classes at runtime. Class 6 must evict from its own LRU — even if all its entries are stale and would be freely giveable if global eviction existed. The cache is wasting potentially large amounts of memory on data it will never read again, while the slab class that actually needs space keeps evicting its own genuinely warm entries.

The timeline shows the failure progression and the fix. At T0, slab classes are proportionally filled — everything is healthy. At T1, the workload shifts to only small objects. Class 3 fills up and starts aggressively evicting its own warm entries, while Classes 6 and 9 sit full of stale data from the old workload and refuse to give up their memory. The cache's effective capacity collapses to just the capacity of Class 3. The slab automover (T2) detects the mismatch via per-class eviction rate metrics and reassigns whole slab pages from the idle classes to the overloaded one.

The real-world tell: Slab calcification looks exactly like "cache is too small" — hit ratio drops, database load rises. But adding RAM doesn't help because the problem isn't total capacity; it's per-class distribution. Check Memcached's per-slab stats: if one slab class shows a high eviction rate while another shows near-zero eviction with many idle slabs, calcification is the culprit. Fix: enable the slab automover (-o slab_reassign,slab_automove on the Memcached command line, available since 1.4.11) and consider whether your object-size distribution has shifted since the server started.

Memcached uses slab allocation to eliminate memory fragmentation: memory is pre-divided into fixed-size slab classes, each with its own independent LRU. The side effect is slab calcification: when a workload's object-size distribution shifts, the formerly dominant slab classes retain their memory while the newly dominant class evicts aggressively, collapsing effective cache capacity. The slab automover (Memcached 1.4.11+) rebalances slab pages based on per-class eviction rates; Extstore (1.5+) spills cold data to SSD to reduce calcification pressure entirely.

Section 13

CPU Cache & TLB Eviction — Hardware Lessons at Nanosecond Speed

Before Redis, before Memcached, before any software cache you'll ever configure — there is a tiny piece of hardware inside every CPU that has been solving the eviction problem since the 1960s. The L1, L2, and L3 caches inside your processor are real eviction systems, holding between 32 KB and 64 MB of data, making eviction decisions in single-digit nanoseconds. The eviction algorithm the CPU uses — and crucially, why it doesn't use "proper" LRU — teaches a lesson that applies all the way up to distributed caches.

Why the CPU Can't Afford True LRU

Think about what proper LRU requires: every time you access a cache line, you update a timestamp or move a node to the head of a linked list. In software, that costs a handful of CPU instructions — negligible for Redis operating on millisecond timescales. For a CPU cache operating on nanosecond timescales, those extra instructions would cost more than the cache hit saves. The whole point of L1 cache is to return data in 1–2 ns. If maintaining LRU metadata takes 3 ns, you've paid more than you gained.

So CPU designers invented Pseudo-LRU (PLRU) — an approximation that costs almost nothing. Here's the idea: instead of tracking the exact recency order of every cache line, you track one bit per set that says "recently used the left half or the right half?" It's a binary tree of yes/no bits, and eviction just walks the tree following the "least recently used" direction at each level. Four cache lines need only 3 bits. Sixteen cache lines need only 15 bits. The answer is approximate — you might occasionally evict something slightly hotter than the true LRU victim — but you're wrong by at most one level of the tree, and the decision takes a single clock cycle.

The diagram above shows the cache hierarchy that every program you run lives inside. The key insight: as you move outward from the CPU core, size grows but latency grows with it. Each level needs an eviction algorithm, and at L1 the algorithm must complete in a single clock cycle — which rules out any data structure that requires pointer chasing or sorting.

This diagram shows a 4-way set-associative cache using Pseudo-LRU. The root bit says "left half is less recently used." The left subtree bit says "Way 0 is less recently used than Way 1." So the eviction victim is Way 0 — found in two bit-checks, no pointer chasing, no list reordering. When Way 0 gets replaced and later accessed, both bits flip toward it. The whole algorithm runs in a single clock cycle. The lesson from hardware: approximate beats perfect when the cost of perfect exceeds the benefit. This is exactly why TinyLFU uses a Count-Min Sketch instead of exact counters — the same principle, sixty years later, in software.

TLB Eviction — The Same Problem, One Level Up

The TLB (Translation Lookaside Buffer) is another hardware cache with an eviction problem. It holds virtual-to-physical address mappings. When it fills up, it must evict an entry — and once evicted, the next access to that virtual address requires a full page-table walk, costing dozens of nanoseconds. TLBs typically hold 64 to 1,024 entries (tiny!) and use either Pseudo-LRU or fully-random eviction. Why random? At 64 entries, even Pseudo-LRU has measurable overhead per memory access. Random is truly free — and with only 64 slots, the difference in miss rate between random and LRU is small.

The hardware lesson for distributed systems: At every level of the storage hierarchy — L1 cache, TLB, Redis, CDN edge node — the eviction algorithm faces the same constraint: the cost of the algorithm must be less than the cost of a miss. At nanosecond scale, Pseudo-LRU wins. At microsecond scale, true LRU (O(1) with HashMap + doubly-linked list) wins. At millisecond scale, TinyLFU with Count-Min Sketch wins because it approximates frequency cheaply. Scale determines which approximation is worth making.

CPU caches use Pseudo-LRU — a binary tree of bits — because true LRU requires metadata updates that cost more cycles than a cache hit saves. TLBs go further and often use random eviction for the same reason. The lesson that scales all the way to distributed caches: approximate eviction beats perfect eviction when the overhead of perfection exceeds the benefit of accuracy.

Section 14

Browser & CDN Cache Eviction — Disk Limits and Edge Nodes

When you open a website, your browser is running its own eviction system on your hard drive. When a CDN like Cloudflare or Fastly serves your request from an edge node, that edge node is running another eviction system in RAM. Neither system works quite like Redis, and the reasons why matter for anyone designing a web-facing architecture.

Browser Cache: You Don't Control the Eviction Policy

Here's the surprise: Cache-Control: max-age=31536000 tells the browser "this asset is fresh for one year" — but it does not guarantee the asset survives in the disk cache for a year. Browsers cap their disk cache at roughly 5–10% of available disk space (Chrome's heuristic is complex but this is the ballpark). When the cache fills, the browser evicts entries based on its own policy — typically a combination of recency, access frequency, and resource size. A large video file might be evicted after two days even with a one-year max-age, simply because the disk is full and the browser decides the video is too expensive to keep.

What Cache-Control actually controls is freshness — whether the browser even bothers to check with the server before using a cached copy. It's a hint about validity, not a guarantee of retention. The eviction decision is always up to the browser's own heuristics.

This is the most common browser caching misconception in engineering: ETag and Last-Modified are revalidation headers, not invalidation. They let the browser ask "has this changed since I last fetched it?" — and if the server says "no" (304 Not Modified), the browser serves the cached copy. They have no power over whether the browser evicts the entry. If disk space is tight and the browser's eviction policy selects your asset, the ETag is meaningless — the asset is gone.

CDN Eviction: The Two-Tier Edge Model

CDN edge nodes sit geographically close to users and cache content in RAM. When RAM fills up, they need to evict — and the strategy differs from a single-node cache because CDNs have a natural second tier to fall back to. Fastly's architecture is a good example of a real production approach: instead of immediately deleting an eviction candidate, the edge node can demote it to a secondary RAM tier or a disk tier, where it survives at lower cost. Only when even the secondary tier is full does the entry truly disappear.

The two-tier model matters because it halves the effective miss rate. When a popular item gets evicted from L1 due to a sudden traffic spike, it lands in L2 rather than disappearing entirely. If traffic returns to that item within minutes, the L2 hit avoids an origin fetch. The practical implication: when you configure a CDN, cache sizes at L1 are more important than absolute cache size — keeping the working set warm in fast RAM matters more than having a huge disk cache that's slow to serve from.

ETag is not a cache control mechanism. ETag and Last-Modified tell the server "I have version X of this resource — do you have a newer one?" The server's 304 response means "use your copy." This is useful for saving bandwidth on revalidation, but it does not prevent eviction. If you want to force a new asset onto all clients, you change the URL (add a hash to the filename), not the ETag.

Browser caches evict independently of Cache-Control headers — those headers control freshness, not retention. ETag and Last-Modified are revalidation tools, not invalidation tools. CDN edges use a two-tier model where eviction candidates land in a slower tier before true deletion, halving effective miss rates when traffic returns to recently-evicted items.

Section 15

Working Set Sizing & the 80/20 Curve — How Much Cache Do You Actually Need?

Here's the question every team eventually asks: "We're at 75% hit ratio. If we double the cache size, do we get to 90%? Or do we need to triple it?" Without the right tool, you're guessing. With a Miss Ratio Curve (MRC), you can answer that question precisely before spending a dollar on infrastructure.

The Working Set: The Entries That Actually Matter

Not all cached data is equally hot. In most real workloads, access patterns follow a power-law distribution — a small fraction of keys absorb a large fraction of requests. The classic approximation is 80% of requests touch 20% of keys, but the real ratio is often more extreme: 95% of requests touching 5% of keys is common in social feeds and e-commerce. The working set is the set of entries that, if kept warm in cache, will satisfy the vast majority of requests.

Why does this matter for sizing? Because once your cache is large enough to hold the entire working set, adding more cache gives diminishing returns. Going from a cache that holds 50% of the working set to one that holds 100% might move hit ratio from 70% to 95%. Going from 100% to 200% of the working set might move hit ratio from 95% to 96%. The first doubling was worth it. The second doubling was mostly wasted money.

The knee of this curve is exactly where you want to land. To the left of the knee, every extra gigabyte of cache pays off handsomely. To the right, you're paying cloud infrastructure costs for almost no improvement. The question is: where is the knee for your specific workload? This is what the Miss Ratio Curve tells you.

Mattson's Stack Distance Algorithm — One Trace, Every Cache Size

In 1970, Mattson, Gecsei, Slutz, and Traiger at IBM published one of the most elegant results in computer science caching theory. The idea: for each access to a key, count how many other distinct keys were touched since you last saw that key — that count is the "reuse distance" for this access. They called this measurement the stack distance, because conceptually you imagine an infinite LRU stack and the answer is how far down the stack the key currently sits. If stack distance is D, then key K would cause a miss in any LRU cache smaller than D entries, and a hit in any LRU cache of D or more entries.

The elegant payoff: by computing stack distances for the entire trace, you get the miss ratio for every possible cache size at once. You don't need to simulate the cache at size 1 GB, then again at 2 GB, then again at 4 GB. One pass through the trace gives you the full MRC curve. This was revolutionary in 1970 and remains the foundation of MRC analysis today.

An MRC is the decision tool that turns "how much cache should we buy?" from a gut-feel question into a data-driven answer. The curve tells you: at your current cache size, what's the miss ratio? And if you doubled the cache, what would it become? The gap between the LRU and TinyLFU lines on the same curve is the improvement you'd get from upgrading your eviction algorithm without buying any new hardware — often worth investigating before spending money.

Practical MRC without full Mattson simulation: Mattson's full algorithm runs in O(N × C) time (N accesses, C unique keys) which is slow for high-volume systems. Modern approximations like SHARDS (Stateless Hits-per-Request Adaptive Random Down-Sampling) sample 0.1% of requests and produce an MRC within 1% of exact — fast enough to run continuously in production. Several observability platforms expose this as a first-class metric.

The working set is the subset of keys that, if kept warm, absorbs most traffic. Once cache size exceeds the working set, adding more cache yields diminishing returns. Mattson's stack distance algorithm (IBM, 1970) computes the full Miss Ratio Curve in one pass — telling you exactly how much hit ratio gain each dollar of additional cache buys.

Section 16

Scan Pollution & Sequential Flooding — The Killer Failure Mode

You've already seen the story in Section 2: one analytics job, one nightly scan, cache hit ratio from 95% to 60%, 3 AM incident. Now let's go deep on exactly why LRU is defenceless against this, what each production database does about it, and how to protect your own application caches.

The Anatomy of a Scan Pollution Event

When a sequential scan happens, the entries being read are each accessed exactly once — and then never again. But LRU's entire logic is "most recently used is most likely to be used again." So it places every scanned entry at the head of the eviction queue, treating them as the most valuable entries in the cache. Meanwhile, the genuinely hot entries — the ones your real traffic needs — get pushed to the tail and evicted one by one. By the time the scan finishes, the cache is full of entries with zero future value and empty of everything that had value.

The cascade is silent and fast. Each new scan entry doesn't just waste one cache slot — it actively evicts one hot entry. A scan of 1 million rows through a cache of 100,000 slots evicts the entire working set ten times over. The hot entries that get evicted will be re-requested by real traffic, causing real database load. The scan entries that replaced them will never be requested again.

The Four Production Fixes

This failure mode is well-known enough that every major database and caching system has built in a countermeasure. Here are the four most important ones:

Fix 1 — SLRU's Protected/Probation Split

Segmented LRU (SLRU) divides the cache into two regions: a protected segment (typically 80% of cache) for entries that have been accessed more than once, and a probation segment (20%) for newly admitted entries. New entries start in probation. Only when they get a second access do they graduate to the protected segment. Eviction always takes victims from the probation segment first.

Why does this stop scan pollution? Because scan entries are accessed exactly once — they never graduate to protected. When the cache fills, the probation segment fills with scan entries, but the protected segment keeps all the hot working-set entries safe. The scan entries fight each other for probation slots and eventually evict each other. The hot data is untouched.

Fix 2 — TinyLFU's Admission Filter

TinyLFU doesn't just evict entries — it filters incoming entries. Before admitting a new entry to the main cache, TinyLFU checks whether its estimated frequency (from the Count-Min Sketch) is higher than the estimated frequency of the current eviction candidate. If the newcomer has a lower estimated frequency than the victim, it's rejected outright — the victim stays.

For scan entries, their estimated frequency is 0 or 1 — they've never been seen before. The entry they'd evict has been accessed many times and has a high frequency score. TinyLFU's admission filter rejects the scan entry. The scan produces zero impact on the working set. This is the most effective protection against scan pollution of any production cache algorithm.

Fix 3 — PostgreSQL's Ring Buffer for Sequential Scans

PostgreSQL detected this problem and built the fix directly into its buffer pool manager. When the executor begins a sequential scan on a relation larger than one-quarter of shared_buffers, it switches to a BufferAccessStrategy ring buffer — 256 KB by default for sequential reads — instead of allocating freely from the main buffer pool. The scan recycles its own pages within the small ring, so the bulk of the shared buffer pool stays untouched.

The shared buffer pool (configured via shared_buffers) is completely protected. Other queries' hot pages are not disturbed. The ring buffer is recycled at the end of the scan. This is why PostgreSQL's performance under mixed OLTP + analytics workloads is often better than naive expectations — the buffer ring is a first-class protection mechanism, not an afterthought.

# Tune the buffer pool — ring buffer behaviour is automatic # once Postgres detects a sequential scan beyond the threshold shared_buffers = 4GB # main buffer pool — hot OLTP data lives here effective_cache_size = 12GB # hint to the planner (not memory actually allocated) # The ring buffer for seq scans is allocated automatically by the executor. # Size is controlled internally: 256KB for normal seq scans and VACUUM, # 16MB for bulk-write operations like COPY. # Since PostgreSQL 16+, VACUUM's ring size is tunable via vacuum_buffer_usage_limit.

There is no knob to turn here — Postgres handles it automatically whenever it identifies a sequential scan strategy. The key takeaway: if you see sequential scan activity in EXPLAIN ANALYZE, Postgres is already protecting your buffer pool from that scan. The danger is when your application layer is doing the scan (e.g., iterating rows and writing them to Redis) — Postgres protects its own cache, but nothing protects your application cache unless you build that protection yourself.

Fix 4 — MySQL InnoDB's Midpoint Insertion Strategy

MySQL's InnoDB buffer pool uses a clever variant of LRU called midpoint insertion. Instead of inserting newly read pages at the MRU head (which is what standard LRU does), InnoDB inserts them at a configurable point in the middle of the LRU list — by default 37% from the tail. Pages that survive long enough and get re-accessed eventually migrate to the "young" (head) part of the list.

Why does this protect against scan pollution? Pages inserted at the midpoint only survive if they get accessed again before they drift to the tail. Scan pages, accessed once and never again, drift to the tail and get evicted before they can displace the truly hot pages at the head. The hot working set is protected at the front of the list.

[mysqld] # innodb_old_blocks_pct: percentage of buffer pool treated as "old" (tail region) # Default: 37 — new pages are inserted 37% from the tail innodb_old_blocks_pct = 37 # innodb_old_blocks_time: how long (ms) a page must stay in the old region # before a second access promotes it to the "young" region. # Prevents a single scan from rapidly promoting all its pages. # Default: 1000 (1 second) innodb_old_blocks_time = 1000

The innodb_old_blocks_time parameter is the second line of defence: even if a scan page is accessed twice within one second (e.g., two queries hitting the same row during the scan), it still won't get promoted to the hot region. Only pages that survive in the old region for at least 1 second before their second access get promoted — a threshold that eliminates nearly all scan-era accesses.

SLRU's diagram makes the protection crystal clear: scan entries fill the probation segment and evict each other, while the protected segment stays intact. The only way scan entries can damage the protected segment is if there are so many of them that they fill probation and start overflowing — at which point you have a sizing problem, not just a policy problem.

Scan pollution occurs when a sequential read of a large dataset fills LRU's recency queue with one-hit entries, evicting the hot working set. The four production fixes — SLRU's protected segment, TinyLFU's admission filter, Postgres's ring buffer, and MySQL's midpoint insertion — each prevent scan entries from displacing hot data through different mechanisms, but all share the same insight: new arrivals must prove their value before getting premium cache real estate.

Section 17

One-Hit Wonders & The Frequency Filter — The 50% Problem

Scan pollution is dramatic — it's a catastrophic failure that happens fast and is easy to diagnose. The one-hit wonder problem is more insidious: it's a quiet, constant drain on cache efficiency that's present in virtually every cache workload, all day, every day.

What Is a One-Hit Wonder?

A one-hit wonder is a cached entry that gets requested exactly once and then never again. You pay the cache miss cost to fetch it from the database, you pay the memory cost to store it, and then it gets evicted having never served a single cache hit. In many web workloads, around 40–60% of all cache entries fall into this category.

Why is this so common? Think about long-tail content. A news article goes viral for two hours and gets read a million times. The same news site has half a million articles that were published years ago and get one organic search hit per week. Each of those rare articles costs a cache miss every time — but if you cache them on that one hit, you're filling your cache with entries that won't be read again for a week, evicting the genuinely hot entries in the process.

The Two-Miss Math: Why One-Hit Wonders Are Doubly Expensive

Here's the counter-intuitive math. Suppose you have a cache of 100 slots. Your working set is 80 hot entries accessed frequently. The other 20 slots are constantly churned by one-hit wonders — each one-hit wonder enters, occupies a slot for a while, gets evicted to make room for the next one-hit wonder, and so on.

Now consider what happens when a one-hit wonder entry is evicted to make room for a new one. If any of your 80 hot working-set entries happened to be in those 20 churning slots, they just got evicted — and the next access to that hot entry is a miss. So the one-hit wonder cost you not just its own original miss (accessing it from the database the first time), but also the future miss it caused by displacing a hot entry. A single one-hit wonder can produce two database reads: one for itself, one for the entry it evicted. Hence "doubly expensive."

At scale: if 50% of your cache entries are one-hit wonders and your cache has 1 million slots, that's 500,000 slots actively displacing hot data. The effective working-set capacity of your cache is roughly halved.

The diagram captures the core benefit: with TinyLFU's admission filter, a one-hit wonder never makes it into the cache at all. The frequency check asks "is this new entry more likely to be accessed in the future than the current eviction candidate?" For a brand-new entry with zero prior accesses, the answer is almost always "no" — so the hot entry at the eviction candidate position stays put. The admission filter is what allows TinyLFU to achieve near-optimal hit ratios on workloads with heavy one-hit-wonder traffic, typically moving hit ratio from the 65-75% range that LRU achieves up to the 85-93% range.

ARC handles one-hit wonders differently. ARC's ghost lists track the history of recently evicted entries without storing their data. If a one-hit wonder gets evicted and then re-requested, the ghost list knows it was a miss-then-evict pattern — and ARC shifts weight away from the LFU portion of its algorithm for that key. It's a feedback mechanism rather than an upfront filter, but it achieves a similar effect: entries that only ever get requested once never accumulate in the main cache.

One-hit wonders — entries accessed exactly once, never again — make up 40–60% of cache entries in typical web workloads. They waste cache slots and, worse, evict hot entries that serve real traffic. TinyLFU's admission filter solves this by rejecting new entries whose estimated future access frequency is lower than the current eviction candidate's, typically improving hit ratio by 15–25 percentage points over LRU on the same workload.

Section 18

Measurement — Hit Ratio, Miss Cost, MRC Curves & Belady's Optimal

Everything in this page so far has pointed at one skill: knowing when your cache is working and when it isn't. Without measurement, eviction policy tuning is superstition. With the right metrics, it becomes engineering. This section walks through the metrics that actually matter — and the one theoretical benchmark that tells you how close to perfect your cache actually is.

Hit Ratio: The Vanity Metric

Hit ratio — the fraction of cache lookups that return a hit — is the most commonly reported cache metric and often the least useful one on its own. Here's why: a cache hit for a key whose miss cost is 0.1 ms is worth nothing. A cache hit for a key whose miss cost is 800 ms (because the fallback requires a slow external API call) is worth everything. Two caches can have identical hit ratios but wildly different performance impact depending on what they're caching.

Hit ratio is useful as a trend indicator — if it drops from 92% to 78% over an hour, something changed and needs investigation. But it's not useful as a standalone success metric. "Our hit ratio is 95%" sounds great until you realise 90% of your traffic is hitting a tiny set of trivially cheap keys and the 5% of misses are all for your slowest database queries.

Miss Cost: The Metric That Actually Matters

The metric that actually predicts user experience is miss cost — the latency, compute, or dollar cost incurred when a cache miss forces a fallback to the source of truth. The formula for average request latency is:

Average request latency = (hit_ratio × cache_hit_latency) + (miss_ratio × miss_cost)

Example: hit_ratio=0.92, cache_hit_latency=1ms, miss_cost=50ms
→ avg latency = (0.92 × 1ms) + (0.08 × 50ms) = 0.92ms + 4ms = 4.92ms

Improve hit_ratio to 0.96 (same cache, better eviction policy):
→ avg latency = (0.96 × 1ms) + (0.04 × 50ms) = 0.96ms + 2ms = 2.96ms

A 4-percentage-point improvement in hit ratio cut average latency by 40% — because the miss cost dominates.

This is why measuring P99 miss latency — the 99th percentile of requests that missed the cache — is often more valuable than measuring average hit ratio. If your P99 miss latency is 2 seconds (because it requires a complex join on a cold database), then your eviction policy should be optimised to keep the keys that serve that join result warm, even if those keys represent a tiny fraction of total traffic.

The Miss Ratio Curve: Your Capital Expenditure Tool

We introduced the MRC in Section 15. Here we use it for decision-making. The MRC answers three questions that determine caching infrastructure spending:

"Is more cache worth it?" — Look at the slope of the MRC at your current cache size. If the slope is steep (miss ratio drops significantly with each additional GB), adding cache has high ROI. If the slope is flat (diminishing returns), you're past the knee and more hardware won't help much.
"Should I change eviction policy first?" — Plot the MRC for your current policy and for TinyLFU on the same chart. If TinyLFU's curve is significantly lower at your current cache size, switching policies (a software change, free) might get you more hit-ratio improvement than doubling RAM (an infrastructure cost).
"What's my working set size?" — The MRC knee is the working set. If your current cache is much smaller than the knee, you're memory-constrained. If you're well past the knee, you might be over-provisioned.

Belady's Optimal — The Unreachable Ceiling

In 1966, László Bélády at IBM published a result that defined the theoretical upper bound for any cache eviction policy. His algorithm, now called Belady's MIN algorithm or just "Belady's optimal," is simple: when you must evict an entry, evict the entry whose next access is furthest in the future. If an entry will never be accessed again, evict it first. If an entry will be accessed in 1 second and another in 1 hour, keep the 1-second one.

Belady's algorithm produces the minimum possible miss ratio for any given cache size and access trace. It is provably optimal. There is no algorithm that can do better on the same trace and same cache size — this is mathematically proven, not just empirically observed.

The catch is obvious: Belady's algorithm is a clairvoyant oracle. It requires knowing the future — specifically, when each key will next be accessed. In any real system, you don't know the future. So Belady's optimal is unachievable in practice. It's used as a benchmark: you measure how close your real policy gets.

The contrast above is extreme by design, but it illustrates the real lesson: optimise for miss cost, not hit ratio. If you have one keyspace where misses cost 2,000ms and another where misses cost 0.5ms, your eviction policy should be biased toward keeping the 2,000ms keys warm, even at the cost of slightly lower overall hit ratio.

The Belady curve sits at the theoretical floor — no algorithm, no matter how clever, can have a lower miss ratio for a given cache size on a given workload. The LRU curve sits noticeably above it, especially at intermediate cache sizes where the gap between optimal and LRU is largest. TinyLFU's curve sits within 1–2 percentage points of Belady optimal on most production workloads — this is why it became the industry default. It's the closest any practical, online algorithm has come to matching a clairvoyant oracle.

A Working MRC Pseudocode Sketch

For completeness, here's the core idea of stack-distance computation — the algorithm behind MRC generation:

# Stack distance algorithm (Mattson et al., 1970) # Simplified pseudocode — actual implementations use a sorted tree # for O(N log C) time instead of the O(N*C) naive version shown here. from collections import OrderedDict def compute_miss_ratio_curve(access_trace): """ Given a sequence of key accesses (the 'trace'), compute the miss ratio for every possible LRU cache size in a single pass. """ stack = OrderedDict() # ordered by recency: front = MRU, back = LRU # stack_distance_counts[d] = number of accesses with stack distance d stack_distance_counts = {} total_accesses = 0 for key in access_trace: total_accesses += 1 if key in stack: # Key is in the "virtual infinite cache" # Stack distance = position from the MRU end distance = list(stack.keys()).index(key) # O(N) — use a Fenwick tree in production stack_distance_counts[distance] = stack_distance_counts.get(distance, 0) + 1 del stack[key] # move to MRU position else: # Key has never been seen, or was evicted from infinite cache stack_distance_counts['inf'] = stack_distance_counts.get('inf', 0) + 1 stack[key] = True # insert at MRU front # Build the MRC: for each cache size C, # miss_ratio(C) = (accesses with stack_distance >= C) / total_accesses miss_ratio = {} cumulative_hits = 0 for d in sorted(k for k in stack_distance_counts if k != 'inf'): cumulative_hits += stack_distance_counts[d] miss_ratio[d + 1] = 1.0 - (cumulative_hits / total_accesses) return miss_ratio # {cache_size: miss_ratio} for all sizes at once

Walk through the logic: stack is a conceptual "infinite LRU cache" — it holds every key ever seen, ordered by recency. When we see key K, its position in the stack is its stack distance — the number of distinct keys accessed since we last saw K. If K is at position 5, it would be a hit in any LRU cache of size 6 or larger, and a miss in any LRU cache of size 5 or smaller. Counting up all the stack distances for the entire trace gives us the full miss ratio curve in one pass. The bottleneck is finding the position, which naive implementations solve in O(N) time — production implementations use an augmented balanced BST (Fenwick tree) for O(log N) per lookup.

Putting It Together: The Measurement Checklist

Before tuning any eviction policy, collect these four numbers:

Overall hit ratio — baseline, trend over time, not an absolute target
Hit ratio per keyspace — split by key prefix. Often one keyspace has 40% hit ratio dragging the whole system down.
P99 miss latency — the cost of a miss for each keyspace. This tells you which misses to prioritise eliminating.
MRC curve — even an approximated one. Tells you whether the fix is "more cache" or "better policy" before you spend any money.

Hit ratio alone is a vanity metric — what matters is miss cost multiplied by miss rate. Belady's MIN algorithm (1966) defines the theoretical optimum: evict whatever entry will be accessed furthest in the future. This is unachievable in practice because it requires knowing the future, but TinyLFU gets within 1–2% of Belady optimal on most workloads — the closest any online algorithm has come. The MRC curve translates policy quality into a capital expenditure tool: it tells you whether switching algorithms or buying more RAM delivers better ROI for your specific traffic pattern.

Section 19

Tuning Eviction in Production — A Decision Playbook

Theory is useful in interviews. Production is where you actually pay for bad decisions. This section is the five-step playbook you run when you need to answer the question: "Which eviction policy should I pick for this specific system, and how big should the cache be?" The answer is never "just use LRU" — it always starts with measuring what you actually have.

The top row shows the five steps in order. The bottom table summarises step 4 — matching workload character to the right policy. Steps 1-3 are measurement; step 4 is the decision; step 5 is the feedback loop that keeps the decision honest over time.

Step 1 — Measure Hit Ratio Before Changing Anything

The single most common eviction-tuning mistake is changing the policy before knowing what the current policy is doing. Run INFO stats in Redis and look at keyspace_hits and keyspace_misses. Calculate hit_ratio = hits / (hits + misses). If it's already above 95% and latency is fine, you don't have an eviction problem — you might have a capacity problem, a TTL problem, or no problem at all. Measure first. Diagnose second. Change last.

Step 2 — Sample a Workload Trace and Build an MRC

A Miss Ratio Curve (MRC) tells you what your actual working set looks like. You don't need to run the full MRC algorithm on production traffic (that's expensive). Sample 1% of requests for an hour using a log drain or the Redis MONITOR command (with extreme caution — MONITOR halves Redis throughput). Feed that trace into a tool like the SHARDS algorithm or even a simple Python simulation. The curve will show you whether your working set fits in memory, or whether you're fighting a fundamentally oversized problem.

Step 3 — Estimate the Marginal Cost per GB of Hit Ratio Gain

Not every hit-ratio improvement is worth the RAM it costs. Use the MRC to find the "knee" — the point where adding more cache gives rapidly diminishing returns. Then compute the economics: if adding 1 GB of cache RAM costs some amount per month, and it buys you a 2% hit ratio gain, ask what 2% fewer cache misses saves in DB compute cost and latency SLA. Sometimes the math is clear (a busy DB cluster charges a lot per additional query, so more cache pays for itself quickly). Sometimes you're past the knee and adding 10x the RAM only buys a 0.3% improvement — in which case, optimise something else.

Step 4 — Pick the Policy Matched to Your Workload

Now you can use the MRC data and what you know about your access pattern to pick a policy. Here are the four most common production scenarios:

Long-tail content (Zipfian distribution, stable hot keys): A small fraction of keys gets the vast majority of hits — product images, top articles, popular user profiles. LFU or TinyLFU is the right choice because frequency data accurately identifies the hot keys and protects them from eviction even if they weren't accessed in the last few seconds.
Recency-dominated (social feeds, session data, real-time events): The most recently created items are the most valuable, and popularity shifts quickly. Standard LRU or Redis's volatile-lru handles this well. The hot set changes constantly, so frequency history becomes stale quickly — recency is a better proxy for future access than historical counts.
Analytics / broad uniform reads: Every key gets accessed roughly once as queries scan across many records. LRU actually performs badly here (every scan entry looks equally "recent"). Random eviction is surprisingly competitive because there's no hot set to protect — you're just keeping a random sample of the data universe.
Unknown or mixed workloads: If you genuinely don't know the access pattern, or it shifts significantly across the day, use ARC or TinyLFU. Both adapt automatically. TinyLFU in particular has become the default for new general-purpose caches (Caffeine, Ristretto) precisely because it degrades gracefully across workload types without manual tuning.

Step 5 — Monitor Continuously and Adjust

Access patterns drift. A product launch changes what keys are hot. A new reporting feature adds a scan workload you didn't have before. Export keyspace_hits, keyspace_misses, and evicted_keys to your metrics platform (Prometheus, Datadog, CloudWatch — whatever you have). Alert when hit ratio drops more than 5% from baseline. Track eviction rate as a percentage of total requests — if you're evicting more than 5-10% of requests, the cache is too small or the workload has changed. Review the policy choice quarterly if the product changes frequently.

The single most actionable tip: set up a keyspace_hit_ratio metric in your monitoring dashboard today, before you need it. Teams that don't have this metric spend hours during incidents guessing whether the cache is the problem. Teams that do have it diagnose eviction issues in minutes.

Production eviction tuning is a five-step loop: measure current hit ratio, build an MRC from a sampled trace, estimate the economics of adding more cache, pick the policy that matches your workload character, then monitor continuously. Long-tail/Zipfian workloads favour LFU or TinyLFU; recency-dominated workloads favour LRU; broad-scan analytics workloads favour Random; adaptive policies like ARC and TinyLFU are the right default when the workload is mixed or unknown.

Section 20

Bug Studies — When Eviction Goes Wrong in Production

The best way to understand why eviction policy matters is to see it break. These four incidents follow the shape of real production failures — the exact numbers have been softened, but the failure modes are documented and recur regularly. Read each one like a post-mortem: what was the system expecting, what did the eviction policy actually do, and what would have prevented it.

Bug #1 — Nightly Analytics Job Collapses the Morning Hit Ratio

Incident: A web application ran a nightly reporting job that read every row from a large table. By morning, the cache hit ratio had dropped from a healthy baseline to well below 50%. The database CPU spiked hard at peak hour. On-call pages fired. The job had been running for months before this was noticed — the team had gradually scaled the reporting query to cover more rows over time until it finally exceeded the cache capacity.

What Went Wrong

The application used a cache-aside pattern with Redis set to allkeys-lru. The reporting job read millions of rows sequentially — each one a unique key that would never be accessed again by real user traffic. Because the cache-aside layer wrote each result into Redis before discarding it, the LRU eviction saw all those rows as "recently used." By the end of the scan, the rows that real users needed — the hot working set — had been completely evicted to make room for one-time analytical reads.

The diagram shows the eviction damage in two snapshots. Before the scan, the cache is healthy — hot working-set entries fill most of the slots and 95% of requests hit. After the scan finishes, those same slots hold one-time analytical rows that real traffic will never request again, and the hit ratio is more than cut in half. The cache is still "full" but its useful content is gone.

The Fix

# Cache-aside: write every analytical result to Redis def get_row_for_report(row_id): cached = redis.get(f"row:{row_id}") if cached: return cached row = db.query(f"SELECT * FROM events WHERE id = {row_id}") # BUG: writes one-time analytical row into the shared user-facing cache redis.setex(f"row:{row_id}", 3600, serialize(row)) return row

# Fixed: analytical reads bypass the user-facing cache entirely # Option A: read directly from the DB replica — no cache writes def get_row_for_report(row_id): return db_replica.query(f"SELECT * FROM events WHERE id = {row_id}") # Option B: if caching analytical results is useful, use a separate # Redis instance (or different keyspace) with its own eviction policy def get_row_for_report(row_id): cached = analytics_redis.get(f"report-row:{row_id}") if cached: return cached row = db_replica.query(f"SELECT * FROM events WHERE id = {row_id}") analytics_redis.setex(f"report-row:{row_id}", 300, serialize(row)) return row

Lesson: Scan workloads and hot-key workloads must not share a cache. Either route analytics reads to a separate cache instance, bypass caching entirely for batch jobs, or switch to TinyLFU which uses an admission filter to reject one-hit-wonder entries before they can pollute the cache.

How to Spot: Watch the evicted_keys counter in Redis. If it spikes at a consistent time of day (when a batch job runs), and the hit ratio drops shortly after, scan pollution is almost certainly the culprit.

Bug #2 — Memcached Slab Calcification After a Product Launch

Incident: A team launched a new feature that changed the average size of cached objects from around 200 bytes to around 4 KB (richer product metadata). Memcached memory usage barely changed, but hit ratio gradually degraded over the following days, and eventually cache memory appeared "full" while actual useful data density was low. The on-call engineer tried flushing and re-warming, which helped briefly before the problem returned.

What Went Wrong

Memcached allocates memory in fixed-size slab classes. Before the launch, most objects were under 256 bytes, so the 256-byte slab class had accumulated the lion's share of memory. After the launch, new 4 KB objects tried to go into the 4096-byte slab class — which had very few pages allocated to it.

Memcached evicted aggressively from the 4096-byte class even while the 256-byte class sat nearly idle, because slab pages are not transferable between classes at runtime without the automover. The result: new, heavily-accessed 4 KB objects were constantly being evicted because their slab class was starved, while old, stale 256-byte objects happily occupied most of the cache's memory.

Lesson: When object sizes in your workload change significantly (new feature, schema change, serialisation format change), Memcached slab allocation becomes unbalanced. The fix is to restart Memcached so it can re-allocate slab pages to the new size distribution — or to use Redis, which does not have this problem, as a slab-free allocator.

How to Spot: Use stats slabs and stats items in Memcached. Look for slab classes with high evictions and low total_pages relative to hit rates. A healthy cache has evictions spread proportionally to slab usage — a calcified cache has one hot slab class with massive evictions and others nearly empty.

Bug #3 — Redis Sampled LRU Producing Wildly Suboptimal Evictions at Scale

Incident: A team running Redis with allkeys-lru on a dataset of tens of millions of keys noticed that their observed hit ratio was significantly lower than what an ideal LRU should produce for their access pattern. Profiling showed many recently-accessed hot keys were being evicted while genuinely stale keys remained. The Redis version in use had maxmemory-samples set to the default of 5.

What Went Wrong

Redis LRU is not a true LRU. When it needs to evict, it samples a small number of random keys (5 by default) and evicts the one with the oldest LRU clock value. With only 5 samples across tens of millions of keys, the probability of picking a genuinely old key is low. The algorithm frequently evicts keys that just happened to have an older timestamp in the random sample, even if much older keys exist elsewhere. The result looks like LRU to the configuration, but behaves much closer to Random eviction at scale.

# Default: samples only 5 keys — adequate for small datasets, # poor approximation at tens of millions of keys maxmemory-policy allkeys-lru maxmemory-samples 5

# Increase sample size for better LRU approximation at scale. # 10 samples gives near-perfect LRU accuracy with modest CPU overhead. # For very large key spaces where frequency matters, switch to lfu entirely. maxmemory-policy allkeys-lfu # frequency-aware, no sampling needed # -- OR -- maxmemory-policy allkeys-lru maxmemory-samples 10 # 2x samples, ~2x CPU cost on eviction path

Lesson: Redis LRU is approximate by design — exact LRU requires a doubly-linked list reordering on every access, which Redis avoids to keep access paths fast. If your dataset is large and accurate recency ordering matters, either raise maxmemory-samples to 10, or switch to allkeys-lfu (added in Redis 4.0), which uses a frequency counter that doesn't require sampling at eviction time.

How to Spot: Compare observed hit ratio against a theoretical LRU simulation on your access trace. If observed is significantly lower, sampled LRU approximation may be the gap. Also check the Redis docs for your version — the default sample size changed across Redis releases.

Bug #4 — TTL Race Causing Cascading Misses Under Load

Incident: A product catalogue cache used a uniform TTL of 300 seconds on all keys. Every day at the same time, a wave of cache misses would fire simultaneously — the DB would spike, latency would jump, and the system would be slow for a few minutes before recovering. The team initially suspected a traffic spike but the pattern was too regular to be user-driven.

What Went Wrong

The cache was warmed at startup (or after a periodic flush) and all keys were written to Redis at roughly the same time, all with the same TTL of 300 seconds. When the TTL expired 5 minutes later, all keys expired simultaneously. Every incoming request found a cold cache. Every one of those requests went to the database at the same moment. This pattern is called a cache stampede (or thundering herd), and it's caused by synchronised TTL expiry — not eviction exactly, but it interacts with eviction policy because the eviction algorithm doesn't protect you from coordinated expiry.

# All keys written at startup with the same TTL → expire simultaneously def warm_catalogue(products): for product in products: redis.setex(f"product:{product.id}", 300, serialize(product)) # BUG: uniform TTL means they ALL expire at T+300

import random def warm_catalogue(products): base_ttl = 300 for product in products: # Jitter: +/- 20% so expirations spread across ~120s window jitter = random.randint(-60, 60) ttl = base_ttl + jitter redis.setex(f"product:{product.id}", ttl, serialize(product)) # Alternative: probabilistic early refresh # When remaining TTL < threshold AND random() < p, refresh proactively def get_product(product_id): key = f"product:{product_id}" remaining_ttl = redis.ttl(key) value = redis.get(key) if value and remaining_ttl < 30 and random.random() < 0.1: # 10% of requests proactively refresh when TTL is low refresh_in_background(product_id) return value

Lesson: Always add random jitter to TTLs when populating a cache from a bulk load. A ±20% jitter spreads expiry over a window wide enough to prevent stampedes while keeping data reasonably fresh. For high-value keys, consider probabilistic early refresh (refresh the value before it expires, triggered by a small percentage of near-expiry reads).

How to Spot: If your database CPU shows a perfectly regular spike every N seconds or minutes (exactly matching your TTL), and the spike is preceded by a brief dip in cache hit ratio, you have a stampede pattern. The regularity is the giveaway — organic traffic spikes are never that precise.

Four recurring production failure shapes: scan pollution (a batch job evicts the hot working set via LRU), Memcached slab calcification (object size changes starve the right slab class), Redis sampled-LRU suboptimality at scale (small sample sizes make LRU approximate Random), and TTL stampedes (uniform expiry causes coordinated cache misses). All four are preventable: use TinyLFU or separate caches for scan workloads, restart Memcached after size distribution changes, raise sample size or switch to LFU in Redis, and always jitter TTLs on bulk writes.

Section 21

Common Misconceptions — What Engineers Get Wrong

Eviction is one of those topics where a little knowledge confidently leads to bad decisions. Here are the six most common mental models that sound right until you test them in production.

The myth: When you set maxmemory-policy allkeys-lru in Redis, Redis maintains a true least-recently-used ordering of all keys and always evicts the single least-recently-used key.

The reality: Redis LRU is sampled. When Redis needs to evict, it picks a small random sample of keys (5 by default, tunable via maxmemory-samples) and evicts the one with the oldest lru_clock timestamp among that sample. It does not inspect all keys. This is a deliberate engineering trade-off: maintaining a true LRU doubly-linked list across every single access would add overhead to every read and write operation. The sampled approach keeps the eviction path fast but introduces approximation — especially on very large key spaces where the probability of sampling the globally-oldest key in 5 tries is low.

Why it matters: If you're relying on "LRU will definitely keep my recently-used hot keys" as a correctness guarantee, you need to understand it's a probabilistic approximation. The larger the key space and the smaller the sample, the further the actual behaviour drifts from true LRU. For large key spaces where accuracy matters more than approximation speed, use allkeys-lfu or increase maxmemory-samples to 10.

The myth: The goal of cache tuning is to maximise hit ratio. A 99% hit ratio is always better than a 95% hit ratio, so always add more cache until you hit 99%+.

The reality: Hit ratio improvements are subject to diminishing returns, and the value of a hit depends on the cost of a miss. Going from 80% to 90% hit ratio might require doubling the cache (the working set is large, you're near the knee of the MRC). Going from 98% to 99% might require 5x more memory because those last few percent of keys are accessed rarely but are still in the working set. The math you need to do: what does one DB miss cost in latency, compute, and money? If a DB read costs very little (fast read replica, cheap compute), a 95% hit ratio is probably fine. If DB reads are expensive (cross-region, heavy joins, shared DB with little headroom), 99% might be worth the extra RAM.

Why it matters: Blindly optimising hit ratio leads to over-provisioning cache RAM. Always compute the marginal benefit per GB added and compare to the marginal cost of that GB. The MRC gives you the inflection point — the knee beyond which additional cache investment yields rapidly diminishing returns.

The myth: LFU is objectively superior because it uses more information (how often a key was accessed, not just when it was last accessed). More data = better decisions. Always use LFU.

The reality: LFU has a well-known failure mode called the stale popularity problem (also called the "frequency decay" problem). A key that was extremely popular six months ago has accumulated a huge frequency count. If the content it represents is now irrelevant (old news article, retired product, last season's campaign page), LFU will protect it from eviction almost indefinitely — it has a high frequency count and will always win over newer, potentially more relevant keys with low counts. Meanwhile, a genuinely hot new key that launched today has a low frequency count and is at high risk of early eviction.

Why it matters: Plain LFU without frequency decay is inappropriate for workloads where popularity is time-sensitive. This is why TinyLFU uses a frequency sketch with periodic aging — it halves all frequency counts periodically so old popularity fades and new popularity can gain traction. If you use LFU in Redis, it has built-in decay via the lfu-decay-time config parameter (default: 1 minute). Set this to match how quickly popularity changes in your workload.

The myth: Random eviction is a lazy fallback with no intelligence. LRU must always beat it because LRU at least tracks recency — Random throws away a key that might be very hot.

The reality: For workloads where the access distribution is uniform (or close to uniform), random eviction performs nearly as well as LRU — and sometimes better. Here's the intuition: if every key has a roughly equal probability of being accessed next, then the probability of a randomly-chosen eviction victim being the next-needed key is about 1/(cache size). That's the same probability as any other eviction strategy, because no strategy can predict which of N equally-likely keys will be accessed next. The benefit of Random is that it carries zero metadata overhead — no linked list, no access timestamps, no frequency counters. This makes it extremely fast and cache-friendly. For analytics workloads or batch processing where access distribution is broad and uniform, Random is often the right choice.

Why it matters: Don't dismiss Random as "not serious." Several production databases and CDNs use random eviction because the workload is broad enough that recency and frequency metadata provides minimal benefit over the cost of maintaining it.

The myth: Setting a TTL on a cache key is a form of eviction. Expired keys are evicted by the cache.

The reality: Expiry (TTL) and eviction are two completely separate mechanisms with different triggers and different purposes. Expiry is invalidation — a key expires when it becomes stale (data freshness concern). Eviction is capacity management — a key is evicted when the cache is full and needs space (memory pressure concern). A key can be evicted before its TTL expires (if memory is under pressure). A key can persist well past its nominal TTL expiry if memory is not under pressure and the cache backend implements lazy expiry (Redis does this — expired keys are cleaned up on next access or by a background sweep, not immediately on the TTL tick).

Why it matters: If you're relying on TTLs to keep memory usage bounded ("all keys expire in 10 minutes, so memory can't fill up"), you're wrong. Between bulk writes and TTL expiry, memory can absolutely fill up, triggering eviction of entries that haven't yet expired. Set maxmemory in Redis to enforce a hard cap, and treat eviction policy and TTL as two independent, complementary tools — not the same tool.

The myth: Cache is fast, databases are slow, therefore more cache is always better. When in doubt, provision more Redis RAM.

The reality: The MRC curve has a characteristic shape — steep improvement at first as the most-popular items fit into cache, then a knee where improvement slows dramatically, then near-flat tail where adding large amounts of cache buys almost no additional hit ratio improvement. Past the knee, you're caching items that will be accessed so rarely that even if you cache them perfectly, the hit ratio won't meaningfully improve. Beyond the MRC argument, there are practical costs: more Redis RAM means bigger RDB snapshots, longer AOF replay times on restart, more memory pressure on the host, higher cloud costs, and more replication traffic if you run a replica. The right model is: cache the hot working set, measure where the MRC knee is, and stop there. Use the saved budget on database read replicas or query optimisation for the long-tail traffic that isn't worth caching.

Six misconceptions to retire: Redis LRU is approximate (sampled), not exact; higher hit ratio isn't always worth the cost (check the MRC knee first); LFU beats LRU only when frequency is stable (stale popularity kills it otherwise); Random isn't dumb (it's competitive on uniform-access workloads); TTLs do not evict things (they invalidate — eviction is a separate mechanism); and more cache doesn't always help (diminishing returns are real past the working-set knee).

Section 22

Practice Exercises — Build Your Intuition

Understanding eviction policy on paper is one thing. Building intuition — the kind that fires instantly when you're staring at a production incident or a system design whiteboard — requires actually working through problems. These five exercises range from concrete implementation to open-ended design, covering the skills an interviewer would probe.

Implement an LRU cache that supports get(key) and put(key, value) in O(1) time. The cache has a fixed capacity and evicts the least recently used entry when full.

Constraints: O(1) for both operations. No language-specific library (no Java LinkedHashMap shortcuts).

You need two data structures: one for O(1) key lookup, and one for O(1) ordering. A HashMap gives you O(1) lookup. A doubly-linked list gives you O(1) insertion, deletion, and move-to-front — because you can splice any node in O(1) if you have a direct pointer to it. The HashMap stores pointers to list nodes, not the values directly. On every get, move the accessed node to the head of the list. On put when full, remove the tail (LRU entry).

Use sentinel (dummy) head and tail nodes to avoid null checks. The real data nodes live between them. The head's next is the most recently used; the tail's prev is the least recently used.

class LRUCache: capacity: int map: HashMap head: Node # dummy sentinel (MRU end) tail: Node # dummy sentinel (LRU end) init(capacity): self.capacity = capacity self.map = {} head.next = tail tail.prev = head get(key) → value or -1: if key not in map: return -1 node = map[key] remove(node) # detach from current position insert_at_front(node) # move to MRU end return node.value put(key, value): if key in map: remove(map[key]) # update existing: remove old position node = Node(key, value) map[key] = node insert_at_front(node) if len(map) > capacity: lru_node = tail.prev # tail's prev is LRU remove(lru_node) del map[lru_node.key] # helpers remove(node): node.prev.next = node.next node.next.prev = node.prev insert_at_front(node): node.next = head.next node.prev = head head.next.prev = node head.next = node

You are given an access trace of 1 million requests against a dataset of 100,000 unique keys. You simulate LRU caches of sizes 1K, 5K, 10K, 20K, 40K, 60K, 80K, and 100K keys and measure the following hit ratios: 31%, 58%, 73%, 85%, 92%, 94%, 95.5%, 96%.

Questions: (a) Where is the MRC knee? (b) If cache RAM costs X per 1000 keys and a DB miss costs 5X in compute, what is the economically optimal cache size? (c) If a new product launch doubles the working set size from 20K to 40K unique hot keys, what happens to your hit ratio at the current cache size of 20K?

The knee is where the marginal gain per additional cache unit drops sharply. Look at the incremental hit ratio gain per 1K keys added at each step. The steepest drop in marginal gain marks the knee. For part (b), compute: "revenue" of a 1% hit ratio increase (= 10K misses/million avoided × 5X cost per miss) vs. "cost" of the cache size jump needed for that 1% gain.

(a) The knee is between 20K and 40K keys. Going from 1K to 20K: +54 percentage points over 19K keys = ~2.8 pp/1K. Going from 20K to 40K: +7 pp over 20K = 0.35 pp/1K. Going from 40K onward: <0.15 pp/1K. The marginal gain drops roughly 8× between the pre-knee and post-knee regions — that's the inflection point at around 20-25K keys.

(b) Optimal size at 20K. At 20K keys, hit ratio is 85%. Adding 20K more (to 40K) buys +7 pp. That 7 pp prevents 70K misses per million requests. Cost of those 70K misses = 70K × 5X = 350,000X compute units saved. Cost of 20K more cache = 20 × X = 20X RAM cost. The ROI is 350,000/20 = 17,500×. Clearly worth it. Adding another 20K (to 60K) buys only +2 pp = 20K misses avoided = 100,000X saved, at 20X cost — still positive (5,000× ROI) but declining fast. At 80K the ROI drops to around 750×. The exact economic knee depends on your specific cost ratios, but the principle — check ROI, not raw hit ratio — is the answer.

(c) A doubled working set to 40K hot keys at 20K cache size means only half the working set fits in cache. Hit ratio will collapse from 85% toward the hit ratio of a 10K cache on the original 20K working set — roughly equivalent to a 50% cache-to-working-set ratio. From the MRC above, that's somewhere around 58-73% hit ratio, depending on the Zipfian distribution of the new keys. This is the "cache is now half the working set" scenario — the MRC analysis is mandatory after any product launch that significantly changes key space size.

You are designing the caching layer for three independent services. For each one, pick the Redis maxmemory-policy and justify why.

Service A — Session store: Stores user session tokens. Sessions are created on login, accessed frequently during active use, and should automatically expire when the user is inactive. TTLs are set on all keys. Memory must stay bounded.

Service B — Product catalogue: Stores enriched product metadata for an e-commerce site. A small fraction of products (popular / trending items) get the vast majority of views. Product data changes occasionally via editorial updates. No TTLs on keys — invalidation is done explicitly.

Service C — Feature flag store: Stores feature flag evaluations (key = flag name + user segment, value = true/false). There are many flags and segments, but flag data changes infrequently. All lookups should return a value; a cache miss is acceptable but should be rare. Memory is tightly constrained.

Think about what signal matters for each use case. Sessions: TTL is the natural eviction driver — prefer evicting expired entries first. Product catalogue: no TTLs, access is Zipfian (popularity matters more than recency). Feature flags: small working set relative to memory, uniform importance, no natural ordering.

Service A → volatile-lru: All keys have TTLs (session expiry). The right policy is one that evicts keys that have TTLs, prioritising the least recently used. volatile-lru does exactly this — it only evicts keys with TTLs, and among those, picks the LRU. This respects session expiry intent while managing memory pressure. Do NOT use allkeys-lru — you don't want to evict sessions that are still actively used just because a more recent session displaced them.

Service B → allkeys-lfu: No TTLs, Zipfian access distribution, stable popularity for popular products. LFU is the right choice because frequency is a stable and accurate proxy for future access likelihood in e-commerce product catalogues. TinyLFU (available via Caffeine or Ristretto if you front Redis) would be even better, but in Redis 4.0+, allkeys-lfu with a sensible lfu-decay-time is a solid approximation.

Service C → allkeys-lru (or allkeys-random): Feature flag evaluations have no TTLs, the access distribution is relatively uniform across flags and segments (many flags, accessed fairly evenly), and the working set is small relative to expected memory. LRU is a safe default. If memory constraints are severe and you need zero metadata overhead, Random is also defensible here because the uniform access distribution means recency is a weak signal anyway.

You're looking at a 24-hour hit ratio graph for a Redis cache. The graph shows a stable plateau at 91% from midnight to 2 AM, then a sharp drop to 54% at 2:15 AM, then a gradual climb back to 88% over the following 3 hours, then another plateau. This pattern repeats identically the following day. There are no traffic spikes in the incoming request rate. evicted_keys shows a spike at 2:15 AM matching the hit ratio drop.

Questions: (a) What is the most likely cause? (b) How do you confirm it? (c) What are three ways to fix it?

The pattern is too regular to be organic user traffic. The exact time (2:15 AM) and the spike in evictions both point to a scheduled background job. What kind of job runs at 2 AM that would write many new keys into a cache?

(a) Scan pollution from a scheduled batch job. A nightly job (reporting, ETL, full-table cache warm-up for a new feature) ran at 2:15 AM, wrote a large number of keys into the cache, and evicted the hot working set. LRU treated the batch keys as "recently used" because they were just written, even though they will never be accessed again by the production workload. The 3-hour recovery time is the warm-up time for the working set to re-populate via normal traffic.

(b) Confirm by: checking application logs or cron job schedules for a job running at 2:15 AM; looking at Redis INFO keyspace to see if the key count spikes at that time; checking the rate of new keys written per second using redis-cli --latency-history or monitoring; and, if possible, examining whether the evicted keys are from a specific key pattern (batch job keys often have a distinct prefix).

(c) Three fixes: (1) Route the batch job to a separate Redis instance (or a read replica/separate DB) so it doesn't share a cache with the production workload. (2) Modify the batch job to skip the cache-aside write — if each row is only accessed once, there's no reason to cache it. (3) Switch the production cache policy to TinyLFU — the admission filter will reject one-hit-wonder keys (the batch job writes) from entering the cache at all, protecting the hot working set.

You are building a caching layer for a social media feed. The access pattern is highly Zipfian: roughly the top 10% of posts by popularity get about 85% of requests. However, there is a strong "novelty bias" — new posts receive a burst of traffic immediately after posting, then their popularity stabilises or decays. Additionally, many posts are accessed exactly once (users open a post, never return to it). LRU is failing because new posts look "recently used" even when they will only ever be accessed once.

Design task: Design an admission filter for this cache that allows a post to enter the cache only when there is evidence it will be accessed more than once. Specify: (a) the data structure for the filter, (b) the admission rule, (c) how the filter handles memory, and (d) how you handle the novelty-bias problem (new posts that will become popular but start with low frequency counts).

TinyLFU is the canonical solution to exactly this problem. The frequency sketch (Count-Min Sketch) tracks approximate access counts. The admission rule is: only admit a new entry to the main cache if its frequency count exceeds the eviction candidate's count. Think about how you'd add a "new entry grace period" for the novelty-bias part.

(a) Data structure: a Count-Min Sketch (or equivalent frequency sketch). This is a probabilistic data structure using a 2D array of counters and multiple hash functions. It can approximate the access frequency of any key using very little memory — a sketch tracking billions of events can fit in tens of megabytes. Use it to count how many times each post ID has been requested, before it's admitted to the main LRU cache.

(b) Admission rule: When a new post needs to be cached and the cache is at capacity, compare the new post's frequency count (from the sketch) to the current eviction candidate's frequency count. Only evict the candidate and admit the new entry if freq(new) >= freq(candidate). If the new post has only been seen once, it won't pass this test unless the eviction candidate is also a near-zero-frequency entry.

(c) Memory management for the filter: The sketch has a fixed size (set at initialisation). To prevent stale counts from old posts dominating, periodically "age" all counters — halve every counter every N minutes (or every N access events). This is the same decay mechanism TinyLFU uses. The aging step ensures old popularity fades and new popular posts can accumulate counts quickly.

(d) Novelty bias — the W-TinyLFU approach: Maintain a small "window LRU" cache (perhaps 1% of total capacity) that admits all new entries without filtering. New posts enter the window cache immediately, which serves as the novelty buffer. After accumulating some access history in the window, entries are promoted to the main cache only if their frequency count qualifies them. If the window overflows, entries are evaluated against the main cache's eviction candidate. This is exactly the W-TinyLFU (Window-TinyLFU) design used by the Caffeine library — the small window absorbs novelty bursts while the main cache is protected by the frequency admission gate.

Five exercises that build progressively: implementing LRU with HashMap + doubly-linked list (O(1) guarantee), reading and reasoning from an MRC curve, matching Redis policies to real service requirements, diagnosing scan pollution from a hit-ratio chart, and designing a TinyLFU-style admission filter with novelty handling. Working through these builds the intuition to answer eviction questions on a whiteboard without needing to recall memorised answers.

Section 23

Cheat Sheet & Glossary — The 30-Second Recap

Everything in this page, compressed to flashcard size. Use this for pre-interview review or as a quick reference when configuring a new cache.

Policy Quick-Reference

Evicts the entry not accessed for the longest time. Signal: recency. Best for: bursty traffic, session stores, social feeds. Fails on: sequential scans (scan pollution).

Evicts the entry with the fewest total accesses. Signal: frequency. Best for: stable hot-key workloads, product catalogues. Fails on: new popular items (infancy starvation); needs decay to avoid stale popularity.

Approximates frequency with a Count-Min Sketch. Admits new entries only if their frequency beats the eviction candidate. Modern default. Best for: general-purpose caches, CDN-style workloads, one-hit-wonder rejection.

Two LRU lists (recency + frequency) plus ghost lists for both. Automatically shifts weight between recency and frequency based on observed miss patterns. Best for: unknown or mixed workloads that don't stay constant.

Evicts a randomly chosen entry. Zero metadata overhead. Best for: uniform-access analytics, broad-read workloads where no policy can predict access patterns better than chance.

Evicts the entry inserted first, regardless of access. Signal: insertion order. Rarely correct for caching. Used mainly in write buffers and simple queues where insertion time = freshness.

Evicts the entry accessed most recently. Counter-intuitive but optimal for repeated sequential scans where the item just processed is the least likely to be needed again soon.

Evicts the entry whose next access is furthest in the future. Provably optimal — cannot be beaten. Requires knowing the future access sequence. Used only as a benchmark to measure how close real policies get to optimal.

Two-segment LRU: probationary (new entries) and protected (promoted after second access). Entries must be accessed twice to move to the protected segment. Resistant to one-hit wonders without a full frequency sketch.

Key Production Rules

Default 5-key sample. Raise maxmemory-samples to 10 for large key spaces. Or switch to allkeys-lfu.

Uniform TTL on bulk writes → stampede on expiry. Add ±20% random jitter. Use probabilistic early refresh for critical keys.

Batch/analytics reads must not share a cache with hot-key production workloads. Use a separate instance or skip caching.

Adding RAM past the MRC knee yields diminishing returns. Find the knee before provisioning. The knee is where marginal hit-ratio gain per GB drops sharply.

Track evicted_keys / total requests. Alert if >5-10%. A rising eviction rate means the working set no longer fits — cache is too small or workload has grown.

If your average object size changes significantly, slab allocation becomes unbalanced. Restart Memcached to rebalance slab pages to the new size distribution.

Glossary

Working set: The collection of cache entries that are actively needed to serve the current workload. If the cache is large enough to hold the working set, hit ratio is high. If the working set exceeds the cache, hit ratio is limited by how well the eviction policy approximates Belady's OPT.
Miss ratio curve (MRC): A plot of miss ratio vs. cache size for a given access trace. Shows exactly how much hit ratio you gain per additional unit of cache. The knee of the curve marks the economically optimal cache size.
Scan pollution: A failure mode where a sequential scan (full table scan, batch export, nightly analytics) writes a large number of one-time keys into the cache, evicting the hot working set. The LRU policy treats scan entries as "recently used" even though they will never be accessed again.
Ghost entry (ghost list): In ARC, a ghost entry is the key (without its value) of a recently evicted item. Ghost lists track what was evicted so the algorithm can learn: if a ghost entry is accessed, the miss could have been prevented, and the policy shifts weight toward the queue that would have retained it.
Admission filter: A gate that decides whether a new item is worth admitting to the main cache at all. TinyLFU uses a frequency sketch as its admission filter: a new item is only admitted if its estimated frequency exceeds the eviction candidate's frequency. Prevents one-hit wonders from polluting the cache.
Frequency sketch: A probabilistic data structure (typically a Count-Min Sketch) that approximates how many times a key has been accessed, using a small fixed amount of memory. Used by TinyLFU and similar algorithms to approximate LFU frequency counters without the per-key memory cost of exact counters.
One-hit wonder: A cache entry that is accessed exactly once and never again. Caching one-hit wonders wastes capacity and causes pollution. CDN and feed workloads have very high one-hit-wonder rates. TinyLFU's admission filter is specifically designed to reject these entries before they enter the main cache.
Cache stampede (thundering herd): When many cached keys expire simultaneously, causing a flood of concurrent cache misses that all hit the origin (database or upstream service) at the same time. Caused by uniform TTLs on bulk writes. Prevented by TTL jitter and probabilistic early refresh.
LRU clock (Redis): A 24-bit timestamp stored per key in Redis, recording the last-access time relative to the server's internal LRU clock. Updated on access. Used during sampled LRU eviction to find the least-recently-used key among a random sample. Cheaper than a doubly-linked list but less accurate.
Slab class (Memcached): Memcached divides memory into fixed-size slab classes, each serving a range of object sizes. Eviction is per-slab-class: a full slab evicts from within that class only, and cannot borrow free space from other classes. Object size changes can cause slab imbalance ("slab calcification").
Temporal locality: The tendency for recently-accessed data to be accessed again soon. LRU exploits temporal locality. Workloads with strong temporal locality are well-served by LRU; workloads with weak temporal locality (sequential scans) are not.
Frequency locality: The tendency for a small subset of data to be accessed far more frequently than the rest (Zipfian / power-law distribution). LFU and TinyLFU exploit frequency locality. Most real-world workloads have strong frequency locality — a small fraction of keys accounts for the vast majority of requests.

Nine eviction policies at a glance, six production rules, and twelve glossary entries covering the core vocabulary of cache eviction — from working set and MRC knee to ghost lists, admission filters, and slab calcification. Enough to orient you before any cache-related interview question or production incident.

Eviction Internals — How a Cache Decides What to Throw Away

TL;DR — Eviction in Plain English

Why You Need This — When Your Hit Ratio Quietly Collapses

The Production Story: One Nightly Job, One Database Fire

The Math That Makes Eviction Decisions Life or Death

The Question Eviction Policy Answers

Mental Model — The Memory-Pressure Loop

The Five Steps Every Eviction Follows

What "Good" Looks Like

Core Concepts — The Vocabulary of Eviction

The Eight Terms You Must Know

The 7 Canonical Eviction Policies — Overview

The Reference Table

LRU Deep Dive — Why HashMap + Doubly-Linked List is O(1)

The Analogy: A Stack of Parking Permits

The Data Structure: HashMap + Doubly-Linked List

A 6-Step Worked Example

The O(1) Proof

The Pseudocode

The Killer Weakness: Scan Pollution (Preview)

LFU Deep Dive — Why Frequency Wins for Long-Tail Data

The Plain-English Intuition: Most-Played vs. Most-Recently-Played

The Algorithm: Frequency Buckets in O(1)

When LFU Outperforms LRU — and When It Fails

TinyLFU & W-TinyLFU — How Caffeine Crushes LRU

The Count-Min Sketch: Approximate Frequency in Tiny Space

W-TinyLFU: The Window That Fixes Cold-Start

ARC — The Adaptive Replacement Cache

The Four-List Design

How ARC Adapts — Ghost Hits

Random & FIFO — Simple Policies, Surprisingly Resilient

Random Eviction: The Underrated Contender

FIFO: Ordered Simplicity

Redis maxmemory & The 8 Eviction Policies

Setting maxmemory and maxmemory-policy

The 8 Policies Explained

Why Redis's LRU Is "Approximated" — Not Exact

Memcached Slab Allocator & Per-Slab LRU

Why Slab Allocation Exists: Defeating Fragmentation

Per-Slab LRU: The Isolation That Creates the Problem

CPU Cache & TLB Eviction — Hardware Lessons at Nanosecond Speed

Why the CPU Can't Afford True LRU

TLB Eviction — The Same Problem, One Level Up

Browser & CDN Cache Eviction — Disk Limits and Edge Nodes

Browser Cache: You Don't Control the Eviction Policy

CDN Eviction: The Two-Tier Edge Model

Working Set Sizing & the 80/20 Curve — How Much Cache Do You Actually Need?

The Working Set: The Entries That Actually Matter

Mattson's Stack Distance Algorithm — One Trace, Every Cache Size

Scan Pollution & Sequential Flooding — The Killer Failure Mode

The Anatomy of a Scan Pollution Event

The Four Production Fixes

Fix 1 — SLRU's Protected/Probation Split

Fix 2 — TinyLFU's Admission Filter

Fix 3 — PostgreSQL's Ring Buffer for Sequential Scans

Fix 4 — MySQL InnoDB's Midpoint Insertion Strategy

One-Hit Wonders & The Frequency Filter — The 50% Problem

What Is a One-Hit Wonder?

The Two-Miss Math: Why One-Hit Wonders Are Doubly Expensive

Measurement — Hit Ratio, Miss Cost, MRC Curves & Belady's Optimal

Hit Ratio: The Vanity Metric

Miss Cost: The Metric That Actually Matters

The Miss Ratio Curve: Your Capital Expenditure Tool

Belady's Optimal — The Unreachable Ceiling

A Working MRC Pseudocode Sketch

Putting It Together: The Measurement Checklist

Tuning Eviction in Production — A Decision Playbook

Step 1 — Measure Hit Ratio Before Changing Anything

Step 2 — Sample a Workload Trace and Build an MRC

Step 3 — Estimate the Marginal Cost per GB of Hit Ratio Gain

Step 4 — Pick the Policy Matched to Your Workload

Step 5 — Monitor Continuously and Adjust

Bug Studies — When Eviction Goes Wrong in Production

What Went Wrong

The Fix

What Went Wrong

What Went Wrong

What Went Wrong

Common Misconceptions — What Engineers Get Wrong

Practice Exercises — Build Your Intuition