Redis Deep Dive — System Guide

Section 1

TL;DR — Redis in Plain English

Why keeping data in RAM instead of on disk makes reads and writes orders of magnitude faster
What the eight data structures in Redis are and when to reach for each one
How Redis stays durable despite living entirely in memory (AOF + RDB)
When Redis is the right tool — and when it is emphatically the wrong one

Redis lives in RAM — that is the whole insight. Everything else (speed, rich data structures, sub-millisecond latency) is a direct consequence of that one choice.

Redis stores every key and value in RAM. Disk reads take roughly 0.1–10 milliseconds depending on whether the OS page cache helped. A RAM read takes roughly 100 nanoseconds — that is 100–100,000 times faster. That speed gap is not an implementation detail you can optimize away; it is physics. Redis is fast because it chose RAM, full stop.

Most people hear "key-value store" and picture a giant dictionary. Redis goes much further: its server understands eight data structures — strings, hashes, lists, sets, sorted sets, streams, bitmaps, and HyperLogLog. You can push an item onto a list, increment a score in a sorted set, or count unique visitors with HyperLogLog — all with a single command, all server-side, all in microseconds.

"But if Redis is just RAM, what happens when the server restarts?" Redis can write every command to an Append-Only File (AOF) on disk — replay it on startup and you are back. It can also take periodic binary snapshots (RDB). Most production setups enable both. The result is sub-millisecond speed during normal operation plus crash recovery — not a trade-off, a combination.

Redis is a RAM-first key-value store with eight built-in data structures; its microsecond speed, server-side data operations, and optional AOF/RDB durability make it the go-to layer for caching, sessions, leaderboards, and real-time patterns.

Section 2

Why You Need This — The Melting Database Story

Let's start with a story you will live through at some point in your career. It does not require any prior knowledge — just follow the numbers.

The situation: your top-products page

You built an e-commerce site. The homepage shows the top 20 best-selling products. To generate that list, your application runs five SQL queries: fetch products, fetch inventory counts, fetch ratings, fetch review counts, fetch discount info. Together they take about 80 milliseconds — fast enough in development where you are the only user.

Then your site gets popular. Traffic climbs to 10,000 requests per second during a sale. Let's do the math:

10,000 req/sec × 5 queries = 50,000 SQL queries per second hitting your database.
Each query holds a connection and locks rows during reads.
Your database has maybe 200 connection slots total.
Connection pool exhausted → requests queue → latency spikes to 2–5 seconds → users bounce → your boss calls.

The page content does not actually change every millisecond. The top-20 products list is essentially the same for the next 60 seconds. You are paying the full SQL cost for every single request even though the answer is the same. That is the problem Redis solves.

The fix: cache the result in Redis

The pattern is three lines of logic:

On request: try to read the pre-computed result from Redis with GET top-products.
Cache hit (99% of requests): return it. Time: ~1 ms. No database touched.
Cache miss (1% of requests, after TTL expires): run the SQL queries once, store the result with SET top-products <json> EX 60, return it.

The result in production: database load drops ~99%. User-facing latency drops from ~80 ms to ~1 ms — an 80× improvement. And your database is now handling ~100 queries per second instead of 50,000, leaving plenty of headroom for writes and other operations.

Think First: Your gaming app has a global leaderboard updated every 100 ms, with 1 million active players. In development, SELECT user_id, score FROM scores ORDER BY score DESC LIMIT 100 runs in 50 ms. Why does this fall apart at scale? Think before reading on.

See the answer

Every score change means the sort order changes. At 1M players with scores updating 10 times a second each, you have 10 million writes per second invalidating the sort. A SQL ORDER BY on a disk-backed table re-reads and re-sorts rows on every query — it cannot maintain a pre-sorted result across writes at that rate. Indexes help reads but each write must update the index too, causing lock contention. Redis's sorted set (ZSET) was designed exactly for this: it maintains a sorted skip-list in memory, so ZINCRBY (update a score) and ZRANGE (get top N) are both O(log N) — fast whether you are reading or writing.

The canonical Redis use case is a cache: run expensive SQL once, store the result with a TTL, serve 99% of traffic from RAM at ~1 ms — dropping database load by ~99% and user latency by ~80×.

Section 3

Mental Model — Server-Side Data Structures

Here is the key mental shift that separates Redis from every other cache or database you have probably used.

Traditional stores: you move data, your code does the work

In most caches and databases, the server just stores bytes. You want to add an item to a list? You have to: (1) fetch the whole list as a string or JSON blob, (2) parse it into an object in your application, (3) append the new item, (4) re-serialize the whole thing back to a string, (5) write the whole string back to the server. Five round-trips of logic, and the data had to travel across the network twice.

Redis: the server does the work, you send one command

Redis stores your data in a typed structure that the server understands. You want to add an item to a list? You send one command: LPUSH my-list "new-item". Redis appends the item internally in O(1) time and sends back the new list length. No fetching. No parsing. No re-serializing. No second write. The compute happened where the data lives.

This idea — moving computation to the data instead of moving data to the computation — is why Redis stays fast even for operations that look complex. Incrementing a counter, updating a score, adding a member to a set: all atomic, all server-side, all in microseconds.

The durability side of the story

You might worry: "If Redis is all RAM, a server restart wipes everything." Redis has two durability mechanisms to prevent this. The Append-Only File (AOF) writes every command to disk as it arrives — think of it as a receipt tape. On restart, Redis replays every command and rebuilds the dataset. The RDB snapshot takes a complete binary copy of the dataset every few minutes. Most production deployments run both, giving you the speed of RAM during normal operation and the safety net of disk for recovery.

Key insight: Redis does not choose between speed and durability — it uses RAM for speed during live operation and writes to disk asynchronously (AOF) or periodically (RDB) for recovery. The crash window with appendfsync everysec is at most 1 second of writes. For a cache, losing 1 second is acceptable. For primary data storage, you would evaluate whether that fits your durability requirements.

Redis's big idea is moving computation to the data — the server understands your data structures natively, so you send one command instead of fetch-parse-modify-serialize-write. AOF + RDB give durability without sacrificing RAM speed.

Section 4

Core Concepts — The Six Terms You Must Know

Before diving into data structures and commands, let's pin down six concepts. You will see these in every Redis discussion, every config file, and every interview. Understand these and everything else will click.

Key — the name of your data

A key is just a string that identifies a piece of data. You choose the name. Convention is to use colons as a namespace separator: user:42:profile, session:abc123, rate-limit:ip:1.2.3.4. Keys can be up to 512 MB in theory, but in practice you want them short (under 100 bytes) since Redis stores them in memory. The key is your primary access path — there are no secondary indexes by default.

Best practice: Use a consistent naming scheme across your application. object-type:id:field is the most common pattern — it keeps keys readable and makes SCAN commands for maintenance predictable.

Value — the typed data attached to a key

Every key maps to a value, but unlike a plain dictionary that only holds strings, Redis values have types: string, hash, list, set, sorted set, stream, bitmap, or HyperLogLog. The type determines which commands you can run on that key. Trying to run a list command (LPUSH) on a string key returns a type error. This is why Redis is described as a "data structure server" rather than just a key-value store.

TTL (Time-To-Live) — automatic expiry

TTL is the number of seconds until a key automatically deletes itself. You set it when writing: SET session:abc123 "{...}" EX 3600 (expire in 3600 seconds = 1 hour). This is how caches stay fresh without manual cleanup: set the TTL to match how often your source data changes, and Redis handles the rest.

Without TTL, keys live forever and your memory fills up. With TTL, your Redis naturally evicts stale data — no cron job needed. Use TTL my-key to check how many seconds remain. PERSIST my-key removes the expiry if you decide the data should live forever.

AOF (Append-Only File) — the write log

Every write command that Redis processes gets appended to a file on disk. Think of it as a running receipt tape: SET user:1 Alice → written. LPUSH queue "job-1" → written. On crash and restart, Redis replays every line in the file top-to-bottom and rebuilds the entire dataset in memory. The replay takes time proportional to the number of commands in the log — Redis will periodically compact ("rewrite") the AOF to keep it small.

How often to flush to disk is tunable via appendfsync: every command (slowest but safest), every second (the default — you risk losing at most 1 second of writes), or never (let the OS decide — fastest but riskiest).

RDB Snapshot — the point-in-time photograph

RDB stands for Redis Database. It is a compact binary snapshot of the entire dataset at a moment in time. Redis forks the process (copy-on-write fork, so no blocking), writes the snapshot to a new dump.rdb file, then atomically replaces the old file. Typical configuration: save a snapshot if at least 1 write happened in the last 3600 seconds, or at least 100 writes in the last 300 seconds — production sites often tune these thresholds.

RDB is faster to load on restart than replaying a full AOF log (binary format vs. command replay). It is also more compact for backups. The downside: you lose changes since the last snapshot — which could be minutes of writes. AOF gives finer-grained recovery; RDB gives faster recovery. Together they cover both cases.

Pipeline — batch commands, one round-trip

Every Redis command normally needs a network round-trip: send command, wait for reply, send next command. If your network latency is 1 ms, 100 commands takes 100 ms of waiting — even if each command itself executes in microseconds. A pipeline lets you queue up many commands client-side, send them all at once, and receive all replies at once. Result: 100 commands in ~1 ms total instead of ~100 ms. This is especially powerful for bulk imports or multi-step operations where you do not need the result of command N before sending command N+1.

Pipeline vs. transactions: A pipeline batches for network efficiency. A transaction (MULTI/EXEC) batches for atomicity — all-or-nothing execution. They solve different problems. You can combine them: pipeline a transaction.

The six must-know Redis concepts are: Key (the name), Value (the typed data), TTL (auto-expiry), AOF (write log for crash recovery), RDB (binary snapshot), and Pipeline (batched commands for network efficiency).

Section 5

The Eight Data Structures — Pick the Right Tool

This is where Redis gets genuinely exciting. Most key-value stores let you put a string in and get a string back. Redis gives you eight server-understood data structures, each with its own set of commands optimized for that shape. Pick the right structure and your code becomes three lines instead of thirty.

Structure-by-structure plain English

String — the universal type

A string in Redis holds any sequence of bytes — a JSON blob, an integer, a serialized object, even binary data like a JPEG thumbnail. Maximum size is 512 MB. The most common use: plain cache values (SET product:42 "{...json...}" EX 300) and counters (INCR page-views is atomic and will never produce a race condition). The INCR / DECR family is huge for rate limiting: atomically increment a counter, check if it exceeds a threshold, expire the key after the window.

Hash — an object in one key

A hash is a map of field-value pairs stored under one Redis key — think of it as a row in a table, or a JSON object, but one that the server can update field-by-field. HSET user:42 name Alice age 30 city London sets three fields at once. HGET user:42 name retrieves just the name without fetching the whole object. HINCRBY user:42 login_count 1 atomically increments one counter without touching other fields. This is ideal for user sessions, product records, and any object where you frequently update individual fields.

List — queue and activity feed

A Redis list is a doubly-linked list of strings, accessible from both ends in O(1). LPUSH prepends to the head; RPOP removes from the tail — together they make a FIFO queue. LRANGE my-list 0 9 fetches the first 10 items without removing them, which is perfect for an activity feed ("last 10 actions by user X"). The blocking variants BLPOP/BRPOP let a worker process sleep until a job arrives — a lightweight job queue without polling.

Set — unique membership and set math

A set is an unordered collection of unique strings. Adding a duplicate is a no-op — Redis silently ignores it, so you never need to check before adding. The power is in set operations: SUNION (all items in A or B), SINTER (items in both A and B), SDIFF (items in A but not B). Real use: "which users liked both Product A and Product B?" — two sets, one SINTER command, one round-trip. Also perfect for tagging: a set per tag with the user IDs that have that tag.

Sorted Set (ZSET) — the leaderboard structure

A sorted set is like a regular set but every member has a floating-point score. Redis keeps members sorted by score at all times using a skip-list + hash table combination. ZINCRBY leaderboard alice 10 increments Alice's score by 10 — and she is re-positioned in the sort order atomically. ZRANGE leaderboard 0 9 WITHSCORES REV returns the top 10 players in descending order. ZRANK leaderboard alice returns Alice's rank (0-indexed). This covers leaderboards, priority queues, rate-limit sliding windows (score = timestamp, member = event ID), and nearest-neighbor queries on numeric ranges.

Stream — the event log

A stream is an append-only log where each entry has an auto-generated timestamp-based ID and a set of field-value pairs. XADD events * action "click" page "/home" appends a new event. XREAD lets consumers read from a position forward. Consumer groups (similar to Kafka consumer groups) allow multiple workers to divide work: each message is delivered to one worker in the group, and workers acknowledge messages so the server knows what has been processed. Streams are lighter-weight than Kafka for simpler event-driven workflows, and heavier than plain lists when you need consumer groups and acknowledgement.

Bitmap — presence tracking at scale

A bitmap is not a separate data type in Redis — it is a string with commands for individual bit operations. You treat the string as an array of bits and address each bit by offset. SETBIT daily-active:2024-01-01 42 1 marks user 42 as active today. GETBIT daily-active:2024-01-01 42 checks it. BITCOUNT daily-active:2024-01-01 counts how many users were active. For 10 million users, you need 10 million bits = ~1.2 MB — impossibly compact for what it does. This is the go-to for daily-active-user tracking, feature flags per user, and "has this user completed X?" in O(1) per user.

HyperLogLog — unique count at billion scale

Imagine you want to count how many different people visited your site today, but you do not actually need a perfect number — "around 4.2 million" is fine. HyperLogLog is the data structure for exactly that question. It is a probabilistic algorithm that estimates the number of unique items in a set using a fixed ~12 KB of memory, regardless of how many billions of items you add. The trade-off: the count is approximate with ~0.81% standard error. PFADD visitors "user-1" "user-2" "user-3" adds three visitors. PFCOUNT visitors returns an estimate of unique visitors. For "how many unique IPs hit our API today?" at billion-request scale, exact counting would need gigabytes of memory. HyperLogLog does it in 12 KB with sub-1% error — an engineering bargain if exactness is not required.

See it in code

redis-cli — strings & counters

# Plain cache value with 5-minute TTL
SET product:42 '{"name":"Keyboard","price":99}' EX 300

# Read it back
GET product:42
# → '{"name":"Keyboard","price":99}'

# Atomic counter — no race conditions ever
SET page-views 0
INCR page-views   # → 1
INCR page-views   # → 2
INCRBY page-views 10  # → 12

# Rate limiting: allow 100 requests per minute per IP
# Each call increments the counter and checks it
INCR rate:1.2.3.4
EXPIRE rate:1.2.3.4 60
# If result > 100, reject the request

# Check TTL (seconds remaining)
TTL product:42   # → 287 (about 4:47 left)

# NX flag: set only if key does NOT exist (distributed lock pattern)
SET lock:resource-1 "worker-A" NX EX 30
# → OK if lock acquired, (nil) if already held

Strings are the workhorse. INCR is special: it is guaranteed atomic, so 1,000 concurrent callers each running INCR will always produce exactly 1,000 increments with no lost updates — no transactions needed.

redis-cli — hashes & lists

# Hash: store user profile fields individually
HSET user:42 name Alice age 30 city London login_count 0

# Read one field — no need to fetch the whole object
HGET user:42 name        # → Alice
HGET user:42 city        # → London

# Read all fields
HGETALL user:42
# → name Alice age 30 city London login_count 0

# Update one field without touching others
HINCRBY user:42 login_count 1   # → 1

# ─── Lists as a FIFO queue ───
# Producer pushes jobs to the head
LPUSH job-queue '{"task":"send-email","to":"alice@example.com"}'
LPUSH job-queue '{"task":"resize-image","id":99}'

# Worker pops from the tail (FIFO order)
RPOP job-queue   # → '{"task":"send-email",...}'

# Blocking pop: worker sleeps until a job arrives (no polling!)
BRPOP job-queue 30   # wait up to 30 seconds

# Recent activity feed: keep last 10 items
LPUSH activity:user:42 "logged in"
LTRIM activity:user:42 0 9   # trim to 10 items max
LRANGE activity:user:42 0 9  # read all 10

Hashes let you update one field (HINCRBY for login_count) without fetching and re-writing the whole object. Lists with BRPOP replace a polling loop — the worker does zero work while idle, waking instantly when a job arrives.

redis-cli — sorted set leaderboard

# Add players with initial scores
ZADD leaderboard 1500 alice
ZADD leaderboard 2300 bob
ZADD leaderboard 1800 carol

# Alice scores 200 more points
ZINCRBY leaderboard 200 alice   # alice now has 1700

# Top 3 players (high score first)
ZRANGE leaderboard 0 2 WITHSCORES REV
# → 1) bob 2300
# → 2) carol 1800
# → 3) alice 1700

# Alice's rank (0-indexed, highest score = rank 0)
ZREVRANK leaderboard alice   # → 2 (3rd place)

# Players in score range 1500–2000
ZRANGEBYSCORE leaderboard 1500 2000 WITHSCORES
# → alice 1700, carol 1800

# How many players have score >= 1600?
ZCOUNT leaderboard 1600 +inf   # → 2 (bob, carol)

# Remove a player
ZREM leaderboard alice

The sorted set handles leaderboard reads and score updates in O(log N) time. At 1 million players, ZINCRBY + ZREVRANK still complete in microseconds — try doing that with a SQL ORDER BY on every score update.

Redis provides eight native data structures — String, Hash, List, Set, Sorted Set, Stream, Bitmap, and HyperLogLog — each with server-executed commands optimized for its shape, making complex operations a single round-trip instead of fetch-modify-write cycles in application code.

Section 6

Persistence — AOF and RDB: How Redis Survives a Crash

The most common question about Redis: "If everything lives in RAM, what happens when the server crashes or reboots?" The answer is: you decide. Redis gives you two complementary persistence mechanisms, and most production setups use both.

AOF — Append-Only File (the receipt tape)

Every write command that Redis processes — every SET, LPUSH, ZINCRBY, every mutation — is appended to a file on disk before (or just after) the command completes. On a crash and restart, Redis opens the AOF file and replays every command from top to bottom, rebuilding the entire dataset in memory exactly as it was.

How often to force the OS to flush the file buffer to the physical disk is controlled by the appendfsync setting:

appendfsync always — flush after every single command. Safest (zero data loss), but slowest. Adds a disk fsync latency to every write.
appendfsync everysec — flush once per second (the default). In the worst case you lose the last 1 second of writes. Fast enough for most workloads.
appendfsync no — let the OS decide when to flush (typically every 30 seconds). Fastest, but you could lose up to 30 seconds of writes on power loss.

Over time the AOF grows large. Redis can rewrite it ("AOF rewrite") — it forks and writes a minimal set of commands that produce the current dataset, then atomically replaces the old file. A typical trigger: rewrite when the AOF file has grown to double its size at the last rewrite point.

RDB — Binary Snapshot (the photograph)

RDB takes a complete point-in-time snapshot of the dataset and saves it to a compact binary file (dump.rdb). Redis uses a UNIX fork() to create a child process — thanks to copy-on-write memory semantics, the fork is nearly instant even for large datasets. The child writes the snapshot; the parent keeps serving requests. When the child finishes, the new dump.rdb atomically replaces the old one.

Typical configuration (from redis.conf):

Save if 1 or more keys changed in the last 3600 seconds (hourly minimum)
Save if 100 or more keys changed in the last 300 seconds (5-minute interval on moderate write load)
Save if 10,000 or more keys changed in the last 60 seconds (very frequent on high write load)

RDB files are compact — a dataset that uses 10 GB of RAM might compress to a 2–4 GB RDB file. Loading an RDB on restart is faster than replaying a long AOF log because it is a binary format, not command-by-command replay. The downside: you lose changes since the last snapshot — potentially minutes of writes.

AOF vs RDB — the trade-off in plain terms

Choosing between AOF and RDB

AOF gives you near-zero data loss. With appendfsync everysec, the worst-case data loss is about 1 second of writes. AOF files are human-readable command logs you can inspect and even edit manually before replay. The downside: AOF files grow large (they record every command, not just the final state), and replay on restart takes longer than loading a binary RDB file.

RDB gives you faster recovery and smaller backups. A binary snapshot loads in seconds even for large datasets. The downside: you lose changes since the last snapshot — which could be anywhere from 1 minute to an hour depending on your configuration.

Best of both worlds: enable both. On restart, if both files exist, Redis prefers the AOF (more complete). If only RDB exists, it loads that. The typical production setup: appendonly yes with appendfsync everysec, plus RDB saves every 5–15 minutes as a backup/fast-restore option.

Cache-only deployments: If Redis is purely a cache — you can rebuild all data from a source of truth (your main database) — then disabling both AOF and RDB is completely fine. You gain a small write performance boost and eliminate disk I/O entirely. A cache miss after restart just means one slow SQL query per key — acceptable. Never run persistence-off mode if Redis holds data that cannot be rebuilt, like sessions or rate-limit counters, without a plan for what happens when those disappear on restart.

Redis persists via AOF (append every write command to a log — near-zero data loss with everysec) and RDB (periodic binary snapshots via fork — fast restart, loses minutes of writes). Production typically uses both; cache-only deployments can safely disable both.

Section 7

Replication & High Availability — Never Lose Your Cache

A single Redis node is blazing fast — but it is also a single point of failure. If that one server goes down, every cache miss hits your database at once (a thundering herd), sessions vanish, rate-limit counters reset. For any production workload that matters, you need a backup plan. Redis's answer is replication: one primary node that accepts writes, plus one or more replica nodes that copy those writes and can serve reads.

How replication works — plain English first

Think of the primary as the "source of truth" notebook. Replicas are photocopies that automatically stay in sync. When your app writes SET user:42:name "Alice", the primary executes it instantly and — asynchronously, a tiny fraction of a second later — ships that same command to every replica. The replica applies it to its own in-memory state. Reads can go to any replica; writes always go to the primary.

The word asynchronously is critical. The primary does not wait for replicas to confirm before acknowledging the write to your application. This means in the normal case replication is essentially instantaneous (typically under 100 ms on a healthy LAN), but during a crash there may be a small window of commands that reached the primary but never made it to replicas — those writes can be lost.

Three things you must know about Redis replication

Async by default — understand the trade-off. The primary does not wait for replicas to confirm a write. In practice this means replication lag on a healthy cluster is typically under 100 ms. But on primary failure, the replica that gets promoted may be missing the last few commands. This is called async data loss risk. For a cache, losing a few keys is fine — you just refetch from the database. For session data or rate-limit counters, think carefully.

The WAIT command — synchronous opt-in. If you need stronger guarantees on a specific write, Redis lets you follow it with WAIT <numreplicas> <timeout_ms>. This blocks until at least N replicas have acknowledged the write (or timeout expires). It makes that one write synchronous. Teams use this for critical writes like financial ledger updates while keeping normal cache writes asynchronous.

Replica reads = eventual consistency. Replicas serve the same data as the primary — eventually. Between the write landing on the primary and replicating to a replica, a client reading from that replica will see stale data. For cache workloads this is usually acceptable. For anything requiring read-your-own-writes (e.g., a user just changed their profile and immediately reloads the page), always read from the primary or use sticky routing.

Redis replication gives you one write-accepting primary plus N read-serving replicas; replication is asynchronous (typically under 100 ms lag) which means a tiny window of data loss on failover — use WAIT on critical writes to opt into synchronous acknowledgement.

Section 8

Sentinel — Automatic Failover Without Cluster Complexity

Replication gives you copies of your data. But copies are useless if nobody promotes one when the original breaks. Redis Sentinel is the watchdog process that does exactly that: it monitors your primary and replicas, detects when the primary is unreachable, and orchestrates the promotion of a replica to become the new primary — automatically, while notifying your clients where to reconnect.

You run Sentinel as a separate process (usually three instances on separate machines). Each Sentinel monitors the primary independently. The key word is quorum: a single Sentinel saying "the primary is down" is not enough to trigger failover — a majority must agree. This prevents a network hiccup from one machine causing a false alarm that breaks your production cluster.

Sentinel vs Cluster — when to use which

Use Sentinel when: you have a single large Redis instance (or small replica set) and need automatic failover without sharding. Sentinel is simpler to operate — just three extra processes. Your data all lives on one master; replicas handle read scaling only. Most teams start here.

Sentinel's limits: no horizontal write scaling — all writes still go to one primary. Maximum memory = one machine's RAM. If you outgrow a single node, you need Redis Cluster (Section 9). Sentinel also cannot help if the primary crashes mid-write with no durable log (AOF off) — those writes are gone regardless.

Sentinel is Redis's HA watchdog for single-master deployments — three or more Sentinel processes reach quorum consensus on a primary failure, elect a leader, and promote the best replica in roughly 10–30 seconds, with client notification via pub/sub.

Section 9

Redis Cluster — Horizontal Sharding Built In

Sentinel keeps you alive when one node dies. But what if your dataset is simply too large for one machine's RAM, or your write throughput is too high for one CPU? That is where Redis Cluster comes in. It splits your data across multiple master nodes automatically — each node owns a slice of the keyspace — and each master has its own replica(s) for HA. Sharding and replication in a single system.

Hash slots — the math behind the split

Redis Cluster divides the keyspace into exactly 16,384 hash slots. When you write a key, Redis runs CRC16(key) % 16384 to determine which slot it belongs to. Slots are then assigned across master nodes. With three masters you might assign slots 0–5460 to Master A, 5461–10922 to Master B, and 10923–16383 to Master C. Every key finds exactly one node — no guessing, no coordination needed per-request.

The number 16,384 was chosen deliberately: it is large enough that slot assignment is fine-grained (you can balance evenly across many nodes), but small enough that the slot bitmap (a 2 KB structure) fits in a heartbeat gossip message between nodes, keeping cluster overhead low.

Key concepts you must understand

Hash Tags — forcing co-location

Normally each key hashes independently. But sometimes you need two keys on the same node — for example to run MGET user:42:profile user:42:cart in one round trip. Redis solves this with hash tags: if a key contains a {...} section, only the content inside the braces is hashed. So {user:42}:profile and {user:42}:cart both hash to slot CRC16("user:42") % 16384 — guaranteed same node. Use hash tags deliberately; overusing them can push too many keys onto one slot.

Resharding — moving slots online

Adding a fourth node? Redis Cluster can migrate individual slots from existing nodes to the new one while the cluster stays live. The key-move is atomic: during migration, a key is accessible on either the source or destination (MOVED/ASK redirects tell the client which). Zero downtime resharding is one of Cluster's killer features for scaling up without a maintenance window.

Multi-key constraint. Commands like MGET key1 key2 key3, SUNION, and EVAL spanning multiple keys only work if ALL keys hash to the same slot. If they don't, Redis returns a CROSSSLOT error. Design your key naming scheme with hash tags upfront — retrofitting is painful. This is the most common Cluster gotcha.

Cluster is complex to operate. Many teams use single-master + Sentinel until they genuinely need horizontal write scaling or their dataset exceeds one machine's RAM. Do not add Cluster complexity prematurely — it makes debugging harder and adds network round-trip overhead per request.

Redis Cluster shards data across masters using 16,384 hash slots (CRC16 hash); each master owns a slot range and has replica(s) for HA — use hash tags to co-locate related keys, and expect CROSSSLOT errors if you skip that step.

Section 10

Memory Management & Eviction — What Happens When Redis Is Full

Redis is in-memory — so RAM is the hard resource limit. You can configure a maximum memory usage with maxmemory 4gb. When Redis hits that limit and a new write arrives, it has to decide: refuse the write, or evict (delete) an existing key to make space. The eviction policy controls that choice. Pick the wrong policy and you get either silent data loss or production errors — both bad in different ways.

There are eight built-in policies, split along two axes: which keys are eligible (all keys, or only keys with a TTL set) and how to pick the victim (LRU — least recently used; LFU — least frequently used; random; or by remaining TTL).

The eight policies — when to use each

`allkeys-lru` — the standard cache policy

Evict the key that was accessed least recently across all keys — regardless of whether a TTL was set. This is the right default for a pure cache: you assume that recently touched keys are more likely to be needed soon (temporal locality). Most cache workloads follow this pattern, so evicted keys are usually the ones you would have chosen to drop anyway.

`allkeys-lfu` — better for stable hot sets

Evict the key accessed least frequently across all keys. LFU is better when your hot data is stable — certain keys are always popular (landing page content, product catalog) while others spike briefly then go cold. LFU keeps the perpetually-popular keys alive even if they haven't been touched in the last few seconds. LRU would evict them during brief quiet periods.

`volatile-*` — evict only TTL-carrying keys

These policies only consider keys that have a TTL set (EXPIRE was called on them). Keys without a TTL are never evicted — they're "pinned." Use this when your Redis holds both cache entries (which have TTLs and can be dropped) and critical state (sessions, locks — no TTL, must survive). The volatile variants give you precise control over what can be evicted.

`noeviction` — danger for caches

When memory is full, Redis returns an OOM error to the client and refuses the write. Nothing is deleted. This is the correct behavior if Redis holds primary data that must not be silently dropped (e.g., a message queue where lost messages are unacceptable). For a cache, this will halt your application when Redis fills up — a production incident in waiting.

Default is noeviction. Redis ships with noeviction as the default. If you deploy Redis as a cache and forget to set maxmemory-policy allkeys-lru, your cache will fill up and start returning errors instead of evicting stale entries. This is a very common production surprise. Always set the policy explicitly.

Redis eviction policies control what happens when maxmemory is hit; for caches use allkeys-lru (or allkeys-lfu for stable hot sets); noeviction is correct for primary data stores but dangerous for caches — and it is the default, so always set the policy explicitly.

Section 11

Pub/Sub & Streams — Redis as a Message Bus

Redis is not just a cache. It has two distinct messaging primitives built in, and picking the right one matters a lot. The confusion between them trips up engineers regularly, so let's be very clear about the difference before any code.

Pub/Sub is like a live TV broadcast: the channel is live, and only viewers watching right now receive the message. Miss the show? You missed it. No replay, no buffer, no history. This is fire-and-forget, and that is intentional — it is incredibly fast and requires zero storage.

Streams are like a DVR recording: every message is appended to a durable log. Consumers can read from any point in time (replay), multiple consumer groups can independently track their position, and messages survive disconnections. Think of it as a lightweight Kafka built into Redis, available since Redis 5.0 (released 2018).

Three messaging patterns and when to pick each

Pub/Sub — real-time broadcast

Use Pub/Sub when you need real-time delivery to all currently-connected clients and you are OK with offline subscribers missing messages. Good fits: live chat notifications, presence indicators ("Alice is typing"), cache invalidation signals (broadcast "key X changed" so all app servers flush their local cache), and live dashboards. The simplicity is a feature — no consumer state, no ACKs, no backpressure.

Streams (XADD / XREADGROUP) — durable queues

Use Streams when messages must not be lost even if consumers are temporarily offline. Consumer groups track each group's read position independently, so two different services (say, analytics and billing) can consume the same event stream at their own pace without interfering with each other. Individual consumers within a group get load-balanced delivery with explicit acknowledgement (XACK). Unacknowledged messages are redelivered. This is producer-consumer reliability without running Kafka.

Sorted-set queue — priority queue pattern

Before Streams existed, teams used sorted sets as priority queues: ZADD queue <priority_score> <job_id> to enqueue, ZPOPMIN queue (Redis 5.0+) to dequeue the highest-priority job. This still works and is perfectly valid for simple priority queues. The trade-off versus Streams: sorted sets have no consumer groups, no ACK mechanism, and no replay — but they are simpler and the sorted-by-priority behaviour is built in.

Historical context: Streams were added in Redis 5.0 (2018). Before that, "Redis as a queue" meant using List operations (LPUSH + BRPOP) or Pub/Sub — both of which have significant limitations for durable messaging. If you read old Redis tutorials showing list-based queues, that is why. Streams are now the recommended pattern for anything requiring delivery guarantees.

Redis Pub/Sub is fire-and-forget broadcast (zero persistence, offline subscribers miss messages); Redis Streams are a durable append-only log with consumer groups, independent offsets, and replay — use Pub/Sub for live notifications and Streams when messages cannot be lost.

Section 12

Lua Scripting & Transactions — Atomic Operations in Redis

Here is something surprising about Redis: it is fundamentally single-threaded for command execution. One command runs to completion, then the next starts. No two commands ever interleave inside Redis. This sounds like a weakness — but it is actually a superpower for atomic operations, because it means a Lua script you send to Redis runs entirely without interruption. No other client can sneak a command in between your script's steps.

Why does this matter? Imagine you need to "increment a counter, but only if it is below 100 — and do it safely even with 10,000 concurrent clients." With separate GET then SET commands there is a race condition: two clients could both read 99, both decide to increment, and both write 100. With a Lua script, that read-check-write is a single atomic unit. Nobody else touches the key while it runs.

Three patterns for atomic Redis operations

Send a Lua script inline with EVAL. The first argument is the script source, then the number of keys, then the key names, then any extra arguments. The script runs atomically on the server — no round trips, no race conditions.

increment-with-cap.redis

-- Increment counter only if below cap.
-- KEYS[1] = counter key, ARGV[1] = cap value
EVAL "
  local current = tonumber(redis.call('GET', KEYS[1]) or 0)
  if current < tonumber(ARGV[1]) then
    return redis.call('INCR', KEYS[1])
  else
    return current
  end
" 1 rate:user:42 100

The entire get-check-increment runs as one uninterruptible unit. Ten thousand concurrent clients all running this script will each see a consistent state — no two can simultaneously read the same "current" value and both decide to increment past the cap.

MULTI/EXEC is Redis's explicit transaction syntax. Commands between MULTI and EXEC are queued, then executed as a batch. Pair it with WATCH for optimistic locking: if any watched key changes between WATCH and EXEC, the entire transaction is aborted (returns nil) — your code can retry.

transfer-with-watch.redis

-- Optimistic transfer: debit A, credit B.
-- WATCH lets us detect if balance changed under us.

WATCH balance:alice
GET balance:alice           -- read current balance: 200

MULTI
  DECRBY balance:alice 50   -- queued (not run yet)
  INCRBY balance:bob 50     -- queued (not run yet)
EXEC
-- If balance:alice changed between WATCH and EXEC → returns nil (retry)
-- Otherwise → both commands execute atomically

Important distinction from Lua: with MULTI/EXEC, if one queued command has a runtime error (e.g., calling INCR on a string), the other commands in the batch still execute — there is no rollback. WATCH + EXEC gives you optimistic concurrency, not full ACID transactions.

Redis Functions (added in 7.0) are server-stored, named scripts. Instead of sending script source code every request, you register a library once and call it by name. Faster on repeat calls, survives AOF persistence, and replicated to replicas automatically.

-- Step 1: Register the library (once, at deploy time)
FUNCTION LOAD "#!lua name=mylib\n
  redis.register_function('incr_cap', function(keys, args)
    local current = tonumber(redis.call('GET', keys[1]) or 0)
    if current < tonumber(args[1]) then
      return redis.call('INCR', keys[1])
    end
    return current
  end)
"

-- Step 2: Call it by name — no script source sent over the wire
FCALL incr_cap 1 rate:user:42 100

Redis Functions are the modern replacement for complex EVALSHA patterns. They are versioned (you can reload a library), named (readable in FUNCTION LIST), and first-class citizens in the Redis replication/persistence system.

Long Lua scripts block everything. Because Redis is single-threaded and Lua runs atomically, a script that takes 500 ms of CPU time will freeze all other clients for 500 ms. Keep scripts under 50 ms wall time. The configuration option lua-time-limit (default 5000 ms) kills scripts that run too long, but by then damage is done. If your logic is complex, consider moving it to the application layer with optimistic retry loops rather than a monolithic Lua script.

Redis executes Lua scripts atomically on its single-threaded server — a script runs end-to-end with no interleaving, making it the tool for read-check-write operations; use MULTI/EXEC + WATCH for optimistic transactions, and Redis Functions (7.0+) for server-stored named scripts — but keep all scripts short to avoid blocking other clients.

Section 13

Caching Patterns

Redis as a cache is the number-one use case you will encounter. But "just throw Redis in front of the database" is not a strategy — it is a plan to introduce subtle bugs. The pattern you choose determines whether your cache stays correct, how much latency it adds on writes, and what happens when the cache restarts cold.

There are four battle-tested patterns. Each one answers a slightly different question: who is responsible for filling the cache? Who decides when to write through to the database? How much latency is acceptable on writes?

Cache-Aside (Lazy Loading)

This is the most common pattern. Your application code does the heavy lifting: on a read, check Redis first; if it misses, fetch from the database and write the result back into Redis with a TTL. The cache only ever contains data that was actually requested — no wasted RAM on cold, rarely-used records.

Why the risk? On a cold start (cache flushed or server restart), every request misses simultaneously. If you have thousands of concurrent users, they all hammer the database at once — the "thundering herd." The fix is a short random jitter added to each TTL, so keys expire at slightly different times rather than all at once.

Read-Through

Same logic as cache-aside, but the cache library handles the miss transparently. Your application code calls the library, which checks Redis, fetches from the DB on miss, writes back, and returns the value — all without the application needing to know. The benefit is cleaner application code; the trade-off is that you are now coupled to the library's behavior.

Write-Through

Every write goes to Redis and the database synchronously, in sequence. The cache is always up to date — there is no stale window. The cost is obvious: write latency roughly doubles because you are waiting for both systems to confirm. Use this when read-after-write consistency is more important than write throughput.

Write-Behind (Write-Back)

Writes land in Redis immediately; a background worker flushes them to the database asynchronously. This gives you the lowest possible write latency from the client's point of view. The danger is real: if Redis crashes between a write and its database flush, that data is gone. Use this only for metrics, analytics, or situations where losing a few writes is acceptable.

cache_aside.py

import redis
import json

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def get_user(user_id: int) -> dict:
    cache_key = f"user:{user_id}"

    # Step 1 — check the cache
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)   # cache hit: fast path

    # Step 2 — cache miss: load from database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if not user:
        return None

    # Step 3 — populate the cache for next time (TTL = 5 minutes)
    r.setex(cache_key, 300, json.dumps(user))
    return user

cache_aside_jitter.py

import random
import redis
import json

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

BASE_TTL = 300  # 5 minutes base

def get_user(user_id: int) -> dict:
    cache_key = f"user:{user_id}"
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)

    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if not user:
        return None

    # Add jitter: ±30 seconds so all keys don't expire at the same instant
    # Without jitter, if 1000 keys were set at the same time, they all
    # expire together — thundering herd. Jitter spreads expiry out.
    ttl = BASE_TTL + random.randint(-30, 30)
    r.setex(cache_key, ttl, json.dumps(user))
    return user

write_through.py

import redis
import json

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def update_user(user_id: int, data: dict) -> None:
    # Write-through: update DB first (source of truth), then cache.
    # Order matters: if the DB write fails, we don't pollute the cache.
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)

    cache_key = f"user:{user_id}"
    r.setex(cache_key, 300, json.dumps(data))
    # Now both DB and cache have the fresh value.
    # Reads will hit the cache and always see up-to-date data.

Cache invalidation is notoriously hard. Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. The realistic answers are: use TTLs aggressively (accept slight staleness), design cache keys so a user update invalidates exactly the right keys, and prefer write-through for data that absolutely must be consistent.

Cache-aside is the simplest and most common pattern; add TTL jitter to prevent thundering herd on cold starts.
Write-through keeps the cache fresh at the cost of doubled write latency; good for read-heavy, consistency-sensitive data.
Write-behind gives the lowest write latency but risks data loss on crash — use only for tolerable-loss workloads.
Cache invalidation is hard; TTLs plus careful key design are the practical answer in most production systems.

Section 14

Common Use Cases

Caching is how most people first meet Redis, but it is only one of eight very different jobs Redis does well. Once you understand the data structures, you start seeing Redis-shaped holes everywhere in your architecture — places where a small, fast, in-memory operation can replace an expensive database round-trip or a complex piece of application logic.

Sessions

When a user logs in, your server creates a session: a small blob of data (user ID, roles, preferences) with a unique random token. Store that token as the Redis key, the JSON blob as the value, with a TTL matching your session expiry policy. Every subsequent request just does a single GET session:<token> — one RAM lookup instead of a database query.

SET session:abc123 '{"user":42,"role":"admin"}' EX 1800

Rate Limiting

To limit a user to 100 API calls per minute, increment a counter on every request and set a 60-second expiry on the first increment. If the counter exceeds 100, reject the request. The key trick is doing both INCR and EXPIRE atomically in a Lua script — without atomicity, a race condition could let the counter live forever.

-- Lua script (atomic): increment + set expiry only on first call
local count = redis.call('INCR', KEYS[1])
if count == 1 then redis.call('EXPIRE', KEYS[1], 60) end
return count

Distributed Locks

To prevent two servers from processing the same job simultaneously, use SET lock:job:42 <unique-id> NX EX 30. The NX flag means "only set if the key does not exist" — if another server got the lock first, this command returns nothing and your server backs off. The EX 30 ensures the lock auto-releases even if the lock holder crashes. See Section 15 for the deeper nuances.

Leaderboards

A sorted set stores members with a floating-point score. When a player earns points, call ZINCRBY leaderboard 50 "player:42" — Redis atomically adds 50 to their current score. To get the top 10, call ZRANGE leaderboard 0 9 REV WITHSCORES. Redis maintains the sorted order internally using a skip list, so rank queries are O(log N) even with millions of players.

Real-Time Analytics

Counting unique visitors to a page sounds simple, but storing a full set of user IDs uses memory proportional to the number of users. HyperLogLog solves this: it estimates unique counts with ~0.81% error using only 12 KB of RAM regardless of cardinality. For "did user X visit today?" questions, a bitmap is even faster — one bit per user ID, so 10 million users costs only 1.25 MB.

PFADD page:views:2025-05-09 user:42 user:99 user:7
PFCOUNT page:views:2025-05-09   -- returns ~3 (approx unique count)

Pub/Sub

Pub/Sub lets one publisher broadcast messages to many subscribers instantly. Think of it as a chat room: publishers post to a channel; all subscribers receive the message in real time. Redis Pub/Sub does not persist messages — if a subscriber is offline, it misses the message. For durable message delivery, use Redis Streams instead (see Job Queues below).

Job Queues

A simple queue uses a list: producers push jobs with LPUSH queue:email <job-json>; consumers pop with BRPOP queue:email 0 (blocking pop — waits if the queue is empty). For more sophisticated needs — acknowledgements, consumer groups, replay — use Redis Streams, which work like a persistent append-only log with consumer group semantics modeled after Kafka.

Geospatial

Redis encodes latitude/longitude as a 52-bit geohash stored in a sorted set. GEOADD locations 13.361 38.115 "Palermo" adds a point; GEORADIUSBYMEMBER locations "Palermo" 200 km returns all points within 200 km. This is not a replacement for PostGIS on complex geo queries, but for simple "find nearby" lookups it is orders of magnitude faster because everything stays in RAM.

Sessions, rate limiting, and leaderboards are the three use cases most teams reach for first after caching.
HyperLogLog and bitmaps trade a tiny amount of accuracy for dramatic RAM savings in analytics workloads.
Redis Pub/Sub is fire-and-forget; use Redis Streams when you need durability or consumer group semantics.
Geospatial commands are a convenient built-in, but PostGIS is still the right choice for complex polygon and shape queries.

Section 15

Distributed Locks: Why It's Hard

A lock prevents two processes from doing the same thing at the same time. In a single program, this is solved with a mutex — a language primitive backed by the operating system. In a distributed system spanning multiple servers, there is no shared memory, no OS mutex. You have to coordinate through a network, and networks are unreliable.

The naive Redis approach — SET lock NX EX 30 — works surprisingly well for many everyday tasks like "only one process should run this cron job." But once the stakes rise (financial operations, idempotency guarantees), three subtle failure modes appear.

Single-Node SET NX EX

For a single-master Redis setup, SET lock:resource <unique-random-value> NX EX 30 is correct and simple. The unique value is critical: when releasing the lock, you must verify that the value in Redis matches your value before deleting — otherwise you might delete someone else's lock if your TTL expired. This check-and-delete must be done atomically in a Lua script.

-- Release lock safely (Lua — atomic check-and-delete)
if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
else
    return 0   -- someone else's lock, don't touch it
end

Redlock

Proposed by Antirez (Redis's creator), Redlock uses N independent Redis nodes (typically 5). A lock is considered acquired only if you successfully set it on a majority (at least 3 of 5) within a short time window. This protects against a single Redis node failing — even if two nodes crash, you still have 3, and you can always get a majority. However, distributed systems researcher Martin Kleppmann argued that Redlock is still unsafe against process pauses and clock drift. The debate is worth reading if this matters to you.

Fencing Tokens

A fencing token is a monotonically increasing number attached to each lock grant (lock #1 gets token 1, lock #2 gets token 2, and so on). The downstream resource (database, file system) only accepts requests with a token greater than the last token it saw. This means even if two processes simultaneously think they hold the lock, the downstream resource will accept only the newer one. This is Kleppmann's recommended approach for truly safe locking.

Practical guidance: For most use cases — cron job deduplication, preventing duplicate email sends, rate limiting — single-node SET NX EX with safe release is more than sufficient. Only reach for Redlock or fencing tokens when the cost of running the same operation twice is catastrophic (financial transactions, idempotency in payment flows). When correctness is that critical, also evaluate etcd or ZooKeeper, which are built specifically for consensus.

The naive SET NX EX approach is correct for single-master Redis with proper safe-release (Lua check-and-delete).
Three failure modes threaten naive locks: TTL shorter than actual work, node failover losing the lock, and clock drift across nodes.
Redlock uses a quorum of 5 independent Redis nodes but remains debated for high-stakes use cases.
Fencing tokens (monotonic counter + downstream enforcement) are the safest general solution; etcd/ZooKeeper are the right tool if you need consensus at the infrastructure level.

Section 16

Performance & Throughput

Redis is famously fast, but "fast" needs a number attached to it. A single Redis instance typically handles somewhere in the range of 50,000 to 200,000 operations per second on commodity hardware. That wide range reflects real variables: command type (O(1) GET vs O(N) LRANGE), payload size, network topology, and whether you are using pipelining.

Understanding why Redis is fast also tells you exactly where it can become slow — and how to avoid those traps.

Pipelining

By default, every Redis command requires one network round-trip. If you need to issue 1,000 commands and your network latency is 0.5 ms, that is 500 ms of pure waiting — nothing to do with Redis speed. Pipelining lets you send many commands in a single network packet and receive all the responses together. The server processes them sequentially but you pay only one round-trip. Typical improvement is 5–10× throughput for workloads with many small commands.

Note: commands in a pipeline are not atomic. If you need atomicity, use a Lua script or a MULTI/EXEC transaction instead.

Hash-Tag Co-location (Cluster)

In Redis Cluster, keys are distributed across 16,384 hash slots across multiple nodes. A command like MGET user:1 user:2 fails if those keys live on different nodes. Hash tags solve this: any part of the key inside curly braces is used for slot assignment. user:{1}:profile and user:{1}:settings are guaranteed to share the same slot, so cross-key operations work. Use this for any keys that must be operated on together.

Avoid O(N) Commands

Redis runs on a single event loop. One blocking command blocks every other client. The most dangerous command is KEYS * — it scans every key in the database and blocks for hundreds of milliseconds on a large instance. Always use SCAN instead: it iterates in small batches, returning a cursor. Other O(N) traps include LRANGE on huge lists, SMEMBERS on huge sets, and large Lua scripts with loops.

maxmemory Policy

When Redis reaches its memory limit, it must decide what to evict. The right policy depends on your use case. For a pure cache where any key can be regenerated from the database, allkeys-lru (evict the least recently used key from all keys) is the right choice. For a data store where some keys must never be evicted, use volatile-lru (evict only keys with a TTL set). Setting no eviction policy in a cache workload means Redis will return errors when full — almost always the wrong behavior.

Connection Pooling

Opening a new TCP connection to Redis for every operation is expensive — each connection costs time and a file descriptor. Redis client libraries handle this with connection pools: a fixed set of persistent connections shared across your application threads. Tune the pool size to match your concurrency: too small causes queuing; too large wastes file descriptors. A good starting point for most web servers is 10–50 connections.

Redis is single-threaded for command execution. This is why it is predictable and avoids locking overhead. But it also means ONE slow command blocks ALL clients. The culprits to watch: KEYS * on large databases (use SCAN), deleting a list with millions of items (use UNLINK, which is async), and Lua scripts that loop over large data sets. Monitor the slowlog in production — Redis logs any command exceeding a configurable threshold.

A single Redis instance handles roughly 50K–200K ops/sec; pipelining can push this 5–10× higher by amortizing network round-trips.
Never use KEYS * in production — it blocks the server. Use SCAN for safe iterative key enumeration.
Set maxmemory and choose allkeys-lru for cache workloads so Redis evicts gracefully rather than returning errors.
Redis is single-threaded: one slow command (large Lua, big DEL) blocks all clients. Use UNLINK for async deletion and watch the slowlog.

Section 17

Redis vs Valkey vs KeyDB vs Dragonfly

For most of its life, Redis was a beloved open-source project under the permissive BSD license. In March 2024, Redis Inc. changed the license to a dual SSPL/RSALv2 model. The practical effect: cloud providers can no longer offer a managed Redis service without commercial terms. The response from the broader community was swift — a fork of the last BSD-licensed version (7.2.4) was announced under the Linux Foundation banner and named Valkey.

This is not just a legal footnote. It changes which software your cloud provider's managed service runs, which version gets community bug fixes, and what the long-term maintenance path looks like for your infrastructure.

Redis (Post-2024)

Redis Inc. continues to develop Redis under the new SSPL/RSALv2 dual license. If you run Redis on your own servers, the license change likely does not affect you directly — the new terms primarily restrict cloud providers from reselling it as a managed service. Redis Inc. offers commercial cloud hosting and enterprise features. If your team is on a managed cloud service, your provider likely switched to Valkey.

Valkey

Valkey is a BSD-licensed fork from Redis 7.2.4, governed by the Linux Foundation. It is designed to be a drop-in replacement — your existing Redis clients, commands, and configuration files work without modification. As of 2026, Valkey is the managed service that AWS (ElastiCache), Google Cloud (Memorystore), and Oracle offer. It has strong community momentum and active development from major cloud contributors.

KeyDB

KeyDB is an earlier multi-threaded fork of Redis. It was originally created in 2019 by John Sully and Ben Schermel at EQ Alpha Technology, then acquired by Snap Inc. in 2022, which open-sourced the previously enterprise-only codebase under the permissive BSD 3-clause license. KeyDB broke Redis's single-threaded model to handle higher concurrency on multi-core machines. Today KeyDB is still developed under Snapchat's GitHub organization and remains BSD-licensed — but momentum has largely shifted to Valkey for new open-source Redis-compatible projects.

Dragonfly

Dragonfly is not a fork of Redis — it is a ground-up reimplementation in modern C++, built around a multi-threaded, shared-nothing architecture. Its benchmarks claim dramatically higher single-instance throughput compared to Redis (the "25× faster" figure comes from their own tests and should be taken with appropriate skepticism). It supports the Redis protocol, so most Redis clients work without changes. It is a legitimate option for teams whose bottleneck is single-instance Redis throughput, though it is less battle-tested in production than Redis or Valkey.

Practical recommendation (as of 2026): For new projects on managed cloud infrastructure, use Valkey — it is what the major cloud providers ship by default and it has BSD licensing. For self-hosted setups where you want official support and commercial features, Redis Inc.'s offering may suit you. Dragonfly is worth evaluating only if single-instance throughput is a demonstrated bottleneck. The landscape is moving fast; check your cloud provider's latest documentation before committing.

In March 2024 Redis moved from BSD to SSPL/RSALv2, preventing cloud providers from offering managed Redis without commercial agreements.
Valkey (Linux Foundation, BSD) forked from Redis 7.2.4 and is now the default on AWS, Google Cloud, and Oracle managed services.
KeyDB (multi-threaded fork from 2019, acquired by Snap in 2022) remains BSD-licensed under Snap; Valkey now has more community momentum for new projects.
Dragonfly is a multi-threaded C++ reimplementation with Redis-compatible protocol — worth evaluating for throughput-bound workloads, less proven in production.

Section 18

Operational Considerations

Running Redis in a proof-of-concept is trivial: redis-server and you are done. Running Redis in production is a different story. Production Redis needs persistence tuned to your durability requirements, monitoring so you know when something is wrong before your users do, security so that it is not an open door on your network, a backup strategy, capacity headroom, and a plan for cluster operations.

None of these are exotic requirements — they apply to any stateful service. But Redis has a few specific characteristics that make each one worth understanding explicitly.

Persistence Config

Redis offers two persistence mechanisms. RDB (snapshot) writes a point-in-time snapshot to disk on a schedule — fast to create, fast to load on restart, but you can lose up to minutes of data. AOF (append-only file) logs every write command — can be configured to fsync every second (at most 1 second of data loss) or on every command (no data loss, higher write overhead). For production systems that store real data, enable both: RDB for fast restarts, AOF for durability. For pure cache deployments where losing all data is fine, disable both and save the I/O.

Monitoring

The INFO command returns a comprehensive snapshot of Redis internals — memory usage, ops/sec, keyspace stats, replication status — in a few milliseconds. In production, run a Prometheus exporter (redis_exporter is the standard one) that polls INFO and exposes metrics for Grafana dashboards. The MONITOR command streams every command in real time — useful for debugging but dangerous in production since it roughly halves throughput. The SLOWLOG records commands that exceeded a threshold and is the right tool for performance investigation.

Security

Redis was designed for trusted internal networks — it has no encryption in transit by default and historically had no authentication. The first line of defense is binding Redis to a private IP and never exposing port 6379 to the public internet. Add a password with requirepass in redis.conf. For more serious setups, Redis 6.0 introduced ACL (access control lists): named users, per-user command allow-lists, and key pattern restrictions. This lets you give your cache service a user that can only GET and SET, not CONFIG SET or FLUSHDB.

Backups

RDB snapshot files are self-contained and portable — they are the natural backup artifact. Configure a cron job (or your managed service's built-in backup) to copy RDB files to S3 or equivalent object storage regularly. AOF files grow unboundedly; use BGREWRITEAOF (or configure auto-aof-rewrite-percentage) to compact the AOF in the background periodically. Test your backups by restoring them to a separate instance — a backup you have never tested is not a backup.

Capacity Planning

Redis's memory usage is roughly: (key size + value size + per-key overhead of ~50–100 bytes) × number of keys. The per-key overhead seems small but adds up — a million tiny keys can use several hundred MB in overhead alone. Always set maxmemory to some fraction of your total RAM, leaving 25–30% headroom for replication buffers, AOF rewrites, and OS use. Never run Redis in a container without a maxmemory limit — it will happily consume all available RAM until the OOM killer terminates it.

Cluster Operations

Redis Cluster distributes data across shards (each shard is a primary + replicas). Adding a new node involves joining the cluster and triggering a slot migration — Redis moves key-value pairs from existing nodes to the new one, one slot at a time, without downtime. Use redis-cli --cluster rebalance to redistribute slots evenly. Regularly test failover: take down a primary and verify that a replica is promoted and clients reconnect within your SLA window.

Pre-6.0 Redis had only requirepass — a single shared password for all clients. Redis 6.0 added ACL with named users, per-user command restrictions, and key pattern filters. If your Redis version predates 6.0, you have no fine-grained access control. For any non-trivial production setup (multiple services sharing one Redis, separation of cache vs. queue namespaces), upgrade to 6.0+ and configure ACL users.

Enable both RDB and AOF for production data stores; disable both for pure caches where data loss is acceptable.
Use redis_exporter + Prometheus for monitoring; the six key metrics to watch are memory, ops/sec, hit rate, connected clients, evicted keys, and replication lag.
Never expose Redis to the public internet; use requirepass at minimum, ACL users for anything more complex.
Always set maxmemory with 25–30% headroom and test backups by actually restoring them to a separate instance.

Section 19

Tools & Clients — Your Redis Toolbox

Redis has a tight, well-organised toolbox. You will reach for the command-line client for quick checks and scripting, the official GUI when you need to understand what is inside a live server, the built-in benchmark when you want raw numbers, and a language driver when your application talks to Redis. Here is the rundown of the six you will use most.

redis-cli

The official command-line client that ships with every Redis installation. You open an interactive REPL (redis-cli), run commands one at a time, and see the results immediately — think of it as the psql equivalent for Redis. It also works non-interactively for scripting: redis-cli SET foo bar from a shell script is perfectly valid. Beyond basic commands it supports redis-cli --scan for safe key traversal, redis-cli monitor to watch every command hitting the server in real time, and redis-cli debug sleep 0 for latency testing. Any time you are wondering "what is actually in this Redis server?", open redis-cli first.

RedisInsight

The official cross-platform GUI from Redis Inc. (free download). Connect it to any Redis instance — local, remote, or Redis Cloud — and you get a visual browser for your keyspace: see all keys, inspect their types and values, watch memory usage, and run commands in an embedded terminal with autocomplete. The built-in Profiler tab shows a live stream of incoming commands, and the Slow Log tab surfaces commands that exceeded your configured latency threshold. Particularly useful when you inherit an unfamiliar Redis instance and want to understand what is inside before writing code.

redis-benchmark

A built-in throughput and latency tester that ships with Redis. Running redis-benchmark -n 100000 -c 50 fires 100,000 requests across 50 concurrent connections and reports operations-per-second and percentile latencies for common command types (GET, SET, LPUSH, and so on). It is your first port of call for capacity planning questions — "how many ops/sec can this server handle before latency climbs?" — and for validating that a configuration change (enabling persistence, changing maxmemory-policy) had no unexpected performance impact. Run it on the same network as your application for realistic numbers; running from a remote laptop adds network RTT noise.

Official Driver Libraries

Redis maintains or endorses official clients for every major language. Java: Lettuce (reactive, thread-safe) and Jedis (synchronous, simpler). Python: redis-py — the reference implementation. Node.js: ioredis (feature-rich, supports pipelining, cluster, Sentinel) and the newer redis npm package. Go: go-redis. Rust: fred and redis-rs. Use the official driver for your language rather than rolling your own protocol implementation — the drivers handle RESP3 framing, connection pooling, automatic reconnection, and cluster slot routing so you don't have to. These details are genuinely subtle to get right under failure conditions.

Frameworks & Integrations

Higher-level abstractions built on top of the drivers: Spring Data Redis gives Java/Spring applications a repository abstraction and object serialization layer. Sidekiq (Ruby) uses a Redis list as a job queue — the dominant background-jobs framework in the Rails ecosystem. Bull / BullMQ (Node.js) does the same for Node, adding priorities, retries, rate limiting, and a dashboard UI. Various sorted-set ranking libraries wrap Redis sorted sets in a friendlier API for leaderboard use cases. These frameworks save significant boilerplate but add a layer between you and raw Redis features — when you hit their limits, drop down to the driver.

Managed Services

Running Redis yourself means you own backups, failover configuration, version upgrades, and security patching. Managed services handle all of that: Redis Cloud (from Redis Inc.) supports all Redis modules and has a generous free tier. Amazon ElastiCache offers both Redis-compatible and Valkey engine options on AWS infrastructure. Google Memorystore is the GCP equivalent. Azure Cache for Redis integrates tightly with Azure services. For most teams the managed route is the right default — the operational overhead saved usually outweighs the cost premium. Consider self-hosted Redis only when compliance requirements, extreme scale, or budget constraints make managed services impractical.

Run these directly in a redis-cli session. The shell is synchronous and stateful — every command returns immediately, and the connection stays open until you type QUIT. Great for exploratory work and one-off admin tasks.

# Connect to a local Redis (default port 6379)
redis-cli

# Or connect to a remote server with auth
redis-cli -h my-redis.example.com -p 6379 -a $REDIS_PASSWORD

# Basic string ops
SET user:42:name "Rafikul"          # store a string
GET user:42:name                     # => "Rafikul"
SET page:hits 0
INCR page:hits                       # atomic increment => 1
EXPIRE page:hits 86400               # expire in 24 h

# Hash (flat object)
HSET session:abc token "xyz123" user_id 42 created_at 1715000000
HGETALL session:abc                  # returns all fields and values
HGET session:abc user_id             # => "42"

# Sorted set (leaderboard)
ZADD leaderboard 1500 "alice"
ZADD leaderboard 2300 "bob"
ZADD leaderboard 1900 "carol"
ZRANGE leaderboard 0 -1 WITHSCORES REV   # top scores, highest first

# Safe key traversal — NEVER use KEYS * in production
SCAN 0 MATCH "user:*" COUNT 100     # returns cursor + batch of keys
# loop: feed returned cursor back until cursor == 0

# Check server health
INFO server      # version, uptime, config
INFO memory      # used_memory, mem_fragmentation_ratio
INFO stats       # ops/sec, hit ratio

redis-py is the official synchronous Python client. Create one Redis instance per process — it manages a connection pool internally and is safe to share across threads. The async variant (redis.asyncio) drops in cleanly for async frameworks like FastAPI.

import redis
import os

# One client, reused across the application. Pool size defaults to 10.
r = redis.Redis(
    host=os.environ["REDIS_HOST"],
    port=6379,
    password=os.environ.get("REDIS_PASSWORD"),
    decode_responses=True,   # return str, not bytes
)

# --- Cache pattern: get-or-set with TTL ---
def get_user_profile(user_id: int) -> dict:
    key = f"user:{user_id}:profile"
    cached = r.hgetall(key)
    if cached:
        return cached                         # cache HIT

    # cache MISS — fetch from database
    profile = db_fetch_user(user_id)          # your DB call here
    r.hset(key, mapping=profile)
    r.expire(key, 3600)                       # 1-hour TTL
    return profile

# --- Sorted set: real-time leaderboard ---
def update_score(user_id: str, delta: int):
    r.zincrby("game:scores", delta, user_id)  # atomic increment

def top_10() -> list[tuple]:
    # ZRANGE with REV + WITHSCORES returns highest-score first
    return r.zrange("game:scores", 0, 9, withscores=True, desc=True)

# --- Rate limiter: sliding window via INCR + EXPIRE ---
def is_rate_limited(ip: str, limit: int = 100, window: int = 60) -> bool:
    key = f"rate:{ip}:{int(time.time()) // window}"
    count = r.incr(key)
    if count == 1:
        r.expire(key, window * 2)  # keep key slightly longer than window
    return count > limit

ioredis is the most popular Node.js client. Its built-in pipeline bundles multiple commands into one TCP round-trip — that is the key to 5–10× throughput gains over issuing commands one at a time. The pipeline sends all commands then collects all responses in one batch.

import Redis from 'ioredis';

// One client instance — reuse across the application
const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  password: process.env.REDIS_PASSWORD,
  lazyConnect: true,           // don't connect until first command
  retryStrategy: times => Math.min(times * 50, 2000), // exponential back-off
});

// --- Pipeline: batch 3 writes in ONE round-trip ---
// Without pipeline: 3 × RTT. With pipeline: 1 × RTT.
async function recordPageView(userId, page) {
  const pipeline = redis.pipeline();
  pipeline.incr(`page:${page}:views`);
  pipeline.lpush(`user:${userId}:history`, page);
  pipeline.ltrim(`user:${userId}:history`, 0, 99);  // keep last 100 only
  const results = await pipeline.exec();
  // results: [[null, 42], [null, 5], [null, "OK"]]  (err, value) pairs
  return results;
}

// --- Cache-aside with TTL ---
async function getCachedProduct(productId) {
  const key = `product:${productId}`;
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);         // cache HIT

  const product = await db.getProduct(productId); // DB fallback
  await redis.setex(key, 300, JSON.stringify(product)); // 5-min TTL
  return product;
}

// --- Pub/Sub: fire-and-forget notifications ---
async function publishEvent(channel, payload) {
  await redis.publish(channel, JSON.stringify(payload));
}

// Subscriber must use a SEPARATE connection (ioredis auto-handles this)
const sub = redis.duplicate();
sub.subscribe('order:created', (err, count) => {
  if (err) console.error(err);
});
sub.on('message', (channel, message) => {
  console.log(`[${channel}]`, JSON.parse(message));
});

The Redis toolbox: redis-cli for interactive exploration and admin scripting, RedisInsight (GUI) for browsing keys and profiling live traffic, redis-benchmark for throughput and latency capacity testing, official drivers (Lettuce/Jedis, redis-py, ioredis, go-redis) for application integration, framework integrations (Spring Data Redis, Sidekiq, BullMQ) for higher-level job and queue abstractions, and managed services (Redis Cloud, ElastiCache, Memorystore, Azure Cache) for production hosting without operational overhead.

Section 20

Common Misconceptions About Redis

Redis has a reputation that is partly accurate and partly a decade-old snapshot of a very different product. The six beliefs below circulate widely in interviews and tech Twitter. Each one contains a grain of truth — which is exactly why they persist. Knowing why each one is at least partially wrong is more useful than just knowing the correct answer.

"Redis is just a NoSQL database."

Yes, technically — but calling Redis a NoSQL database is like calling a Swiss Army knife a knife. It undersells it badly. Redis is more accurately an in-memory data structure server: it speaks a custom protocol over TCP, understands eight native data types (strings, hashes, lists, sets, sorted sets, streams, bitmaps, HyperLogLog), and executes server-side operations on those structures atomically. Most NoSQL databases are optimised for durably persisting large datasets and querying them. Redis is optimised for serving the hot layer of your data stack — the data that needs to be read and updated thousands of times per second. Treating it as just another key-value store means you miss its most powerful features: sorted sets for leaderboards, streams for event sourcing, HyperLogLog for cardinality estimation, and Lua scripting for multi-step atomic operations.

"Redis loses data on crash."

Only if you choose not to enable persistence — which might be the right choice for a pure cache, but is a configuration decision, not an inherent limitation. Redis offers two durability mechanisms. RDB snapshots write a point-in-time fork of the dataset to disk periodically; you lose at most the writes since the last snapshot (configurable from minutes to seconds). AOF (Append-Only File) journals every write command; with appendfsync everysec you lose at most one second of writes; with appendfsync always you lose at most one command — a durability level comparable to a synchronous relational database. For maximum durability, enable both: AOF for low data-loss recovery, RDB for fast restart from a compact file. The "Redis loses data" reputation comes from deployments where the operator left persistence disabled because it was a cache, then later started storing data they actually cared about.

"Redis is single-threaded, so it's slow."

This conflates two very different things: the command execution model and the I/O model. Redis executes commands on a single thread — meaning commands are serialised and never run concurrently, which eliminates locking complexity and makes atomic operations trivially safe. That single thread routinely achieves 50,000–200,000+ operations per second because every operation is a RAM access, typically completing in microseconds. There is no disk I/O to wait for, no B-tree traversal, no page cache miss on the hot path. Redis 6.0 added multi-threaded I/O: network reads and writes now use multiple threads, so CPU is no longer bottlenecked on serialising packet processing. The command execution itself remains single-threaded. The practical implication: Redis is rarely CPU-bound in real-world caching workloads. It is far more often bounded by network bandwidth or memory.

"Redis Cluster solves all sharding problems."

Redis Cluster does solve horizontal write scaling — it shards your keyspace across up to 1,000 master nodes, each owning a subset of the 16,384 hash slots. But it introduces real operational complexity: cluster-aware clients (not all clients support Cluster mode), hash-tag requirements for multi-key operations, careful key design to avoid hot slots, and a non-trivial failure recovery process. For the vast majority of applications — where the working set fits comfortably in a single instance (typically under 100 GB) and throughput is under 200,000 ops/sec — a single master with one or two replicas behind Redis Sentinel is dramatically simpler to operate. Cluster is the right tool when you have genuinely outgrown a single node; it is overkill before that point. The common mistake is reaching for Cluster as a first architecture choice before any scaling problems emerge.

"TTL means the key disappears the instant it expires."

Close, but not quite accurate. Redis uses a lazy expiration strategy combined with active expiration. Lazy: when you access a key, Redis checks its TTL and deletes it on the spot if it has expired — so from your application's perspective the key is gone immediately on the next access. Active: a background task runs periodically (roughly 10 times per second) and samples random keys from the "keys with TTL" set, deleting any that have expired. This means a key can exist in memory for a brief window past its expiry time before the active expiration sweep happens to pick it up. In practice the lag is tiny (milliseconds to tens of milliseconds) and inconsequential for most use cases. The implication worth knowing: if you have millions of keys all expiring at exactly the same time (a "TTL storm"), the active expiration loop will have a lot of work to do and you may see a brief memory spike. Spread expiry times with random jitter to prevent this.

"Distributed locks with Redis are bulletproof."

The basic single-node distributed lock — SET lock:resource uuid NX EX 30 — is correct and widely used for mutual exclusion under normal failure conditions. It atomically sets a key only if it does not already exist, with a TTL to prevent stale locks. The grain of truth in the misconception is that this pattern works well in practice. The nuance: under split-brain scenarios (network partition where the node becomes isolated) or clock skew, the safety guarantee weakens.

The Redlock algorithm (using N independent Redis instances, acquiring locks on a majority) was proposed by Redis's creator to address multi-node safety — but academic distributed systems experts have raised concerns about whether Redlock's safety guarantees hold under adversarial timing conditions. For most applications, single-node SET NX EX is sufficient. For genuinely mission-critical mutual exclusion (financial transactions, inventory reservations), use fencing tokens (a monotonically increasing lock token stored alongside the protected resource) or a consensus-based system like etcd or ZooKeeper. Redis locks are pragmatically safe; they are not formally proven safe in all failure modes.

Six Redis misconceptions debunked: Redis is an in-memory data structure server, not just a key-value DB. Persistence with AOF + RDB can give near-zero data loss. Single-threaded command execution still hits 50–200K ops/sec because it runs in RAM, and multi-threaded I/O arrived in 6.0. Cluster is powerful but operationally complex — Sentinel + single master covers most use cases. TTL expiry has a brief lazy/active lag, not instantaneous removal. Distributed locks via SET NX EX are pragmatically solid but not formally bulletproof under adversarial failure; add fencing tokens for critical resources.

Section 21

Real-World Disasters & Lessons

These are real patterns drawn from widely-reported production failures and recurring incidents that Redis practitioners encounter repeatedly. Every one was preventable. Read them as the most concrete possible evidence that command choice, eviction policy, and replication configuration are not abstract concerns — they are the difference between a five-minute fix and a multi-hour outage.

Incident 1 — KEYS * on a 10M-Key Production Server

Incident: An on-call engineer running a routine audit typed KEYS user:* in redis-cli against the production instance. The server had roughly 10 million keys. The command ran for approximately 4 seconds while the single-threaded event loop was completely occupied scanning the entire keyspace. During those 4 seconds every client request timed out. Users saw 500 errors. The application was effectively down.

Why it happens: KEYS is O(N) where N is the total number of keys in the database. It returns all matching keys in a single call, which means it must scan every key before returning a single result. Because Redis command execution is single-threaded, no other command can run while KEYS is scanning. On a server with millions of keys, "scanning every key" takes seconds.

The fix: SCAN is the only safe way to traverse the keyspace in production. SCAN cursor MATCH pattern COUNT batch_size returns a small batch of keys and a new cursor. You loop, feeding the returned cursor back each iteration, until the cursor returns to 0. Each individual SCAN call returns quickly, interleaving with other client commands between batches. The COUNT hint tells Redis roughly how many keys to return per call — 100–500 is typical for a balance of speed and server yield. Never use KEYS in any code path that runs against a production Redis instance, and restrict operator permissions if your team is large.

Incident 2 — noeviction Cache Fills Up, Writes Fail

Incident: A caching layer running Redis was deployed with maxmemory-policy noeviction (the default). Over several months of traffic growth, the cache filled to its memory limit. New cache writes started failing with OOM command not allowed when used memory > 'maxmemory' errors. Because the application did not handle those errors gracefully, cache misses turned into unhandled exceptions and users saw 500 errors across the product.

Why it happens: Redis has configurable behaviour for when it runs out of memory. noeviction refuses new writes but preserves all existing data — correct for a primary data store, catastrophic for a cache. The operator had not explicitly set an eviction policy, so the default applied without anyone thinking about it.

The fix: cache deployments should use allkeys-lru (evict the least-recently-used key across the entire keyspace when memory is full) or allkeys-lfu (evict the least-frequently-used). Both policies keep Redis writable at all times by making room for new data automatically. noeviction and volatile-* policies (which only evict keys that have a TTL set) are appropriate for primary data stores where you cannot afford to lose data silently. Set your eviction policy explicitly at deployment time, never rely on the default for a cache workload.

Incident 3 — Long Lua Script Blocks All Clients

Incident: A team implemented a complex rate-limiting and quota calculation as a Lua script that could take 200–400 ms to run on large inputs. Under peak traffic the script ran dozens of times per second. Other clients queued behind each script execution. Effective Redis throughput dropped by 60% and latency p99 climbed from 2 ms to over 300 ms site-wide.

Why it happens: Lua scripts in Redis execute atomically — no other command runs while a script is running. This is the source of their correctness guarantee for multi-step operations. But it means a slow script is as bad as a slow KEYS call: the server is blocked for its entire duration. Redis has a lua-time-limit configuration (default 5 seconds) after which it starts refusing new commands with a BUSY error, but this is a last resort, not a safety net.

The fix: keep Lua scripts under 50 ms of wall-clock execution time as a firm rule. Profile scripts with redis-cli --latency and SLOWLOG GET before deploying. If complex logic requires more time, break it into smaller scripts or move the logic to the application layer. Redis 7.0+ Redis Functions replace Lua scripts with a more structured execution model, but the same time constraint applies. Set lua-time-limit 100 (100 ms) in your config as a safety tripwire.

Incident 4 — Async Replication Data Loss on Failover

Incident: A Redis primary received 500 writes in the 800 ms window between the last successful replication sync to its replica and a primary crash. Sentinel promoted the replica. Those 500 writes — which the primary had acknowledged to clients — were gone. The application had recorded successful database writes that were silently discarded.

Why it happens: Redis replication is asynchronous by default. The primary writes to its own memory and acknowledges the client immediately; the replication stream to replicas is best-effort. If the primary crashes before those writes propagate, they are lost on failover. This is not a bug — it is a fundamental trade-off between latency and durability. Synchronous replication would add the latency of a round-trip to every write.

The fix: for writes that must not be lost, use the WAIT numreplicas timeout command after the write. WAIT 1 100 blocks until at least one replica has acknowledged the write, or 100 ms elapse. This turns one write into a synchronous operation for that specific key. Use it selectively — for financial ledger entries, user data after sign-up, or any data where silent loss is unacceptable. Do not apply it universally; the latency cost on every cache write is rarely justified.

Incident 5 — Memory Fragmentation Creep

Incident: A long-running Redis server (18 months, never restarted) showed used_memory of 12 GB but RSS (Resident Set Size, the actual OS memory consumed) of 21 GB. The server was approaching OOM from the OS's perspective even though Redis thought it had room to spare. Applications started seeing latency spikes as the OS swapped pages to disk.

Why it happens: When Redis writes and deletes many objects of varying sizes, the allocator (jemalloc) can accumulate fragmented free blocks — memory that has been freed but cannot be reused for new allocations of different sizes. The ratio of RSS to used_memory is the mem_fragmentation_ratio. A ratio above roughly 1.5 indicates significant fragmentation. Long-running Redis instances that mix large and small objects frequently, or that experience heavy delete/replace workloads, are most susceptible.

The fix: monitor INFO memory regularly — specifically mem_fragmentation_ratio. Redis 4.0+ introduced active defragmentation (activedefrag yes) which incrementally compacts memory during idle periods without requiring a restart. Enable it proactively on long-running instances. For severely fragmented instances, a controlled restart (saving an RDB, restarting, loading from RDB) resets fragmentation instantly. Plan for periodic Redis restarts during maintenance windows — treating Redis as a service that never needs restarts is the mental model that leads to fragmentation surprises.

Five production disaster patterns — KEYS * blocking (fix: SCAN), noeviction cache filling up (fix: allkeys-lru), long Lua scripts blocking clients (fix: <50 ms scripts, set lua-time-limit), async replication data loss on failover (fix: WAIT for critical writes), and memory fragmentation creep (fix: activedefrag + monitor mem_fragmentation_ratio) — are all preventable with explicit configuration choices and a handful of operational habits established from day one.

Section 22

Performance & Best Practices Recap

Everything on this page distills into eight practices. None of these are arbitrary rules — each one has a clear mechanical reason rooted in how Redis actually works. Follow them and you avoid the overwhelming majority of Redis production problems. Skip any one of them and you are relying on luck.

1 · Use Pipelining

Without pipelining, each command waits for its response before the next is sent. With pipelining, you bundle N commands and send them together, then collect all N responses. The network round-trip — which can be 0.5 ms on a LAN and 10+ ms across a data centre — is paid once instead of N times. In loops that set 1,000 keys at once, pipelining typically gives 5–10× throughput improvement. Both ioredis and redis-py expose a simple pipeline API; use it any time you have more than 2–3 independent commands to issue together.

2 · Right Data Structure

Using a String where a Hash would do means your application serialises/deserialises JSON on every access — work that Redis's server-side Hash type does in O(1) per field. Using a List for a leaderboard means O(N) sorting; a Sorted Set gives O(log N) insert and O(log N + M) range query. HyperLogLog uses 12 KB of fixed memory to count unique items across billions — a Set holding billions of members would use gigabytes. Match the data structure to the operation, not to familiarity.

3 · TTL on Everything Cacheable

A key without TTL lives forever in Redis — it is not a cache entry, it is a permanent record. Over time, TTL-less cache keys accumulate, memory fills, and you start seeing eviction or OOM errors. Set TTL on every key whose purpose is caching. Add small random jitter (e.g., base TTL ± 10%) to spread expiry times — if thousands of cache keys were all set at the same time they will all expire at the same time, causing a simultaneous cache-miss storm on your database, sometimes called a "thundering herd".

4 · allkeys-lru vs noeviction

For a cache deployment, set maxmemory-policy allkeys-lru. When Redis fills up, it evicts the least-recently-used key to make room — the cache stays writable and your application keeps working, just with a slightly lower hit ratio. For a primary data deployment (session store, job queue, primary record store), use noeviction: writes fail when memory is full, which forces you to notice and provision more capacity instead of silently discarding data. Set the policy explicitly in your config file — never rely on the default.

5 · SCAN, Never KEYS

The rule is absolute: KEYS * (and its variants KEYS pattern, SMEMBERS on a huge set, LRANGE 0 -1 on a huge list) must not run in production. Use SCAN cursor MATCH pattern COUNT batch in a loop instead. For set members, SSCAN. For hash fields, HSCAN. For sorted set members, ZSCAN. The COUNT hint is advisory — Redis may return more or fewer; always loop until the cursor returns to 0. If your codebase has a search for KEYS, audit it and replace every instance.

6 · Bounded Lua Scripts

Lua scripts in Redis are atomic and fast — perfect for multi-step operations like compare-and-swap or atomic rate-limit increment. Keep them under 50 ms of wall-clock time. Profile them with SLOWLOG GET and redis-cli --latency. Set lua-time-limit 100 in your redis.conf to get a BUSY error rather than a silent hang if a script runs long. Redis Functions (Redis 7.0+) provide a library model for reusable server-side code with slightly better tooling, but the same time constraint applies.

7 · Hash-Tag Keys for Cluster

In Redis Cluster, keys are distributed across 16,384 hash slots by computing CRC16(key) % 16384. Multi-key commands (MGET, MSET, SUNIONSTORE) and Lua scripts must operate on keys in the same slot — issuing them across slots returns a CROSSSLOT error. Hash tags solve this: the slot for a key is computed only on the substring inside {}. So {user:42}:profile and {user:42}:sessions both hash to the slot for user:42 and always live on the same node. Design your key naming convention with hash tags from the start if you plan to use Cluster.

8 · Monitor 6 Metrics

Run redis-cli INFO all or INFO memory|stats|replication and watch these six: (1) used_memory_rss vs used_memory — ratio above 1.5 signals fragmentation. (2) instantaneous_ops_per_sec — your throughput baseline. (3) keyspace_hits / (keyspace_hits + keyspace_misses) — your cache hit ratio; below roughly 85% is worth investigating. (4) latency from redis-cli --latency — p99 above 5 ms on a LAN is a red flag. (5) master_repl_offset - slave_repl_offset — replication lag in bytes. (6) mem_fragmentation_ratio. Set up alerts on all six.

Eight Redis best practices — pipeline for throughput, right data structure for the operation, TTL with jitter on all cache keys, allkeys-lru for cache and noeviction for primary data, SCAN instead of KEYS, Lua under 50 ms, hash-tag keys for Cluster co-location, and monitoring six key metrics — collectively prevent the configuration disasters, production outages, and capacity surprises that dominate Redis war stories.

Section 23

FAQ — Your Redis Questions Answered

These are the questions engineers most commonly ask when they start working with Redis in practice — or when they are deciding whether to use it at all. Plain English first, then the nuance that matters for real decisions.

Redis or Memcached — which should I pick?

Redis for almost everything in 2026. Memcached is a pure, simple, ephemeral key-value cache: strings only, no persistence, no replication, no data structures beyond key → string. Its advantages are simplicity and marginally lower per-operation overhead for pure GET/SET workloads. Redis adds sorted sets, hashes, streams, Pub/Sub, persistence, replication, Lua scripting, and Cluster sharding — and its raw throughput for simple GET/SET is within a few percent of Memcached in most benchmarks. Choose Memcached only if you have an existing deployment you are maintaining, or a very specific workload where you have benchmarked a meaningful throughput advantage. For all new deployments, Redis is the clear default: you get the same speed, and you retain the option to use richer features when your use case evolves.

When does Redis Cluster make sense?

When you have genuinely outgrown a single instance — which typically means your working set exceeds roughly 100 GB of RAM, or you need more than roughly 200,000 ops/sec and a single-threaded instance is saturating its CPU. Below those thresholds, a single Redis primary with one or two replicas behind Redis Sentinel gives you high availability with far less operational complexity. Cluster requires cluster-aware clients, hash-tagged keys for multi-key operations, careful slot distribution planning, and a non-trivial failure recovery process. Start with single-master + Sentinel, add Cluster only when a specific scaling constraint forces the upgrade.

Should I enable AOF, RDB, or both?

It depends on how much data loss you can tolerate. For a cache-only deployment where all data is derived and can be rebuilt from the source of truth: disable both, accept that a restart starts with an empty cache. For a deployment where you want quick restart from a compact file and can tolerate losing the last few minutes of writes: RDB only (snapshots every 1–5 minutes). For maximum durability: enable both. AOF with appendfsync everysec gives at most one second of data loss and handles crash recovery; RDB gives a compact backup file for fast full restarts and point-in-time copies. Redis 7.0's RDB-AOF hybrid persistence combines both in one file. When in doubt, enable both — the disk overhead is modest and the durability benefit is significant.

Redis or Valkey — should I care?

Valkey is a community fork of Redis created in 2024 after Redis Inc. changed Redis's license from BSD to a source-available license (SSPL + RSALv2). Functionally, as of mid-2026, Valkey and Redis are largely drop-in compatible at the protocol and API level — most clients work against both, most commands behave identically. The practical distinction is licensing: Valkey is licensed under BSD (permissive open-source) and is the version offered by major managed services (AWS ElastiCache, Google Memorystore) as their open-source option. Redis Cloud and Redis Inc. distributions include proprietary modules (Search, JSON, TimeSeries, Bloom) that Valkey does not ship with by default. Choose Valkey for open-source compliance or cost reasons; choose Redis Cloud or Redis Enterprise for commercial module features like Redis Search or Redis JSON with full vendor support.

How do I scale Redis writes beyond a single instance?

Three options in increasing complexity: (1) Redis Cluster shards writes across multiple master nodes, each owning a subset of hash slots. This is the official Redis scaling path. (2) Application-level sharding: your application hashes keys to one of N independent Redis instances. Simpler than Cluster but requires cluster-awareness in the application. (3) Dragonfly — a Redis-compatible server that replaces Redis's single-threaded model with a multi-threaded shared-nothing architecture, scaling write throughput linearly with CPU cores on a single machine. Dragonfly is a newer option (stable as of 2024) that can substitute for Redis Cluster for many workloads without the operational complexity. Evaluate based on your specific write throughput requirements and tolerance for operational complexity vs vendor maturity.

What is the maximum key and value size?

The hard limit for a String value is 512 MB. In practice, storing values anywhere near that size defeats the purpose of Redis: reading a 100 MB value takes measurable time, consumes large chunks of your instance's RAM for one key, and stresses your network. A healthy rule of thumb: keys under 1 KB (shorter keys = smaller memory footprint across millions of entries), String values under 1 MB, and Hash fields under 64 KB. If you find yourself wanting to store a 10 MB JSON blob in Redis, reconsider: Redis is not a document store. Store the large object in object storage (S3, GCS) and cache only the metadata or a compressed summary in Redis.

Should I run Redis on Kubernetes?

Yes, with the right operator. Running Redis as a bare Deployment in Kubernetes is a bad idea — pods can be rescheduled, and a stateful server like Redis needs stable network identities, persistent volumes, and proper failover orchestration. Use a purpose-built operator instead: the Bitnami Redis Helm chart and the OT-CONTAINER-KIT Redis Operator are the most widely used community options. The Redis Enterprise Operator (from Redis Inc.) is the supported commercial path. Operators handle replica management, Sentinel or Cluster topology, TLS certificate rotation, and rolling upgrades. Attach PersistentVolumeClaims backed by SSD storage classes for durability. For most Kubernetes environments, a managed service (ElastiCache, Memorystore) remains simpler — use the operator route when you need full control or are not on a major cloud.

How do I migrate data between Redis instances?

Depends on scope. For a single key or small set of keys: the MIGRATE host port key db timeout command atomically moves a key from one instance to another over the wire. For a full dataset migration: use redis-cli --rdb /path/to/dump.rdb to dump an RDB file from the source, then redis-cli --pipe or RESTORE commands to load into the destination. For zero-downtime live migrations: configure the destination as a replica of the source, let replication sync, then promote and cut over your application connection strings. Managed services (Redis Cloud, ElastiCache) usually provide native migration tools in their consoles that orchestrate replication-based migration automatically — use those when available.

Eight FAQ answers — Redis vs Memcached (Redis wins in 2026), when Cluster makes sense (above ~100 GB RAM or 200K ops/sec), AOF vs RDB vs both (match to durability needs), Redis vs Valkey (licensing split, mostly compatible), scaling writes (Cluster, app sharding, or Dragonfly), max key/value sizes (512 MB hard limit; keep values under 1 MB practically), running on Kubernetes (use an operator + PVCs), and live migration strategies (MIGRATE, RDB dump, or replication-based cutover) — cover the practical gaps engineers encounter when moving from Redis theory to production decisions.

Section 24

Connected Topics — What to Study Next

Redis does not exist in isolation. The topics below are either the databases you will be asked to compare it against, the underlying concepts that explain why Redis works the way it does, or the complementary technologies it is commonly paired with. Work through them in the order that matches where you want to go next.

Six connected topics — MongoDB (document model contrast and persistent companion), Database Internals (disk I/O physics that make RAM speed tangible), Relational Databases (the SQL source-of-truth tier), Sharding (hash-slot theory behind Cluster), Real-time Protocols (Pub/Sub and Streams in context), and Caching Strategies (the coordination patterns that govern Redis use) — form the complete context around Redis that turns isolated knowledge into systems-level thinking.

Redis — In-Memory Data Structures at Microsecond Speed

TL;DR — Redis in Plain English

Why You Need This — The Melting Database Story

The situation: your top-products page

The fix: cache the result in Redis

Mental Model — Server-Side Data Structures

Traditional stores: you move data, your code does the work

Redis: the server does the work, you send one command

The durability side of the story

Core Concepts — The Six Terms You Must Know

Key — the name of your data

Value — the typed data attached to a key

TTL (Time-To-Live) — automatic expiry

AOF (Append-Only File) — the write log

RDB Snapshot — the point-in-time photograph

Pipeline — batch commands, one round-trip

The Eight Data Structures — Pick the Right Tool

Structure-by-structure plain English

String — the universal type

Hash — an object in one key

List — queue and activity feed

Set — unique membership and set math

Sorted Set (ZSET) — the leaderboard structure

Stream — the event log

Bitmap — presence tracking at scale

HyperLogLog — unique count at billion scale

See it in code

Persistence — AOF and RDB: How Redis Survives a Crash

AOF — Append-Only File (the receipt tape)

RDB — Binary Snapshot (the photograph)

AOF vs RDB — the trade-off in plain terms

Choosing between AOF and RDB

Replication & High Availability — Never Lose Your Cache

How replication works — plain English first

Three things you must know about Redis replication

Sentinel — Automatic Failover Without Cluster Complexity

Sentinel vs Cluster — when to use which

Redis Cluster — Horizontal Sharding Built In

Hash slots — the math behind the split

Key concepts you must understand

Hash Tags — forcing co-location

Resharding — moving slots online

Memory Management & Eviction — What Happens When Redis Is Full

The eight policies — when to use each

allkeys-lru — the standard cache policy

allkeys-lfu — better for stable hot sets

volatile-* — evict only TTL-carrying keys

noeviction — danger for caches

Pub/Sub & Streams — Redis as a Message Bus

Three messaging patterns and when to pick each

Pub/Sub — real-time broadcast

Streams (XADD / XREADGROUP) — durable queues

Sorted-set queue — priority queue pattern

Lua Scripting & Transactions — Atomic Operations in Redis

Three patterns for atomic Redis operations

Caching Patterns

Cache-Aside (Lazy Loading)

Read-Through

Write-Through

Write-Behind (Write-Back)

Common Use Cases

Sessions

Rate Limiting

Distributed Locks

Leaderboards

Real-Time Analytics

Pub/Sub

Job Queues

Geospatial

Distributed Locks: Why It's Hard

Single-Node SET NX EX

Redlock

Fencing Tokens

Performance & Throughput

Pipelining

Hash-Tag Co-location (Cluster)

Avoid O(N) Commands

maxmemory Policy

Connection Pooling

Redis vs Valkey vs KeyDB vs Dragonfly

`allkeys-lru` — the standard cache policy

`allkeys-lfu` — better for stable hot sets

`volatile-*` — evict only TTL-carrying keys

`noeviction` — danger for caches