Cassandra — System Guide

Section 1

TL;DR — Cassandra in Plain English

Why Cassandra was built for write-heavy workloads at a scale that breaks traditional databases
How Cassandra's ring architecture automatically distributes data without a single primary node
What trade-offs you accept — no JOINs, no cross-row transactions, eventual consistency by default
The query-first schema philosophy that makes Cassandra succeed or fail based on how you design it

Cassandra's core insight: every node is equal, every write goes to memory first, and the ring topology means adding a new machine automatically absorbs part of the load — no resharding ceremony required.

Cassandra was built when Facebook engineers needed to store billions of inbox messages and search them fast. Their problem was not reading — it was absorbing an endless firehose of writes. Relational databases handle writes by updating rows in-place on disk, which requires seeking to the right position and locking rows. Cassandra writes to an in-memory buffer first (the Memtable) and appends to a commit log on disk — two sequential operations, no locking, no seeking. Sequential disk I/O is roughly 100× faster than random I/O. That physics advantage is why a well-tuned Cassandra node typically handles tens of thousands of writes per second on commodity hardware, and a cluster of dozens of nodes can absorb hundreds of thousands to millions of writes per second in aggregate.

In PostgreSQL or MySQL you have a primary that accepts writes and replicas that serve reads. If the primary dies, you need a failover procedure. Cassandra has no primary. Every node is equal. When a client writes a row, it contacts any node — that node acts as the coordinator, hashes the partition key to find which 3 (or however many) nodes own the data, and fans the write out to them. There is no single point of failure because there is no throne to lose.

Nothing is free. Cassandra gives you write throughput and masterless resilience; it takes away JOINs, multi-row ACID transactions, and the freedom to query data any way you want. Because there is no query planner doing JOINs at runtime, you have to pre-shape your data for each access pattern at design time. If you have 5 different ways to query the same data, you keep 5 copies in 5 tables. That is not a bug — that is the architecture. Embrace it or fight it forever.

Cassandra is a masterless wide-column database that wins on write throughput and fault tolerance by accepting explicit trade-offs: no JOINs, eventual consistency by default, and a query-first schema design discipline.

Section 2

Why You Need This — The IoT Firehose Problem

Let's walk through a concrete scenario. No Cassandra knowledge required yet — just follow the numbers and feel the pain point.

The situation: a global sensor network

You are building a telemetry platform for industrial IoT sensors. Each sensor reports temperature, pressure, and vibration readings 100 times per second. You start with 10,000 sensors. That is 1,000,000 writes per second — one million rows landing in your database every second of every day.

You try Postgres first, because you know SQL and it works great for everything else you have built. Here is what happens:

A single Postgres node can handle roughly 10,000–30,000 writes per second under favorable conditions.
At 1M writes/sec you need 30–100 Postgres nodes just to absorb the load.
Sharding Postgres is manual work: you pick a shard key, you route writes in application code, and when you outgrow your shard count you go through a painful re-sharding migration where you move data around while staying live.
Six months later your sensor count grows to 50M — now you need 5× as many shards. Re-shard ceremony again. And again next year.

The re-sharding problem is not a Postgres bug. It is a fundamental property of any system not designed for horizontal elasticity. You are bolting distribution onto a system built for a single powerful machine.

Why Cassandra changes the equation

Cassandra treats sharding as a first-class feature, not an afterthought. From the moment you create your first table, your data is automatically distributed across however many nodes you have using consistent hashing. When you add a new node, Cassandra automatically transfers a share of the token range from existing nodes to the new one — no application code changes, no downtime, no migration script.

The practical effect: scaling from 10M sensors to 50M sensors means adding machines to the cluster. Cassandra handles the rest. Your application never needs to know how many nodes exist.

Think First

Your IoT platform currently writes at 100K rows/sec and grows roughly 30% year-over-year. In three years you will be at roughly 220K rows/sec. A single well-tuned Postgres node tops out around 20–30K writes/sec. Before reading further: what would you need to do to keep Postgres viable here? What does that operational cost look like compared to just using a database designed for this load?

The trade you make

Cassandra's write performance comes from a specific design choice: writes go to an in-memory structure (Memtable) plus a commit log, then are flushed to immutable files on disk (SSTables) in the background. There are no in-place row updates — every change is a new timestamped entry. Reads are more work because Cassandra may need to merge multiple SSTables and check for tombstones (deletion markers). The conclusion: Cassandra is optimized for write-heavy workloads where you know your access patterns up front. For complex ad-hoc queries, analytics, or ACID transactions, a relational database or a warehouse will serve you better.

Cassandra was designed to absorb massive write loads that re-sharding SQL databases cannot handle gracefully — elastic scale by adding nodes, no re-sharding ceremony, at the cost of no JOINs and query-pattern-aware schema design.

Section 3

Mental Model — The Ring

Before looking at any CQL syntax, you need one mental picture burned in permanently: the ring. Every design decision in Cassandra makes sense once you understand the ring. Without it, the partition key, replication factor, and consistency levels feel like arbitrary rules.

The ring explained simply

Imagine a circular number line from 0 to 1,023 (Cassandra actually uses a much larger range — around −2⁶³ to 2⁶³ − 1 — but the concept is the same). Each Cassandra node is assigned responsibility for a range of numbers on that circle. When you write a row, Cassandra hashes your partition key to produce a number (called a token). Then it walks clockwise around the ring until it finds the first node whose range includes that token. That node stores the row. Simple.

Replication works the same way. If your replication factor is 3, the row is stored on three consecutive nodes clockwise from the token. Node 1 fails? Nodes 2 and 3 still have the data. The cluster keeps serving requests without any failover script.

Four design heuristics that flow from the ring

Once you picture the ring, four key ideas click into place:

1. Partition Key Picks the Node

The partition key is hashed to determine which token range — and therefore which node — holds your data. This is not just a uniqueness requirement. It is a routing decision. If you make your partition key too coarse (e.g., a single value like "all_sensors"), all data lands on one node and you get a hot partition. If you make it too granular (e.g., each individual sensor reading), you lose the ability to range-query. The art of Cassandra schema design is choosing partition keys that spread load evenly AND group the rows your queries need.

2. Replication Factor (RF) — How Many Copies

RF=3 means three nodes always hold a copy of each partition. Why 3? Because it allows for one node failure while still being able to do a quorum read/write (two out of three). RF=1 gives you no fault tolerance; RF=5 gives you stronger durability but more write cost and storage. Most production clusters use RF=3 per datacenter. In a multi-datacenter setup you can have RF=3 in DC-East and RF=3 in DC-West — Cassandra replicates asynchronously between DCs, giving you active-active geographic redundancy.

3. Consistency Level (CL) — How Many Must ACK

With RF=3 copies of each partition, you choose how many must acknowledge before Cassandra considers an operation successful. ONE = fastest, weakest (1 of 3 must ACK). QUORUM = balanced (majority — 2 of 3 must ACK). ALL = strongest, slowest (all 3 must ACK). The sweet spot for most production workloads is LOCAL_QUORUM — a majority within the local datacenter must ACK, balancing speed with consistency. The magic formula: if write CL + read CL > RF, you get strong consistency (every read sees the latest write).

4. No Primary — Every Node Is Equal

In a traditional primary/replica setup, all writes go to the primary. If the primary dies, the cluster stalls until a replica is elected. Cassandra has no such bottleneck. Any node can coordinate any request. If a client connects to N5 to write a row that belongs to N3, N5 just proxies the write to N3 (and the replicas) and returns an ACK. This means adding nodes increases write throughput linearly — each new node takes on a share of the token ring and can independently coordinate requests.

The ring is Cassandra's core abstraction: consistent hashing maps partition keys to nodes, RF controls redundancy, CL controls durability strength per operation, and masterless design means linear write scaling with no single point of failure.

Section 4

Core Concepts — The Vocabulary

Before writing a single CQL statement you need to feel comfortable with six concepts. Each one maps to a very concrete thing — there is no hand-waving here.

Keyspace

A keyspace is the top-level container in Cassandra — roughly equivalent to a database in SQL. When you create a keyspace you declare two critical things: the replication factor (how many copies of each partition to keep) and the replication strategy (SimpleStrategy for single-datacenter setups, NetworkTopologyStrategy for multi-datacenter). Everything else in Cassandra lives inside a keyspace.

Table

A table in Cassandra looks superficially like a SQL table — columns, rows, a schema. But the primary key is fundamentally different. Instead of a simple unique identifier, the primary key has a two-part structure: a partition key (which node?) and optional clustering keys (what order within the node?). That structure is not optional decoration — it is the heart of how Cassandra stores and retrieves your data efficiently. Designing the primary key badly is the single most common Cassandra mistake.

Partition Key

The partition key is hashed to determine which node (and which replica nodes) store a given set of rows. All rows with the same partition key value live together in one partition on one node (plus its RF replicas). This is powerful: a query that filters by partition key requires hitting exactly one node — O(1) routing. A query that does NOT filter by partition key requires hitting every node — this is a full cluster scan, and Cassandra will warn you about it or refuse unless you explicitly allow it. Choose your partition key to match your most common query pattern.

Clustering Key

Within a partition, rows are sorted on disk by the clustering key. This sorting is not cosmetic — it is physical storage order. Because Cassandra stores data in sorted immutable files (SSTables), range queries on the clustering key are extremely fast: Cassandra seeks to the start of the range and reads contiguous bytes. A query like "give me all sensor readings for sensor A-42 between 09:00 and 10:00" is efficient only if sensor_id is the partition key (routes to the right node) and recorded_at is the clustering key (sorted within the partition for fast range scan).

Replication Factor (RF)

RF is the number of nodes that store a copy of each partition. RF=1 means if one node dies, that data is temporarily unavailable. RF=3 means three nodes hold the data — you can lose one node and still read and write with QUORUM consistency (2-of-3). RF=5 adds more redundancy at higher storage cost. The rule of thumb: RF=3 per datacenter is the production standard. It gives you one-node-fault tolerance with quorum reads while keeping storage overhead manageable.

Consistency Level (CL)

CL is a per-operation dial that controls how many of the RF replicas must respond before Cassandra returns success. ONE — fastest, weakest: any single replica ACKs. QUORUM — balanced: majority of replicas must ACK. ALL — strongest, slowest: every replica must ACK. LOCAL_QUORUM — quorum within the local datacenter only, ignoring remote DCs (great for multi-DC setups where cross-DC latency is too slow for synchronous ACK). The magic tuning insight: write at LOCAL_QUORUM + read at LOCAL_QUORUM gives you strong consistency within a datacenter, because any write was acknowledged by at least ⌊RF/2⌋+1 nodes, which overlaps with any read quorum.

Cassandra's vocabulary — keyspace, table, partition key, clustering key, RF, CL — maps directly to physical storage and routing decisions; understanding each term in operational terms is the prerequisite for every design decision that follows.

Section 5

Data Model — Wide-Column Explained

The phrase "wide-column database" sounds intimidating. In practice it means one thing: a partition (identified by the partition key) can hold a very large number of rows, all sorted together on disk. Think of each partition as its own mini-sorted-table, stored contiguously on one node. Different partitions live on different nodes, but within a partition everything is local and sorted.

The two-part primary key — the most important concept in CQL

In SQL, a primary key uniquely identifies one row. In Cassandra, the primary key has two jobs:

Partition key — hashed to determine which node. All rows with the same partition key go to the same node.
Clustering key — determines the physical sort order of rows within a partition. This enables efficient range scans.

Together they look like: PRIMARY KEY ((sensor_id), recorded_at) — the inner parentheses mark the partition key; recorded_at is the clustering key. Multiple partition key columns are allowed (composite partition key), and multiple clustering keys too.

Three common schema patterns in CQL

-- Time-series: sensor readings bucketed by date + sensor
-- Why composite partition key? Without the date bucket,
-- one sensor_id could accumulate billions of rows in one
-- partition (unbounded growth). Bucketing by day caps a
-- partition at ~8.64M rows/day at 100 readings/sec.

CREATE TABLE sensor_readings (
  sensor_id   TEXT,
  date_bucket DATE,          -- partition key part 2: limits partition size
  recorded_at TIMESTAMP,     -- clustering key: sorts rows in time order
  temperature FLOAT,
  pressure    FLOAT,
  PRIMARY KEY ((sensor_id, date_bucket), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC); -- newest first

-- Write:
INSERT INTO sensor_readings (sensor_id, date_bucket, recorded_at, temperature, pressure)
VALUES ('A-42', '2026-05-09', toTimestamp(now()), 21.3, 1013.2);

-- Read last 100 readings for sensor A-42 today (hits 1 partition, 1 node):
SELECT * FROM sensor_readings
WHERE sensor_id = 'A-42' AND date_bucket = '2026-05-09'
LIMIT 100;

-- User profile: simple lookup by username
-- Single-column partition key is fine here — one user = one partition.
-- No clustering key needed (only one row per user).

CREATE TABLE users (
  username    TEXT PRIMARY KEY,  -- shorthand: partition key only, no clustering key
  email       TEXT,
  display_name TEXT,
  created_at  TIMESTAMP,
  preferences MAP<TEXT, TEXT>   -- Cassandra supports map/list/set column types
);

-- Write:
INSERT INTO users (username, email, display_name, created_at)
VALUES ('rafikul', 'rafikul@example.com', 'Raf', toTimestamp(now()));

-- Read (routes to exactly 1 node — O(1)):
SELECT * FROM users WHERE username = 'rafikul';

-- Update is actually an upsert — Cassandra never locks a row:
UPDATE users SET display_name = 'Rafikul' WHERE username = 'rafikul';

-- Range query within a partition using clustering key
-- Scenario: fetch all orders for a user placed in a date range.
-- Partition key = user_id (routes all user orders to same node).
-- Clustering key = order_date DESC (newest first on disk).

CREATE TABLE orders_by_user (
  user_id    TEXT,
  order_date DATE,        -- clustering key 1 (for date range queries)
  order_id   UUID,        -- clustering key 2 (for uniqueness within a day)
  total      DECIMAL,
  status     TEXT,
  PRIMARY KEY (user_id, order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC, order_id DESC);

-- Range query — FAST because order_date is the clustering key:
SELECT * FROM orders_by_user
WHERE user_id = 'user-123'
  AND order_date >= '2026-01-01'
  AND order_date <= '2026-05-09';

-- Note: You CANNOT do WHERE status = 'PENDING' alone —
-- filtering non-key columns requires ALLOW FILTERING (slow scan).
-- The solution: create a separate table keyed by (user_id, status).

Soft Rule: Partition Size

A Cassandra partition can technically hold up to around 2 billion cells, but in practice performance degrades well before that. A rough rule of thumb used by many Cassandra practitioners: keep individual partitions under 100MB of data. If a single sensor writes 100 readings/sec with 200 bytes per row, one day of data is roughly 1.7GB — way too large. Adding a date_bucket to the partition key caps each partition at 1.7GB ÷ 365 ≈ 4.6MB per day. Many small partitions are always preferable to a few giant ones.

Cassandra's wide-column model stores rows sorted by clustering key within partitions; the two-part primary key (partition key + clustering keys) is a routing and storage decision, not just a uniqueness constraint — design it around your read queries.

Section 6

Query-First Schema Design

Here is the single biggest mindset shift you need to make when coming from SQL: in Cassandra, you design your tables around your queries, not around your data.

In SQL you normalize data — one table per entity, no duplication. When you need to answer a question, the database JOINs tables at query time to reconstruct the answer. The database engine handles the complexity of combining data from multiple tables. That is the relational model's superpower.

Cassandra has no JOIN operation. Zero. The reason is physical: your data is spread across dozens or hundreds of nodes. A JOIN would require coordinating between potentially hundreds of nodes on every query — the latency would be unbearable. Instead, Cassandra's answer is denormalization: you pre-compute the answer to each query and store it in its own table, shaped exactly the way the query needs it.

The four rules of query-first design

Start with Queries, Not Entities

Before writing a single CREATE TABLE statement, write down every query your application needs to run. Literally list them: "get user profile by username," "get all orders for a user sorted by date," "get all orders for a product," etc. Each query on that list will become its own table. This is not optional — it is the methodology. Trying to design the tables first and fit the queries in later is what causes Cassandra projects to fail.

One Table Per Query Pattern

This feels wasteful to SQL veterans. If you have 5 different ways to query orders, you create 5 tables — all holding largely the same order data, shaped differently. The duplication is intentional: each table is pre-joined and pre-sorted exactly the way its query needs. The payoff is that every read hits one partition on one node and returns in a handful of milliseconds regardless of whether your cluster has 5 nodes or 500. Storage is cheap; query latency is precious.

Partition Keys Reflect Access Patterns

Choose your partition key to be whatever value you always know at query time. If you always look up orders by user_id, make user_id the partition key for your orders-by-user table. This routes every query to exactly one node — no scatter-gather, no multi-node coordination. If your access pattern requires querying by two values (e.g., by user_id and date), consider a composite partition key or use one as the partition key and the other as a clustering key, depending on whether you need range queries on it.

Keep Partitions Bounded

A partition that grows without bound will eventually become a performance bottleneck — Cassandra has to read and compact increasingly large SSTable files for that partition. The most common cause is using a partition key with low cardinality (e.g., country_code or status), where all rows for "US" or "PENDING" pile into one giant partition. The fix is almost always to add a bucketing column (like a date or a hash-range bucket) to the partition key to cap the size per partition.

The #1 Cassandra Design Mistake

Copying your SQL schema directly into Cassandra is the most common reason Cassandra projects fail. You will get a normalized schema with tables like users, orders, products — and then discover that the queries you need require JOINs that Cassandra cannot do. You will try workarounds (ALLOW FILTERING, secondary indexes, Spark), and each one will add latency and operational complexity. The lesson: treat Cassandra schema design as a fresh start. Begin with query patterns, not entities.

Query-first schema design means listing every query before writing a single CREATE TABLE — each query becomes a denormalized table shaped exactly for that access pattern, with partition keys chosen to route the query to a single node for constant-time reads.

Section 7

Replication & Multi-DC

Cassandra copies every piece of data to multiple nodes automatically. This is called replication — and unlike MySQL primary/replica setups, all copies are equal. Any replica can answer reads or accept writes. There is no "master" to fail over from.

The Replication Factor (RF) tells Cassandra how many copies to keep. RF=3 means every row lives on 3 different nodes. If one node dies, 2 still have your data. If two die simultaneously, 1 still does. You typically need to lose all RF nodes at once before data becomes unavailable — which is why RF=3 is the production minimum and often the sweet spot.

Within each datacenter, Cassandra places replicas on different racks automatically — so a single rack power failure can't take out all copies. Cross-DC replication happens asynchronously; the local DC responds to clients immediately, and the remote DC catches up in the background.

Two Replication Strategies

SimpleStrategy

Places the first replica on the node chosen by the partitioner, then walks clockwise around the ring for each additional replica. It has no concept of racks or datacenters — all nodes are treated as equals.

When to use: single-datacenter clusters, local development, quick prototyping. Never use this in production with more than one DC — it can place all replicas in the same physical cabinet.

CREATE KEYSPACE my_app
WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 3
};

NetworkTopologyStrategy

You specify the replication factor per datacenter. Cassandra then places replicas across racks within each DC to avoid correlated failures. Cross-DC replication is automatic and asynchronous.

Why it matters: with RF=3 in each of two DCs, you can lose an entire datacenter and still serve every read and write from the surviving DC. This is the production standard for serious deployments.

CREATE KEYSPACE my_app
WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'eu-west': 3
};

Soft numbers: a typical production deployment runs RF=3 in each DC. Cross-DC replication lag is typically in the low milliseconds on fast backbone links, though spikes to tens of milliseconds under load are normal. The local DC always answers first — remote replication never adds latency to your client.

Cassandra replicates each row to N nodes (typically 3) using SimpleStrategy for dev or NetworkTopologyStrategy for production multi-DC — that's how it eliminates single-datacenter risk while keeping cross-DC writes asynchronous.

Section 8

Consistency Levels

Every Cassandra read and write carries a consistency level — a number that says "how many replicas must confirm this operation before we call it done?" This is Cassandra's most powerful feature: you can dial safety vs. speed on a per-query basis. Most SQL databases give you one global setting and force you to choose between strong consistency everywhere or eventual consistency everywhere. Cassandra lets you choose for each individual query.

Think of it like a vote: with RF=3, you have 3 votes. Requiring 2 out of 3 to agree (quorum) means you can tolerate 1 disagreeing replica. Requiring all 3 is safest but slowest. Requiring just 1 is fastest but you might read stale data.

The R + W > N Formula

Here's the math that makes tunable consistency work: if the number of replicas that must acknowledge a write (W) plus the number that must respond to a read (R) is greater than the total replica count (N), then those two sets of replicas must overlap by at least one node. That overlapping node holds the latest write — so your read is always fresh. This is how you get strong consistency without a single primary.

In practice: RF=3, W=LOCAL_QUORUM (2), R=LOCAL_QUORUM (2): 2 + 2 = 4 > 3. Guaranteed overlap. The typical p99 write latency at LOCAL_QUORUM on commodity hardware is roughly 5–20 ms within a single datacenter.

Use LOCAL_QUORUM for writes on anything where data loss is unacceptable — user profiles, orders, payments. Two replicas in your local DC must confirm before the coordinator returns success.

# Python driver — consistency per-statement
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement

stmt = SimpleStatement(
    "INSERT INTO orders (order_id, user_id, total) VALUES (%s, %s, %s)",
    consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
session.execute(stmt, (order_id, user_id, total))
# Waits for 2/3 replicas in the local DC to confirm

Read at ONE when you're OK with slightly stale data — product catalog, recommendation feeds, analytics. The coordinator picks the closest replica, which is extremely fast but may be a few milliseconds behind.

stmt = SimpleStatement(
    "SELECT * FROM product_catalog WHERE product_id = %s",
    consistency_level=ConsistencyLevel.ONE
)
row = session.execute(stmt, (product_id,)).one()
# Returns immediately from nearest replica — may be slightly stale

To guarantee strong consistency: write at LOCAL_QUORUM AND read at LOCAL_QUORUM. The overlapping replica ensures you always read the latest write. Use this for financial records, inventory counts, anything where stale reads cause real harm.

# Strong consistency: R + W > N
# RF=3, W=LOCAL_QUORUM(2), R=LOCAL_QUORUM(2) → 4 > 3 ✓

write_stmt = SimpleStatement(
    "UPDATE inventory SET stock = %s WHERE sku = %s",
    consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
read_stmt = SimpleStatement(
    "SELECT stock FROM inventory WHERE sku = %s",
    consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
session.execute(write_stmt, (new_stock, sku))
row = session.execute(read_stmt, (sku,)).one()
# Guaranteed to see the write above ✓

Per-query tunable consistency lets the same cluster serve audit logs at QUORUM and analytics counters at ONE — the formula R+W>N guarantees strong consistency when needed, while lower levels trade safety for speed.

Section 9

Storage: SSTables, Memtables & Commit Log

Here's the secret behind Cassandra's write speed: writes never go directly to disk in random positions. Random disk writes are slow — even on SSDs. Instead, Cassandra layers three structures that turn random writes into sequential ones. This is called an LSM tree (Log-Structured Merge tree), and it's the same idea behind RocksDB, LevelDB, and HBase.

The Four Components

Commit Log

Every write is appended to the commit log on disk before anything else. This is a crash safety net — if the server loses power while data is in the memtable (RAM), the commit log lets Cassandra replay missed writes on restart. It's append-only so it's extremely fast.

Why it's append-only: sequential disk writes are 10–100x faster than random writes. The commit log never seeks backwards.

Memtable

An in-memory sorted data structure — one per table. Writes land here after the commit log. Reads check here first. When the memtable fills up (or a timer fires), it's flushed to an SSTable on disk. Flushing is sequential, which is fast.

Why in memory: RAM is orders of magnitude faster than disk. Batching up many writes and flushing as one sequential write is the core trick behind Cassandra's write throughput — a single node can handle tens of thousands of writes per second on commodity hardware.

SSTable (Sorted String Table)

An immutable, sorted file on disk. Once written, it is never modified — updates and deletes create new entries (deletes create a "tombstone"). Many SSTables accumulate over time for the same table. Compaction periodically merges them, resolving conflicts by timestamp and discarding tombstones.

Why immutable: updating an existing file requires random I/O. Never updating keeps all disk writes sequential. The trade-off is that reads may need to check multiple SSTables — which is why bloom filters exist.

Bloom Filter

A probabilistic data structure stored in memory for each SSTable. Before reading an SSTable from disk, Cassandra asks the bloom filter: "does this SSTable maybe contain key X?" If the filter says definitely not, Cassandra skips that SSTable entirely — saving a disk read. The filter can have false positives (say "maybe" when it shouldn't) but never false negatives.

Why it matters: in practice, a bloom filter eliminates roughly 80–95% of unnecessary SSTable reads. Without it, Cassandra would have to check every SSTable for every key — reads would be slow regardless of write speed.

Writes go to commit log + memtable (fast); SSTables flushed periodically; reads check bloom filter then memtable then SSTables; compaction reclaims space — this LSM-tree design is why Cassandra crushes write throughput.

Section 10

Tunable Consistency in Action

Imagine running an e-commerce platform. Your audit log absolutely cannot lose a row — you'd fail a compliance audit. Your product view counter, on the other hand, being off by a few counts for a few seconds is completely fine. Your user profile needs to be consistent within a datacenter but can lag slightly across continents. In most databases, you're forced to use the same consistency level for everything. Cassandra lets each of these workloads use the right level for it.

Audit Logging — Write at QUORUM, Read at QUORUM

Compliance logs must never be lost or return stale data. QUORUM means a majority of all replicas across DCs must confirm. R + W = 2 + 2 = 4 > 3 — overlap guaranteed, strong consistency achieved. The trade-off: writes wait for cross-DC acknowledgment, adding latency. For audit logs, this is acceptable.

Time-Series Telemetry — Write at ONE, Read at ONE

Analytics counters and event streams can tolerate a few seconds of staleness. Write at ONE means Cassandra returns success the moment one replica confirms — the other two will catch up. Read at ONE means you might see a slightly old count, which is perfectly fine for a dashboard that refreshes every 30 seconds anyway. Extremely fast, low overhead.

User Profile — Write at LOCAL_QUORUM, Read at LOCAL_ONE

User preferences need to be consistent within the local datacenter (your latest bio change should be visible immediately), but slight cross-DC lag is fine. LOCAL_QUORUM writes ensure 2 of 3 local replicas have the data. LOCAL_ONE reads are very fast — just one local replica needed. Cross-DC replication happens asynchronously in the background.

Cross-DC Replicated Cache — EACH_QUORUM Write, LOCAL_QUORUM Read

For globally consistent cached data (e.g., feature flags, rate limit configs), EACH_QUORUM ensures a quorum of replicas in every DC confirms the write before returning. Reads use LOCAL_QUORUM for speed. This gives you the strongest consistency guarantee Cassandra can offer across multiple DCs, at the cost of write latency proportional to your slowest DC.

Why this matters architecturally: per-query tunability is impossible in most SQL databases — you get one setting for the whole cluster. Cassandra's approach means a single cluster can replace what would otherwise require separate database deployments for "fast analytics store" vs. "strongly consistent OLTP store." This simplifies operations significantly.

Same cluster, same data, but each query picks its own consistency dial — audit logging at QUORUM, telemetry at ONE, profiles at LOCAL_QUORUM, cross-DC cache at EACH_QUORUM. This per-query tunability is impossible in most SQL databases.

Section 11

Lightweight Transactions & Counters

Cassandra is eventually consistent by default — but sometimes you genuinely need "if X is still true, do Y." Think of claiming a unique username: two users try to register "alice" simultaneously. Without any coordination, both succeed, and now you have two accounts with the same name. This is exactly the kind of problem Lightweight Transactions (LWT) solve.

LWT uses the Paxos consensus protocol — a distributed agreement algorithm — to coordinate across replicas. Before writing, Cassandra runs a four-phase protocol (prepare/promise, read, propose/accept, commit) to ensure all replicas agree on the current state of that row. Only then does the write proceed. This is robust, but expensive: roughly 4 round trips instead of 1, which translates to roughly 3–4× the latency of a normal write (Paxos v2 in Cassandra 4.1+ cuts this in half).

LWT — IF NOT EXISTS / IF (Compare-and-Swap)

LWT adds conditional logic to writes: "insert this row only if it doesn't already exist" or "update this column only if its current value is X." The Paxos protocol ensures the check and the write are atomic across all replicas — no two clients can win the same race condition.

Best uses: unique username/email registration, seat reservation, coupon redemption — any "first one wins" scenario. Avoid on hot rows (many concurrent writers serialize through Paxos, creating a bottleneck).

Counters — Atomic Increments Across Replicas

Cassandra has a special counter column type that handles distributed increments safely. Rather than "read current value, add 1, write back" (which has race conditions), a counter write says "add N to whatever the current value is." Cassandra reconciles the deltas across replicas automatically.

Constraint: counter columns cannot live in the same table as regular columns. A counter table contains only the partition key, clustering columns, and counter columns — nothing else. Counter reads are also eventually consistent, not strongly consistent.

Use IF NOT EXISTS to guarantee uniqueness — only the first writer succeeds. Cassandra returns a result set with an [applied] column: true if the insert happened, false if a row already existed.

-- Reserve a username: only one client can win
INSERT INTO users (username, email, created_at)
VALUES ('alice', 'alice@example.com', toTimestamp(now()))
IF NOT EXISTS;

-- Check [applied] in your driver code:
-- Row(applied=True)  → success, username claimed
-- Row(applied=False) → username already taken

Use IF <condition> for compare-and-swap: "only update if the current value is what I expect." This prevents lost-update races where two clients both read a stale value and overwrite each other's changes.

-- State machine transition: only move from 'pending' to 'confirmed'
UPDATE orders
SET status = 'confirmed', confirmed_at = toTimestamp(now())
WHERE order_id = 12345
IF status = 'pending';

-- If another process already changed status, [applied]=false
-- Your code should re-read and decide what to do next

Counter tables are dedicated — they cannot mix with regular data columns. Increments and decrements are atomic; Cassandra resolves concurrent updates across replicas automatically.

-- Counter table definition (counters only — no regular columns)
CREATE TABLE page_views (
  page_id   text,
  day       date,
  views     counter,
  PRIMARY KEY (page_id, day)
);

-- Increment: safe to call from multiple nodes simultaneously
UPDATE page_views
SET views = views + 1
WHERE page_id = 'homepage' AND day = '2026-05-09';

-- Read (eventually consistent — may be slightly behind)
SELECT views FROM page_views
WHERE page_id = 'homepage' AND day = '2026-05-09';

Performance warning: never use LWT on a row that receives many concurrent writes. Every writer must serialize through Paxos — one at a time. For a popular item (think: a product with thousands of simultaneous purchasers), this creates a traffic jam. Consider using counters, or redesigning to avoid the contention entirely.

Lightweight transactions (LWT) use Paxos for compare-and-swap on critical operations like unique usernames; counters increment atomically across replicas. LWT is roughly 3–4× slower than normal writes (4 round trips vs 1) — use sparingly.

Section 12

Schema Design Patterns

In a relational database, you design the schema around the data model, then write whatever queries you need. In Cassandra, you do the opposite: you start with the queries and work backwards to the schema. Every table is essentially a pre-computed answer to a specific question. This sounds strange at first, but it becomes natural — and it's what makes Cassandra fast at scale.

The reason is physical: Cassandra can only efficiently read data within a single partition (one partition key) or do a range scan within a partition (using clustering columns). It cannot efficiently join two tables or filter on non-key columns across millions of partitions. So your schema must put the data a query needs in exactly the partition that query will hit.

Five Schema Patterns

Time-Series Bucketing

For sensor data, logs, or any timestamped stream: include a time bucket (day, month, year) in the partition key alongside the entity ID. The clustering key is the exact timestamp for ordering within the partition.

Why: without bucketing, all of a sensor's history lives in one partition that grows forever. Cassandra partitions should stay under roughly 100MB — a partition with 5 years of per-second sensor data would be gigabytes. Bucketing keeps each partition bounded and keeps the "last N days" query fast (it's just a range scan on the clustering key).

Multi-Key Lookup (Two Tables)

If you need "find user by ID" AND "find user by email," create two tables: users_by_id and users_by_email. Write to both in a BATCH statement (Cassandra's way of making multiple writes atomic across tables).

Why: you can't efficiently query by a non-partition-key column. Duplicating data costs storage (cheap) but avoids full-cluster scans (expensive). This is the fundamental trade-off in Cassandra: RAM and disk are cheap; network round trips and full-cluster scans are not.

Wide Row Aggregation

Collect all events for one entity in a single partition; use the clustering key to order them by time. To retrieve "all events for user X in the last 7 days," specify the partition key (user X) and give a range on the clustering key (timestamp >= 7 days ago). One partition scan, extremely fast.

Why: Cassandra is optimized for sequential reads within a partition (which are essentially sequential disk reads from an SSTable). Cross-partition reads require hitting multiple nodes — always prefer a single-partition design when the access pattern allows it.

Materialized Views

Cassandra can automatically maintain a denormalized copy of a table with a different partition key — called a Materialized View. You write to the base table; Cassandra keeps the view in sync automatically. This sounds like magic, but materialized views have well-documented consistency and performance issues at scale.

When to use cautiously: low-throughput tables where the alternative (application-level dual writes) is even harder to manage. Avoid materialized views on tables with high write rates — they add synchronization overhead that can cause availability issues.

Inverted Index (Manual)

To answer "find all products in category X," maintain a separate table with category as the partition key and product_id as a clustering column. A write to the product table also writes to this index table. On read, look up the index table first to get the list of IDs, then look up each product.

Why not use Cassandra's built-in secondary indexes? Built-in secondary indexes perform a scatter-gather across all nodes — fine for low-cardinality columns on small clusters, but they become slow and expensive as the cluster grows. A manually managed inverted index table gives you full control and predictable performance.

Including the month in the partition key keeps each partition to one month's worth of readings. The timestamp clustering key lets you range-scan within the month efficiently.

-- Table: one partition per (sensor, month)
CREATE TABLE sensor_readings (
  sensor_id  text,
  year_month text,          -- e.g. '2026-05' — the bucket
  recorded_at timestamp,    -- clustering key: fine-grained ordering
  temperature float,
  humidity    float,
  PRIMARY KEY ((sensor_id, year_month), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);

-- Write: bucket is determined by application code
INSERT INTO sensor_readings
  (sensor_id, year_month, recorded_at, temperature, humidity)
VALUES
  ('sensor-42', '2026-05', toTimestamp(now()), 22.4, 58.1);

-- Query: last 30 days = just one partition scan
SELECT * FROM sensor_readings
WHERE sensor_id = 'sensor-42'
  AND year_month = '2026-05'
  AND recorded_at >= '2026-05-01'
LIMIT 1000;

Two tables, each optimized for one lookup pattern. A BATCH makes both writes atomic — either both succeed or neither does (within a single partition; cross-partition BATCHes are logged but not transactional).

-- Table 1: look up by user ID (primary access pattern)
CREATE TABLE users_by_id (
  user_id   uuid PRIMARY KEY,
  email     text,
  username  text,
  created_at timestamp
);

-- Table 2: look up by email (login flow)
CREATE TABLE users_by_email (
  email     text PRIMARY KEY,
  user_id   uuid,
  username  text
);

-- Application writes both atomically using BATCH
BEGIN BATCH
  INSERT INTO users_by_id   (user_id, email, username, created_at)
    VALUES (uuid(), 'alice@example.com', 'alice', toTimestamp(now()));
  INSERT INTO users_by_email (email, user_id, username)
    VALUES ('alice@example.com', uuid(), 'alice');
APPLY BATCH;

The index table stores category → product mappings. Writes go to both tables. Reads first look up the index to get product IDs, then fetch each product by primary key — both steps are single-partition lookups.

-- Main product table (by product ID)
CREATE TABLE products (
  product_id uuid PRIMARY KEY,
  name       text,
  price      decimal,
  category   text
);

-- Inverted index: category → product IDs
CREATE TABLE products_by_category (
  category   text,
  product_id uuid,
  name       text,
  price      decimal,
  PRIMARY KEY (category, product_id)
);

-- Application writes both on product creation
BEGIN BATCH
  INSERT INTO products (product_id, name, price, category)
    VALUES (uuid(), 'Wireless Mouse', 29.99, 'peripherals');
  INSERT INTO products_by_category (category, product_id, name, price)
    VALUES ('peripherals', uuid(), 'Wireless Mouse', 29.99);
APPLY BATCH;

-- Query by category: single partition read, fast
SELECT * FROM products_by_category
WHERE category = 'peripherals'
LIMIT 50;

Five patterns cover ~95% of Cassandra schemas: time-series bucketing, multi-key duplication for lookup variations, wide rows for aggregation, materialized views (with caveats), and inverted indexes for filtering — every pattern starts with the QUERY, not the data.

Section 13

Compaction Strategies

Every time Cassandra flushes data from memory to disk it creates a new immutable file called an SSTable. Left unchecked, dozens of SSTables pile up on disk. A read then has to check every single one to find all versions of a row — that's called read amplification, and it makes reads painfully slow. Compaction is Cassandra's housekeeping job: it merges SSTables together, applies tombstone deletions, and produces fewer, cleaner files. The strategy you pick tells Cassandra how to choose which files to merge.

The Four Strategies

STCS — SizeTieredCompactionStrategy

This is the default strategy and the simplest to understand. Cassandra groups SSTables by size and merges them when enough similarly-sized files accumulate — think of it like combining piles of papers that are roughly the same height. Because it only merges whole groups at once, each individual write causes little extra I/O. That's why STCS shines for write-heavy workloads. The downside: a read might still need to scan several large SSTables to reconstruct a row, so read performance is less predictable.

When to use: bulk ingestion, write-heavy IoT streams, batch ETL where reads are rare.

LCS — LeveledCompactionStrategy

LCS organizes SSTables into levels (L0, L1, L2 …). Each level has a size cap roughly 10x larger than the previous. When a level fills up, it compacts into the next. The key property: within any level, SSTables never overlap in key range. This means a read checks at most one SSTable per level — that's bounded, predictable read amplification. The cost: more frequent compaction activity means more disk writes (higher write amplification). Great for SLA-sensitive reads.

When to use: read-heavy workloads where consistent latency matters more than write throughput.

TWCS — TimeWindowCompactionStrategy

TWCS groups SSTables into time windows (e.g. one window per hour or per day). Data within a window only compacts with data in the same window. When a window is old enough and its TTL has expired, the entire SSTable can be dropped without reading it. This is transformative for time-series: instead of hunting through tombstones scattered across dozens of files, Cassandra just discards the whole old window. The caveat: TWCS assumes writes only arrive for the current window — late-arriving data can break the model.

When to use: sensor telemetry, app logs, analytics events — anything with a TTL and mostly-append access.

UCS — UnifiedCompactionStrategy (5.0+)

UCS is the newest addition and was designed to replace all three strategies with a single, tunable algorithm. It exposes a scaling_parameters knob that lets you dial between STCS-like (favour writes) and LCS-like (favour reads) behaviour without switching strategies. For most new clusters on Cassandra 5.0+, UCS is the recommended starting point because you can tune it in one place rather than migrating between strategies as workload patterns evolve.

When to use: new clusters on Cassandra 5.0+. Tune the single parameter rather than switching strategies later.

TWCS is essential for time-series — do not skip it. Teams that use STCS for time-series workloads (sensor data, application logs) often end up with compaction backlogs that grow faster than Cassandra can process them. Each flush creates new SSTables, compaction falls behind, read latency degrades, and disk fills up. Using TWCS and a matching TTL sidesteps this entirely because expired windows are simply discarded.

Compaction merges SSTables to limit read amplification; STCS favours writes, LCS favours reads, TWCS is essential for time-series, and UCS (5.0+) is a single tunable replacement for all three.

Section 14

Tombstones & Deletes

In most databases, DELETE means the row is gone immediately. In Cassandra, DELETE is a lie — or at least a delayed truth. Cassandra is append-only by design: every node writes data forward, never backwards. When you delete a row, Cassandra writes a special marker called a tombstone that says "this data is dead." The actual bytes sit on disk until compaction + a safety window (gc_grace_seconds) pass. Understanding tombstones is crucial because misusing them is one of the top causes of Cassandra performance disasters.

Why the 10-day grace period?

Imagine a replica node goes offline for a week. While it's down, a DELETE happens on the rest of the cluster. The replica comes back, and Cassandra runs repair to sync it. If the tombstone were already gone, the repair would just re-add the deleted row — a zombie resurrection. The gc_grace_seconds window gives every replica a safe period to learn about tombstones before they vanish. This is why you should never allow a node to be down longer than gc_grace_seconds before running repair.

The Tombstone Graveyard Anti-Pattern

The most damaging tombstone problem comes from treating Cassandra like a queue: constantly insert rows, then delete them after processing. The result is a partition that contains mostly tombstones. Reads have to scan all those dead markers to confirm no live rows exist underneath — Cassandra doesn't short-circuit this. Queries scanning more than roughly 1,000 tombstones emit a warning in the logs. Queries scanning more than roughly 100,000 tombstones are automatically aborted by default. That threshold is lower than you think for a busy partition.

Three Mitigation Patterns

TTL Columns

Instead of explicitly deleting rows, set a time-to-live when you write them: INSERT ... USING TTL 86400. Cassandra automatically marks them as expired and turns them into tombstones at the right time. This is much cleaner than explicit deletes because the expiry is baked into the write path. TWCS compaction (Section 13) then drops whole SSTables for expired windows — no tombstone scanning at read time at all.

TWCS for Time-Series

Pairing TTL with TimeWindowCompactionStrategy is the gold standard for time-bounded data. TWCS groups SSTables by time window, and once a window's TTL has elapsed, the entire SSTable is deleted without reading it. No tombstones accumulate, no reads slow down, no graveyard builds up. This is the architectural solution — don't fight tombstones, make them unnecessary.

Don't Use Cassandra as a Queue

The queue pattern (insert → process → delete → repeat) is the fastest path to tombstone hell. Cassandra was designed for high-volume writes that are rarely deleted and frequently read by known partition keys. If you need a queue, use Kafka, RabbitMQ, or Redis Streams — tools that were specifically built for the consume-and-delete access pattern.

Cassandra deletes write tombstone markers instead of removing data; tombstones are physically purged only after compaction plus gc_grace_seconds, so avoid queue-like insert-delete patterns that create tombstone graveyards.

Section 15

Partition Sizing

A Cassandra partition is all the rows that share the same partition key. If your table has PRIMARY KEY (user_id, event_time), then every row for user_id = 42 lives in the same partition — and that partition lives entirely on the same set of replica nodes. This is what makes reads fast: one network hop to the right node, then a sequential scan within the partition. The problem arises when one partition key attracts vastly more data than others, creating a hot partition: a single node drowns while the rest sit idle.

A partition over roughly 100 MB or 100,000 rows starts attracting operational problems: reads slow down, repair takes longer, streaming during node replacement takes longer. A partition over ~1 GB is genuinely painful to operate. Design your partition keys to keep partitions well-bounded before you have data — retrofitting this is hard.

Four Strategies for Bounded Partitions

Composite Partition Keys

Instead of partitioning by a single high-cardinality field, combine it with a natural time or category bucket: (user_id, year_month) instead of just user_id. A user active for five years would have 60 partitions instead of one monster partition. Each stays bounded, and queries for a specific month are still one-hop reads.

PRIMARY KEY ((user_id, year_month), event_time)

Bucketing

Add a synthetic bucket column derived from time or another range: bucket = FLOOR(event_time / 3600) (one bucket per hour). Your application queries the relevant bucket(s) and you know exactly how much data fits in each. This is essentially a manual form of composite partition key, useful when the natural time column has too high a resolution to use directly.

Salting

For keys that are inherently singular — think a global counter, a celebrity's social feed, or a viral post — append a random number suffix: post_id + "_" + RANDOM(0, 9). Writes scatter across 10 sub-partitions. Reads fan out to all 10 and merge results. This trades read complexity for write scalability, so use it only when you know a key will be genuinely extreme.

Hash-Based Sharding

Compute a shard from the natural key: shard = natural_key % 100. The shard becomes part of the partition key. This gives deterministic, even distribution across 100 sub-partitions without randomness. The benefit over salting: reads know exactly which shard to query — no fan-out needed. A good choice for uniform-access data with no time component.

Plan partition size from day one. A partition that grows past roughly 1 GB becomes operationally painful in multiple ways: reads slow as Cassandra loads more data from disk; repair has to stream a massive partition across the network; streaming during node replacement is slower. There is no online command to split an existing partition — you must rewrite the data with a new schema. Design for bounded partitions before you write row one.

All rows for a partition key land on the same node, so over-large or hot partitions create single-node bottlenecks; keep partitions under 100 MB and 100K rows using composite keys, bucketing, salting, or hash sharding.

Section 16

Performance & Tuning

Cassandra's performance ceiling is mostly determined by three things: schema design (the most important), consistency level (fewer replicas queried = faster but less safe), and hardware (SSDs are effectively mandatory for production). JVM and daemon tuning then let you squeeze the last bit of performance from your hardware. Here are the five most impactful levers operators reach for.

The Five Tuning Levers

Heap & Memory

Cassandra runs on the JVM. The recommended JVM heap is typically 8–16 GB — enough to buffer active memtables and key metadata without triggering frequent garbage collection pauses. Giving Cassandra too much heap actually hurts: large heaps mean longer GC stop-the-world pauses, which cause latency spikes. The rest of your RAM (often 50–100+ GB on modern servers) should be left to the OS page cache, which Cassandra relies on heavily for "warm" read paths.

Compaction Throughput

The compaction_throughput_mb_per_sec setting limits how aggressively compaction consumes disk I/O. Set it too low and compaction falls behind your write rate — SSTable count grows, reads slow down. Set it too high and compaction starves read and write I/O. A common starting point is around 64–128 MB/s per node on SSDs. Monitor pending compaction count and tune this until it stays comfortably low without impacting read latency.

Bloom Filter False Positive Rate

Bloom filters are in-memory probabilistic structures that tell Cassandra "this key is definitely NOT in this SSTable" — skipping an SSTable entirely without a disk read. The false positive rate controls accuracy vs memory cost. A lower FPR (say 0.01 vs 0.1) means fewer unnecessary disk reads but more RAM per SSTable. For read-heavy tables, investing memory in tighter bloom filters often yields more benefit than buying more disk.

Speculative Retry

When a primary replica is slow (network hiccup, GC pause, disk I/O spike), Cassandra can automatically send the same request to a second replica after a short timeout — a speculative retry. Whichever replica responds first wins. This dramatically cuts tail latency (p99, p999) because you no longer wait for a struggling replica. The cost is slightly higher overall load, since some requests hit two replicas. A common setting is 99percentile — retry if the replica is slower than the 99th-percentile read time.

Read Repair Chance

During a read, Cassandra can optionally compare the response from the coordinator replica against other replicas and fix any inconsistencies in the background. The legacy dclocal_read_repair_chance and read_repair_chance table options controlled how often this happened (0.0 to 1.0). These options were removed in Cassandra 4.0 (CASSANDRA-13910) in favour of a new read_repair table option (with values blocking or none) plus scheduled full-cluster repair. On modern clusters, rely on nodetool repair (or Reaper) instead.

Cassandra performance depends primarily on schema design, consistency level, and SSD hardware; key JVM/daemon levers include heap sizing (8-16 GB), compaction I/O budget, bloom filter FPR, and speculative retry for tail latency.

Section 17

Operations & Maintenance

Cassandra is famous for availability — but that availability requires active maintenance. Unlike managed SQL databases where the cloud provider handles backups and upgrades, Cassandra clusters need deliberate operational care. The six areas below cover what you'll actually spend time on in production. The good news: Cassandra's nodetool CLI gives you direct access to most of these operations without cluster downtime.

Repair

Replicas can drift out of sync over time — a node misses a write while it's briefly offline, or a hint expires before delivery. Repair is the housekeeping job that walks each replica, compares what they hold, and copies any missing rows so all replicas agree again. The technical name for this is the anti-entropy process. You run it with nodetool repair. The rule of thumb: repair every node at least once within gc_grace_seconds (default: within 10 days). Why? Because if a node was ever down during a write (and hints were dropped or exceeded), the node may have missed writes. Repair catches those gaps. Skipping repair is the single most common cause of stale reads even at QUORUM consistency.

Recommended practice: schedule repair with a tool like Reaper (open source) to run rolling repairs continuously, so no node ever goes more than a week without a partial repair pass.

Backups

Cassandra's nodetool snapshot command creates a point-in-time snapshot by hard-linking all current SSTable files into a snapshot directory. Because SSTables are immutable, hard links are instant and cheap — no data is copied. Your backup job then uploads those SSTable files to remote storage (S3, GCS, etc.). To restore, copy the SSTables back and use nodetool refresh or a full sstableloader import. Full cluster backup requires snapshotting all nodes and uploading from each.

Monitoring

Cassandra exposes hundreds of metrics over JMX. In practice, teams deploy a Prometheus JMX exporter and scrape a curated set of key SLOs:

Read/Write latency p99 — primary user-facing signal
Pending compactions — if climbing, compaction can't keep up with writes
Hint queue depth — non-zero means a node is unreachable; high = risk of data loss
Dropped mutations — write requests that were rejected; a critical alert threshold
GC pause duration — sustained pauses over 200ms indicate heap tuning is needed

Adding Nodes

Adding a node to a Cassandra cluster is called bootstrapping. The new node joins the ring, announces its token range, and existing nodes stream their share of data to it. For large clusters with terabytes of data, this can take hours or days. Cassandra remains fully available during this process because the existing nodes still serve all requests while streaming happens in the background. After bootstrap completes, run nodetool cleanup on the other nodes to remove data they no longer own.

Removing Nodes

Graceful node removal is called decommission: nodetool decommission on the node being removed. Cassandra streams that node's data to its neighbours, then the node leaves the ring cleanly. If a node is dead and unrecoverable, use nodetool removenode from another node in the cluster. In both cases, never remove more nodes simultaneously than your replication factor can tolerate — losing RF-many nodes at once means data loss.

Upgrades

Cassandra supports rolling upgrades — upgrade one node at a time while the cluster stays online. The protocol is: stop node → upgrade binary → restart → verify healthy → move to next node. Because all nodes must be able to communicate during the rolling upgrade, Cassandra maintains backward compatibility for at least one major version. Always test the upgrade in staging first, and follow the official upgrade notes for your specific version jump (some versions require an intermediate stop, e.g. 3.x → 4.0 → 5.0).

Skipping repair is the #1 operational sin. Engineers often disable or forget to schedule repair because it generates I/O and "the cluster seems fine." Inconsistencies accumulate silently. Eventually, a node goes down, data diverges, and queries return stale results even at QUORUM — because QUORUM only guarantees you've heard from a majority, not that the majority is correct. Regular repair is the only way to guarantee consistency over time.

Running Cassandra in production requires scheduled repair (within gc_grace_seconds), snapshot-based backups, JMX metric monitoring, and careful rolling procedures for adding nodes, removing nodes, and version upgrades.

Section 18

Cassandra vs Alternatives

Cassandra occupies a specific niche: massive write throughput, global distribution, no single point of failure, predictable latency at scale. Other databases cover overlapping territory but make different trade-offs. Knowing which tool fits which problem is more valuable than knowing any one tool deeply — the wrong database choice is expensive to undo.

Five Alternatives at a Glance

ScyllaDB

ScyllaDB is a complete rewrite of Cassandra in C++. It exposes the exact same CQL interface and wire protocol, so existing Cassandra drivers and tooling work without changes. The C++ implementation sidesteps JVM garbage collection pauses and uses a shared-nothing architecture that avoids lock contention between cores. The result: dramatically better single-node throughput and more consistent tail latency. Discord's widely-read engineering blog post documented their migration of trillions of messages from Cassandra to ScyllaDB specifically to eliminate latency spikes caused by JVM GC pauses. If you're considering Cassandra for a new project and have a strong performance requirement, ScyllaDB is worth evaluating first.

DynamoDB

DynamoDB is AWS's fully managed wide-column database. Like Cassandra, it has no JOINs and no cross-partition transactions (though single-item operations are atomic). The key difference is operational: DynamoDB is a hosted service — you don't manage nodes, JVM, compaction, or repair. You pay per request or for provisioned capacity. This is compelling for teams that want Cassandra-like scale without the operational burden. The hard constraints: it's AWS-only, vendor lock-in is significant, and cost can be unpredictable at very high throughput.

HBase

HBase is Apache's wide-column store built on top of HDFS (Hadoop's distributed file system). Like Cassandra, it's inspired by Google's Bigtable paper. Unlike Cassandra, HBase has a single master node (with hot-standby) and relies on Hadoop's infrastructure for storage. This makes HBase a good fit for teams already running a Hadoop ecosystem who want Cassandra-like access patterns. For greenfield projects, the operational overhead of maintaining Hadoop plus HBase makes Cassandra or ScyllaDB a more pragmatic choice.

CockroachDB

CockroachDB is a distributed SQL database with full ACID transactions. It uses the Raft consensus protocol to replicate data globally and presents a PostgreSQL-compatible SQL interface. When you need JOINs, foreign key constraints, and strong transactional guarantees at global scale, CockroachDB (or Google Spanner) fills the gap that Cassandra deliberately ignores. The trade-off is latency: ACID cross-shard transactions require network round-trips for coordination, which Cassandra avoids by forgoing the guarantee entirely.

Google Bigtable

Cassandra was directly inspired by the Bigtable paper (Chang et al., 2006) and the Amazon Dynamo paper. Bigtable is Google's internal wide-column store, available externally as a managed Google Cloud service. It powers Google Search, Maps, and Gmail. Bigtable has a slightly different API (no CQL; uses HBase-compatible Java API via Cloud Bigtable), requires Google Cloud, and has similar trade-offs to Cassandra in terms of no SQL JOINs and partition-key-oriented access. Useful context: understanding Bigtable deeply helps you understand Cassandra, because many of its concepts (tablets, SSTs, memtable) map directly.

ScyllaDB is wire-compatible with Cassandra. Teams at Discord, Numberly, and others have migrated from Cassandra to ScyllaDB without changing application code — only the cluster endpoints changed. If you're experiencing JVM GC-related latency spikes and aren't ready to re-architect your data model, a ScyllaDB migration is one of the most impactful things you can do for performance with the least application risk.

Cassandra fits high-write, low-ACID workloads; ScyllaDB is a drop-in C++ replacement with better tail latency, DynamoDB is the managed AWS alternative, CockroachDB covers distributed SQL needs, and Bigtable is the Google Cloud equivalent that originally inspired Cassandra.

Section 19

Tools & Drivers — Your Cassandra Toolbox

Cassandra has a well-rounded toolbox for every phase of development and operations. You will use the CQL shell to explore and prototype, the nodetool CLI to check cluster health and run maintenance tasks, the official DataStax drivers to connect from your application code, and a cloud-hosted option when you want all of this without managing servers yourself. Here is the rundown of the six tools you will reach for most.

cqlsh — CQL Shell

The official command-line shell for Cassandra. CQL stands for Cassandra Query Language — it looks and feels a lot like SQL, which makes it accessible to anyone who already knows relational databases. You open a REPL with cqlsh (or cqlsh hostname port for a remote node) and run statements interactively: CREATE TABLE, INSERT, SELECT, DESCRIBE TABLE. It is your first stop for exploring an unfamiliar cluster, checking table schema, and prototyping queries before wiring them into application code. cqlsh also supports COPY FROM / COPY TO for bulk CSV import and export — useful for small data migrations and test fixtures. Always start here when debugging production query issues: confirm the row actually exists, confirm the partition key you are using, and see the raw data before blaming your application code.

nodetool — Operational CLI

The primary operational command-line tool that ships with Cassandra. Think of it as the system administrator's remote control for the cluster. The commands you will use most often: nodetool status shows every node's IP, state (Up/Down), load, and ownership percentage of the token ring — the quick health check you run first when something seems wrong. nodetool repair keyspace table synchronises data between replicas to fix drift (run this weekly — see the Disasters section). nodetool flush keyspace forces in-memory memtables to be written to SSTables on disk immediately — useful before a snapshot or before a node maintenance window. nodetool snapshot creates a point-in-time backup by hard-linking SSTable files. nodetool compactionstats shows ongoing compaction work. nodetool tpstats exposes thread pool saturation — the best early indicator of a node that is falling behind.

DataStax Studio

A browser-based IDE and notebook for Cassandra, similar in spirit to Jupyter notebooks for data science. You write CQL cells, run them, and see results inline — great for exploratory analysis, schema documentation, and sharing query patterns with teammates in a format that is more readable than raw cqlsh output. Studio also understands Gremlin (for DSE Graph) and Spark SQL (for DSE Analytics) if your deployment uses those DataStax Enterprise features. For pure open-source Cassandra users, Studio's CQL notebook is the most useful part. It is free to download and run locally; DataStax Astra DB users get a hosted version in the Astra console. Use it for onboarding new engineers to a schema — a shared notebook of "here are the five queries our service runs and why" is far more useful than a wall of comments in code.

DataStax Drivers

DataStax maintains officially supported drivers for the major languages your services are likely written in. Java: the DataStax Java Driver — async, reactive-friendly, built-in connection pooling and load balancing. Python: cassandra-driver — the reference Python implementation, synchronous by default with async event loop support. C# / .NET: DataStax C# Driver — full CQL support with LINQ integration. Community-maintained drivers exist for Go (gocql — widely used, battle-tested), Rust (scylla-rust-driver also works with Cassandra), and Node.js (datastax/nodejs-driver). All drivers handle token-aware routing automatically — meaning a write for partition key X goes directly to the node that owns X, skipping an extra network hop. This token awareness is important for performance: without it, every request routes through a coordinator node first, doubling network latency for simple queries.

Spring Data Cassandra

The Spring framework integration for Cassandra, built on top of the DataStax Java Driver. If your service is a Java Spring Boot application, Spring Data Cassandra gives you a familiar repository abstraction: annotate a Java class with @Table, annotate your primary key field with @PrimaryKey, extend CassandraRepository, and Spring generates the CQL automatically for common CRUD operations. It handles object mapping (Java class ↔ Cassandra row), connection lifecycle, and session management. For complex queries, you drop down to CqlTemplate for direct CQL execution. The trade-off of any ORM-style abstraction: it is convenient for simple cases but can hide the performance implications of your query structure — always check what CQL it is actually generating when performance matters.

Astra DB

DataStax's managed, serverless Cassandra cloud offering. You create a database in a browser, pick a cloud provider and region, and get a Cassandra-compatible endpoint without managing nodes, repair schedules, compaction, or upgrades. Astra DB is built on Apache Cassandra internally, so the CQL you write locally works unchanged in Astra. It also exposes a REST API and a GraphQL API as additional interfaces on top of CQL — useful for serverless functions and edge deployments. The free tier is generous enough for learning and small projects. For production use, the pricing model is consumption-based (reads + writes + storage), which is cost-effective for bursty workloads but can be more expensive than self-hosted Cassandra at sustained high throughput. Consider Astra when your team lacks Cassandra operational expertise or when time-to-production beats cost optimisation.

Run these directly in a cqlsh session. The shell is synchronous and stateful — every CQL statement returns before the next prompt appears, and the session stays open until you type EXIT. Great for schema exploration and one-off admin tasks.

-- Connect to a local Cassandra node (default port 9042)
cqlsh

-- Or connect to a remote node with credentials
cqlsh my-cassandra.example.com 9042 -u cassandra -p mypassword

-- Create a keyspace (use NetworkTopologyStrategy in production)
CREATE KEYSPACE IF NOT EXISTS events
  WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3   -- 3 replicas in datacenter1
  };

USE events;

-- Create a time-series table (partition by sensor, cluster by time DESC)
CREATE TABLE IF NOT EXISTS readings (
  sensor_id   UUID,
  recorded_at TIMESTAMP,
  value       DOUBLE,
  unit        TEXT,
  PRIMARY KEY (sensor_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);

-- Insert a row
INSERT INTO readings (sensor_id, recorded_at, value, unit)
VALUES (uuid(), toTimestamp(now()), 22.4, 'celsius');

-- Query the latest 10 readings for a sensor (fast: single partition)
SELECT recorded_at, value, unit
FROM readings
WHERE sensor_id = 550e8400-e29b-41d4-a716-446655440000
LIMIT 10;

-- Inspect schema
DESCRIBE TABLE readings;

-- Explore cluster topology
DESCRIBE CLUSTER;

The cassandra-driver for Python manages a session pool internally. Create one Cluster and one Session per process and reuse them — opening a new session per request is expensive. Always use prepared statements for queries you run repeatedly; the driver sends the statement once, gets back an ID, and sends only the ID + values on subsequent executions.

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.policies import DCAwareRoundRobinPolicy
import os

# One Cluster + one Session — reuse across the application
auth = PlainTextAuthProvider(
    username=os.environ["CASS_USER"],
    password=os.environ["CASS_PASS"],
)
cluster = Cluster(
    contact_points=[os.environ["CASS_HOST"]],
    auth_provider=auth,
    load_balancing_policy=DCAwareRoundRobinPolicy(local_dc="datacenter1"),
)
session = cluster.connect("events")

# --- Prepared statement (prepare once, execute many times) ---
INSERT_READING = session.prepare("""
    INSERT INTO readings (sensor_id, recorded_at, value, unit)
    VALUES (?, ?, ?, ?)
""")

GET_LATEST = session.prepare("""
    SELECT recorded_at, value, unit
    FROM readings
    WHERE sensor_id = ?
    LIMIT ?
""")

def record_reading(sensor_id, ts, value, unit):
    """Write-heavy: Cassandra shines here."""
    session.execute(INSERT_READING, (sensor_id, ts, value, unit))

def get_latest(sensor_id, n=10):
    """Fast single-partition query — O(log N) in partition."""
    rows = session.execute(GET_LATEST, (sensor_id, n))
    return [{"ts": r.recorded_at, "val": r.value, "unit": r.unit} for r in rows]

# --- Batch insert (same partition only — mixed partitions = anti-pattern) ---
from cassandra.query import BatchStatement, BatchType

def bulk_insert(sensor_id, readings):
    """Only use batch when all rows share the same partition key."""
    batch = BatchStatement(batch_type=BatchType.UNLOGGED)
    for ts, value, unit in readings:
        batch.add(INSERT_READING, (sensor_id, ts, value, unit))
    session.execute(batch)

nodetool runs locally on each Cassandra node (it connects to the JMX port). These are the commands you will run most often during routine operations and incident investigation. Run nodetool help <command> for full option documentation.

# ── CLUSTER HEALTH ──────────────────────────────────────────────────────
# Quick overall view: node states, load, token ownership
nodetool status

# Verbose: shows address, rack, DC, token ownership per node
nodetool status --resolve-ip

# ── MAINTENANCE ─────────────────────────────────────────────────────────
# Run repair on a keyspace (sync replicas, fix drift — schedule WEEKLY)
nodetool repair events

# Repair a specific table only
nodetool repair events readings

# Force a memtable flush to SSTables (run before a snapshot or restart)
nodetool flush events readings

# ── BACKUPS ─────────────────────────────────────────────────────────────
# Create a snapshot (hard-links SSTable files — instant, space-efficient)
nodetool snapshot --tag my-snapshot-2026-05-09 events

# List all snapshots
nodetool listsnapshots

# Clear a snapshot when done (free the hard-linked space)
nodetool clearsnapshot --tag my-snapshot-2026-05-09

# ── PERFORMANCE INVESTIGATION ────────────────────────────────────────────
# Thread pool saturation (look for Pending > 0 in any stage)
nodetool tpstats

# Ongoing compaction jobs (how much I/O is compaction eating?)
nodetool compactionstats

# Current reads/writes per second, latency percentiles
nodetool tablestats events.readings

# ── DECOMMISSION / REPLACE ───────────────────────────────────────────────
# Gracefully remove THIS node from the ring (streams its data first)
nodetool decommission

The Cassandra toolbox: cqlsh for interactive CQL exploration and schema inspection, nodetool for all operational tasks (status checks, repair, snapshots, compaction monitoring), DataStax Studio for notebook-style query development and team sharing, the official Java/Python/C# drivers and community Go/Rust/Node drivers for application integration, Spring Data Cassandra for Java framework convenience, and Astra DB for managed serverless Cassandra when self-hosting is not worth the operational cost.

Section 20

Common Misconceptions About Cassandra

Cassandra has a reputation built up over years — and parts of that reputation are out of date, oversimplified, or just plain wrong. The six beliefs below come up repeatedly in interviews, design reviews, and tech blog posts. Each one has a grain of truth, which is exactly why it persists. Knowing why each one is at least partly wrong is more useful than just memorising the correction.

"Cassandra is just another NoSQL database."

This is technically true in the broadest sense — Cassandra is not a relational SQL database. But "NoSQL" is a notoriously vague umbrella that covers document stores (MongoDB), key-value stores (Redis), graph databases (Neo4j), and wide-column stores (Cassandra, HBase) — all with completely different architectures and trade-offs. Calling Cassandra "just NoSQL" is like calling a helicopter "just a vehicle." The accurate description is: Cassandra is a wide-column, distributed, write-optimised database designed for multi-data-centre deployments. Its LSM-tree storage engine, leaderless replication, peer-to-peer architecture, and tunable consistency model are specific design choices that set it apart from every other NoSQL database. When you say "NoSQL" without qualifying it, you lose all of that nuance — and nuance is what drives the decision to use Cassandra versus MongoDB, DynamoDB, or HBase.

"Eventual consistency means your data can get lost or corrupted."

This confuses two different things: availability and correctness. Eventual consistency means that if you stop writing, all replicas will eventually converge to the same value — it says nothing about data loss. In Cassandra, writes are never silently discarded; they are written to a commit log on disk before the acknowledgement is sent to the client. What eventual consistency actually means in practice: a reader might briefly see stale data if they read from a replica that has not yet received the latest write. That brief window is typically milliseconds in a healthy cluster. Furthermore, you can opt out of eventual consistency entirely: QUORUM consistency requires a majority of replicas to agree before a read returns — if R + W > N (replication factor), you are guaranteed to read your own writes every time. "Eventual consistency" is a default configuration choice, not an immutable property of Cassandra.

"Cassandra has no transactions."

Cassandra does not support full multi-row, multi-partition ACID transactions — that part is accurate. But it is not transaction-free. Cassandra has Lightweight Transactions (LWT), which implement a compare-and-swap (CAS) operation using the Paxos consensus protocol. An example: INSERT INTO users (email) VALUES ('alice@x.com') IF NOT EXISTS — this is an atomic operation that only inserts if the row does not already exist. Similarly, UPDATE accounts SET balance = 90 WHERE id = 1 IF balance = 100 updates only if the current value matches. LWTs are correct and safe, but they are significantly slower than regular Cassandra writes (roughly 4× the latency because of Paxos round-trips) and should be used sparingly on hot rows. For true multi-row atomic operations, Cassandra is not the right tool — that is a use case for a relational database or a distributed SQL system.

"Cassandra scales horizontally, so it is good for any big-data problem."

Cassandra scales beautifully — but only for a specific shape of problem. Its sweet spot is: write-heavy at scale, time-series or event data, simple access patterns that you know in advance, and multi-data-centre replication. Step outside that sweet spot and Cassandra fights you. It has no JOINs — data that needs to be queried in multiple ways must be stored in multiple denormalised tables. It has no aggregations beyond basic COUNT and SUM on small result sets — analytics workloads need a separate system (Spark, BigQuery, etc). Complex ad-hoc queries are impractical because partition key equality is required. If your problem involves unpredictable query patterns, lots of joins, or heavy analytics, Cassandra is the wrong tool regardless of scale.

"Schema changes in Cassandra are easy — just add a column."

Adding a new column to an existing table is indeed painless — it is an O(1) metadata operation and Cassandra handles it with zero downtime. But the deeper truth is that Cassandra's schema is tightly coupled to your query patterns, not just your data shape. The partition key determines which node stores your data. The clustering key determines sort order within a partition. If your access patterns change — you need to query by a different field, or sort in a different order — you cannot just alter the primary key. You have to create a new table, backfill the data, and migrate your application. This is fundamentally different from a relational database where you can add an index and immediately query by a new field. The correct mental model: design your Cassandra schema around your queries first, and treat any change to query patterns as a potentially large migration effort.

"Higher consistency level is always the safe choice."

This sounds intuitive — more consistency, more safety — but it gets the trade-off backwards for many workloads. In Cassandra, choosing ALL consistency (every replica must respond) gives you the strongest consistency guarantee but also the weakest availability: if any one replica is down, your query fails. QUORUM requires a majority — balanced consistency and availability. ONE or LOCAL_ONE gives you the fastest, most available reads at the cost of potentially reading slightly stale data. The right consistency level depends on what you are actually asking for. A sensor reading from two seconds ago being one value vs another is usually fine — use LOCAL_ONE. A financial balance that must always be current warrants LOCAL_QUORUM or QUORUM. Blanket-applying ALL to everything means a single unhealthy node breaks your entire application during routine maintenance or failure scenarios. Tune consistency level per query, not per cluster.

Six Cassandra misconceptions debunked: it is a wide-column write-optimised system, not generic NoSQL. Eventual consistency is about brief read staleness, not data loss — QUORUM gives strong consistency. Lightweight Transactions (Paxos CAS) exist but are slow; use sparingly. Cassandra only scales well for write-heavy, simple-access patterns — not JOINs or analytics. Schema changes beyond adding columns can require full table migrations; design schema around queries first. Higher consistency = lower availability during failures — tune per query, not per cluster.

Section 21

Real-World Disasters & Lessons

These are real patterns drawn from widely-reported production failures and recurring incidents that Cassandra practitioners encounter repeatedly. Every one was preventable. Read them as the most concrete possible evidence that table design, compaction strategy, and operational habits are not abstract concerns — they are the difference between a five-minute fix and a multi-hour outage.

Disaster 1 — Tombstone Graveyard (Cassandra Used as a Queue)

Incident: A team built a job queue on top of Cassandra — insert a row when a job arrives, delete the row when the job is processed. The table looked clean in monitoring: usually just a few thousand pending jobs. Within months, reads started timing out. Investigation revealed over 100,000 tombstones per partition from all the deleted rows.

Why it happens: In Cassandra, deleting a row does not immediately remove it from disk. Instead, Cassandra writes a special marker called a tombstone — a record saying "this row was deleted at time T." The actual SSTable files are immutable; data is never updated in place. Tombstones are cleaned up by compaction, but only after gc_grace_seconds (default 10 days) have passed since the deletion — this grace period exists to prevent deleted rows from "coming back to life" on a replica that missed the delete. A read that scans a partition must step over every tombstone to find live rows. With 100,000 tombstones and two live rows, every read becomes an O(100,000) scan.

The lesson: Cassandra is not a queue. If your write pattern is "insert + delete," you will accumulate tombstones faster than compaction removes them. Use Kafka, RabbitMQ, or SQS for job queues. Cassandra's strength is append-only time-series data — store events and never delete them (or use TTL to expire old data in bulk, which is far friendlier to the compaction engine).

Disaster 2 — Hot Partition from Low-Cardinality Key

Incident: An analytics team partitioned a user-events table by country_code (e.g., "US", "GB", "IN"). The United States generated roughly 70% of all traffic. One Cassandra node became responsible for the "US" partition and received 70% of all writes while other nodes were nearly idle. That node's CPU, disk I/O, and network all saturated. US users experienced degraded latency; other regions were fine.

Why it happens: In Cassandra's ring architecture, each partition key hashes to a token and lands on the node(s) responsible for that token range. If your partition key has very few distinct values and some values are far more common than others, you get a hot partition — one node receiving a disproportionate share of traffic. The remaining nodes sit mostly idle while the hot node becomes a bottleneck, negating Cassandra's horizontal scaling entirely.

The lesson: Partition keys must have high cardinality and roughly uniform distribution. If you genuinely need to partition by a low-cardinality field like country, add a bucket suffix to spread load: country_code + bucket_id where bucket_id = hash(user_id) % 100 distributes one logical "US" partition across 100 physical partitions. The trade-off is that queries that need all US data must now query 100 partitions (a scatter-gather), but write performance and balance improve dramatically.

Disaster 3 — Three Years Without Running Repair

Incident: A small startup ran Cassandra in production for three years. Repair had never been set up — the original engineer who deployed Cassandra had not read past the "get it running" section of the docs. Over time, subtle replica drift accumulated. QUORUM reads began returning stale data inconsistently. Debugging was extremely difficult because the issue was non-deterministic.

Why it happens: Cassandra's leaderless replication means each replica accepts writes independently. If a replica briefly goes offline (network blip, rolling restart, maintenance), it misses some writes. When it comes back, it receives hints or read repairs for recently missed data — but hints expire after max_hint_window_in_ms (about 3 hours by default). For long-lived replica outages, or for subtle bit-rot over years, some data may silently diverge between replicas without any error being thrown.

The lesson: Run nodetool repair on every keyspace, on every node, at least once per week. Repair is a full anti-entropy synchronisation — it compares Merkle tree hashes between replicas and streams any missing rows. It is I/O-intensive, so schedule it during low-traffic windows and stagger it across nodes. Many teams automate repair with tools like Reaper (open-source Cassandra repair scheduler). Treat un-repaired Cassandra like un-vacuumed PostgreSQL — it looks fine until suddenly it doesn't.

Disaster 4 — Wrong Compaction Strategy for Time-Series Data

Incident: A time-series team used Cassandra's default compaction strategy (STCS — Size-Tiered Compaction Strategy) for a table storing one year's worth of sensor readings. By month 12, the table had grown to hundreds of gigabytes. Compaction constantly re-read and rewrote old data that would never be queried again, burning significant disk I/O and slowing down recent-data reads.

Why it happens: STCS works by grouping SSTables of similar size and merging them into larger files. This is fine for general-purpose workloads but terrible for time-series data, where old data is never updated and new data arrives in predictable time windows. STCS has no concept of "this SSTable contains only old data that is past its useful life" — it treats all data equally. The result is compaction that spends most of its I/O budget on historical data the application never reads.

The lesson: For time-series tables, use TWCS — Time Window Compaction Strategy. TWCS partitions SSTables into non-overlapping time windows. Old windows seal and are never touched again; only the current window's SSTables are actively compacted. When old data expires via TTL, entire sealed windows can be dropped with a single file delete — zero bytes read, zero bytes written. STCS is the correct default for general-purpose tables; TWCS is mandatory for time-series. Set it at table creation time with WITH compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'HOURS', 'compaction_window_size': 1}.

Lesson 5 — Discord's Migration: Cassandra → ScyllaDB

Context: Discord stored trillions of messages in a Cassandra cluster of roughly 177 nodes. They completed the migration to ScyllaDB in May 2022 and published a detailed engineering blog post in March 2023: the new cluster needed only about 72 ScyllaDB nodes, p99 read latency dropped from 40–125 ms on Cassandra to ~15 ms on ScyllaDB, and p99 write latency dropped from 5–70 ms to ~5 ms.

Why it matters: ScyllaDB is a wire-compatible reimplementation of Cassandra written in C++ instead of Java. It uses a share-nothing, per-core architecture that avoids JVM garbage collection pauses — a known source of latency spikes in Cassandra. The CQL protocol is identical, so most drivers and client code work against ScyllaDB without modification. The migration was possible primarily because the protocol compatibility meant Discord could migrate data table-by-table while running both clusters in parallel.

The lesson: Wire-compatible alternatives exist and should be evaluated seriously at scale. Cassandra's JVM GC pauses, especially with multi-GB heap sizes and large SSTables, produce unpredictable tail latency spikes. If your team is hitting consistent p99/p999 latency issues that tuning does not fix, evaluate ScyllaDB as a drop-in replacement. The migration cost is real (dual-run periods, validation, driver adjustments) but often less than continuing to tune an architecture that is fundamentally limited by GC. Additionally: the existence of alternatives emphasises that no database is a permanent choice — design for eventual replaceability from day one.

Five Cassandra disaster patterns — tombstone graveyard from insert+delete queue pattern (fix: use a real queue), hot partition from low-cardinality key like country_code (fix: composite key with bucket suffix), three years without repair causing stale QUORUM reads (fix: weekly nodetool repair / Reaper), wrong compaction strategy STCS for time-series (fix: TWCS from day one), and Discord's migration to ScyllaDB for better tail latency (lesson: wire-compatible alternatives exist; evaluate them) — are all preventable with correct schema design, compaction configuration, and operational discipline from day one.

Section 22

Performance & Best Practices Recap

Everything on this page distils into eight practices. None of these are arbitrary rules — each one has a clear mechanical reason rooted in how Cassandra actually works. Follow them and you avoid the overwhelming majority of Cassandra production problems. Skip any one of them and you are relying on luck.

1 · Schema Follows Queries

Cassandra has no query planner, no index joins, and no ad-hoc query flexibility — it executes exactly the access pattern encoded in its primary key structure. This means the only way to serve a new query efficiently is to have a table whose partition key equals the equality predicate in that query's WHERE clause. Design the table first, then confirm the query is a direct partition key lookup. Every table is purpose-built for one specific query shape. If you need to answer three different queries, build three tables and write to all three at write time.

2 · Bound Partition Size

A partition is the unit of physical co-location in Cassandra — all rows in a partition live on the same set of nodes. Very wide partitions (hundreds of thousands of rows, many gigabytes) cause two problems: reads that touch the whole partition must deserialise enormous amounts of data, and compaction for that partition consumes disproportionate I/O. The generally recommended soft ceiling is around 100 MB of data or 100,000 rows per partition. If a partition will naturally grow beyond that, redesign with a composite key that includes a time bucket or a hash suffix to spread data across multiple partitions. Use nodetool tablestats to check your partition size distributions in production.

3 · LOCAL_QUORUM as Default CL

LOCAL_QUORUM — a majority of replicas in the local data centre must acknowledge — is the right default consistency level for most production applications. It is strong enough that you will always read your own writes when W+R exceeds N within a DC, fast enough that cross-DC latency does not affect routine requests, and available enough that a single node failure does not break queries. Use LOCAL_ONE for read-heavy workloads where slight staleness is acceptable (IoT telemetry, analytics counters). Use EACH_QUORUM or ALL only for operations that absolutely require global consistency across DCs — the latency cost is the round-trip to every DC involved.

4 · TWCS for Time-Series

Time Window Compaction Strategy works by grouping SSTables into non-overlapping time windows (e.g., one window per hour). Once a window's time range is in the past, it is sealed and never compacted again — its SSTables are already optimal. New writes go into the current window only. When you set a TTL on rows and those rows age out, entire sealed windows can be deleted as a single file operation — zero bytes read, zero CPU consumed. Contrast with STCS: STCS merges all SSTables regardless of age, repeatedly rewriting cold historical data to achieve the same size tiers. For any table where rows have a natural time axis and a predictable TTL, TWCS is mandatory.

5 · Run Repair Weekly

Repair is Cassandra's anti-entropy mechanism. It works by computing a Merkle tree hash of each replica's data for a token range, comparing hashes between replicas, and streaming any rows that differ. This catches drift from: nodes that were briefly offline and missed writes, hinted handoff that expired before delivery, and subtle storage-level corruption. A cluster that has not been repaired accumulates drift silently — there is no error or alert that tells you. Use Reaper (open-source at cassandra-reaper.io) to automate repair scheduling, rate-limiting, and progress tracking. Run repair in a rolling fashion: one node at a time, during off-peak hours, with a bandwidth cap so it does not disrupt production traffic.

6 · Avoid LWT on Hot Rows

Lightweight Transactions use the Paxos consensus protocol to implement compare-and-swap. Paxos requires multiple round-trips between nodes — a prepare phase, a promise phase, an accept phase, and a commit phase. This means an LWT write takes roughly 4× as long as a regular write and consumes significantly more coordinator and replica CPU. On a hot partition (high writes per second), LWT creates a serialised bottleneck: each LWT must complete before the next can begin. Reserve LWT for genuinely rare idempotency guards: INSERT IF NOT EXISTS for user registration (low frequency), UPDATE IF balance = X for a financial CAS operation (low frequency). Never use LWT on a write path that runs thousands of times per second.

7 · Cassandra Is Not a Queue

The pattern of "insert when a job arrives, delete when done" is the most common Cassandra anti-pattern. Every delete writes a tombstone. Tombstones persist for gc_grace_seconds (10 days by default). Any read that touches a partition with many tombstones must scan all of them to find live rows. Beyond a certain threshold (configurable but typically around 100,000 tombstones per read), Cassandra throws a TombstoneOverwhelmingException. The correct architecture: use Kafka, RabbitMQ, or SQS for queuing; use Cassandra to store events append-only with a TTL. The TTL expiry model is friendly to TWCS and creates no tombstones when entire time-window SSTables are dropped.

8 · Test Failure Scenarios

Cassandra's value proposition is that it keeps working when nodes fail and data centres go down. But "it should keep working" and "it actually keeps working with your application code and your consistency level configuration" are very different things. Test in staging: kill one node and confirm reads succeed at LOCAL_QUORUM. Kill two nodes and confirm writes succeed with RF=3. Simulate a full DC outage (firewall off the DC) and confirm the surviving DC continues serving requests. Run repair under write load and measure the impact. Chaos engineering tools like Chaos Monkey or Gremlin make this repeatable. You learn more from one 20-minute failure drill than from weeks of reading documentation about what Cassandra is designed to do.

Eight Cassandra best practices — schema follows queries (denormalise aggressively), keep partitions under 100 MB / 100K rows, use LOCAL_QUORUM as default consistency level, TWCS for all time-series tables, run repair weekly with Reaper, avoid LWT on hot rows (4× latency cost), never use Cassandra as an insert+delete queue (tombstone accumulation), and test failure scenarios in staging — collectively prevent the schema mistakes, operational disasters, and capacity surprises that dominate Cassandra war stories.

Section 23

FAQ — Your Cassandra Questions Answered

These are the questions engineers most commonly ask when they are evaluating Cassandra for the first time or debugging it in production. Plain English first, then the nuance that matters for real decisions.

Cassandra or ScyllaDB — which should I pick?

ScyllaDB is a wire-compatible reimplementation of Cassandra written in C++ rather than Java. You use the same CQL, the same drivers, and the same data model — just a different server binary. The practical differences: ScyllaDB avoids JVM garbage collection pauses, which reduces tail latency spikes (p99, p999) significantly at high throughput. It also typically achieves better per-node throughput, meaning you may need fewer nodes for the same workload. Discord's published migration showed they went from roughly 177 Cassandra nodes to roughly 72 ScyllaDB nodes with better latency. The counter-argument for Cassandra: ecosystem maturity, the largest community, and the most operations tooling (Reaper, DataStax connectors, deep Apache project history). For new projects, both are excellent choices; if tail latency and hardware efficiency are priorities, ScyllaDB is worth serious evaluation.

When does Cassandra actually make sense?

Cassandra is the right choice when: (1) your write volume is genuinely high — millions of inserts per second across a cluster is where Cassandra's LSM-tree engine shines; (2) your data has a natural time series or event stream shape with predictable access patterns; (3) you need multi-data-centre replication with no single point of failure — Cassandra's leaderless architecture handles DC-level outages gracefully; (4) your access patterns are known and simple — primary key lookups, range scans within a partition, no ad-hoc queries. Cassandra is the wrong choice when: you need JOINs, complex aggregations, ad-hoc query flexibility, full-text search, or ACID transactions across multiple rows. The strongest signal for Cassandra: you are replacing a time-series table in PostgreSQL that is growing faster than you can shard it.

How big can a Cassandra cluster realistically get?

Very large. Apple reportedly runs Cassandra clusters with over 75,000 nodes and stores more than 10 petabytes. Netflix uses hundreds of nodes per cluster across multiple data centres. Discord ran 177 nodes before switching to ScyllaDB. Instagram ran Cassandra for years at similar scales. The practical limits are not in Cassandra's architecture — the gossip protocol that nodes use to discover each other scales well into the thousands. The practical limits are in your team's ability to operate the cluster: repair scheduling at scale, coordinating rolling upgrades across thousands of nodes, and monitoring become increasingly complex. Most teams never hit architectural limits; they hit operational complexity limits well before then. Start with a small cluster (6–12 nodes), prove the access patterns work, then grow.

SimpleStrategy vs NetworkTopologyStrategy — when does it matter?

It matters immediately in production. SimpleStrategy places replicas on the next N nodes around the token ring without any awareness of which rack or data centre they are in. This means you might end up with all three replicas on nodes in the same rack or the same data centre — a rack-level or DC-level failure takes out all your replicas simultaneously. NetworkTopologyStrategy is rack- and DC-aware: it distributes replicas across different racks within a DC and allows you to specify per-DC replication factors. Use SimpleStrategy only for local development and learning. Any production keyspace, even a single-DC one, should use NetworkTopologyStrategy with explicit DC configuration. Changing strategy later requires altering the keyspace and running a full repair — easy to do but a noisy operation to schedule.

Can I use secondary indexes in Cassandra?

Yes, with CREATE INDEX ON table (column) — but understand what that index actually does. Cassandra's built-in secondary indexes are per-node local indexes: each node indexes only the rows it stores. A query that uses a secondary index must be sent to every node in the cluster (a scatter-gather query), because no single node knows which nodes have matching rows. This is fine for low-cardinality columns (e.g., status = 'active' where most nodes have some matching rows) but terrible for high-cardinality columns where most nodes have zero matching rows. The practical recommendation: prefer denormalised lookup tables over secondary indexes. A separate table with the lookup column as the partition key gives you a fast single-partition query instead of a cluster-wide scatter. Cassandra's SAI (Storage Attached Index — experimental in Cassandra 4.x, production-ready and the recommended index in 5.0+) is a newer, more performant index implementation — worth evaluating for Cassandra 5.0 deployments.

How does Cassandra handle backups?

nodetool snapshot is the primary backup mechanism. It creates hard links to the current SSTable files for a keyspace — because SSTables are immutable and never modified in place, hard-linking them is instant and space-efficient. You then copy those hard-linked files to remote storage (S3, GCS, etc.) using your standard file transfer tooling. Tools like Tablesnap (Python) or DataStax's Medusa automate incremental SSTable uploads to S3 as new SSTables are written. Restore is a full-cluster operation: copy SSTable files back onto each node's data directory, run nodetool import or restart the node and let it load the files. cqlsh COPY is an alternative for smaller tables (exports to CSV) but is too slow for large datasets. Schedule automated snapshots with offsite upload daily at minimum; combine with weekly repair to ensure backup data is consistent across replicas.

What about JOINs in Cassandra?

There are no JOINs in Cassandra — full stop. Cassandra does not support them at the query layer, and there is no query planner that could implement them efficiently given that related data may live on different nodes. The solution is to denormalise at write time: when you write data, you write it into every table that will need to read it. If you need to look up users by email AND by user_id, you maintain two tables — one partitioned by email, one by user_id — and write to both on every user creation. The application code coordinates the dual write. This feels redundant coming from a relational background, but it is the correct mental model for Cassandra. The writes are cheap; the reads are fast single-partition lookups. If your data model has so many relationships that denormalising becomes unmanageable, Cassandra may not be the right database for that problem.

Cassandra vs DynamoDB — how do they compare?

They share a very similar data model: both are wide-column stores with a partition key + sort key primary key structure, leaderless replication, tunable consistency, and a focus on write-heavy workloads. The practical differences: DynamoDB is fully managed, AWS-only, with simpler operations but less control — you cannot tune compaction, repair, or the replication topology. Cassandra is open-source and self-hosted (or via Astra DB), giving you full control over every operational parameter and no vendor lock-in to a single cloud provider. DynamoDB has predictable per-request pricing that is excellent for bursty or unpredictable workloads; Cassandra on your own hardware is more cost-effective at sustained high throughput. DynamoDB has tighter AWS integration (Lambda triggers, Streams for CDC, fine-grained IAM permissions). Cassandra is the right choice when you need multi-cloud, open-source guarantees, or fine-grained operational control. DynamoDB is the right choice when you want managed infrastructure and are already deep in the AWS ecosystem.

Eight FAQ answers — Cassandra vs ScyllaDB (wire-compatible, Scylla wins on tail latency and hardware efficiency), when Cassandra makes sense (write-heavy, time-series, multi-DC, simple access patterns), cluster scale (Apple 75K+ nodes, Netflix hundreds), SimpleStrategy vs NetworkTopologyStrategy (always use NTS in production for rack/DC awareness), secondary indexes (per-node local scatter-gather; prefer denormalised lookup tables), backups with nodetool snapshot + Medusa, no JOINs (denormalise at write time), and Cassandra vs DynamoDB (open-source multi-cloud control vs AWS-managed simplicity) — cover the practical gaps engineers encounter when moving from theory to production decisions.

Section 24

Connected Topics — What to Study Next

Cassandra does not exist in isolation. The topics below are either the databases you will be asked to compare it against, the underlying concepts that explain why Cassandra works the way it does, or the complementary systems it is commonly deployed alongside. Work through them in the order that matches where you want to go next.

Six connected topics — MongoDB (document model contrast), Redis Deep Dive (hot-cache partner for Cassandra), Sharding (token ring and consistent hashing theory), Schema Design (relational normalisation as the counterpoint to Cassandra denormalisation), CAP Theorem (Cassandra's AP position and what it means in practice), and Database Internals (LSM-trees, SSTables, compaction — the mechanical foundations of Cassandra's performance) — form the complete context around Cassandra that turns isolated knowledge into systems-level thinking.

Cassandra — Wide-Column at Planet Scale

TL;DR — Cassandra in Plain English

Why You Need This — The IoT Firehose Problem

The situation: a global sensor network

Why Cassandra changes the equation

The trade you make

Mental Model — The Ring

The ring explained simply

Four design heuristics that flow from the ring

1. Partition Key Picks the Node

2. Replication Factor (RF) — How Many Copies

3. Consistency Level (CL) — How Many Must ACK

4. No Primary — Every Node Is Equal

Core Concepts — The Vocabulary

Keyspace

Table

Partition Key

Clustering Key

Replication Factor (RF)

Consistency Level (CL)

Data Model — Wide-Column Explained

The two-part primary key — the most important concept in CQL

Three common schema patterns in CQL

Query-First Schema Design

The four rules of query-first design

Start with Queries, Not Entities

One Table Per Query Pattern

Partition Keys Reflect Access Patterns

Keep Partitions Bounded

Replication & Multi-DC

Two Replication Strategies

SimpleStrategy

NetworkTopologyStrategy

Consistency Levels

The R + W > N Formula

Storage: SSTables, Memtables & Commit Log

The Four Components

Commit Log

Memtable

SSTable (Sorted String Table)

Bloom Filter

Tunable Consistency in Action

Audit Logging — Write at QUORUM, Read at QUORUM

Time-Series Telemetry — Write at ONE, Read at ONE

User Profile — Write at LOCAL_QUORUM, Read at LOCAL_ONE

Cross-DC Replicated Cache — EACH_QUORUM Write, LOCAL_QUORUM Read

Lightweight Transactions & Counters

LWT — IF NOT EXISTS / IF (Compare-and-Swap)

Counters — Atomic Increments Across Replicas

Schema Design Patterns

Five Schema Patterns

Time-Series Bucketing

Multi-Key Lookup (Two Tables)

Wide Row Aggregation

Materialized Views

Inverted Index (Manual)

Compaction Strategies

The Four Strategies

STCS — SizeTieredCompactionStrategy

LCS — LeveledCompactionStrategy

TWCS — TimeWindowCompactionStrategy

UCS — UnifiedCompactionStrategy (5.0+)

Tombstones & Deletes

Why the 10-day grace period?

The Tombstone Graveyard Anti-Pattern

Three Mitigation Patterns

TTL Columns

TWCS for Time-Series

Don't Use Cassandra as a Queue

Partition Sizing

Four Strategies for Bounded Partitions

Composite Partition Keys

Bucketing

Salting

Hash-Based Sharding

Performance & Tuning

The Five Tuning Levers

Heap & Memory

Compaction Throughput

Bloom Filter False Positive Rate