Database Foundations

NewSQL & Distributed SQL β€” ACID at Global Scale

For years, engineers faced an ugly dilemma: traditional SQL gave you ACID transactions and rich joins but couldn't scale horizontally past a single beefy server, while NoSQL scaled effortlessly across hundreds of nodes but threw ACID out the window. NewSQL β€” pioneered by Google Spanner in 2012 and commercialized by CockroachDB, YugabyteDB, and TiDB β€” cracks that trade-off by layering SQL semantics and full ACID guarantees on top of a consensus-replicated distributed storage engine. The sweet spot: globally-distributed applications that can't tolerate data loss or dirty reads but need to serve millions of writes per second across regions. The honest cost: writes are slower than single-node Postgres because every commit must achieve consensus across N replicas, and operating a distributed database adds meaningful complexity.

8 Think Firsts ~25 SVG Diagrams 24 Sections ~40 Tooltips 5 Exercises
Section 1

TL;DR β€” NewSQL in Plain English

  • Why traditional SQL databases hit a hard scale ceiling and why NoSQL's solution sacrifices too much
  • How NewSQL layers full ACID guarantees on top of a distributed consensus engine
  • Which databases belong in the NewSQL family and what makes each distinctive
  • The real cost: why every distributed write is slower than a local write β€” and when that trade-off is worth it

NewSQL's core insight: you don't have to choose between horizontal scale and ACID transactions. Replicate data across nodes using a consensus protocol, shard the keyspace into ranges, and serve SQL on top β€” the app sees a single database; the engine handles the rest.

WHERE NewSQL LIVES β€” Scale vs ACID Guarantees Horizontal Scale (nodes) ACID Guarantees Strong ACID, limited scale High scale, weak consistency Postgres MySQL / Oracle Cassandra DynamoDB / MongoDB NewSQL Spanner Β· CockroachDB YugabyteDB Β· TiDB + horizontal scale + ACID

Traditional relational databases like Postgres are brilliant at enforcing ACID transactions, but they scale vertically β€” add more RAM and CPU to one machine. Once your write throughput overwhelms that machine, you're stuck. NoSQL databases like Cassandra and DynamoDB spread writes across hundreds of nodes easily, but they relax consistency: you can read stale data, cross-row transactions are awkward or impossible, and JOINs don't exist. For a payments ledger, inventory system, or any domain where "lost or duplicated rows" means real money lost, that trade-off is unacceptable.

NewSQL databases tackle the problem at the storage layer rather than the application layer. They split data into contiguous key ranges called shards or tablets, assign each range to a small cluster of replicas, and use a consensus protocol β€” most commonly Raft β€” to keep those replicas in sync. Because every write goes through consensus, the database can guarantee that all replicas saw it in the same order, which is exactly what ACID serialization requires. Your SQL query looks like a normal Postgres query; under the hood it may touch data across twenty machines in three datacenters β€” transparently.

Consensus takes time. A single-node Postgres write commits to WAL and returns in roughly 0.1–1 ms. A CockroachDB write in a single-region deployment typically takes 2–10 ms; a multi-region write that must reach a quorum across datacenters in different continents can take 50–200 ms. This is physics, not engineering failure β€” photons travel at the speed of light and round-trip latency between continents is non-negotiable. NewSQL is the right tool when you need global distribution AND strong consistency AND SQL semantics. If you need single-digit millisecond writes, stay on a single-region Postgres until your traffic demands otherwise.

NewSQL databases give you SQL + ACID + horizontal scale by running a consensus protocol (Raft/Paxos) across replicas for every write. Writes are slower than single-node Postgres because consensus across nodes takes time. The sweet spot: globally distributed apps that cannot tolerate data loss or dirty reads.
Section 2

Why You Need This β€” A Story of Scale Pain

Let's start with a situation that's painfully real for a lot of engineering teams. Picture a global SaaS company β€” say, a fintech platform handling payments for users in the US, Europe, and Japan. The early days were easy: one beefy Postgres instance in us-east-1, 50 K requests per second, everything humming. Then growth hit.

The Three Pain Points That Force a Decision

Pain 1 β€” Tokyo users get 200 ms read latency. Your Postgres is in Virginia. A user in Tokyo fires a query, packets cross the Pacific (roughly 150 ms one-way), the query runs, and the response travels back. That's 300+ ms for what should feel instant. You can add a read replica in Tokyo β€” but now you have replication lag and your app must decide which replica to read from.

Pain 2 β€” Write throughput ceiling. At 200 K writes per second, your single Postgres primary is at 95% CPU. You can't just add another primary β€” Postgres has no built-in multi-primary mode. You start considering sharding: split users by ID mod 4 across four Postgres instances. Suddenly your engineers are writing shard-routing logic, schema migrations become nightmares, and any transaction that touches two users on different shards requires custom two-phase commit code.

Pain 3 β€” You can't drop ACID. This is a payments platform. "Eventual consistency" means a user's balance might temporarily show the wrong value. A transfer could debit Account A but fail before crediting Account B. Cassandra or DynamoDB would scale horizontally but they'd force you to build your own transaction layer β€” and that's a distributed systems PhD project, not a feature sprint.

What Each Option Gives You

THREE OPTIONS β€” What Each Gives You Sharded Postgres NoSQL (Cassandra/Dynamo) NewSQL (CockroachDB) ACID txns βœ“ βœ— βœ“ Horiz. scale manual βœ“ βœ“ SQL + JOINs βœ“ βœ— βœ“ Global dist. painful βœ“ βœ“ Cross-shard txns 2PC only βœ— βœ“

With NewSQL, you write to the nearest replica (low latency for local users), ACID is enforced across regions (no dirty reads), SQL and JOINs work exactly as you'd expect, and the database automatically re-balances ranges as data grows β€” no shard routing code in your app.

Think First. Your fintech app has a transaction ledger. Every debit and credit must be strongly consistent β€” a partial transfer (debit succeeds, credit fails) is a real-money bug. Reads are 99% of traffic. You're evaluating three options: (A) Postgres with read replicas and manual sharding, (B) DynamoDB with application-level idempotency logic, (C) CockroachDB. Which do you pick, and why? What's the cost of each? Think it through before reading on.
Global SaaS platforms eventually outgrow a single Postgres: write throughput ceiling, cross-region latency, and operational sharding complexity all compound. NoSQL scales but drops ACID, which is unacceptable for financial data. NewSQL delivers horizontal scale without sacrificing SQL or ACID.
Section 3

Mental Model β€” SQL on Top of Consensus

When you type a SQL query at a NewSQL database, you're talking to something that looks exactly like a normal SQL database. But underneath that familiar interface, there are several layers doing very different jobs. Understanding this stack is the key to predicting how NewSQL databases behave β€” when they're fast, when they're slow, and what can go wrong.

Think of it like a post office. You hand a letter to the counter clerk β€” that's the SQL parser. The clerk checks the address and picks a route β€” that's the query planner. The letter then enters a sorting system that tracks exactly which letters go in which order β€” that's the distributed transaction layer. The actual trucks that carry identical copies of the letter to multiple delivery depots to prevent loss β€” that's the consensus replication layer. And at the bottom, the physical shelves and bins where envelopes are actually stored β€” that's the key-value store and disk.

NewSQL ARCHITECTURE STACK β€” Every Layer Has One Job SQL Interface (Parser + Optimizer) Your app sends standard SQL β€” SELECT, INSERT, UPDATE, BEGIN TRANSACTION… Distributed Transaction Layer (MVCC + Serialization) Assigns timestamps, tracks read/write sets, detects conflicts, coordinates cross-range commits Consensus Replication (Raft / Paxos per Range) Each keyspace range has its own Raft group (leader + followers). A write is durable only when a majority ack. Distributed Key-Value Store Rows encoded as KV pairs; range-partitioned across nodes LSM-Tree Storage (RocksDB or similar) β†’ NVMe / SSD

Two Architectural Families

Not all NewSQL databases are built the same way. There are two dominant approaches, and understanding which family a database belongs to helps you predict how it will behave in your environment.

Spanner-Derived (Distributed-First)

These databases were designed from the ground up to be distributed. There is no single-node version β€” distribution is not optional, it's the foundation. CockroachDB and YugabyteDB belong here. They use range-based sharding (the keyspace is split into contiguous ranges) and each range is managed by its own Raft consensus group. Because they start distributed, features like geo-partitioning and multi-region failover feel natural rather than bolted on.

Best for: greenfield applications that need global distribution from day one, or teams willing to re-architect for it.

MySQL/Postgres-Compatible (Protocol-Shim)

These databases plug into an existing MySQL or Postgres ecosystem by speaking the same wire protocol. TiDB (MySQL-compatible) and Vitess (sharding proxy for MySQL) let you keep your existing ORM, drivers, and operational tooling. Under the hood they have a distributed storage layer, but the interface is familiar. The trade-off: some advanced SQL features may behave slightly differently, and cross-shard transactions are more expensive than in distributed-first designs.

Best for: teams with large existing MySQL/Postgres codebases that need to scale without rewriting the application.

NewSQL databases are a five-layer stack: SQL interface β†’ distributed transaction layer β†’ consensus replication (Raft/Paxos) β†’ key-value store β†’ disk. The SQL layer hides distribution from your app. Two families exist: distributed-first (CockroachDB, YugabyteDB) and MySQL/Postgres-compatible shims (TiDB, Vitess).
Section 4

Core Concepts β€” The Vocabulary You Need

Before going deeper, let's pin down the six terms that come up in every NewSQL conversation. You'll see these in documentation, job interviews, and architecture diagrams. Each one earns its place β€” none is jargon for jargon's sake.

NewSQL

A category name coined roughly around 2011 to describe databases that recombine SQL + ACID transactions with horizontal scalability β€” something the world had concluded was impossible to achieve simultaneously. The "New" is a marketing label, not a technical term. CockroachDB, YugabyteDB, TiDB, Google Spanner, and Amazon Aurora (in serverless mode) are commonly placed in this bucket. Newer marketing often says "Distributed SQL" instead β€” same idea, snappier branding.

Distributed SQL

A newer label favored by CockroachDB and YugabyteDB in their documentation. It emphasizes that the distribution is built into the engine β€” not achieved by a sharding proxy sitting in front of a normal database. If the SQL engine knows about multiple nodes natively, it can optimize cross-shard queries, enforce foreign keys across nodes, and run cross-shard transactions using the same protocol as single-shard transactions. That's what "distributed SQL" specifically promises over sharding proxies.

Consensus Protocol

An algorithm that lets a group of servers agree on something β€” in this case, the exact sequence of writes β€” even if some servers are slow, crashed, or lying. The two most important consensus protocols in databases are Paxos (invented by Leslie Lamport, complex, provably correct) and Raft (designed by Diego Ongaro in 2013 to be easier to understand and implement). CockroachDB and YugabyteDB use Raft. Spanner uses Paxos. The key property: a write is only acknowledged to the client after a majority of replicas have durably recorded it β€” so losing one server never causes data loss.

Range / Shard

A contiguous slice of the keyspace. Imagine all rows in a table sorted by primary key. Now draw a line at every millionth row β€” each segment is a range. In CockroachDB these are called "ranges" (default size ~512 MB); in Spanner and YugabyteDB they're called "tablets." Each range is independently replicated β€” it has its own Raft group, its own leader, and its own set of followers. This is why adding a node helps: the database can move ranges to the new node and each range's Raft group becomes self-sufficient.

Leader / Follower

Within each range's Raft group, one replica is elected leader. All writes to that range go through the leader; the leader replicates the write to followers, waits for a majority to ack, then acknowledges the client. Followers serve reads in some configurations (with a possible staleness trade-off) but never accept writes directly. If the leader crashes, the remaining replicas hold an election and pick a new leader β€” usually within 1–3 seconds. This is how NewSQL databases survive node failures without losing data or availability.

HLC β€” Hybrid Logical Clock

Ordering events across distributed machines is surprisingly hard. Wall clocks drift β€” two servers' clocks can disagree by tens of milliseconds, so you can't fully trust them. The flip side: a purely logical counter that just says "this event came after that one" (a classic Lamport clock) tracks cause-and-effect but throws away any sense of real time. A Hybrid Logical Clock (HLC) splits the difference: it uses the wall clock as a base but adds a logical counter that increments whenever two events happen "at the same millisecond." You get timestamps that respect causality and stay close to real wall-clock time β€” exactly what you need for snapshot reads and transaction ordering in a distributed database. CockroachDB and YugabyteDB both use HLC.

Six terms to know: NewSQL (category), Distributed SQL (built-in distribution, not sharding proxy), Consensus Protocol (Raft/Paxos β€” majority-ack before write confirmed), Range/Shard (contiguous keyspace slice, independently replicated), Leader/Follower (Raft election within each range), HLC (wall clock + logical counter for distributed timestamp ordering).
Section 5

Spanner β€” The Original Distributed SQL Database

Google Spanner is the database that proved you could run one big SQL database across the entire planet β€” with the same correctness guarantees you get from a single Postgres on one server. When the design was published at OSDI 2012, the distributed systems community was surprised: not just that Google had built a globally distributed SQL database, but that they'd done it with what appeared to be a fundamentally new approach to distributed time.

Before Spanner, the accepted wisdom was that you couldn't have both global distribution and strong consistency without either (a) giving up on consistency during network partitions, or (b) paying an enormous latency penalty. Google found a third option: make the clocks so accurate that the uncertainty window β€” the gap between "it definitely happened before now" and "it definitely happened after now" β€” becomes small enough to commit around. (We'll explain "external consistency" in plain English in a moment β€” it's the stronger-than-serializable guarantee Spanner achieves.)

TrueTime β€” The Key Innovation

In a normal datacenter, servers synchronize their clocks via NTP (Network Time Protocol), which keeps them within about 50–200 ms of each other on a good day β€” far too imprecise for database transaction ordering. Google solved this by installing atomic clocks and GPS receivers in every datacenter. These hardware clocks keep servers within about 7 ms of each other globally. The Spanner team then built the TrueTime API, which doesn't give you a single timestamp β€” it gives you an interval (earliest, latest) representing the range of possible true times, where the uncertainty is bounded by the hardware.

Why does bounded uncertainty matter? Because Spanner can commit a transaction by waiting out the uncertainty window. If TrueTime says "the current time is somewhere in [T - 4ms, T + 4ms]", Spanner waits 4 ms after writing and then says: any transaction that started before this commit is guaranteed to have an earlier timestamp, and any transaction that starts after this commit is guaranteed to have a later timestamp. This property is called external consistency β€” the strongest possible consistency guarantee β€” and it means even clients outside the database who compare timestamps will see a consistent ordering.

GOOGLE SPANNER β€” Global Architecture (Simplified) US-East Datacenter βš› Atomic Clock + GPS TT: [Tβˆ’3ms, T+3ms] Paxos Leader handles writes Spanserver / Tablets KV store + WAL Europe Datacenter βš› Atomic Clock + GPS TT: [Tβˆ’4ms, T+4ms] Paxos Follower replicates writes Spanserver / Tablets KV store + WAL Asia Datacenter βš› Atomic Clock + GPS TT: [Tβˆ’5ms, T+5ms] Paxos Follower replicates writes Spanserver / Tablets KV store + WAL replicate replicate Client Application routes write to Paxos leader

Four Key Innovations

TrueTime API

Instead of returning a single point in time, TrueTime returns a bounded interval (earliest, latest). Hardware β€” atomic clocks and GPS receivers in each datacenter β€” guarantees the true current time falls within that interval. Why does this matter? Because Spanner can use the interval to enforce transaction ordering without needing a single global time authority. By waiting out the uncertainty window at commit time, Spanner ensures that any later transaction's start time is definitively greater than the committed transaction's timestamp. This is external consistency β€” the gold standard.

External Consistency

In plain English: if a wall-clock observer sees transaction T1 finish before T2 even begins, the database must agree β€” T2 sees everything T1 did, no exceptions. That promise is called external consistency. It extends linearizability (operations look instantaneous) to distributed transactions spanning multiple machines and regions. It is stronger than serializability, which only requires some consistent order β€” not necessarily the real-time order. Google considers this essential because AdWords and Gmail require it β€” a charge that commits at the datacenter level must be visible everywhere before any dependent billing query runs.

MultiPaxos Per Split

Each table is split into "splits" (Spanner's term for ranges/tablets). Every split has its own independent Paxos group β€” its own leader election, its own log, its own replication flow. This is the key to parallel scalability: a write to split #1 does not need to coordinate with a write to split #2 (unless they're in the same transaction). Hundreds of thousands of splits can be making progress independently and simultaneously, which is how Spanner handles enormous write throughput across Google's fleet.

Schematized Data + Interleaving

Spanner has a full SQL schema system (DDL, types, foreign keys) plus one unusual feature: table interleaving. You can declare that a child table's rows should be physically stored next to the parent row they belong to β€” similar to clustered indexes, but hierarchical. For example, Albums rows can be interleaved inside their Singers row. This means a query joining Singer and Album for the same singer is likely a single range read on one Spanserver, not a distributed join. It's a performance optimization that gets more important the more geographically distributed your data is.

Context. Spanner is reported to power AdWords, Google Photos, Gmail, Google Drive, and many other Google services. Open-source equivalents that drew heavy inspiration from the Spanner paper include CockroachDB (launched ~2014) and YugabyteDB (launched ~2016). Neither has TrueTime hardware β€” they use HLC instead β€” which is why they require slightly different approaches to external consistency.
Google Spanner (2012) proved that global ACID SQL is possible. Its key innovation is TrueTime β€” atomic clocks + GPS receivers giving bounded clock uncertainty β€” which enables external consistency at planetary scale. Each table split has its own Paxos group for parallel replication. Open-source equivalents (CockroachDB, YugabyteDB) use HLC instead of hardware clocks.
Section 6

Why Postgres Sharding Wasn't Enough

Companies have been sharding Postgres and MySQL for decades. Instagram sharded Postgres. Stripe shards Postgres. If sharding works for them, why do NewSQL databases exist? The answer comes down to one gnarly problem: cross-shard transactions.

When data lives on a single Postgres instance, a transaction is a local operation β€” begin, write some rows, commit. The database engine handles locking, versioning, and rollback internally. When data is split across four Postgres instances, and a single business operation (say, transferring money between two users) needs to atomically update rows on Shard 1 AND Shard 3 β€” you need a protocol to coordinate the commit across two independent databases. That protocol is called Two-Phase Commit (2PC), and it has significant problems.

The 2PC Problem

In Two-Phase Commit, a coordinator (often your application or a proxy) sends a "prepare" message to all shards involved. Each shard votes yes or no. If all vote yes, the coordinator sends "commit" to everyone. Sounds reasonable β€” but here's where it breaks:

Sharded Postgres + 2PC Coordinator App Shard 1 (PG) Shard 2 (PG) β‘  PREPARE β‘  PREPARE β‘‘ vote yes β‘‘ vote yes ⚠ Coordinator crashes here? Shards stuck in PREPARED state β‘’ COMMIT (if coordinator survives) NewSQL β€” Consensus-Based Commit Client Range Leader (cross-range TXN coordinator) Follower A Follower B Follower C β‘  WRITE β‘‘ replicate via Raft β‘’ majority ack β†’ ACK client. No coordinator crash risk.

Four Key Differences

Sharded SQL β€” Cross-Shard Transactions

The application or a proxy layer must be shard-aware. It routes writes to the right shard and avoids cross-shard transactions wherever possible. When cross-shard transactions are unavoidable, 2PC is used β€” but it's treated as an edge case because of its fragility. Teams often design schemas specifically to avoid cross-shard joins, which constrains the data model.

NewSQL β€” Cross-Range Transactions

Distribution is invisible to the application. A transaction that touches rows on five different ranges is handled by the database using the same consensus mechanism as a single-range transaction β€” there's no special "cross-shard mode" to worry about. The app writes normal SQL and the engine figures out how to coordinate it. Cross-range transactions are slightly slower than single-range (more network round-trips), but they're first-class operations with the same ACID guarantees.

Sharded SQL β€” Re-sharding

When a shard grows too large or traffic is uneven (a "hot shard"), you need to split it. This typically involves dumping data, re-routing traffic to new instances, and re-importing β€” a multi-hour operation with a maintenance window. If the shard key was chosen poorly (user ID instead of a more distributed key), some shards may be 10Γ— larger than others and no amount of careful balancing fully solves it.

NewSQL β€” Auto-Rebalancing

Ranges split automatically when they exceed a target size (often ~512 MB in CockroachDB). The database's load balancer moves ranges to underloaded nodes continuously in the background. Adding a new node causes the cluster to rebalance ranges onto it β€” no manual migration, no maintenance window. This is why teams say NewSQL "scales like NoSQL" from an operational standpoint, even though the consistency model is much stronger.

The marketing angle. When CockroachDB and YugabyteDB use the term "Distributed SQL" rather than "NewSQL," they're specifically emphasizing this point: distribution is built into the engine β€” not bolted on via a sharding proxy. "Distributed SQL" = the SQL engine knows about multiple nodes natively, can optimize queries across them, and can enforce ACID across them without 2PC fragility.
Sharded Postgres requires 2PC for cross-shard transactions, which is fragile (coordinator crash = hung state) and slow (two round-trips). NewSQL bakes distributed transactions into the same consensus mechanism that handles all writes, making cross-range transactions first-class. NewSQL also auto-rebalances ranges as data grows β€” no manual re-sharding.
Section 7

Raft & Consensus Replication

Before any NewSQL database can guarantee ACID transactions across multiple nodes, all those nodes need a way to agree on what was written and in what order. That agreement process is called consensus. Think of it like a committee vote: a write only "counts" once a majority of voting members say "yes, I recorded that." The algorithm most NewSQL databases use today is called Raft.

Raft is not magic β€” it is a carefully engineered rulebook for how a small cluster of nodes elects a leader, replicates each write to followers, and decides when a write is safe to call "committed." Every NewSQL database carves its keyspace into ranges (sometimes called tablets or regions), and each range runs its own independent Raft group β€” typically 3 or 5 replicas. This means a 30-node cluster might have thousands of independent Raft groups all humming in parallel.

RAFT COMMIT FLOW β€” From Client Write to Durable Commit Client INSERT ... Leader Range [A–M] Follower 1 Node 2 Follower 2 Node 3 Follower 3 slow / partitioned STEP-BY-STEP 1 Client sends write to Leader (any node redirects if wrong node) 2 Leader appends write to its local log, sends AppendEntries RPC to all followers 3 Followers 1 & 2 ACK β€” majority (2 of 3) have logged it β†’ Leader marks COMMITTED 4 Leader replies success to Client. Follower 3 catches up later when it reconnects. β‘  AppendEntries β‘‘ ACK β‘’ Success β‘£

The Four Guarantees Raft Provides

Leader Election

When a leader goes silent (network partition, crash, restart), followers wait a randomised timeout β€” roughly 150–300 ms β€” before declaring themselves a candidate and asking the others to vote. The candidate with the most up-to-date log wins. This randomisation prevents two followers from starting an election simultaneously and splitting the vote forever.

A range is briefly unavailable during election β€” typically <500 ms.

Log Replication

Every write the leader accepts is first appended to its own log, then fanned out to all followers via AppendEntries RPCs. Followers append the entry and send an ACK. The leader does not wait for every follower β€” just a quorum (majority). A 3-node group needs 2 ACKs; a 5-node group needs 3. This means one slow or failed node never blocks progress.

Commit

Once a quorum has acknowledged an entry, the leader marks it committed and applies it to the state machine (the actual data). It also notifies followers so they can advance their commit index. Committed entries are permanent β€” no future leader can ever remove a committed write, which is exactly what "durable" means in ACID.

Safety

Raft's election rules guarantee that any new leader must have every previously committed entry in its own log before it can win an election. This is enforced by comparing log indices during voting. The result: a committed write can never disappear, even across leader changes, network partitions, or node restarts.

Soft Latency Numbers

Within one datacenter, a Raft commit round-trip is roughly 1–3 ms β€” fast enough that most OLTP apps don't notice it. Cross-region (e.g., US-East to EU-West, ~80 ms RTT), a Raft commit adds around 80–200 ms per write because the AppendEntries call must physically travel to the other continent and back. This is why NewSQL databases offer follower-read options for reads β€” you can read from a local replica (possibly a few ms stale) instead of forcing every read through the leader.

Paxos vs Raft: Paxos is the older and mathematically equivalent algorithm. Both guarantee the same safety properties. Raft won developer mindshare largely because Diego Ongaro's 2014 doctoral paper ("In Search of an Understandable Consensus Algorithm") was written for humans, not mathematicians. CockroachDB, TiKV (TiDB), and YugabyteDB all use Raft. Google Spanner uses a Paxos variant internally β€” but from a correctness standpoint they are equivalent.
Raft gives every NewSQL range a leaderless-resilient log by electing one leader per group and requiring majority ACKs before a write is committed. Same-DC commits take ~1–3 ms; cross-region commits take as long as the network RTT. Paxos and Raft are interchangeable in correctness; Raft won on readability.
Section 8

Distributed Transactions

A single write to a single range is straightforward β€” one Raft group, one commit. But real-world transactions usually touch multiple rows that may not live on the same machine. Picture a money transfer: subtract $100 from Alice's account, add $100 to Bob's. If Alice's row sits on one node and Bob's row on another, the database must still guarantee atomicity β€” either both rows change, or neither does. No middle ground where Alice loses money but Bob never receives it. In SQL that looks like UPDATE accounts SET balance = balance - 100 WHERE id = 42 followed by UPDATE accounts SET balance = balance + 100 WHERE id = 77, but the hard part isn't the syntax β€” it's making both updates land or neither, across independent Raft groups. That's the distributed transaction problem.

2PC OVER RAFT β€” Coordinator Touching 3 Ranges Coordinator Transaction Manager Range Leader A rows 1–1000 Range Leader B rows 1001–2000 Range Leader C rows 2001–3000 PREPARE β‘  PREPARE β‘  PREPARE β‘  All OK β€” All prepared β‘‘ COMMIT β‘’ (broadcast) Range A βœ“ Range B βœ“ Range C βœ“

Three Approaches Used in Practice

2PC over Raft (most NewSQL)

Two-Phase Commit (2PC) is the classic protocol. A coordinator node sends a Prepare message to every affected range leader. Each leader locks the relevant rows in its Raft log and replies "yes, I'm ready." Once the coordinator hears "yes" from every range, it broadcasts Commit. If any range says "no" (lock conflict, timeout, error) the coordinator broadcasts Abort.

The downside of plain 2PC is the coordinator becomes a single point of failure during the commit phase β€” if it crashes after sending Prepare but before Commit, the ranges are stuck holding locks. NewSQL databases solve this by logging the transaction record inside a Raft group, so the coordinator state itself is replicated and can be recovered by any node.

Used by: CockroachDB, YugabyteDB, Google Spanner.

Spanner External Consistency (TrueTime)

Google Spanner takes a different angle. Instead of a coordinator managing locks through two phases, Spanner assigns every transaction a globally unique commit timestamp using TrueTime β€” an API that returns a guaranteed clock interval backed by GPS and atomic clocks. Transactions wait out the uncertainty window (~7 ms) before committing, ensuring no two transactions can have conflicting timestamps. This is called external consistency β€” a stronger property than serializability.

Available only on Google Cloud Spanner (hardware dependency).

Calvin β€” Deterministic Ordering

Calvin (originally from Yale, 2012) takes a completely different philosophy: agree on the order of transactions BEFORE executing them. A sequencing layer broadcasts a globally ordered log of transactions to all nodes. Every node then executes the same log in the same order β€” deterministically β€” which means 2PC is never needed. There is no prepare/commit dance; the log IS the commit.

The cost is that transactions must declare their read/write sets upfront (before execution). This works well for predictable OLTP but is awkward for transactions where the write set depends on what you read. FaunaDB was inspired by Calvin; its successor (Fauna v2) continues this lineage.

Avoids 2PC entirely β€” no coordinator bottleneck.

Latency Cost of Distribution

A distributed transaction pays a tax compared to a local transaction. Within one datacenter, a distributed 2PC transaction might add roughly 5–50 ms over a single-shard transaction β€” most of that is the extra Raft round-trips on multiple ranges plus lock coordination. Cross-region, that tax balloons to 100–500 ms because the Prepare messages must physically traverse long-haul fiber. This is why you want to co-locate frequently-joined data in the same range wherever possible β€” a pattern called geo-partitioning in CockroachDB.

Distributed SQL transactions require all affected ranges to agree on a single atomic outcome. Most NewSQL uses 2PC layered on Raft consensus; Spanner uses globally-ordered TrueTime timestamps instead; Calvin avoids 2PC by pre-ordering transactions deterministically. Cross-region distributed transactions cost 100–500 ms due to physics.
Section 9

CockroachDB Deep Dive

CockroachDB started as an open-source project on GitHub in 2014, and Cockroach Labs (the company) was founded in 2015 by Spencer Kimball, Peter Mattis, and Ben Darnell β€” all former Google engineers who had worked on Google's internal distributed systems (Kimball and Mattis on the GFS/Colossus team; Darnell on Google Reader). The name is deliberate: cockroaches survive anything. The goal was to build a database that kept running even when nodes, racks, or entire datacenters went offline.

The most important practical detail: CockroachDB speaks the PostgreSQL wire protocol. This means the psql CLI, every Postgres driver (Python psycopg2, Java JDBC, Node.js pg), and most Postgres-compatible ORMs (Django ORM, SQLAlchemy, Hibernate) connect to CockroachDB without modification. For teams already on Postgres, the migration story is much smoother than switching to a brand-new API.

COCKROACHDB ARCHITECTURE β€” 4-Layer Stack SQL Layer Parses SQL Β· Plans queries Β· Postgres wire protocol Β· DistSQL for cross-node query execution Distributed Transaction Layer 2PC coordinator Β· MVCC timestamps Β· Transaction heartbeating Β· Conflict resolution Distributed KV Layer Range management Β· Raft consensus per range Β· Leaseholder routing Β· Auto-rebalancing Local Storage (Pebble / RocksDB) LSM-tree on each node Β· Compaction Β· MVCC key-value pairs on disk Β· WAL Distributed ↑ Higher abstraction

Key Features

Postgres Compatibility

CockroachDB implements a large subset of PostgreSQL's SQL dialect and the full Postgres wire protocol. In practice, most Postgres applications work without code changes β€” just swap the connection string. Unsupported features include some PL/pgSQL stored procedures, a few exotic data types, and certain extensions (PostGIS had limited support as of 2024).

Multi-Region Support

CockroachDB supports several multi-region primitives. Geo-partitioning pins ranges for specific tenant IDs or geographic keys to specific regions β€” a German user's rows stay in the EU. Follower reads let you read from the nearest replica at the cost of a small staleness window (typically under 10 seconds). Regional tables vs global tables let you express per-table locality preferences declaratively.

Auto-Rebalancing

When you add nodes to a CockroachDB cluster, the range balancer automatically migrates Raft replicas from busy nodes to new ones. You don't have to reshard or migrate data manually β€” the cluster converges on a balanced state on its own. This also handles hot-spot detection: a heavily-read range can be split into two smaller ranges and distributed.

Online Schema Changes

Traditional databases lock the table during ALTER TABLE β€” adding a column to a 500M-row table could block writes for minutes or hours. CockroachDB executes DDL (schema changes) using a multi-version approach: the old and new schema versions coexist during the migration window. Backfilling happens in the background while the table remains live for reads and writes.

Trade-offs to Know

CockroachDB is not a drop-in replacement with zero downsides. Every write goes through Raft consensus β€” in a single-region cluster that costs roughly 2–10 ms per transaction. A single-node Postgres write returns in under 1 ms. If your application is write-latency sensitive and fits on one machine (or one primary + replicas), Postgres is faster and simpler. Additionally, CockroachDB has a smaller extension ecosystem than Postgres β€” if you depend on PostGIS for geospatial or TimescaleDB for time-series, you'd need a different approach.

On the positive side: a well-tuned CockroachDB cluster can handle roughly 10,000–100,000 writes per second and scales linearly β€” add nodes, get more throughput. Reads scale even more easily because any follower with a valid lease can serve them.

CockroachDB is the leading open-source distributed SQL database: Postgres-wire-compatible, auto-rebalancing, multi-region ready, with online schema changes. Its costs are write latency (Raft overhead: 2–10 ms in single-region) and a smaller extension ecosystem. Best for teams already on Postgres that need global distribution.
Section 10

YugabyteDB & TiDB

CockroachDB is not the only open-source distributed SQL database worth knowing. Two strong alternatives are YugabyteDB (founded 2016, open-sourced 2017) and TiDB (founded 2015 by PingCAP). Both follow the same core model β€” range-based sharding, Raft per range, ACID transactions β€” but they make different wire-protocol and architectural choices that make each a better fit for certain teams.

YugabyteDB vs TiDB β€” Architecture Comparison YugabyteDB TiDB YSQL (PostgreSQL query layer β€” forked) Full Postgres SQL dialect + PL/pgSQL YCQL (Cassandra-compatible API) DocDB β€” Custom Distributed Storage Raft per tablet Β· RocksDB local engine Β· MVCC RocksDB (local disk on each node) Wire protocols: PostgreSQL + Cassandra CQL Founded 2016 Β· Apache 2.0 core TiDB β€” MySQL-compatible SQL engine Stateless Β· horizontally scalable SQL nodes TiKV β€” Distributed Transactional KV Raft per region Β· RocksDB Β· MVCC (Percolator) TiFlash β€” Columnar OLAP engine Real-time sync from TiKV Β· ClickHouse-style storage PD (Placement Driver) β€” cluster metadata + scheduling Wire protocol: MySQL 5.7 compatible Founded 2015 Β· Apache 2.0 Β· HTAP in one cluster HTAP: OLTP (TiKV) + OLAP (TiFlash) in same cluster

YugabyteDB

YugabyteDB's most distinctive feature is its dual API: it supports both PostgreSQL (via YSQL, a forked Postgres query layer) and Cassandra (via YCQL, a reimplementation of the Cassandra Query Language) on top of the same distributed storage engine called DocDB. This means a team that has some Cassandra-style wide-column access patterns and some relational patterns can run both against one cluster.

The storage layer (DocDB) uses Raft per tablet and RocksDB as the local disk engine β€” similar in structure to CockroachDB. YugabyteDB is Apache 2.0 licensed for the core and available as a managed service (YugabyteDB Aeon / Anywhere).

Best for: teams migrating from Postgres who want full PL/pgSQL support, or teams that need both relational and wide-column access patterns simultaneously.

TiDB

TiDB is MySQL-compatible rather than Postgres-compatible β€” the SQL nodes speak MySQL 5.7 protocol, so existing MySQL clients and ORMs connect without changes. This makes TiDB the primary choice for teams on MySQL that need horizontal scaling.

TiDB's headline differentiator is HTAP (Hybrid Transactional/Analytical Processing): TiKV handles OLTP with row-based storage and Raft replication; TiFlash is a columnar engine that replicates data from TiKV in real time. A single SQL query can hit both engines simultaneously β€” analytical aggregations fan out to TiFlash while point lookups hit TiKV. This eliminates the ETL pipeline you'd normally need to feed an analytics warehouse.

Best for: MySQL shops that need scale, and teams with mixed OLTP + real-time analytics needs in one cluster. Heavily deployed in China (JD.com, Bilibili, Pinduoduo).

Wire-protocol compatibility matters: YugabyteDB and TiDB being wire-protocol compatible with Postgres and MySQL respectively means you don't have to rewrite your application layer or swap out your ORM. The migration risk lives in SQL dialect differences (a small set of unsupported features) rather than in API rewrites β€” a much smaller scope.
YugabyteDB (Postgres + Cassandra wire protocols, DocDB storage, Apache 2.0) and TiDB (MySQL-compatible, TiKV for OLTP + TiFlash for OLAP) are the two main open-source NewSQL alternatives to CockroachDB. Choose YugabyteDB for Postgres migration; choose TiDB for MySQL migration or real-time HTAP in one cluster.
Section 11

Cloud Spanner

Google built Spanner internally around 2007–2012 to solve its own global-scale data challenges β€” AdWords billing, Photos storage, Drive metadata β€” where even eventual consistency was unacceptable. For years it was Google-only. In 2017, Cloud Spanner launched as a publicly available managed service on Google Cloud. It is arguably the most technically sophisticated production database in existence: globally distributed, externally consistent, fully managed, with a 99.999% multi-region SLA.

The key thing to understand about Cloud Spanner is that it doesn't give you a machine to SSH into β€” it's a fully managed service. You pay for processing capacity in "nodes" and storage separately. Google handles everything: replication, failover, backups, software updates. You interact with it via standard SQL (ANSI 2011 compatible) or through client libraries.

Multi-Regional Configurations

Cloud Spanner lets you choose from single-region, multi-region, and custom configurations. A multi-region config like nam6 spans 5 read-write nodes across US data centers. A eur3 config spans EU. The nam-eur-asia1 config puts replicas on three continents. Spanner automatically routes reads to the nearest replica and synchronously replicates writes to all voting nodes before committing β€” externally consistent by design.

Pricing Reality

Cloud Spanner pricing is based on compute capacity (processing units β€” 1 node = 1000 processing units) plus storage. As of 2024, a single-region 1-node configuration runs roughly $650–$700/month before storage and egress. A multi-region 3-continent config can cost several thousand dollars per month at minimum. This makes Spanner appropriate for large-scale production workloads at companies with significant revenue, but overengineered for startups that could use Postgres or CockroachDB instead.

Five-Nines SLA

Cloud Spanner's multi-region configurations carry a 99.999% monthly uptime SLA β€” that is roughly 26 seconds of allowed downtime per month. This is not a marketing claim; it is contractually backed. Spanner achieves this through redundancy across multiple independent failure domains: if an entire Google datacenter goes offline, your Spanner instance keeps running on the remaining replicas, elects a new leader, and continues serving traffic without operator intervention.

Where Google Uses It

Internally, Google's most demanding services run on Spanner. AdWords (now Google Ads) uses Spanner for billing and campaign management β€” money must not be double-charged or lost. Google Photos and Drive use Spanner for file metadata. The scale is extreme: billions of rows, millions of queries per second, 99.999% availability measured continuously. The public Cloud Spanner service runs on the same infrastructure.

Alternatives to Cloud Spanner: If you want Spanner-like distributed SQL without Google Cloud lock-in, the closest open-source paths are CockroachDB Dedicated (managed) or YugabyteDB Aeon. For Google Cloud users who want a more Postgres-friendly managed DB, AlloyDB (Google's Postgres-compatible managed DB) is worth considering β€” though it does not have Spanner's global distribution capability. Spanner remains the highest-performance, highest-availability, and most expensive option.
Cloud Spanner is Google's fully managed global distributed SQL service, publicly available since 2017. It offers 99.999% SLA in multi-region configs, external consistency via TrueTime, and is battle-tested at Google's internal scale. The trade-off: it's the most expensive option, roughly $650+/month per node before storage.
Section 12

When NOT to Use NewSQL

NewSQL solves a real problem β€” but it also adds real complexity and latency. The question isn't "is NewSQL impressive?" (it is) β€” the question is "does my application actually need what NewSQL provides?" More often than engineers expect, the honest answer is no.

Every distributed NewSQL write goes through consensus β€” that's the irreducible cost of safety across nodes. If your application can fit on a single primary plus a few read replicas, that cost buys you nothing. Before adopting NewSQL, ask: do I need writes to scale horizontally? Do I need multi-region active-active? Do I need the database itself to survive a datacenter loss with zero data loss? If all three answers are "no," NewSQL is likely overkill.

WHEN DOES NEWSQL MAKE SENSE? Need horizontal write scale? (single Postgres maxed out?) NO Use Postgres + read replicas YES Need strong ACID guarantees? (payments, inventory, ledger?) NO Cassandra DynamoDB YES Multi-region or global? (active users in 2+ continents?) NO Postgres + Vitess (MySQL sharding) YES NewSQL is the right choice CockroachDB / Spanner / YugabyteDB / TiDB Postgres lineage? CockroachDB / YugabyteDB MySQL lineage or HTAP? TiDB Β· Cloud Spanner (GCP)

Five Anti-Patterns β€” When NewSQL Is the Wrong Tool

Single-Region, Modest Scale

If all your users are in one country and your peak write load is under 5,000 writes/second, a single Postgres primary with a couple of read replicas is simpler, cheaper, and faster than a NewSQL cluster. You avoid consensus overhead, get sub-millisecond writes, and have access to the full Postgres extension ecosystem (PostGIS, pg_vector, Citus, etc.).

Postgres on a modern M-series or high-core-count x86 server handles tens of thousands of queries per second.

Read-Heavy Workloads

If 90%+ of your traffic is reads, adding read replicas gives you horizontal read scale without the consensus overhead on writes. A single Postgres primary + 5 read replicas can serve millions of reads per minute. Tools like PgBouncer (connection pooling) and a load balancer in front of replicas make this a mature, well-understood pattern. NewSQL adds complexity without benefit here.

Time-Series Workloads

Metrics, IoT sensor data, financial tick data β€” all share a pattern: append-mostly, queries over time windows, high-velocity ingest. Dedicated time-series databases like TimescaleDB (Postgres extension with automatic partitioning by time), InfluxDB, or QuestDB are purpose-built for this. They use time-aware compression and time-range partitioning that NewSQL general-purpose storage cannot match.

Unstructured / Document Data

If your data is schema-less or deeply nested (user-generated content, product catalogs with variable attributes, JSON-heavy APIs), document databases like MongoDB or DynamoDB (with a document model) offer flexible schemas, rich query operators on nested fields, and horizontal scale without forcing relational structure onto inherently non-relational data.

Pure OLAP / Analytics

NewSQL is optimised for OLTP β€” many small, fast transactions. If your primary workload is large analytical queries (full-table scans, aggregations, GROUP BY across billions of rows), columnar analytical databases are the right choice. ClickHouse compresses column data extremely efficiently and parallelises scans. BigQuery and Snowflake scale to petabytes. Using NewSQL for OLAP is like using a scalpel to dig a ditch.

Premature adoption is common: Many engineering teams reach for NewSQL (or even NoSQL) before exhausting what a well-tuned Postgres setup can do. Postgres + read replicas + connection pooling + PgBouncer handles more scale than most teams ever reach. If you're on MySQL, Vitess (developed at YouTube, now CNCF) layers horizontal sharding on top without changing your application. Migrate to NewSQL when you have evidence you've outgrown these tools β€” not as a prophylactic architectural decision.
NewSQL has real costs: write latency overhead, operational complexity, and smaller ecosystems. Skip it for single-region apps (Postgres is faster and simpler), read-heavy workloads (use replicas), time-series (TimescaleDB/InfluxDB), documents (MongoDB), or pure analytics (ClickHouse/Snowflake). Adopt NewSQL only when you have evidence that simpler options are genuinely insufficient.
Section 13

Multi-Region Deployments

NewSQL's biggest pitch is global ACID β€” the ability to run strongly consistent transactions across data that lives in multiple countries at once. On paper that sounds magical. In practice it costs latency, because making nodes in different continents agree requires messages to physically cross the ocean.

Here is the core mechanic: data is split into ranges, each range has a leaseholder (the leader), and all writes must reach a quorum of that range's replicas before they commit. When replicas are in the same datacenter, quorum is a ~1-3 ms hop. When replicas span continents, quorum requires a ~50-200 ms round-trip β€” that is the speed-of-light tax, not a bug.

3-REGION NewSQL CLUSTER β€” CROSS-REGION WRITE COMMIT LATENCY us-east-1 ⭐ Leaseholder Range R1 β€” Leader Write origin local commit: 1-3ms eu-west-1 Follower Replica Range R1 β€” Replica 2 RTT β‰ˆ 80-120ms ap-south-1 Follower Replica Range R1 β€” Replica 3 RTT β‰ˆ 140-200ms Raft log Raft log Raft Quorum Logic for 3 replicas Write commits when any 2 of 3 replicas ACK the log entry. Cross-region commit latency = RTT to the faster of the two followers.

Soft numbers to keep in mind: a same-region commit is roughly 1–3 ms. A cross-region commit (US ↔ Europe) is roughly 80–120 ms. US ↔ Asia-Pacific can be 140–200 ms. These are physics-bounded β€” you cannot engineer your way below the speed of light.

4 Strategies to Soften Latency

Geo-Partitioning

Pin certain rows to specific regions based on a partition key β€” usually user_region. A European user's rows live in the EU region, so reads and writes stay in-region and skip cross-ocean consensus entirely. The catch: cross-region queries that join EU rows with US rows still pay the full RTT.

Best for: multi-tenant SaaS, GDPR compliance, and apps where users clearly belong to one region.

Follower Reads

Instead of routing reads to the leaseholder (potentially remote), allow a local follower replica to serve the read with bounded staleness β€” the data may be up to a configurable number of seconds behind. Since most reads don't need the absolute latest millisecond of data, follower reads can bring read latency below 10 ms even in multi-region clusters.

Best for: dashboards, reporting queries, and any read that can tolerate a few seconds of lag.

Read-Your-Writes via Bookmarks

A user submits a form and immediately expects to see their change reflected. If their next read hits a stale follower, they'll feel a ghost β€” their edit disappeared. The fix: track a causal token (a timestamp or log sequence number) per session. Route reads to a replica that has caught up to at least that token. This gives "your own writes are always visible" without requiring every read to hit the leader.

Leaseholder Placement

You can tell the cluster which region should hold the leaseholder for a range. If 90% of your writes originate in the US, pin the leaseholder to us-east. This makes writes fast for the dominant write region and only hurts the minority of writes coming from elsewhere. Most NewSQL databases support this via zone configs or placement policies.

NewSQL enables global ACID by replicating ranges across 3+ regions via Raft. Cross-region writes pay a 50–200 ms latency tax (physics). Mitigate with geo-partitioning, follower reads, causal bookmarks, and deliberate leaseholder placement.
Section 14

Transactions in Practice

From an application developer's perspective, writing NewSQL transactions feels almost identical to writing Postgres transactions β€” you use BEGIN, run some SQL, and call COMMIT. But a few important differences lurk under the surface, and ignoring them will cause subtle bugs and mysterious latency spikes.

The biggest difference is that every write in a NewSQL transaction must travel through consensus. A long-running transaction holds locks across multiple Raft rounds. The longer a transaction is open, the more likely another transaction will conflict with it β€” leading to an abort. Unlike Postgres where aborts are rare, in distributed SQL under load, abort-and-retry is expected and must be handled in application code.

4 Best Practices

Keep Transactions Short

Target a maximum of roughly 100 ms per transaction. Long transactions hold range locks across consensus rounds; while they're open, other transactions touching the same ranges must wait or abort. A transaction that holds locks for 2 seconds in a high-traffic system can cause a cascade of aborts that is very hard to debug. Move any slow I/O (HTTP calls, file reads) outside the transaction boundary.

Retry on Serialization Failures

CockroachDB and YugabyteDB return a SQLSTATE 40001 error ("serialization failure") when two transactions conflict and one must be retried. This is not a bug β€” it is the correct behavior of a serializable isolation level. Your application must catch this error and retry the entire transaction from scratch. Most ORM libraries don't do this automatically; you must add retry logic yourself.

Avoid Hot Rows

In NewSQL, each range is served by one leaseholder. If many transactions all write to the same single row (a global counter, a "likes" count, a running balance), all those writes serialize through the same Raft log entry. This becomes a bottleneck that no amount of horizontal scaling will fix β€” it is a design problem. Instead, use distributed counters (sum partitioned counters at read time) or accumulate updates in a queue.

Use Bounded Staleness Reads When Possible

Not every read in your app needs the freshest possible data. Reporting queries, dashboards, and "what did we sell last week?" analytics can tolerate data that is a few seconds old. Using bounded staleness (or follower reads) for these paths dramatically reduces load on leaseholders and cuts read latency for everyone β€” including the transactional paths that actually need strong reads.

Code Patterns

Wrap any transaction that touches shared rows in a retry loop. The key: on a 40001 serialization error, roll back and try again from the beginning β€” do not try to salvage a partially executed transaction.

retry_txn.py
import psycopg2
from psycopg2 import OperationalError

MAX_RETRIES = 5

def run_with_retry(conn_str, txn_fn):
    """
    Execute txn_fn(cursor) inside a transaction.
    Retries up to MAX_RETRIES times on serialization failure (40001).
    """
    conn = psycopg2.connect(conn_str)
    conn.autocommit = False

    for attempt in range(MAX_RETRIES):
        try:
            with conn.cursor() as cur:
                txn_fn(cur)       # run all SQL statements
            conn.commit()         # success β€” exit the loop
            return
        except OperationalError as e:
            conn.rollback()
            if e.pgcode == "40001":
                # Serialization failure β€” retry is correct
                if attempt == MAX_RETRIES - 1:
                    raise RuntimeError("Max retries exceeded") from e
                continue          # back to top of loop
            else:
                raise             # different error β€” do not retry

# Usage
def transfer_funds(cur):
    cur.execute("UPDATE accounts SET balance = balance - 100 WHERE id = %s", (1,))
    cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = %s", (2,))

run_with_retry("postgresql://localhost:26257/bank", transfer_funds)

CockroachDB exposes AS OF SYSTEM TIME to read from a point slightly in the past, allowing the query to be served by any follower replica. No cross-region coordination needed β€” latency drops to the nearest node.

bounded_staleness.sql
-- Read data that is at most 10 seconds old.
-- CockroachDB will route this to the nearest follower replica,
-- so even a us-east app reading eu-west data pays only local latency.
SELECT
    order_id,
    status,
    total_amount
FROM orders
AS OF SYSTEM TIME follower_read_timestamp()
WHERE created_at > now() - interval '24 hours';

-- follower_read_timestamp() returns (now - closed_timestamp_lag)
-- Typical lag: 4.8 seconds (configurable via kv.closed_timestamp.target_duration)

When you expect low contention but want to avoid lock escalation, use a version column. Read the row, remember its version, update only if the version hasn't changed. This is optimistic locking β€” no locks are held during the think time between read and write.

optimistic_lock.sql
-- Step 1: read the current version
SELECT id, balance, version FROM accounts WHERE id = 42;
-- Returns: { id: 42, balance: 500, version: 7 }

-- Step 2: update only if version matches what we read
UPDATE accounts
SET
    balance  = 400,
    version  = version + 1
WHERE
    id      = 42
    AND version = 7;   -- if another writer bumped version, 0 rows updated

-- Step 3: check rows_affected
-- If 0 β†’ someone else changed the row; re-read and decide whether to retry
-- If 1 β†’ success, your update won the race
NewSQL transactions look like Postgres but need extra care: keep them under ~100 ms, handle 40001 serialization errors with retry logic, avoid hot-row bottlenecks, and use bounded staleness for reads that don't need strong consistency.
Section 15

Performance & Tuning

A freshly provisioned NewSQL cluster will handle your workload β€” but a tuned cluster can handle 5-10Γ— more. The good news: unlike hand-sharded Postgres, you are tuning a system that understands its own topology. Most gains come from five practical levers.

TYPICAL NewSQL PERFORMANCE DASHBOARD β€” KEY SIGNALS Write Throughput 42K txn/sec (well-tuned) p99 Write Latency 8ms single-region cluster Cross-Region Throughput ~8K txn/sec (multi-region) Leaseholder Balance β‰ˆ33% per node (target: even) Soft reference numbers for a 3-node, 16-core cluster. Actual numbers depend heavily on workload shape and hardware. Cross-region throughput is roughly half single-region because quorum requires an intercontinental round-trip per write. watch for skew

5 Performance Levers

Batch Writes

Sending 1,000 rows as a single multi-value INSERT INTO t VALUES (...), (...), ... is dramatically faster than 1,000 individual inserts. Why? Each insert is a separate Raft round-trip. Batching collapses 1,000 round-trips into one β€” or at most a few. Aim for batch sizes of 100–1,000 rows depending on row width. This is the single highest-ROI optimization for bulk ingestion workloads.

Range Management

NewSQL databases split and merge ranges automatically as data grows. Most of the time you should leave this alone. The exception: hot ranges β€” a single range absorbing a disproportionate share of writes (visible in the DB console as "hot spots"). You can manually split a hot range or pre-split on insert to spread writes across multiple ranges immediately rather than waiting for the auto-split timer.

Connection Pooling

Each open connection consumes memory and CPU on the NewSQL nodes. A serverless app that opens a new connection per request will exhaust resources quickly. Always sit a connection pooler (PgBouncer for CockroachDB/Yugabyte, ProxySQL for TiDB) between your app and the cluster. Keep pool size proportional to node count Γ— CPU cores β€” a rough starting target is 4Γ— the number of vCPUs.

Cache Hot Reads

NewSQL is optimized for transactional writes and strong reads. It is not a low-latency cache. For lookup patterns that need sub-millisecond responses β€” product details, user profile, feature flags β€” put a Redis or Memcached layer in front. NewSQL becomes the source of truth; the cache absorbs the hot read traffic. Invalidate cache on write, not on a timer, to keep consistency tight.

Avoid Full Table Scans

A full table scan on a single-node Postgres reads one machine's disk. A full table scan on a NewSQL cluster fans out across every node, collects partial results, and merges them β€” it is far more expensive in proportion. Use EXPLAIN to confirm index usage. Add partial indexes for common filtered queries. Ensure that high-cardinality columns used in WHERE clauses have covering indexes so no scan is needed.

The five biggest NewSQL performance wins: batch writes (1 round-trip instead of N), manual range splits for hot spots, aggressive connection pooling, Redis caching for hot reads, and filtered indexes to eliminate distributed scans. A well-tuned single-region cluster handles roughly 10K–100K txn/sec; multi-region roughly halves that.
Section 16

Operations & Maintenance

Running a distributed SQL database is meaningfully more complex than running a single Postgres server. A single Postgres goes down and you notice immediately. A NewSQL cluster has nodes, ranges, leaseholders, replication factors, and certificates β€” any of which can silently degrade without taking down the whole system.

The good news: NewSQL clusters are self-healing for most failure modes. A node dies, the cluster re-elects leaseholders, re-replicates affected ranges, and continues serving. You do not need a manual failover script. What you do need is visibility and a runbook for the things the cluster cannot fix itself.

Cluster Monitoring

Track these signals daily: node health (are all nodes live?), range under-replication (are any ranges below their replication factor?), leaseholder balance (is one node holding 80% of leases?), and inter-node RTT (is network latency spiking?). Most NewSQL dashboards surface these automatically. Set alerts on under-replicated ranges β€” it means you have lost a replica and are one more failure away from losing quorum.

Backups

Full backups plus incremental backups to object storage (S3, GCS). NewSQL systems typically support online backups that do not pause queries. CockroachDB's BACKUP command writes directly to S3 with point-in-time restore support. Cross-region replication is not a backup β€” it replicates deletes and corruptions too. Always maintain a separate backup that is not a live replica.

Rolling Upgrades

NewSQL upgrades work node-by-node: take one node offline, upgrade it, bring it back, wait for re-replication, move to the next node. The cluster stays online throughout because the remaining nodes maintain quorum. Most databases provide a one-click rolling upgrade in their managed console. For self-hosted, follow the official upgrade checklist β€” version skew between nodes is allowed for at most one major version.

Adding / Removing Nodes

Scale out by adding a node; the cluster automatically rebalances ranges to the new node. Scale in by decommissioning β€” the system moves all ranges off the target node before removing it. On large clusters with terabytes of data, rebalancing can take hours. Plan capacity changes in advance; don't add nodes in response to a live incident and expect instant relief.

Monitoring Tools

Each system ships its own dashboard: DB Console (CockroachDB), Yugabyte Platform, TiDB Dashboard. All expose similar panels: node list, range distribution, slow queries, hot spots, Raft log metrics. Export metrics to Prometheus + Grafana if you need unified observability across your stack. Datadog and Grafana Cloud have pre-built NewSQL dashboards.

Cluster Certificates (TLS)

NewSQL nodes communicate over TLS by default β€” node-to-node and client-to-node. Certificate rotation is the trickiest operational task: you must roll new certs to all nodes without breaking the in-flight Raft connections. Most teams use short-lived certs (90 days) managed by cert-manager or HashiCorp Vault with automatic rotation. Don't let certs expire on a production cluster β€” the entire cluster can become unreachable.

Managed offerings handle most of this automatically. CockroachDB Cloud, Yugabyte Aeon, and TiDB Cloud handle upgrades, certificate rotation, monitoring dashboards, and backup schedules out of the box. Unless compliance or data sovereignty rules require self-hosting, managed is the recommended starting point for production workloads β€” the operational burden is roughly equivalent to using RDS instead of self-hosted Postgres.
Distributed SQL needs more operational care than single-node Postgres: monitor node/range health, schedule offline backups (replication is not a backup), use rolling upgrades, plan capacity changes ahead of time, and keep TLS certs rotating. Managed offerings absorb most of this complexity.
Section 17

Comparing the Big 4

Four systems dominate the NewSQL landscape today. They all solve the same core problem β€” distributed ACID at scale β€” but they differ in wire compatibility, deployment model, licensing, and heritage. Picking the right one usually comes down to two questions: what wire protocol does my app already speak? and where do I need to run this?

NewSQL BIG 4 β€” WIRE COMPATIBILITY Γ— DEPLOYMENT MODEL Prefer Managed Prefer Self-Hosted Postgres wire MySQL wire Spanner Google Cloud only GSQL / Spanner API CockroachDB Postgres wire βœ“ Cloud + self-host YugabyteDB PG + CQL wire Cloud + self-host TiDB MySQL wire βœ“ TiDB Cloud + self-host

Spanner β€” Google's Gold Standard

Spanner is the original globally distributed SQL database β€” the system that proved NewSQL was possible at scale. It uses Google's TrueTime API (atomic clocks + GPS receivers in every datacenter) to achieve external consistency without any clock drift assumptions. The result is arguably the strongest consistency model in production databases today.

The catch: Spanner is Google Cloud only. You cannot run it on-premises. It is priced at the premium end β€” roughly 10Γ— the cost of equivalent CockroachDB on GKE. But for a GCP-native team that needs 99.999% global ACID and money is not the primary constraint, Spanner is the safe, boring, correct choice.

CockroachDB β€” Postgres-Compatible Multi-Region

CockroachDB speaks the PostgreSQL wire protocol, so existing Postgres applications typically connect with zero driver changes. Multi-region is a first-class feature β€” you can define REGION constraints per table, geo-partition rows by user location, and set GLOBAL tables that replicate everywhere for low-latency reads. It runs on any Kubernetes cluster or as a managed cloud service (CockroachDB Cloud).

Licensing: the core is BSL 1.1 (source available, not fully open-source). The change date to Apache 2.0 is 4 years after release. For most commercial use this is fine β€” only competing hosting providers need to worry about the BSL terms.

YugabyteDB β€” PostgreSQL Fork + Cassandra API

Yugabyte is unique in that it exposes two wire protocols: YSQL (fully compatible PostgreSQL) and YCQL (Cassandra Query Language). This matters if you have existing Cassandra applications you want to migrate while also adding relational capabilities. The storage layer (DocDB) is an LSM-tree based engine optimized for both workloads simultaneously.

It is fully open-source (Apache 2.0), available as Yugabyte Aeon (managed) or self-hosted on any cloud or bare metal. Benchmarks suggest comparable write throughput to CockroachDB with lower read latency in some workloads, though real-world numbers vary significantly by query shape.

TiDB β€” MySQL-Compatible HTAP

TiDB speaks MySQL wire protocol β€” any MySQL client library connects without modification. What makes TiDB distinctive beyond the others is HTAP (Hybrid Transactional/Analytical Processing) via TiFlash, a columnar replica that runs analytical queries without impacting transactional throughput. You get a data warehouse and an OLTP database from one system.

TiDB has the largest user base in China and is growing globally. TiDB Cloud is the managed option. If your stack is MySQL-heavy and you need HTAP without running a separate Spark or ClickHouse cluster, TiDB is worth evaluating seriously.

Spanner: best consistency, GCP-only, premium price. CockroachDB: Postgres wire, multi-region first-class, cloud or self-host. YugabyteDB: Postgres + Cassandra APIs, open-source. TiDB: MySQL wire, HTAP via TiFlash, strong in Asia. Choose based on wire compatibility with your existing stack and deployment model preference.
Section 18

When NewSQL Wins

NewSQL is not the right answer for every database problem β€” in fact, for the majority of applications it is overkill. But for a specific shape of problem, it is the only good answer. Understanding that shape clearly is what makes a senior engineer: knowing not just what a tool does, but precisely when to reach for it.

WHEN TO CHOOSE NewSQL β€” DECISION TREE Do you need ACID? No NoSQL (Cassandra/Dynamo) Yes Fits single Postgres? (<50K writes/sec, 1 region) Yes Postgres (stay simple) No Multi-region writes needed? No Sharded Postgres or PlanetScale Yes NewSQL βœ“ Spanner Β· CockroachDB Β· Yugabyte Β· TiDB Measure real bottlenecks before choosing. Premature migration to NewSQL adds cost and complexity with no benefit.

5 Winning Use Cases

Global SaaS with Strong Consistency

Banking, fintech, payments, identity management, and insurance systems all have one thing in common: a wrong write is not an inconvenience, it is a compliance failure or financial loss. If your company processes money or stores user identity across regions, you need a database that cannot produce "I debited account A but didn't credit account B" scenarios. That guarantee is precisely what NewSQL provides at global scale.

Example: a global payment processor with users in the US, EU, and Singapore. Postgres can handle the US alone, but the moment you need consistent cross-region writes (a user transfers money to someone in another region), you need distributed ACID.

Apps That Outgrew Sharded Postgres

Many companies handle scale by manually sharding Postgres β€” splitting user data across multiple independent database instances by user ID range or hash. This works until it doesn't: cross-shard joins require application-level scatters, schema migrations become nightmares, and adding a new shard involves data reshuffling that causes downtime. When your sharding logic is more complex than your application logic, it is time to delegate that complexity to a NewSQL database designed to handle it transparently.

Multi-Region Active-Active

"Active-active" means every region can accept writes, not just reads. In a traditional Postgres setup, you have one primary that accepts writes and one or more read replicas. If the primary region goes down, you have to failover and potentially lose the last few seconds of data. With NewSQL and geo-partitioning, a user in Europe writes to the EU region's leaseholder, and a user in the US writes to the US leaseholder β€” both regions are writable simultaneously with no data loss on failure.

Compliance-Driven Data Residency

GDPR requires that EU citizen data be stored and processed within the EU. PIPL requires Chinese user data to stay in China. Historically, enforcing this required separate database deployments per jurisdiction β€” a huge operational burden. NewSQL with geo-partitioning lets you declare "rows where user_region = 'EU' must be stored in eu-west-1 only" directly in the database, with the cluster enforcing it automatically. One cluster, multiple jurisdictions, zero application changes needed to comply.

Migration from Legacy Databases

Wire protocol compatibility is a superpower here. A company running on MySQL can migrate to TiDB β€” same connection strings, same SQL dialects, same ORM drivers. A Postgres shop migrates to CockroachDB or YugabyteDB with minimal driver changes. The migration path goes: deploy NewSQL alongside legacy DB β†’ replicate data β†’ cut over read traffic β†’ cut over write traffic β†’ decommission legacy. Compare that to migrating to Cassandra, which requires rewriting every query and removing every JOIN.

Don't over-engineer. A well-tuned Postgres instance with read replicas handles up to roughly 50,000 writes per second in a single region. Citus (sharded Postgres) extends that significantly. Move to NewSQL only when you have measured real bottlenecks β€” write throughput exceeding single-node Postgres limits, multi-region consistency requirements, or data residency compliance needs. NewSQL adds operational complexity and cost that is not justified for a 10K DAU startup.
NewSQL wins for global SaaS requiring strong consistency (payments, fintech), teams that have outgrown manual sharding, multi-region active-active write patterns, GDPR/PIPL data residency, and legacy MySQL/Postgres migrations. Don't migrate until you've measured real bottlenecks β€” Postgres handles more than most people think.
Section 19

Tools & Drivers β€” The NewSQL Toolbox

You do not need to learn an entirely new ecosystem to work with NewSQL databases. Most of the tools you already know β€” psql, JDBC, Python database drivers β€” work unchanged because NewSQL databases speak existing wire protocols (Postgres or MySQL). The handful of new tools listed below are mostly operational: they help you deploy clusters, manage nodes, and monitor cluster health. Learn these six and you are ready to run NewSQL in production.

DEPLOY DISTRIBUTED SQL β€” TYPICAL WORKFLOW Your App psql / JDBC / Go / Python Gateway Node SQL parse + route to range leaders Range Leaders each range has one Raft leader across 3-5 replica nodes Admin & Monitoring DB Console Β· tiup Β· yugabyted Spanner Console Β· Cloud UIs standard drivers built-in, stateless auto-rebalanced observe & operate tiup bootstraps TiDB clusters with one command Β· yugabyted starts a single-node YugabyteDB in seconds Β· CockroachDB uses cockroach start All three expose Postgres wire protocol β†’ your existing psql / JDBC / ORM works without code changes

cockroach sql / psql

CockroachDB speaks the full Postgres wire protocol, which means the Postgres command-line client psql connects to it out of the box β€” no special driver needed. CockroachDB also ships its own cockroach sql shell that adds cluster-aware features like showing range distribution and lease holders. This matters because it lowers the migration bar enormously: any tool, ORM, or script that already works with Postgres will almost certainly work with CockroachDB without any changes. The same logic applies to YugabyteDB's YSQL API β€” it is Postgres wire-compatible.

tiup

TiUP is TiDB's all-in-one deployment and lifecycle management tool. Think of it like npm but for TiDB clusters. A single tiup playground command starts a full local TiDB cluster β€” TiDB SQL nodes, TiKV storage nodes, and a Placement Driver β€” inside minutes. For production, tiup cluster deploy provisions a multi-node cluster across machines with a YAML topology file. TiUP also handles rolling upgrades, scaling out (adding nodes), and scaling in (decommissioning nodes). Before TiUP existed, running TiDB in production required manually orchestrating four different binaries across many machines β€” TiUP turned that into a single command.

yugabyted

yugabyted is YugabyteDB's single-binary all-in-one tool, designed to make getting started effortless. Running yugabyted start on any machine brings up a full single-node YugabyteDB cluster with both YSQL (Postgres-compatible) and YCQL (Cassandra-compatible) APIs active. For multi-node clusters, you run yugabyted start --join <existing-node> on each additional machine and YugabyteDB forms the cluster automatically. Why does this matter? The traditional database setup experience involves many configuration files, certificates, and service restarts. yugabyted makes the first experience immediate and the operational model easy to reason about.

DB Console / Yugabyte Platform / TiDB Dashboard

Each major NewSQL database ships an administrative web UI that gives you real-time visibility into the cluster. CockroachDB's DB Console (built in) shows node health, range distribution, SQL statement statistics, and replication lag. TiDB's TiDB Dashboard (built into the Placement Driver) shows slow queries, key-visualizer heat maps (which ranges are hot), and cluster topology. YugabyteDB's Yugabyte Platform (the enterprise management layer) goes further, providing multi-cluster management, alerting, and backup scheduling. These UIs are your first stop when something feels slow β€” the range heat map and slow query log together explain about 80% of production performance issues.

Drivers β€” JDBC, libpq, Go/Python/Node

Because NewSQL databases speak standard wire protocols, they work with the full ecosystem of existing drivers. CockroachDB and YugabyteDB YSQL: use any libpq-based driver β€” psycopg2 (Python), node-postgres (Node.js), pgx (Go), or any JDBC driver with a postgresql:// connection string. TiDB: use any MySQL driver β€” MySQL Connector/J for Java, mysql2 for Node.js, PyMySQL for Python. The one nuance: CockroachDB recommends using its own verified driver versions and connection pool settings (max open connections, retry-on-serialization-failure logic) to handle distributed SQL's slightly different transaction retry semantics.

Cloud Consoles

All major NewSQL databases now have fully managed cloud offerings that handle operations for you. CockroachDB Serverless (and Dedicated) on CockroachDB Cloud auto-scales storage and compute, handles backups, and provides a web console for cluster management. Google Cloud Spanner is fully managed on GCP β€” you provision instances and Spanner handles replication, upgrades, and global distribution. Yugabyte Aeon (YugabyteDB's cloud) runs on AWS, GCP, and Azure. The trade-off: managed services cost more per GB than self-hosted but dramatically reduce operational burden. For teams without dedicated database administrators, the managed path is almost always the right starting point.

Connection Code Examples

# CockroachDB β€” psql works because CockroachDB speaks Postgres wire protocol
# The only difference from a normal Postgres connection is the port (26257 default)
psql "postgresql://root@localhost:26257/defaultdb?sslmode=disable"

# YugabyteDB YSQL β€” identical to Postgres, default port 5433
psql "host=localhost port=5433 dbname=yugabyte user=yugabyte"

# Once connected, everything is standard SQL
# CockroachDB-specific: show which node owns each range
SHOW RANGES FROM TABLE orders;
// CockroachDB JDBC configuration
// Uses the standard PostgreSQL JDBC driver β€” no special CockroachDB driver needed
// Key: set reWriteBatchedInserts=true for 2-3x faster batch INSERT performance
// Key: set ApplicationName so the DB Console shows which app is causing load

import java.sql.*;
import java.util.Properties;

public class CockroachExample {
    public static void main(String[] args) throws Exception {
        String url = "jdbc:postgresql://localhost:26257/bank"
                   + "?sslmode=require"
                   + "&reWriteBatchedInserts=true"  // batches writes β†’ faster
                   + "&ApplicationName=payment-service";

        Properties props = new Properties();
        props.setProperty("user", "myuser");
        props.setProperty("password", "secret");

        try (Connection conn = DriverManager.getConnection(url, props)) {
            // IMPORTANT: CockroachDB recommends wrapping transactions in a retry loop
            // because distributed serializable transactions can return "40001 restart"
            // errors when two transactions conflict. The retry is your responsibility.
            boolean success = false;
            while (!success) {
                try (var tx = conn.prepareStatement("BEGIN")) {
                    tx.execute();
                    // ... your SQL here ...
                    conn.createStatement().execute("COMMIT");
                    success = true;
                } catch (SQLException e) {
                    if ("40001".equals(e.getSQLState())) {
                        conn.createStatement().execute("ROLLBACK");
                        // retry the whole transaction
                    } else throw e;
                }
            }
        }
    }
}
# Google Cloud Spanner β€” Python client
# Spanner does NOT speak Postgres wire protocol β€” it has its own gRPC API
# The google-cloud-spanner library wraps that API cleanly

from google.cloud import spanner

# 1. Create a client pointing at your Spanner instance
client  = spanner.Client(project="my-gcp-project")
instance = client.instance("my-spanner-instance")
database = instance.database("my-database")

# 2. Read-only transaction (uses bounded staleness β€” cheap, no locks)
with database.snapshot() as snapshot:
    results = snapshot.execute_sql(
        "SELECT user_id, balance FROM accounts WHERE user_id = @uid",
        params={"uid": "usr_123"},
        param_types={"uid": spanner.param_types.STRING},
    )
    for row in results:
        print(f"User {row[0]}: balance = {row[1]}")

# 3. Read-write transaction (serializable, uses TrueTime for external consistency)
def transfer(transaction, from_id, to_id, amount):
    # Spanner automatically retries this function on serialization conflicts
    transaction.execute_update(
        "UPDATE accounts SET balance = balance - @amt WHERE user_id = @uid",
        params={"amt": amount, "uid": from_id},
        param_types={"amt": spanner.param_types.INT64, "uid": spanner.param_types.STRING},
    )
    transaction.execute_update(
        "UPDATE accounts SET balance = balance + @amt WHERE user_id = @uid",
        params={"amt": amount, "uid": to_id},
        param_types={"amt": spanner.param_types.INT64, "uid": spanner.param_types.STRING},
    )

database.run_in_transaction(transfer, "usr_123", "usr_456", 5000)
NewSQL databases reuse the tools you already know β€” psql, JDBC, and standard ORM drivers work because CockroachDB and YugabyteDB speak Postgres wire protocol; TiDB speaks MySQL. New tools to learn: tiup (TiDB deployment), yugabyted (YugabyteDB all-in-one), and the built-in admin UIs (DB Console, TiDB Dashboard, Yugabyte Platform) for cluster visibility. All three databases also have fully managed cloud offerings that eliminate the operational burden entirely.
Section 20

Common Misconceptions

NewSQL databases attract enthusiastic adoption β€” and with that comes a wave of misconceptions. Some come from the marketing copy ("infinite scale!"), some from half-remembering a conference talk, and some from assuming NewSQL is simply "Postgres but bigger." Clearing these up before you start a project saves you from architectural regrets that are very painful to undo.

1. "NewSQL is just sharded Postgres."

This is the most common misconception and it is wrong in an important way. Sharded Postgres means you run multiple independent Postgres instances and put a routing proxy in front of them. Cross-shard transactions require Two-Phase Commit (2PC) layered on top of those independent databases. 2PC is fragile β€” if the coordinator crashes between the prepare and commit phases, all participating shards are stuck waiting with locked rows until a human intervenes.

NewSQL databases are fundamentally different: consensus (Raft or Paxos) is baked into the storage layer, not an afterthought. Every write to a range goes through a Raft group β€” leader proposes, followers ack, majority reached, then acknowledged. There is no separate coordinator that can crash and leave things stuck. The consistency guarantee comes from the physics of the consensus protocol, not from a bolt-on middleware layer. The architecture is different at the foundation β€” not the same thing with a proxy in front.

2. "It's a drop-in replacement for Postgres."

NewSQL databases are mostly compatible with Postgres syntax, but "mostly" is doing a lot of work in that sentence. A few real-world friction points:

  • Write latency is higher β€” a CockroachDB write in a single-region cluster typically takes 2–10 ms vs Postgres's 0.5–2 ms because consensus requires at least two network round-trips between replicas.
  • Some Postgres extensions do not work β€” CockroachDB does not support PostGIS, pg_vector, or many other pg extensions that live inside the Postgres storage engine.
  • Long transactions behave differently β€” in Postgres you can hold a transaction open for hours; in CockroachDB, long transactions encounter dramatically more serialization conflicts as the write timestamp ages.

If you are migrating an existing Postgres application, budget time to audit these differences β€” they are manageable but not zero.

3. "Spanner is the fastest because of TrueTime."

TrueTime gives Spanner external consistency β€” the strongest possible consistency guarantee β€” but it does not make Spanner faster than other databases. In fact, Spanner's commit protocol deliberately waits out the TrueTime uncertainty window (roughly 3–7 ms) at commit time to ensure the timestamp ordering guarantees hold. That commit-wait adds latency. Same-region writes on Spanner are comparable in latency to CockroachDB and YugabyteDB. The value of TrueTime is correctness, not speed: it guarantees that if transaction T1 commits before T2 starts (in wall-clock time), T2 observes T1's effects β€” even across continents and even for clients comparing timestamps externally. That is a correctness property that matters deeply for Google's billing and AdWords systems. For most applications, HLC (used by CockroachDB and YugabyteDB) provides equivalent practical guarantees at no extra latency cost.

4. "I'll get NoSQL scale and SQL semantics for free."

NewSQL genuinely delivers both horizontal scale and SQL semantics β€” but the word "free" is misleading. There are real costs:

  • Write latency: distributed consensus adds milliseconds that a single-node database does not pay.
  • Operational complexity: running a 3- or 5-node cluster β€” with certificate management, range rebalancing, Raft leader elections, and upgrade rollouts β€” is meaningfully harder than running a single Postgres primary.
  • Smaller ecosystem: tooling around NewSQL databases is younger and thinner than decades of Postgres tooling β€” fewer battle-tested extensions, fewer migration tools, fewer DBA experts to hire.
  • Write throughput per node: because writes must achieve consensus across replicas, a NewSQL cluster does not achieve the same write throughput per machine as a single-node NoSQL database. You get more machines β€” but each write still touches at least three nodes.

These costs are worth paying for the right use case; they are not negligible.

5. "Multi-region distribution is automatic β€” I don't need to think about it."

NewSQL databases do handle replication automatically β€” but where data lives and where the Raft leader sits is something you still need to think about carefully. By default, a CockroachDB cluster distributes range leaders evenly across nodes, which means a user in Tokyo might write to a leader that is physically in us-east. Their write crosses the Pacific twice (to the leader and back) β€” about 300 ms. The solution is geo-partitioning: you tell CockroachDB which rows belong to which region and pin the range leaders for those rows to the nearest region. This requires understanding your data's access patterns. Similarly, "surviving a region outage" requires 5+ nodes across 3+ regions with specific replication zone configs. None of this is hard to configure β€” but it does not happen without intent. Treating multi-region as fully automatic is how teams end up with 200 ms writes for half their user base.

6. "Eventual consistency is impossible β€” NewSQL is always strongly consistent."

This is actually backwards from how most people think about it. By default, NewSQL reads are strongly consistent β€” you read from the Raft leader and get the latest committed value. But most NewSQL databases also let you opt into bounded staleness reads (or "follower reads"), where you read from a nearby follower instead of the leader. The follower may be slightly behind β€” perhaps 4–8 seconds β€” but you get much lower latency because you are not crossing regions to reach the leader. In CockroachDB this is AS OF SYSTEM TIME follower_read_timestamp(). In Spanner it is a stale read bound. In YugabyteDB, yb_read_from_followers = on. This gives you the dial: full strong consistency for writes and read-your-writes scenarios, and bounded-staleness eventual consistency for large read queries where a few seconds of lag is acceptable (dashboards, reports, analytics). The system gives you both β€” you choose which one to use per query.

Six misconceptions corrected: NewSQL is not sharded Postgres β€” consensus is baked in at the storage layer, not layered on top; it is mostly but not fully Postgres-compatible (latency, extensions, long transactions differ); TrueTime gives correctness, not speed; the scale-plus-SQL combination has real costs in write latency, operational complexity, and ecosystem maturity; multi-region still requires you to configure geo-partitioning and leader placement; and eventual consistency IS possible via opt-in bounded staleness reads.
Section 21

Real-World Disasters & Lessons

The stories below reflect failure patterns that have been observed across multiple teams adopting distributed SQL. Names and specifics are composites, but the failure modes are real and well-documented in engineering post-mortems. The price of learning these the hard way is measured in weeks of incident recovery and often in architectural rewrites. Read them now instead.

CROSS-REGION WRITE STORM β€” Leader in Wrong Region EU Users eu-west-1 write traffic ~140ms RTT Atlantic crossing every write Β· every transaction Raft Leader us-east-1 wrong placement Fix: Geo-partition EU rows β†’ pin Raft leader to eu-west-1 EU writes now stay in-region: ~5ms RTT instead of ~140ms Β· 28Γ— improvement
Disaster 1 β€” Premature NewSQL Adoption

A startup with roughly 100 active users adopted CockroachDB in the early months "to be ready for the scale we know is coming." The 3-node CockroachDB cluster required TLS certificate management, upgrade runbooks, Raft health monitoring, and range rebalancing awareness that no engineer on the team had before. Feature velocity dropped sharply because every infrastructure change was now a distributed systems operation. When the team's Postgres engineer left, nobody fully understood the cluster topology. Six months later, the team migrated back to a managed Postgres instance.

Lesson: Scale to your actual problem, not your projected one. A single managed Postgres handles the write load of most applications up to millions of users. Adopt NewSQL when you can measure a specific problem it solves β€” cross-region latency, write throughput at >100 K/sec, global ACID across regions β€” not as a speculative hedge. The operational complexity is real and requires dedicated expertise to manage well.

Disaster 2 β€” Cross-Region Transaction Storm

A financial application deployed CockroachDB with the Raft leader for its accounts table in us-east. EU users were 40% of the write traffic. Every EU write β€” a balance update, a payment record β€” crossed the Atlantic to reach the leader, waited for consensus, and crossed back. The 140 ms round-trip per write meant EU checkout latency was 400–600 ms even under light load. The team had assumed CockroachDB would "automatically route writes to nearby nodes," which it does for reads but not for writes to a specific range β€” writes must go to the Raft leader for that range.

Lesson: Place Raft leaders near your write traffic. In CockroachDB, use geo-partitioning with ALTER PARTITION ... CONFIGURE ZONE USING lease_preferences = [[+region=eu-west]] to pin EU rows' leaders to the EU region. In YugabyteDB, use tablespaces and preferred zones. This requires thinking about your data's geographic access patterns before you deploy β€” not after you see 400 ms writes in production.

Disaster 3 β€” Long Transaction Collapse

An analytics team needed to backfill a derived column across 200 million rows. They wrote a Python script that opened a single transaction, iterated through all rows with a cursor, computed the new value, and updated each row before committing. In Postgres, this works (slowly) β€” the transaction holds a consistent snapshot and updates proceed. In CockroachDB, a 4-hour transaction causes writes to accumulate a long "write history" that must be carried and checked against for the entire duration. After about 90 minutes, the cluster's serialization conflict rate spiked, other transactions started timing out, and throughput dropped by 70% across the entire cluster.

Lesson: NewSQL databases are especially sensitive to long transactions. Aim for transactions under 100 ms. For bulk operations, use batched transactions β€” break 200 million rows into 10 000-row batches, each committed separately. CockroachDB explicitly recommends keeping transactions short and provides documentation on the internal write-intent mechanism that makes long transactions so expensive. Check your ORM for accidental long transactions (e.g., lazy-loaded associations held across HTTP requests).

Disaster 4 β€” Hot-Row Contention

An e-commerce platform stored a global "available inventory" counter for each product in a single row. At peak sales, 10 000 writes per second were updating the same row. Despite having a 5-node CockroachDB cluster, throughput collapsed β€” the cluster was reporting as healthy with 20% CPU, but inventory updates were timing out. The reason: a single row belongs to one range, and one range has one Raft leader. All 10 000 writes per second had to serialize through that single leader and its Raft log. Adding more nodes did not help because the bottleneck was a single Raft consensus group, not cluster-wide capacity.

Lesson: Distributed databases serialize writes within a range β€” adding more nodes does not help a hot single row. Partition your high-write counters: instead of one inventory row, maintain 10 or 20 inventory_shard_N rows and sum them at read time. This spreads writes across multiple ranges and multiple Raft groups. The same principle applies to "global counter" tables, "last-event" tables, and any row that is written more than a few hundred times per second.

Disaster 5 β€” Cluster-Wide Certificate Expiry

A self-hosted CockroachDB cluster had all nodes' TLS certificates set to expire on the same date β€” they were all provisioned in the same automation run during initial setup. Nobody added certificate expiry monitoring because "we will renew when it comes up." On expiry day, all inter-node TLS handshakes failed simultaneously. Nodes could not reach each other. The cluster lost quorum and refused writes entirely. Recovering required rotating certificates on all nodes under time pressure with an angry production outage in progress.

Lesson: Stagger certificate expiry dates across nodes β€” provision them 30 days apart so a certificate rotation affects at most one node at a time. Add certificate expiry monitoring to your alerting stack (CockroachDB's DB Console shows cert expiry; export it as a metric). Write and test your certificate rotation runbook before production. This is a generic distributed systems lesson: anything that expires simultaneously across all nodes is a single point of failure in time.

Five disaster patterns: premature adoption where operational complexity overwhelms a small team (adopt when you measure the problem, not speculatively); cross-region write storm from misplaced Raft leaders (geo-partition to put leaders near write traffic); long transaction collapse in bulk operations (keep transactions under 100 ms, batch large jobs); hot-row contention that extra nodes cannot fix (shard high-write counters across multiple rows); and cluster-wide certificate expiry causing simultaneous quorum loss (stagger cert dates, monitor expiry, test the runbook).
Section 22

Performance & Best Practices Recap

Eight rules cover the vast majority of NewSQL production decisions. They are not advanced tuning tips β€” they are the baseline minimum. Running through this checklist before you design your schema and before you go to production catches most problems before they become incidents.

NewSQL Production Best Practices β€” Quick Reference β‘  Don't adopt prematurely Postgres + read replicas handles most apps. Adopt NewSQL when you can measure the specific problem. β‘‘ Geo-partition by access pattern Place Raft leaders near write traffic. EU users writing to US leaders pay 140 ms RTT every transaction. β‘’ Keep transactions under 100 ms Distributed serializable txns accumulate write intents. Long txns stall the cluster. Batch bulk ops instead. β‘£ Retry on serialization failures SQLSTATE 40001 = restart transaction. Build retry logic into your DAL β€” it is a first-class operation. β‘€ Shard high-write rows Counter tables, hot IDs β†’ split into N shards and sum at read time. Consensus serializes per range. β‘₯ Use bounded staleness for analytics Reports and dashboards don't need leader reads. Follower reads: ~4–8s stale, much lower latency. ⑦ Right cluster size for HA 3 nodes for HA (survive 1 failure). 5 nodes for region tolerance + better availability during maintenance. β‘§ Monitor cluster health continuously Track: node count, range health, Raft leader status, p99 write latency, certificate expiry. Alert on all. Apply all 8 rules before tuning anything else β€” they prevent the most common NewSQL production failures

When to adopt NewSQL

The decision should be driven by measurable pain, not speculation. The three signals that suggest NewSQL is the right answer: (1) your write throughput is consistently above 50 000–100 000/sec and you have already exhausted vertical scaling on Postgres; (2) your users are globally distributed and cross-region latency is measurably impacting your product (not just theoretically); (3) you need globally consistent transactions and your team's custom 2PC implementation is a maintenance burden. None of these require NewSQL preemptively. A well-tuned Postgres + read replicas architecture handles the vast majority of production workloads at a fraction of the operational cost.

Geo-partition your schema

In a global deployment, the most impactful performance lever is where the Raft leader sits relative to where writes originate. The mechanics in CockroachDB: create a PARTITION BY LIST (region) on your primary key, then set zone configurations using ALTER PARTITION ... CONFIGURE ZONE USING lease_preferences = [[+region=eu-west]]. This pins the Raft leader for EU rows to EU nodes. EU writes now commit in ~5 ms instead of ~140 ms. YugabyteDB achieves the same via tablespaces and preferred-zone settings. The key insight: geo-partitioning is a schema design decision, not an infrastructure decision β€” you make it when you define the table, not when you add nodes.

Keep transactions short

The 100 ms rule is a practical target derived from how distributed transaction protocols work. In CockroachDB, a serializable transaction must maintain a "read timestamp" for all rows it has read. As the transaction ages, other transactions modify those rows, creating write intents. When the long transaction finally tries to commit, it must resolve all those intents β€” and if any conflict, it restarts from the beginning. The older the transaction, the more likely it conflicts, and the more expensive the restart. For bulk jobs (backfills, migrations, large inserts), break the work into 1 000–10 000 row batches committed independently. Each batch is a short transaction; the overall job is a sequence of short transactions with a progress checkpoint.

Build retry logic into your DAL

Serialization conflicts are a normal part of operating a distributed serializable database β€” they are not errors in the failure sense, they are the database's mechanism for enforcing correctness when two transactions race. Your data access layer (DAL) must handle SQLSTATE 40001 (or the equivalent "restart transaction" error) by rolling back and retrying the entire transaction from scratch. Do not log these as errors; track them as a metric. A retry rate below 1% is healthy. Above 5% suggests a hot row or an overly broad transaction scope. Most modern ORMs (GORM, SQLAlchemy with CockroachDB dialect, Hibernate) can be configured to retry automatically β€” check your ORM's documentation for the setting.

Right cluster size for your HA goal

Raft requires a majority of replicas to be alive to make progress. A 3-node cluster (replication factor 3) can survive 1 node failure β€” the remaining 2 are a majority. A 5-node cluster can survive 2 simultaneous failures. For multi-region setups, the general guidance is 5+ nodes spread across 3+ regions: one region can go down entirely and the other two still form a majority. Adding a 6th node does not improve fault tolerance on its own (you still need 4 out of 6 = majority), but a 7-node cluster (survive 3 failures) does. Size your cluster based on your availability SLA and how many simultaneous failures you need to tolerate β€” not just on write throughput, which can be addressed by adding CPU/RAM to existing nodes.

Monitor cluster health

A NewSQL cluster has more failure modes than a single Postgres instance, and you need visibility into each one. The four metrics to alert on: node health β€” any node marked as unavailable means you are one failure away from losing quorum; range health β€” "under-replicated ranges" mean some ranges have fewer than the required replicas, reducing fault tolerance; Raft leader distribution β€” if all leaders are on one node, that node is both a performance bottleneck and a single point of failure; p99 write latency β€” a spike signals a hot range, a network partition, or a disk I/O problem. All four are exposed by CockroachDB's DB Console and via its Prometheus metrics endpoint, which integrates directly with Grafana.

Use bounded staleness for analytics

Not every read needs to be the freshest possible value. Dashboards, reports, trend calculations, and read-heavy APIs that do not implement "read your own writes" semantics can tolerate data that is a few seconds old. Bounded staleness reads in CockroachDB (AS OF SYSTEM TIME follower_read_timestamp()) route the read to the nearest replica β€” which may be a follower in the same datacenter as the reader β€” instead of routing to the Raft leader, which may be in another region. The staleness is typically 4–8 seconds, far less than the 15–30 second refresh rate of most dashboards. This can reduce cross-region read traffic dramatically and lowers pressure on the Raft leader, freeing it to serve writes faster.

Avoid hot rows

Any row that receives more than a few hundred writes per second will create a hot range. Because all writes to a range must go through that range's Raft leader, the throughput ceiling for a single row is roughly the throughput of one Raft consensus round β€” typically a few thousand writes per second at best. At 10 000 writes/sec, a single-row counter will collapse. The fix: split into N shards. If you have a product_inventory table with a single row per product, change it to have N shard rows (e.g., 10 rows with shard IDs 0–9), write to a randomly chosen shard, and read the inventory by summing all shards. This distributes the Raft load across N independent ranges and N independent leaders.

Eight NewSQL best practices: only adopt when you measure the problem Postgres cannot solve; geo-partition to keep Raft leaders near write traffic; keep transactions under 100 ms and batch bulk jobs; build serialization retry (40001) into your DAL; shard hot rows across multiple ranges; use bounded staleness reads for analytics and dashboards; size clusters at 3+ for HA or 5+ for region tolerance; and monitor node health, range health, Raft leader distribution, and p99 write latency continuously.
Section 23

Frequently Asked Questions

These are the questions that come up in every system design interview, every architecture review, and every engineering Slack thread that ends with "should we migrate to CockroachDB?" Each answer is written for someone who understands databases generally but is still learning the distributed SQL space.

Q1: When is NewSQL actually the right choice?

NewSQL earns its operational complexity only when you have exhausted what Postgres + replicas can give you. The three clear signals: (1) Global ACID + horizontal scale together β€” you have users on multiple continents, each writing transactionally, and you need all writes to be globally consistent without application-level coordination. (2) Write throughput beyond a single primary β€” you are consistently above ~50 000 writes per second and vertical scaling is no longer cost-effective. (3) You cannot drop ACID β€” your domain (payments, inventory, reservations) requires strict serializability and you cannot build a compensating transaction system on top of eventual consistency. If none of these three apply, a well-tuned Postgres primary with read replicas is almost certainly the right choice β€” simpler to operate, better ecosystem, and a larger pool of engineers who know it well.

Q2: CockroachDB or Spanner β€” how do I choose?

Choose Google Cloud Spanner if: you are already committed to GCP; you need the absolute highest consistency guarantees (external consistency via TrueTime, not just serializability); or you need five-nines availability backed by Google's SLA. Spanner is fully managed β€” Google handles everything β€” and it has a long track record powering Google's own internal systems. The trade-off: you are fully GCP-locked-in and Spanner is expensive (roughly $0.90/node-hour for a single processing unit). Choose CockroachDB if: you want cloud portability (CockroachDB runs on AWS, GCP, and Azure); you want Postgres wire-protocol compatibility so your existing drivers and tools work; you want the option to self-host; or you need a smaller starting cluster size (Spanner's minimum recommended for production is 3 nodes per region). Both are excellent for the right context β€” the GCP commitment is usually the deciding factor in practice.

Q3: What is the write latency hit compared to Postgres?

The numbers depend on your deployment topology. Single-region (all replicas in same datacenter or AZ): CockroachDB single-region writes typically take around 1–3 ms per transaction, versus Postgres at roughly 0.5–2 ms. The difference is the Raft round-trip between leader and followers β€” two or three network hops within a datacenter, each adding sub-millisecond latency. Multi-region (replicas across datacenters): writes that must achieve consensus across regions pay the full network round-trip latency. With replicas in us-east, eu-west, and ap-southeast, a write needs a majority ack β€” meaning at least one cross-region round-trip (~50–70 ms for us-east to eu-west). Multi-region writes commonly take 50–200 ms. With geo-partitioning: if you pin the Raft leader to the same region as the writer, multi-region writes can drop back to the single-region range (1–10 ms for the local consensus round-trip). The latency hit is a physics problem β€” reduce it by minimizing the distance between writers and their range leaders.

Q4: Can I run analytical queries on NewSQL?

You can run SQL analytical queries, but NewSQL databases are row-oriented OLTP engines by design β€” they are not optimized for the large table scans that OLAP queries require. A complex aggregation over a billion rows will be significantly slower than the same query on a columnar database like ClickHouse or BigQuery. That said, there are practical options. TiDB with TiFlash: TiDB's analytical extension TiFlash is a columnar storage engine that replicates data from TiKV in real time. You can mark specific tables to replicate to TiFlash and queries automatically route there for analytical workloads β€” essentially HTAP (hybrid transactional/analytical processing) built in. Dedicated warehouse: for serious analytics, stream data from your NewSQL database (via CDC / Kafka) to a dedicated columnar warehouse (BigQuery, Snowflake, Redshift) and run analytics there. Read replicas: for lighter analytical needs, route long-running queries to follower replicas to avoid impacting the leader's write throughput.

Q5: Is Google Spanner truly serializable?

Spanner provides something stronger than serializable β€” it provides external consistency, also called linearizability at the transaction level. Serializability guarantees that the outcome of concurrent transactions is equivalent to some serial execution β€” but does not specify which serial order. External consistency adds a real-time constraint: if transaction T1 commits before T2 starts (in real wall-clock time), then the database's transaction order must show T1 before T2 β€” even if T1 and T2 are running in different datacenters and different clients. This matters for distributed systems where clients observe the database from the outside and compare timestamps across machines. Google achieves this via TrueTime's bounded clock uncertainty β€” by waiting out the uncertainty window at commit, Spanner can guarantee that any later-starting transaction has a definitively higher timestamp. For most applications, the practical difference between serializable and externally consistent is invisible β€” but for billing systems and globally synchronized workflows, external consistency prevents subtle ordering bugs that serializability alone does not catch.

Q6: Can NewSQL replace MySQL with Vitess?

These are two different solutions to similar-sounding problems. TiDB is MySQL wire-compatible and stores data in TiKV, a distributed key-value store β€” it is a genuine NewSQL database that gives you cross-shard ACID transactions natively. Vitess is a sharding proxy that sits in front of real MySQL instances β€” your data still lives in MySQL, but Vitess routes queries to the right shard and handles some cross-shard merging. Vitess does NOT give you native cross-shard ACID transactions β€” cross-shard operations still require 2PC. Choose TiDB if you need true cross-shard transactions and want a single coherent database engine with distributed semantics. Choose Vitess if you have a very large existing MySQL installation and want to scale it without migrating to a new storage engine β€” Vitess lets you keep MySQL while adding horizontal write capacity for shardable workloads. The two are not interchangeable; the right choice depends on whether cross-shard ACID is a requirement.

Q7: What is the minimum cluster size I should run?

The minimum depends on your availability goal. 3 nodes: the minimum for high availability. With replication factor 3, Raft requires 2 out of 3 nodes to be alive. If one node fails (hardware, network, maintenance reboot), the cluster keeps serving reads and writes. If two nodes fail simultaneously, the cluster loses quorum and stops accepting writes β€” protecting consistency over availability. 5 nodes: required for region-level fault tolerance in a multi-region setup. With 5 nodes across 3 regions (e.g., 2+2+1), one entire region can fail and the remaining two regions still have a majority (4 out of 5 nodes alive). Also gives better load distribution during maintenance (rolling upgrades take one node offline at a time; 5 nodes gives more headroom than 3). Never run 2 nodes: two nodes cannot form a majority if either fails β€” the cluster would stop writes to stay consistent, giving you no benefit over a primary+replica Postgres setup.

Q8: What about Postgres extensions like pg_vector?

This is one of the real compatibility gaps between NewSQL and Postgres. Because CockroachDB and YugabyteDB implement the Postgres wire protocol and SQL dialect from scratch (they are not actually running a Postgres storage engine), they do not support Postgres extensions that require hooks into the Postgres storage layer. pg_vector (vector similarity search for AI embeddings) does not work in CockroachDB as of its 2024 releases. PostGIS (geospatial) does not work. pg_partman, pg_cron, and other popular extensions have varying support. YugabyteDB has broader extension support than CockroachDB (it uses the actual Postgres query layer on top of its distributed storage), but even YugabyteDB does not support all extensions. Before committing to a NewSQL database, explicitly check each extension your application depends on against that database's documented extension support list β€” do not assume Postgres compatibility means extension compatibility.

Eight key answers: adopt NewSQL for global ACID + horizontal scale when Postgres + replicas is genuinely insufficient; choose Spanner for GCP commitment + highest consistency, CockroachDB for cloud portability + Postgres compatibility; single-region writes add ~1–3 ms, multi-region adds 50–200 ms (fix with geo-partitioning); NewSQL is OLTP-first β€” use TiFlash or a separate warehouse for analytics; Spanner provides external consistency, stronger than serializable; TiDB gives native cross-shard ACID while Vitess is a MySQL sharding proxy without native cross-shard ACID; minimum 3 nodes for HA, 5 nodes for region tolerance; check extension compatibility explicitly β€” pg_vector and PostGIS are not available in all NewSQL databases.