Database Foundations

Time-Series Databases โ€” When Every Row Has a Timestamp

Most data generated by modern systems is timestamped โ€” server metrics, IoT sensor readings, stock ticks, application traces. The workload is append-mostly, time-anchored, and retention-driven: you write fast, you query by time window, and you drop old data when it ages out. Time-series databases (TSDBs) like TimescaleDB, InfluxDB, Prometheus, and ClickHouse optimize storage layout, ingestion pipelines, and query engines specifically for this shape. Their sweet spot is observability, IoT, financial analytics, and any system where you need to ask "what happened between 14:00 and 15:00?" with sub-second response times. They are not a replacement for a primary OLTP store โ€” pick the right tool for the right shape.

8 Think Firsts ~25 SVG Diagrams 24 Sections ~40 Tooltips 5 Exercises
Section 1

TL;DR โ€” Time-Series Databases in Plain English

  • Why timestamped data has unique storage requirements that regular databases struggle with
  • What a time-series database actually optimizes โ€” and what it sacrifices to do it
  • How the four main TSDBs (TimescaleDB, InfluxDB, Prometheus, ClickHouse) position themselves differently
  • When a TSDB is the right tool โ€” and when Postgres with good indexes is perfectly fine

Time-series databases make one radical bet: time is not just a column โ€” it IS the storage layout. Physically group data by time bucket, compress similar values together, and drop old buckets like deleting a folder. The result: 100K writes/sec at 10-50ร— less disk space than a general-purpose database.

TIME-SERIES DATA FLOW โ€” From Sources to Insights SOURCES ๐ŸŒก IoT Sensors ๐Ÿ“Š Server Metrics ๐Ÿ’น Stock Ticks ๐Ÿ” App Traces high write INGESTION Batch + compress 100K+ writes/sec TSDB chunk: 2025-05-09 (hot) chunk: 2025-05-08 (warm) chunk: 2025-05-07 (cold) chunk: <30 days ago โ†’ dropped 10โ€“50ร— compressed columns time- window query QUERY / DASHBOARD Grafana, Kibana PromQL, Flux SQL (TimescaleDB) any data with a timestamp batch writes โ†’ compress time-partitioned storage time-window aggregations

A time-series database is a database where every row has a timestamp and the physical storage is organised by time rather than by row ID. Think of it as a conveyor belt: new readings land at the front, old ones roll off the back (retention), and everything in between is aggressively compressed because adjacent sensor readings are almost identical. The four main players are TimescaleDB (Postgres under the hood โ€” familiar SQL, time-aware storage), InfluxDB (purpose-built, its own Flux query language), Prometheus (pull-based metrics for monitoring, PromQL), and ClickHouse (columnar OLAP powerhouse that happens to be spectacular at time-series analytics).

A plain Postgres table with a created_at index works fine up to maybe a few million rows. Beyond that, writes start hammering the WAL, the B-tree index balloons in memory, and SELECT avg(cpu) FROM metrics WHERE ts > now() - interval '1 day' becomes a full-index scan across months of history. A TSDB sidesteps all of this with time-partitioned chunks (only the last chunk is writable), columnar compression (compress all values together, not row-by-row), and automatic retention (delete an entire old chunk in milliseconds rather than row-by-row DELETE).

Reach for a TSDB when your workload has all three of: (1) timestamp on every record, (2) 90 %+ inserts with rare or no updates, and (3) queries that always have a time-window filter. Classic fits: observability dashboards, IoT telemetry, financial tick stores, application performance monitoring. Do NOT use a TSDB as your primary user or order store โ€” it optimises away the random-access and update patterns you need for OLTP.

Time-series databases optimise for append-mostly, timestamp-anchored data by physically partitioning storage by time, compressing similar values together, and auto-expiring old partitions โ€” making high-throughput writes and time-windowed queries orders of magnitude faster than general-purpose databases.
Section 2

Why You Need This โ€” The 100 K Writes/Sec Wall

Imagine you are building an IoT platform. You have 100 000 temperature sensors, each reporting once per second. That is 100 000 inserts every single second, 8.6 billion rows a day. Before you even think about queries, you have a write problem. Let's trace exactly where a conventional database breaks โ€” and why a TSDB doesn't.

The Postgres Breaking Points

Postgres is a fantastic database. It is not built for this shape of work. Here is what actually happens at scale:

Problem 1 โ€” WAL Saturation

Every insert writes to the WAL before touching the heap. At 100 K inserts/sec you generate roughly 1โ€“2 GB of WAL per hour even before replication. Replication lag climbs, checkpoints become expensive, and the whole cluster slows down.

Problem 2 โ€” Index Bloat

A B-tree index on created_at grows with every insert. After a few weeks it no longer fits in shared_buffers, so every time-range query becomes a disk-seeking index scan. The index that was supposed to speed things up is now the bottleneck.

Problem 3 โ€” Query Scans Everything

SELECT avg(temperature) WHERE ts > now() - '1 day' sounds simple. But if the table holds 1 year of data, Postgres still has to touch the index to find yesterday's rows โ€” and the index spans the entire history. A TSDB with time-partitioning physically doesn't even open partitions older than your query window.

Problem 4 โ€” Deleting Old Data Is Expensive

Running DELETE FROM metrics WHERE ts < now() - interval '30 days' is a full row-by-row delete that generates even more WAL, leaves dead tuples, and requires a VACUUM pass. On a billion-row table this can take hours. A TSDB drops an old chunk the same way you delete a folder โ€” one filesystem operation.

Think First

You have 1 billion sensor readings covering 1 year. A business analyst asks: "What was the average CPU temperature per hour, for every hour, over the last year?"

In a flat Postgres table that query scans and groups 1 billion rows โ€” it might run for minutes. In a TSDB, this is a continuous aggregate: a pre-computed rollup that is updated incrementally every time new data arrives. The same query hits a tiny summary table and returns in milliseconds. The computation happened at write time, distributed across thousands of small incremental updates โ€” not crammed into one giant query.

What a TSDB Does Instead

A TSDB solves all four problems by rethinking the storage model from the start, not by adding indexes on top of row-oriented storage:

At IoT scale (100K+ writes/sec), Postgres hits four walls: WAL saturation, index bloat, full-history scans, and expensive row-by-row deletes. TSDBs eliminate all four by using time-partitioned chunks, columnar compression, partition pruning, and partition-drop retention.
Section 3

Mental Model โ€” Time as the First-Class Axis

The single biggest conceptual shift when moving from a general-purpose database to a TSDB is this: time is not just a column you filter on โ€” it is the axis that determines where data physically lives on disk. This one idea explains almost every design decision a TSDB makes.

In a regular relational database, rows are laid out roughly in insertion order, indexed by primary key or B-tree. Time is just another value in a column. Two readings from the same sensor, 1 second apart, might live in completely different data pages. To find "everything from yesterday" the database has to consult an index tree that spans the entire history.

In a TSDB, incoming data is partitioned by time window โ€” often called a chunk (TimescaleDB) or shard (InfluxDB). All data for "2025-05-09 14:xx" lives in one chunk. All data for "2025-05-09 15:xx" lives in the next. The storage looks less like a giant table and more like a series of filing cabinet drawers, one drawer per time window.

REGULAR DATABASE rows sorted by primary key โ€” time is scattered Data Pages (heap) id=1 ts=2025-05-07 09:01 cpu=23.4 host=web1 id=2 ts=2025-05-09 14:02 cpu=44.1 host=db2 id=3 ts=2025-05-08 22:55 cpu=11.7 host=web1 id=4 ts=2025-05-09 14:03 cpu=45.0 host=db2 id=5 ts=2025-05-07 09:02 cpu=23.5 host=web1 id=6 ts=2025-05-08 23:01 cpu=12.0 host=web1 Query "last 1 hour" must consult B-tree spanning ALL rows โ†’ random page I/O across the entire heap DELETE old data = row-by-row scan + VACUUM TIME-SERIES DATABASE data physically grouped by time chunk โ€” hot chunk writable Chunk: 2025-05-09 (HOT โ€” writable) ts=14:02 cpu=44.1 | ts=14:03 cpu=45.0 | ts=14:04 โ€ฆ sorted by time, all same-day โ†’ sequential writes Chunk: 2025-05-08 (COMPRESSED โ€” read-only) 86 400 rows โ†’ compressed to ~3 400 bytes (columnar) Chunk: 2025-05-07 (COLD โ€” compressed) eligible for tiered storage / archive Chunk: < 30 days ago โ†’ DROPPED (O(1) operation) retention policy deletes entire chunk, no row scanning Query "last 1 hour" opens only the HOT chunk โ†’ sequential read

Four Architectural Decisions That Follow From This Model

Once you commit to "time is the physical layout axis", four other design choices fall out naturally:

Time-Bucketed Storage

Incoming data is routed to the chunk that covers the current time window. Chunk size (e.g., 1 day, 1 week) is tuned to match query patterns โ€” smaller chunks mean more precise pruning but more metadata overhead. TimescaleDB uses "hypertables" that transparently chunk a Postgres table by time and optionally by a secondary dimension (like host).

Append-Only

Sensor readings don't get corrected โ€” if you sent the wrong value, you send a new corrected reading. This means chunks are write-once: once a chunk's time window closes, nothing inserts or updates into it. Read-only data compresses much better because the compressor can see the whole chunk at once, not just a rolling window.

Compression on Write

When a chunk's time window closes, the TSDB compresses it. Compression works column-by-column: all timestamps together, all values together, all tag strings together. Adjacent sensor readings are almost identical, so compression ratios of 10โ€“50ร— are routine. This is not a nice-to-have โ€” at 100 K writes/sec you'd otherwise be writing terabytes per day.

Retention Policies

Most time-series data has a useful life. You need raw 1-second readings for 7 days; you need hourly averages for 1 year; you don't need anything after 3 years. A retention policy says "drop any chunk whose time window is older than N days". Because a chunk is a single storage object, the drop is O(1) โ€” no scanning, no VACUUM, no downtime.

In a TSDB, time is the physical partitioning key โ€” all data for a time window lives together on disk. This single decision enables append-only writes, columnar compression, partition pruning for fast reads, and O(1) retention by dropping entire old chunks.
Section 4

Core Concepts โ€” The Vocabulary of Time-Series

Before we dive into how each TSDB works, let's nail down the six terms you'll see everywhere. They're not complicated โ€” they're just precise names for things you already intuit.

Time-Series Data

A stream of (timestamp, value, tags) triples. The timestamp says when, the value says what, and the tags (also called labels or dimensions) say who or where. Example: ts=14:02:01, value=44.1, tags={metric="cpu", host="web1", region="us-east"}. Every measurement in your system fits this shape โ€” whether it's a CPU reading, a blood-pressure sensor, or a stock price.

Series

A series is a unique combination of metric name + tag values. Think of it as a single continuous line on a graph. cpu_usage{host="web1", region="us-east"} is one series; cpu_usage{host="web2", region="us-east"} is a different series. Data within one series is guaranteed to be time-ordered, which is exactly what compression algorithms exploit.

Cardinality

Cardinality is the total number of distinct series in your database. If you have 10 metrics, each for 1 000 hosts, across 2 regions โ€” that's 20 000 series. Cardinality is the main scaling challenge for time-series databases, especially ones like Prometheus that keep series metadata in RAM. Adding a new high-cardinality tag (like a user ID on every metric) can explode cardinality from thousands to billions and bring a TSDB to its knees.

Downsampling

Downsampling means replacing many fine-grained readings with a single coarser summary. You keep raw 1-second readings for 7 days (they're big but you need them for debugging). After 7 days you downsample to 1-minute averages (90ร— smaller). After 90 days you downsample to 1-hour averages (another 60ร— smaller). You lose resolution but gain enormous storage savings for data you'd only ever look at in aggregate anyway.

Retention Policy

A retention policy is a rule that automatically deletes data older than a configured age. "Keep raw metrics for 30 days, hourly aggregates for 1 year." The TSDB enforces this without any application code โ€” it just drops old chunks on schedule. This is a core feature of every major TSDB because without it, an append-only write pattern would fill your disk indefinitely.

Continuous Aggregate

A continuous aggregate is a pre-computed rollup that the database keeps up to date automatically as new data arrives. Instead of running SELECT avg(cpu) GROUP BY hour at query time (expensive on billions of rows), the TSDB incrementally updates a "cpu per hour" summary table every time a chunk of raw data is compressed or written. Queries hit the tiny summary, not the raw data โ€” milliseconds instead of minutes.

The six core TSDB concepts are: time-series data (timestamp + value + tags), series (unique metric+tag combination), cardinality (number of distinct series โ€” the main scaling challenge), downsampling (coarsen old data to save space), retention policy (auto-delete old data), and continuous aggregate (pre-computed rollup updated at write time).
Section 5

The Time-Series Workload Pattern โ€” What Makes It Different

It might be tempting to think "a TSDB is just Postgres with time-based partitioning". But TSDBs exist because the time-series workload has five very specific characteristics โ€” and each one justifies a different design decision. Understanding these characteristics will help you decide whether a TSDB is the right tool for your situation. As you read each card, ask yourself: does my workload look like this? If you can't tick all five, a regular database is probably the better answer.

Mostly Writes

In a typical time-series workload, 90โ€“99% of operations are inserts. Reads are far less frequent โ€” a dashboard polls every 30 seconds, while a sensor writes every second. This means the database should be optimised for write throughput first, read latency second. TSDBs batch and buffer writes, compressing them before flushing to disk, in a way that would be unsafe for an OLTP database where you need immediate read-your-own-writes consistency.

Append-Mostly

Once a sensor reading is written, it is almost never updated. UPDATE is rare; DELETE is retention-driven, not application-driven. This constraint is what makes columnar compression work so well โ€” the compressor can process an entire closed chunk without worrying about concurrent modifications invalidating its work. It's also why TSDBs don't need the heavy bookkeeping that lets Postgres handle concurrent updates safely (the technical name is MVCC) โ€” that bookkeeping costs CPU and disk, and TSDBs skip it entirely.

Time-Windowed Reads

Every query has a time range. Nobody asks "show me all CPU readings ever". They ask "last 5 minutes", "yesterday", "Q3 2024". This is fundamentally different from OLTP queries (fetch order by ID) or OLAP queries (full table aggregations). Time-windowed queries map perfectly onto time-partitioned storage โ€” the database only opens the chunks that overlap the query window.

Aggregation-Heavy

Raw readings are usually too noisy to display directly. Applications want avg, max, sum, P95, P99 over time buckets โ€” "average CPU per minute", "max latency per hour". This is why every TSDB has a built-in function for time-bucket aggregation (time_bucket() in TimescaleDB, summarize in InfluxDB, rate() in Prometheus). Columnar storage accelerates these because all values for an aggregation live contiguously on disk.

Retention-Bounded

Unlike a user database where you'd never want to lose a customer record, time-series data has a natural shelf life. Raw 1-second sensor data from 2 years ago has almost zero operational value. TSDBs treat data expiry as a first-class feature, not an afterthought. Retention policies are configured at database creation, not added as a cron job later. This keeps storage costs flat even as writes continue forever.

TIME-SERIES QUERY PATTERNS 10:00 11:00 12:00 13:00 14:00 15:00 Point Lookup ts = 12:00 exactly Range Scan: 13:00โ€“14:00 all raw rows in window โ†’ sequential read of 1-2 chunks avg 10:00โ€“11:00 = 23.4 avg 11:00โ€“12:00 = 31.7 avg 12:00โ€“13:00 = 44.1 avg 13:00โ€“14:00 = 40.2 Bucketed aggregation: pre-computed by continuous aggregate โ†’ microsecond queries
When a TSDB is Overkill

If your workload doesn't have all five of these characteristics, you probably don't need a specialised TSDB. A well-indexed Postgres table with created_at partitioned monthly handles millions of rows per day without drama. The jump to a purpose-built TSDB is worth making when you hit the 100K+ writes/sec wall, need automatic retention, or are building a multi-tenant metrics platform where cardinality and compression efficiency become existential concerns.

The five time-series workload characteristics โ€” mostly writes, append-only, time-windowed reads, aggregation-heavy, and retention-bounded โ€” each drive a specific TSDB design decision. If your workload lacks these patterns, a general-purpose database with partitioning is sufficient.
Section 6

Compression โ€” Why TSDBs Are 10โ€“50ร— Smaller

At 100 K writes/sec, raw uncompressed data would consume around 8โ€“10 TB per day (each row ~100 bytes ร— 100K rows/sec ร— 86 400 sec). That's clearly unsustainable โ€” you'd need a new disk every day. Time-series databases get away with much smaller storage budgets โ€” often under 200 GB/day for the same load โ€” because they exploit something unique about timestamped data: adjacent readings are nearly identical. A CPU at 44.1% one second ago is overwhelmingly likely to be 44.1% or 44.2% right now, and that near-sameness is the gift compression algorithms feast on.

If your CPU usage is 44.1% at 14:02:00 and 44.3% at 14:02:01, those two numbers share almost all their bits. A general-purpose row-oriented database stores them as two independent 8-byte floats. A TSDB stores them in a column together and encodes only the difference โ€” a tiny fraction of the original data.

ROW-ORIENTED vs COLUMNAR STORAGE โ€” Same 5 Readings Row-Oriented (Postgres-style) timestamp + value + tag interleaved per row 1715256120 44.1000 cpu{host="web1"} 1715256121 44.3000 cpu{host="web1"} 1715256122 44.3000 cpu{host="web1"} โ€ฆ 86 397 more rows โ€ฆ ~8 MB / day / series (uncompressed) timestamp col value col tag col Columnar (TSDB-style) each column stored separately โ†’ compress each type optimally ALL TIMESTAMPS 1715256120 +1, +1, +1, +1 โ€ฆ delta-of-delta: near 0 ~4 bytes / 86 400 ts ALL VALUES (floats) 44.1, 44.3, 44.3 โ€ฆ Gorilla XOR: tiny diffs ~1-2 bits / repeated value ALL TAGS dictionary: "web1" โ†’ ID 3 3,3,3,3,3 โ†’ run-length = 1 entry ~80โ€“320 KB / day / series 25โ€“100ร— smaller than row typical: 10โ€“50ร— average

The Four Compression Techniques

Each column type has a different compression algorithm tuned to its specific patterns:

Delta Encoding

Instead of storing absolute values 1715256120, 1715256121, 1715256122, store the first value and then the differences: 1715256120, +1, +1, +1. The deltas are tiny integers instead of 10-digit Unix timestamps. Why this works: sensors send readings at regular intervals, so the delta between two consecutive timestamps is almost always the same small number โ€” often just 1.

Delta-of-Delta Encoding

Take delta one step further: if the delta between timestamps is always +1, +1, +1, +1, then the delta of the delta is always 0, 0, 0, 0. Storing a stream of zeros compresses to almost nothing. This is the key trick for Gorilla timestamps โ€” a real-world column of evenly-spaced timestamps might compress to a handful of bytes regardless of how many rows there are.

Gorilla / XOR Float Compression

Facebook's insight (2015, published from their in-memory TSDB): two consecutive floats from the same sensor are usually almost identical at the bit level โ€” they share the same sign bit and the same exponent bits (the parts of the standard 64-bit float format that say "positive number, in roughly the same magnitude"). XOR the two values โ€” a bit-level "what's different?" operation โ€” and you get a result dominated by leading zeros. Encode only the meaningful (non-zero) bits. A repeated value compresses to 1 bit; a small change compresses to ~9 bits instead of 64. This is why InfluxDB and Prometheus use Gorilla-style compression for value columns.

Dictionary Encoding

Tag values like host="web1" or region="us-east" repeat millions of times. Rather than storing the full string every row, build a dictionary: "web1" โ†’ 3, "us-east" โ†’ 7. Store the integer ID instead of the string. This typically reduces tag column sizes by 10โ€“100ร—. Combined with run-length encoding, a series where all rows have the same host tag compresses to a single (tag_id, count) pair.

Soft Numbers โ€” Real-World Compression Ratios

Compression ratios vary widely depending on how "stable" your data is. Typical figures from TimescaleDB documentation suggest native compression achieves 90โ€“96% reduction (roughly 10โ€“25ร— smaller than uncompressed row storage) for most real IoT and metrics workloads. InfluxDB reports similar results. These numbers are realistic for well-behaved sensor data โ€” highly chaotic data (random noise) will compress less; extremely stable data (CPU idle at exactly 0.0% for hours) may compress even more.

TSDBs achieve 10โ€“50ร— compression by storing data in columns (not rows), then applying algorithm-specific compression per column type: delta-of-delta for timestamps (near-zero stream for regular intervals), Gorilla XOR for float values (similar readings share bits), and dictionary + run-length encoding for repeated tag strings.
Section 7

TimescaleDB: Postgres for Time-Series

If your team already knows Postgres, you don't have to abandon it to get serious time-series performance. TimescaleDB is a Postgres extension โ€” not a fork, not a separate database โ€” that bolts on time-series superpowers while keeping every Postgres feature you already rely on: full SQL, JOINs, foreign keys, row-level security, pg_dump, and every Postgres client library. The sweet spot is teams who want time-series write throughput without paying the operational cost of learning an entirely new system.

The fundamental trick TimescaleDB plays is called a hypertable. From your application's perspective, a hypertable looks exactly like a normal Postgres table. Under the hood, TimescaleDB automatically splits that table into smaller time-partitioned chunks. When you query "give me the last hour of data", Postgres only opens the one or two chunks that cover that time range โ€” it never touches the other chunks. That's why time-range queries get fast: instead of scanning a table with 10 billion rows, you scan a chunk with 5 million.

TIMESCALEDB โ€” HYPERTABLE ABSTRACTION Your Application SELECT * FROM metrics Hypertable Logical single table metrics (time, host, value) auto-partitions PHYSICAL CHUNKS (hidden from app) chunk_2025_05_09 (hot) ~5M rows ยท in-memory chunk_2025_05_08 ~5M rows ยท compressed chunk_2025_05_07 ~5M rows ยท compressed chunk_2025_04_โ€ฆ (old) dropped by retention policy all chunks Query planner knows chunk time ranges โ†’ skips irrelevant chunks completely โ†’ only scans chunks that overlap your WHERE time range Result: a query for "last hour" on a 2-year table scans <0.1% of rows

Four superpowers TimescaleDB adds to Postgres

Hypertables

The core abstraction. You create a normal table, then call create_hypertable('metrics', 'time') and TimescaleDB takes over partitioning. All your existing SQL โ€” SELECT, INSERT, UPDATE, JOINs โ€” keeps working. The time-partitioning is completely transparent to your queries.

Why it matters: chunk exclusion turns full-table scans into small window scans without you changing a single query.

Continuous Aggregates

A continuous aggregate is a materialized view that updates itself incrementally. Instead of re-computing "average CPU per 5-minute bucket for the last year" from scratch each time, TimescaleDB only re-processes the newly arrived time buckets and appends them to the cached result. Query the aggregate view and you get instant answers.

Why it matters: pre-aggregating from 1-second resolution to 5-minute resolution cuts dashboard query time from seconds to milliseconds.

Native Compression

TimescaleDB compresses older chunks column-by-column using delta-encoding and gorilla compression โ€” the same techniques purpose-built TSDBs use. Compression ratios of 90โ€“96% are commonly reported, meaning a chunk that took 100 GB uncompressed may take 4โ€“10 GB compressed. Queries on compressed chunks are also faster because less data moves off disk.

Compression is per-column because time-series values in the same column are highly similar โ€” the differences between consecutive readings are tiny, and that's exactly what delta-encoding exploits.

Retention Policies

Instead of writing a cron job to DELETE old rows (which in Postgres would lock tables and fragment storage), you set a policy: add_retention_policy('metrics', INTERVAL '30 days'). TimescaleDB drops entire old chunks as atomic file-system operations โ€” fast, clean, no fragmentation. Old data disappears automatically.

Dropping a whole chunk file is orders of magnitude faster than row-level DELETEs, which is why purpose-built TSDBs all use chunk/segment-based storage.

Performance ballpark: TimescaleDB on commodity hardware can sustain roughly 100Kโ€“1M row inserts per second, depending on row width and hardware. Time-window queries โ€” the most common pattern โ€” scan only the relevant chunks, making them dramatically faster than equivalent queries on a plain Postgres table of the same total size. These are soft guidance numbers; your actual throughput will depend on your schema, hardware, and workload.
TimescaleDB extends Postgres with hypertables (automatic time-partitioned chunks), continuous aggregates (incremental materialized views), native compression (90-96% reduction), and retention policies (drop old chunks automatically). It lets teams keep Postgres expertise while gaining TSDB performance.
Section 8

InfluxDB: Purpose-Built TSDB

InfluxDB was one of the first widely-deployed databases built exclusively for time-series data. It launched around 2013 and gave a generation of engineers their first taste of a database that treats timestamps as first-class citizens. The project has gone through three major architectural versions โ€” each one a bigger departure from the last โ€” and understanding those versions helps explain why some production deployments are stuck on v1 and why v3 is exciting for new projects.

INFLUXDB EVOLUTION โ€” THREE MAJOR VERSIONS InfluxDB v1 ~2013 โ€“ 2020 ยท Go TSM Storage Engine InfluxQL (SQL-like) Simple tag/field model Cardinality limit ~10M series Widely adopted, many still use it InfluxDB v2 ~2020 โ€“ 2022 ยท Go Flux Query Language Embedded UI + Tasks Buckets replace DBs Flux adoption was mixed Disruptive upgrade, many stayed on v1 InfluxDB v3 (IOx) 2022+ ยท Rust rewrite Apache Arrow + Parquet SQL native (no Flux needed) Object-storage backend Fixes cardinality limits of v1/v2 Complete rewrite, best choice for new projects

InfluxDB v1 โ€” The Classic

v1 introduced the TSM (Time-Structured Merge Tree) storage engine and InfluxQL, a query language deliberately similar to SQL so it felt familiar. You defined measurements (like table names), tags (indexed metadata), and fields (the actual numeric values). Simple and effective โ€” so effective that many teams deployed it around 2015โ€“2020 and never needed to upgrade. It has a cardinality ceiling (roughly 10M unique series on commodity hardware) but that's fine for most use cases.

InfluxDB v2 โ€” The Big Rewrite

InfluxData tried to make v2 the future: a new Flux query language (a functional pipeline language, very different from SQL), an embedded web UI, and a new data model. The problem was Flux had a steep learning curve and v2 offered no easy migration path from v1. Teams found themselves having to rewrite all their queries. Many decided the v1 they had was good enough and stayed put. v2 saw real deployments but never achieved the universal adoption v1 had.

InfluxDB v3 / IOx โ€” The Future

IOx is a complete rewrite from scratch in Rust, built around Apache Arrow for in-memory columnar processing and Parquet for on-disk storage. The decision to use an object-storage backend (S3-compatible) makes it cloud-native by default. Crucially, v3 brought back SQL as the primary query language โ€” which addressed the biggest adoption barrier of v2. It also fundamentally changed how cardinality is handled, removing the hard ceiling that plagued v1 deployments at high label diversity.

Production reality: Many organisations have large InfluxDB v1 deployments in production today (2025) with no immediate plans to upgrade, because v1 is stable, well-understood, and "good enough." This is a very common pattern in infrastructure โ€” if it ain't broke, don't break it. For new projects, v3 is the right choice.
InfluxDB pioneered purpose-built TSDB with its TSM storage engine (v1). v2 attempted a functional query language (Flux) but disrupted existing users. v3/IOx is a Rust rewrite using Apache Arrow and Parquet, restores SQL, and fixes the cardinality limits of earlier versions.
Section 9

Prometheus: Pull-Based Metrics

Prometheus is the de-facto standard for open-source metrics monitoring, especially in Kubernetes environments. Built by SoundCloud around 2012 and donated to the CNCF (Cloud Native Computing Foundation) where it became only the second graduated project after Kubernetes itself. If you've worked on any cloud-native stack in the last five years, you've almost certainly encountered Prometheus โ€” or something that talks to it.

What makes Prometheus unusual is a design choice that sounds backwards at first: it pulls metrics from your applications instead of having them push. Your app exposes a simple HTTP endpoint at /metrics that lists all current metric values in a text format. Every 15โ€“30 seconds, Prometheus visits each endpoint and collects those values. This "scraping" approach has a clever consequence: if an app goes down, Prometheus immediately notices because the scrape fails. The monitoring system itself acts as a health check.

PROMETHEUS PULL ARCHITECTURE APPLICATIONS Web App :8080/metrics node_exporter :9100/metrics postgres_exporter :9187/metrics scrape every 15s Prometheus single binary Scrape Engine Service discovery + targets Local TSDB 15 days default retention PromQL Engine rate(http_req[5m]) Alert rules evaluated every 1m Grafana Dashboard + alerts UI PromQL Alertmanager Route / group / silence Thanos / Mimir Long-term + HA Single Prometheus handles ~1M active series; use Thanos/Cortex/Mimir for horizontal scale and long-term storage

Prometheus Server

A single Go binary that does everything: service discovery (finds which apps to scrape from Kubernetes, Consul, or static config), scrapes each target's /metrics endpoint on schedule, stores the results in its built-in TSDB, and evaluates alert rules. No separate storage node needed for basic setups. The 15-day default retention trades long-term history for simplicity.

Exporters

Most systems don't natively expose a Prometheus /metrics endpoint. Exporters bridge that gap. node_exporter exposes Linux OS metrics (CPU, memory, disk). postgres_exporter translates Postgres internal statistics. blackbox_exporter probes HTTP/TCP endpoints. Each exporter is a small sidecar that speaks Prometheus's text format on behalf of the thing it monitors. There are hundreds of community exporters.

PromQL

Prometheus Query Language is a functional query language designed for time-series math. A query like rate(http_requests_total{job="api"}[5m]) means: "compute the per-second rate of HTTP requests for the api job, averaged over a 5-minute sliding window." PromQL handles label filtering, aggregation across label dimensions (sum by, avg by), and rate calculations that respect counter resets โ€” all the math you need for monitoring dashboards.

Alertmanager

Alert rules are PromQL expressions evaluated on a schedule. When an expression fires (e.g., CPU > 90% for 5 minutes), Prometheus sends the alert to Alertmanager. Alertmanager handles deduplication (don't page for the same alert twice), grouping (bundle 100 related alerts into one notification), routing (send database alerts to the DB team, app alerts to the app team), and silencing (suppress alerts during planned maintenance).

Scale limits: A single Prometheus instance can typically handle around 1 million active series on modest hardware. With well-tuned hardware and careful configuration, teams have pushed this to tens of millions. The limiting factor is cardinality โ€” the number of unique time-series โ€” which we'll cover in depth in the next section. For production scale beyond a single instance, the ecosystem offers Thanos, Cortex, and Mimir as drop-in horizontal scaling layers on top of Prometheus.
Prometheus is the CNCF-graduated pull-based metrics system. It scrapes /metrics endpoints on a schedule, stores in a local TSDB, and queries via PromQL. Exporters bridge non-native systems. Alertmanager handles routing and deduplication. Single instance handles ~1M series; Thanos/Mimir add scale.
Section 10

Cardinality: The Hidden Killer

There's one concept that trips up almost every team deploying a time-series database for the first time, and it sounds deceptively simple: cardinality. Plain English first โ€” cardinality is just "how many distinct lines could I draw on a graph?" Every line you'd plot is one series. In TSDB terms, a series is one unique combination of a metric name plus all its labels (the descriptive tags like host or region). Every distinct series gets its own slot in memory and on disk. The problem is that label combinations multiply. Add one new label with a thousand possible values and you just multiplied your series count by a thousand.

Let's make this concrete. Imagine you're tracking HTTP requests. You label them by HTTP method (GET, POST, PUT, DELETE โ€” 4 values) and by response status code (200, 201, 400, 404, 500 โ€” 5 values). That's 4 ร— 5 = 20 unique series. Totally fine. Now you add a "path" label for your API routes โ€” say 1,000 different endpoints. Now you have 4 ร— 5 ร— 1,000 = 20,000 series. Still manageable. Then someone suggests adding a user_id label so you can track per-user request rates. Your system has 1 million users. Suddenly: 4 ร— 5 ร— 1,000 ร— 1,000,000 = 20 billion series. Your TSDB collapses.

CARDINALITY โ€” HOW LABELS MULTIPLY INTO SERIES Step 1 โ€” Safe method (4) ร— status (5) 20 series โ€” easily handled โœ“ Fine + path label (ร—1,000) Step 2 โ€” Manageable ร— path (1,000) 20,000 series โ€” still workable โš  Watch it + user_id label (ร—1,000,000) Step 3 โ€” Disaster ร— user_id (1,000,000) 20 billion series โ€” TSDB collapses โœ— Fatal THE FIX: Replace user_id (1M values) with user_tier (3 values) http_req{method,status,path,user_id="u-8239821"} 20 billion series โ€” OOM BAD: user_id is unbounded http_req{method,status,path,user_tier="free"} 20,000 ร— 3 = 60,000 series โ€” fine GOOD: tier has only 3 values

Four mitigation strategies

Bucket High-Cardinality Dimensions

Instead of storing the raw high-cardinality value as a label, map it to a low-cardinality bucket. User IDs become user tiers (free, pro, enterprise โ€” 3 values). IP addresses become geographic regions (6 continents, maybe 50 countries). This loses per-user granularity but keeps the TSDB alive. If you need per-user data, that's what traces are for (see strategy 4).

Drop Labels at Scrape Time

Prometheus lets you strip labels from metrics before storing them using metric_relabel_configs. If an exporter exposes a high-cardinality label you don't actually need for your alerts or dashboards, configure Prometheus to drop it during the scrape. Prevention is easier than treatment โ€” once a high-cardinality series is ingested, it's already in memory.

Use Histograms Instead of Labels

A common temptation is labeling each request with its exact latency: latency_ms="47", latency_ms="51", etc. That creates an unbounded label. Instead, use a Prometheus histogram that puts requests into fixed buckets (0โ€“10ms, 10โ€“50ms, 50โ€“100ms, 100ms+). You get the same percentile calculations (p50, p95, p99) from a small fixed number of series, not millions.

Use Traces for High-Cardinality Data

The fundamental insight: metrics and traces serve different purposes. Metrics are aggregates โ€” great for "what's the p95 latency for my API?" but wrong for "why was this specific request slow?" Use Jaeger, Tempo, or Zipkin for distributed tracing, which is designed to handle high-cardinality identifiers (request IDs, user IDs, session IDs) because traces are stored differently โ€” as individual events, not as time-series.

Warning โ€” #1 cause of TSDB outages: Cardinality explosions are the most common reason teams wake up to a dead Prometheus or InfluxDB. The failure mode is always the same: a well-intentioned engineer adds a label with high cardinality, series count multiplies, the process OOMs, and the monitoring system goes down exactly when you need it. Always cap unique label values. A rule of thumb: no single label should have more than a few thousand unique values in a well-run metrics system.
Each unique metric+label combination is a "series." Label values multiply โ€” adding a user_id label with 1M users turns 20K series into 20 billion. Mitigations: bucket high-cardinality values, drop labels at scrape time, use histograms over per-value labels, and offload high-cardinality data to traces instead of metrics.
Section 11

Pull vs Push Models

When you're collecting metrics from your services, there are two fundamentally different directions data can flow. Think of it like checking on a friend: you can call them and ask how they're doing (pull), or they can text you whenever something changes (push). Both work โ€” they just shift who initiates the conversation. In the pull model, the monitoring server reaches out to each application and asks "what are your current metrics?" In the push model, each application proactively sends its metrics to the database as they're generated. Both approaches work in production at scale โ€” the right choice depends on your infrastructure topology and operational preferences.

PULL MODEL Prometheus approach Prometheus scrapes targets every 15โ€“30s App A /metrics :8080 App B /metrics :8081 scrape Pull trade-offs โœ“ Server controls rate ยท โœ“ Down = visible ยท โœ“ Service discovery โœ— Harder for ephemeral jobs ยท โœ— Apps must expose HTTP PUSH MODEL InfluxDB ยท Graphite ยท OpenTSDB Metrics DB InfluxDB / Graphite receives incoming data App A App B push Push trade-offs โœ“ Works for ephemeral/batch jobs ยท โœ“ No HTTP server needed on app โœ— App controls rate (may flood DB) ยท โœ— Down is not implicit

Pull (Prometheus)

The monitoring server drives the scrape schedule. This gives you central control: if a service goes rogue and starts producing 10ร— as many metrics, your Prometheus doesn't flood itself because it still scrapes at its own pace. Service discovery integrates naturally โ€” Kubernetes pod discovery automatically adds and removes scrape targets as pods come and go. The one pain point: ephemeral jobs (batch scripts, CI pipelines) may finish before Prometheus scrapes them, losing their metrics entirely. The Pushgateway solves this edge case.

Push (InfluxDB, Graphite, OpenTSDB)

The application sends its own data on its own schedule. This is natural for short-lived processes: a batch job can push its final metrics right before it exits without waiting for a scrape. The challenge is backpressure โ€” if 10,000 apps all start pushing at once during an incident, the metrics database can be overwhelmed. Push systems usually require authentication and rate limiting to protect the ingestion endpoint.

Hybrid patterns that bridge both worlds

Pushgateway

Prometheus's own escape hatch for ephemeral jobs. A batch job pushes its final metrics to the Pushgateway (a persistent sidecar service), then exits. Prometheus scrapes the Pushgateway on its normal schedule. The gateway holds the last-pushed values until overwritten. This lets short-lived jobs participate in a pull-based metrics system without changing the core architecture.

OpenTelemetry Collector

Applications push telemetry (metrics, traces, logs) to an OTel Collector using OTLP (OpenTelemetry Protocol). The Collector can then expose a Prometheus-compatible /metrics endpoint, letting Prometheus pull from it normally. This decouples your applications from your monitoring backend โ€” you can swap Prometheus for a different TSDB without touching your application code.

Pull (Prometheus) lets the server control scrape rate and makes downtime implicit, but struggles with ephemeral jobs. Push (InfluxDB, Graphite) works naturally for short-lived processes but risks flooding the DB. Pushgateway and OpenTelemetry Collector bridge both models without forcing an all-or-nothing choice.
Section 12

ClickHouse for Time-Series

ClickHouse comes from Yandex, where it was built to power Yandex.Metrica (one of the world's largest web analytics platforms) and open-sourced in 2016. Here's the surprising thing: ClickHouse is not a time-series database. It's a columnar OLAP (Online Analytical Processing) database designed for analytical queries on large datasets. But its architecture turns out to be an excellent fit for time-series workloads โ€” often outperforming purpose-built TSDBs for analytical access patterns. Understanding why reveals something important about what makes time-series storage hard.

The magic is in ClickHouse's MergeTree storage engine. Data is stored in column-per-file format, physically sorted by a primary key โ€” which for time-series data you typically set to (timestamp, ...). Time-range queries become pure sequential reads on the timestamp column: the disk head moves forward, never backwards. Combined with aggressive per-column compression and the ability to pre-aggregate during writes, ClickHouse achieves TSDB-grade performance without being a TSDB.

CLICKHOUSE MERGETREE โ€” COLUMNAR SORTED STORAGE Inserts batched writes sorted on flush DATA PARTS (sorted by primary key) Part 2025-05-09_1 ts column host column cpu_pct column mem_mb column Part 2025-05-09_2 ts column host column cpu_pct column mem_mb column โŸณ background merge Part (merged) ts column (ZSTD) host (dict) cpu_pct (LZ4) mem_mb (LZ4) partitioned by toYYYYMM(ts) โ†’ different partition = different directory โ†’ fast partition pruning Time-Range Query SELECT avg(cpu_pct) WHERE ts > now()-1h reads only cpu_pct column of relevant partitions Why columnar wins for time-series analytics Columnar: read only the columns you need Each column compresses independently (similar values) Sort order = sequential disk reads for time ranges

Why ClickHouse excels at time-series analytics

Four architectural features combine to deliver TSDB-grade performance from a general-purpose columnar engine. Each one has a reason it helps time-series specifically:

Columnar Storage

In a row-based database, fetching "the CPU percentage for the last hour" reads every column of every matching row โ€” including host name, region, tags you don't need. In ClickHouse, each column is a separate file. Fetching CPU percentages reads only the CPU column file. For time-series analytics (aggregations across a single metric over time), you're typically reading one or two columns out of dozens. Columnar storage turns a 50ร— data penalty into a near-minimum read.

Sort by Primary Key

Time-series data arrives roughly in time order. ClickHouse stores data sorted by primary key โ€” for metrics tables this is typically (timestamp, host). A query for "last hour of data" becomes a seek to the right position and a sequential forward scan. No random I/O, no index lookups chasing pointers across pages. Sequential reads on modern SSDs and HDDs are vastly faster than random reads โ€” this is the same insight behind LSM trees and B-trees optimized for write-then-sequential-read workloads.

Compression

Each column compresses independently. CPU percentage values in the same time window are highly correlated โ€” they change by a few percent per second โ€” so delta encoding achieves very high compression ratios. ClickHouse uses LZ4 by default and ZSTD for higher ratios at the cost of more CPU. String columns (hostnames, regions) use dictionary encoding. Typical metrics data compresses at 5โ€“20ร— in ClickHouse, directly reducing disk cost and improving query speed by reducing I/O volume.

MergeTree Variants

The base MergeTree keeps all rows. Specialised variants pre-aggregate during the background merge process. SummingMergeTree automatically sums values with the same primary key, collapsing many individual rows into one aggregate. AggregatingMergeTree stores partial aggregate states (e.g., partial sums + counts for averages) that are combined at merge time. This effectively moves downsampling work to the background, so queries read pre-aggregated data without any extra query-time computation.

Real-world use cases

Observability at Scale

Several large-scale observability platforms use ClickHouse or ClickHouse-derived storage for metrics and logs at the ingestion tier. The combination of high write throughput and fast analytical queries makes it attractive when you need to support both real-time dashboards and historical analysis across months or years of data. Cloudflare uses ClickHouse for analytics workloads. GitLab uses it for analytics features inside GitLab itself.

Logs at Scale

Logs are time-series data with a string payload. ClickHouse handles log analytics well as an alternative to the ELK (Elasticsearch, Logstash, Kibana) stack โ€” particularly when you have very high log volume and need cost-efficient storage with fast analytical queries. ELK's inverted index trades storage efficiency for full-text search capability; if you don't need arbitrary full-text search and can filter by structured fields, ClickHouse is typically cheaper and faster for the common "show me errors from service X in the last 2 hours" query pattern.

Performance ballpark: ClickHouse can scan billions of rows per second on a single server for analytical queries โ€” commonly cited as 5โ€“10ร— faster than Postgres on OLAP workloads, depending on the query. These are rough guidance numbers from benchmarks; your actual performance depends on schema design, hardware, and query patterns. The key point is that columnar + sorted + compressed storage is very well suited to time-range aggregation queries.
ClickHouse is a columnar OLAP database that excels at time-series analytics despite not being a TSDB. MergeTree stores data sorted by timestamp in per-column files with per-column compression. Time-range queries become sequential reads on minimal data. SummingMergeTree and AggregatingMergeTree pre-aggregate during background merges. Used for observability, real-time analytics, and log analytics at scale.
Section 13

Continuous Aggregates & Downsampling

Pre-compute aggregates over time windows so queries hit the right resolution instead of scanning billions of raw points. Combined with retention policies this keeps storage flat forever.

Imagine you store one CPU reading every second. After a year that's around 31 million rows per machine. Ask "what was the average CPU last Tuesday?" and the database has to scan ~86 400 rows just for that one day, on one machine. Scale to 1 000 servers and a query that should feel instant takes seconds โ€” because the database is doing the same arithmetic over and over again, every time someone refreshes a dashboard.

The fix is to do the arithmetic once and remember the answer. The technical name is a continuous aggregate โ€” a saved summary the database keeps automatically up to date as new data arrives. The closely related idea, downsampling, just means storing those summaries at coarser time windows so the saved data shrinks. Raw 1-second data stays for a day or two. After that, you only need 1-minute averages. After a week, 5-minute averages are fine. After a month, 1-hour rollups are enough. This is exactly how your phone's Health app works โ€” it keeps every step count for the last day, but shows you weekly bar charts that already summarise earlier data.

DATA RESOLUTION PYRAMID โ€” Older Data = Lower Resolution 1-HOUR ROLLUPS โ€” last year ~8 760 rows/series โ€ข tiny footprint 5-MINUTE ROLLUPS โ€” last month ~8 640 rows/series โ€ข good query speed 1-MINUTE ROLLUPS โ€” last week ~10 080 rows/series โ€ข fast dashboard loads RAW 1-SECOND DATA โ€” last day only ~86 400 rows/series โ€ข full fidelity, biggest cost โ†‘ Newest / highest resolution at bottom โ€” oldest / compressed at top

Four Ways to Implement Continuous Aggregates

TimescaleDB Continuous Aggregates

TimescaleDB lets you declare a CREATE MATERIALIZED VIEW โ€ฆ WITH (timescaledb.continuous). The database incrementally refreshes only the buckets that have new data โ€” you don't reprocess everything, which is why refreshes are cheap even on large tables.

InfluxDB Tasks / Continuous Queries

InfluxDB v2 uses Tasks โ€” scheduled Flux scripts that run on a cron-like schedule and write rollup results back to a new bucket. InfluxDB v1 called these Continuous Queries. Either way the idea is: every N minutes, compute averages for the last N-minute window and write them to a separate measurement.

Prometheus Recording Rules

Prometheus recording rules let you pre-compute expensive PromQL expressions (like per-job request rate) and store them as new synthetic metrics. Dashboards query the recording rule instead of re-computing the aggregation on the fly โ€” often 10โ€“100ร— faster for complex range queries.

ClickHouse Materialized Views

ClickHouse materialized views fire on every INSERT โ€” data lands in the source table and simultaneously aggregates into the view's target table. No scheduled jobs needed. This makes ClickHouse a natural fit for real-time rollup pipelines at very high write throughput.

Implementation Examples

-- Step 1: Create the continuous aggregate (1-minute CPU averages)
CREATE MATERIALIZED VIEW cpu_1min
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 minute', time) AS bucket,
  host,
  AVG(cpu_pct)                  AS avg_cpu,
  MAX(cpu_pct)                  AS max_cpu
FROM metrics
GROUP BY bucket, host;

-- Step 2: Set an automatic refresh policy
-- Refresh the last 1 hour of data every 30 seconds
SELECT add_continuous_aggregate_policy('cpu_1min',
  start_offset => INTERVAL '1 hour',
  end_offset   => INTERVAL '30 seconds',
  schedule_interval => INTERVAL '30 seconds'
);

-- Queries now hit the materialised view โ€” not the raw table
SELECT bucket, host, avg_cpu
FROM cpu_1min
WHERE bucket > NOW() - INTERVAL '7 days'
  AND host = 'web-01'
ORDER BY bucket;
# prometheus.yml โ€” groups section
# Recording rules pre-compute expensive aggregations.
# Name convention: level:metric:operation
groups:
  - name: cpu_rollups
    interval: 1m          # evaluate every minute
    rules:
      # 1-minute per-instance CPU usage rate
      - record: instance:cpu_usage:rate1m
        expr: |
          100 - (
            avg by (instance) (
              rate(node_cpu_seconds_total{mode="idle"}[1m])
            ) * 100
          )

      # 5-minute job-level p99 request latency
      - record: job:http_request_duration_p99:rate5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

# Dashboards query instance:cpu_usage:rate1m instead of
# recomputing the full expression on every dashboard refresh.
-- Source table: raw 1-second metrics
CREATE TABLE metrics_raw (
  ts       DateTime,
  host     LowCardinality(String),
  cpu_pct  Float32
) ENGINE = MergeTree()
  ORDER BY (host, ts);

-- Target table: 1-minute rollups (AggregatingMergeTree)
CREATE TABLE cpu_1min (
  bucket   DateTime,
  host     LowCardinality(String),
  avg_cpu  AggregateFunction(avg, Float32),
  max_cpu  AggregateFunction(max, Float32)
) ENGINE = AggregatingMergeTree()
  ORDER BY (host, bucket);

-- Materialized view fires on every INSERT into metrics_raw
CREATE MATERIALIZED VIEW cpu_1min_mv TO cpu_1min AS
SELECT
  toStartOfMinute(ts) AS bucket,
  host,
  avgState(cpu_pct)   AS avg_cpu,
  maxState(cpu_pct)   AS max_cpu
FROM metrics_raw
GROUP BY bucket, host;

-- Query using Merge functions to finalise the aggregation state
SELECT
  bucket,
  host,
  avgMerge(avg_cpu) AS avg_cpu,
  maxMerge(max_cpu) AS max_cpu
FROM cpu_1min
WHERE bucket > NOW() - toIntervalDay(7)
GROUP BY bucket, host
ORDER BY bucket;
The storage constant trick: Downsampling + retention together mean your disk usage stays roughly constant no matter how long you run. Keep raw data for 30 days, then only the 1-hour rollup. A year of 1-hour rollups is only 8 760 rows per metric โ€” tiny. You get indefinite historical retention without exponential storage growth.
Section 14

Retention Policies

Automate data expiry by dropping entire time-partitioned chunks instead of issuing slow row-level DELETEs. Match retention window to data value โ€” raw data for days, rollups for years.

Without a plan, a time-series database grows forever. Every second that passes adds more rows. After six months you'll have tables with hundreds of billions of rows, most of which nobody will ever query. The older data is, the less valuable it usually is โ€” an ops team cares about the last 24 hours far more than what happened 400 days ago.

The correct approach is retention policies: a rule that says "delete anything older than X days". But the key insight is how to delete. A traditional DELETE FROM metrics WHERE time < now() - interval '30 days' has to scan every row, update indexes, write tombstones โ€” it can be slower than your incoming writes. TSDBs avoid this by storing data in chunks (one file per time window). To expire data, you just drop the whole chunk file โ€” as fast as deleting a folder.

RETENTION POLICY โ€” Drop Chunks Past the Boundary today -7 days -14 days -21 days 90-day boundary -97 days -104 days -111 days DROP (instant) delete entire chunk file โ† live chunks (retained) expired chunks โ†’

Four Retention Strategies

Time-Based Retention

The simplest policy: drop any chunk whose entire time range is older than a threshold. "Delete everything older than 90 days." This is the default in almost every TSDB and works for most use cases because data value decays with age.

Size-Based Retention

Drop the oldest chunks until total disk usage falls below a cap (e.g. keep at most 1 TB). Useful when you have a fixed budget and write rate is unpredictable โ€” storage cost is capped regardless of how much data flows in. InfluxDB supports this natively.

Tiered Retention

Different tables/buckets hold different resolutions with different TTLs. Raw data lives 7 days. 1-minute rollups live 30 days. 1-hour rollups live 1 year. Daily rollups live forever. Each tier is a separate storage policy โ€” you never delete the rollups when you drop raw data.

Per-Tenant Retention

In a SaaS or multi-tenant environment different customers may have different SLAs. Enterprise customers keep 2 years; free-tier customers keep 30 days. Most TSDBs support per-table or per-bucket retention, so you isolate tenant data and apply different policies without running separate databases.

Configuration Examples

-- Add a retention policy: drop chunks older than 30 days
-- TimescaleDB drops entire chunk files โ€” near-zero I/O
SELECT add_retention_policy(
  'metrics',                -- hypertable name
  INTERVAL '30 days'        -- drop chunks whose end time is older than this
);

-- Tiered retention: raw 7 days, 1-min rollup 90 days
SELECT add_retention_policy('metrics',       INTERVAL '7 days');
SELECT add_retention_policy('metrics_1min',  INTERVAL '90 days');
SELECT add_retention_policy('metrics_1hour', INTERVAL '365 days');

-- Check scheduled policies
SELECT * FROM timescaledb_information.jobs
WHERE proc_name = 'policy_retention';
# InfluxDB v2 โ€” set retention on a bucket (via CLI)
influx bucket update \
  --name raw_metrics \
  --retention 720h   # 30 days in hours (0 = infinite)

# Create a separate bucket for 1-year rollups
influx bucket create \
  --name rollup_1hour \
  --retention 8760h  # 365 days

# InfluxDB v1 โ€” retention policy on a database
# (legacy, v1 syntax)
CREATE RETENTION POLICY "30d" ON "telegraf"
  DURATION 30d REPLICATION 1 DEFAULT;

# Size-based cap is not a native InfluxDB feature;
# use Flux tasks to enforce it via monitoring bucket sizes.
# prometheus.yml / CLI flags โ€” retention configuration

# Time-based: keep 90 days of data
# (set as a startup flag, not in prometheus.yml)
# --storage.tsdb.retention.time=90d

# Size-based: keep at most 50 GB
# --storage.tsdb.retention.size=50GB

# Both flags can be combined โ€” whichever limit is hit first wins
# --storage.tsdb.retention.time=90d
# --storage.tsdb.retention.size=50GB

# Docker Compose example
services:
  prometheus:
    image: prom/prometheus:latest
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=90d"
      - "--storage.tsdb.retention.size=50GB"
      - "--storage.tsdb.path=/prometheus"
Back-of-envelope math: 100 000 metrics each emitting one reading per second = 100 000 rows/sec = ~8.6 billion rows per day. At roughly 30 bytes/row compressed that's around 250 GB of raw data per day โ€” about 7.5 TB for 30 days. Downsample to 1-minute after 7 days and you keep only ~10 million rows per day after that, roughly 25 GB total for 30 days. Retention + downsampling together cut storage by roughly 10ร—.
Section 15

Performance & Scaling

Write throughput, query latency, cardinality limits, and compression ratio are the four dials that matter for TSDB performance. Each has a practical ceiling and a set of well-known techniques to push it.

When engineers say a TSDB is "slow," they almost always mean one of four things: writes are backing up, dashboard queries are taking seconds, they've hit a cardinality ceiling, or disk is growing faster than expected. Each problem has a different root cause and a different fix โ€” which is why it's worth understanding the mechanics behind each lever.

KEY TSDB PERFORMANCE METRICS โ€” What to Watch WRITE RATE 1M/s rows per second โ–ฒ healthy โ€” batch writes QUERY P99 18 ms for 7-day time window โ–ฒ healthy โ€” using rollup ACTIVE SERIES 800K unique label combinations โš  approaching ceiling COMPRESSION 12ร— raw vs compressed size โ–ฒ good โ€” delta encoding โ†’ batch 1Kโ€“10K rows โ†’ continuous aggregates โ†’ audit high-cardinality tags โ†’ Gorilla / Delta-of-Delta Ballpark numbers shown โ€” actual figures depend heavily on hardware, schema, and query patterns. TimescaleDB: ~100Kโ€“1M writes/sec โ€ข Prometheus: ~1M active series โ€ข ClickHouse: ~billion rows/sec scan

Five Performance Levers

Batch Writes

TSDBs are optimised for bulk ingestion. Sending one row per HTTP request burns TCP handshakes, serialisation overhead, and network round-trips. Send 1 000โ€“10 000 data points per request instead. Most client libraries support automatic batching โ€” enable it. A well-batched write pipeline can be 20โ€“50ร— faster than unbatched.

Index Strategy

TSDBs index on time (always) plus a small number of low-cardinality tag columns (host, region, service). Adding high-cardinality fields to indexes โ€” like user IDs or request IDs โ€” causes cardinality explosion. The rule of thumb: only tag columns whose values come from a bounded set (hundreds to thousands of unique values, not millions).

Compression Algorithm

Floating-point metrics (CPU%, temperature, latency) compress brilliantly with Gorilla encoding (XOR delta-of-delta). Sequential timestamps compress with delta encoding (store only the difference between consecutive values). Over columnar layout, then general-purpose compressors like LZ4 or Zstandard give another 2โ€“5ร— on top. The result is often 10โ€“20ร— raw-to-compressed ratio.

Cardinality Limits

Each TSDB has a practical ceiling on active series โ€” unique combinations of metric name + label values. Prometheus starts struggling around 1โ€“2 million active series; memory grows roughly proportionally. Monitor your cardinality with prometheus_tsdb_symbol_table_size_bytes and alert before hitting limits, not after.

Cache Hot Windows

The last few hours of data is almost always in RAM โ€” the OS page cache keeps recently written chunks warm. Dashboard queries for "last 1 hour" or "last 6 hours" are therefore sub-millisecond because they never touch disk. Queries for data from six months ago will always be slower because they read from compressed cold storage on disk.

Section 16

Observability Stack Patterns

Modern observability combines three distinct data types โ€” metrics, logs, and traces โ€” each stored in a specialised backend. A time-series database is the foundation of the metrics pillar.

When your application is misbehaving, you need three different kinds of evidence to understand why. Metrics tell you that something is wrong โ€” CPU is 95%, error rate spiked. Logs tell you what happened โ€” "NullPointerException in UserService line 42". Traces tell you where time was spent โ€” this request took 800ms because the database call alone took 750ms. Together these three are called the three pillars of observability.

Each pillar has different storage needs. Metrics are numeric time-series โ€” perfect for a TSDB. Logs are semi-structured text โ€” better in Elasticsearch or Loki. Traces are directed acyclic graphs of spans โ€” stored in Jaeger, Zipkin, or Tempo. Trying to cram all three into one database usually means compromising on at least one of them.

THE OBSERVABILITY TRIAD โ€” Three Pillars, One Unified View Your Application emits metrics, logs, traces METRICS Prometheus โ€ข VictoriaMetrics Mimir โ€ข M3 โ€ข InfluxDB TSDB โ€” numeric time-series LOGS Loki โ€ข Elasticsearch โ€ข Splunk OpenSearch โ€ข Datadog Logs Inverted index โ€” text search TRACES Jaeger โ€ข Zipkin โ€ข Tempo Datadog APM โ€ข New Relic Graph store โ€” span trees Grafana โ€” unified dashboards + alerts

Five Stack Components

Metrics

The backbone of any monitoring setup. Prometheus is the de-facto standard in cloud-native environments โ€” it uses a pull model, scraping metrics endpoints every 15โ€“60 seconds. VictoriaMetrics and Grafana Mimir are Prometheus-compatible alternatives that scale horizontally for very high cardinality or multi-tenant setups.

Logs

Grafana Loki stores logs as compressed blobs indexed only by labels (not by content) โ€” cheap to run, good for label-filtered queries. Elasticsearch / OpenSearch index every token โ€” expensive to run but powerful for full-text search. Splunk and Datadog are the commercial heavyweights with advanced analytics.

Traces

A trace is a tree of spans โ€” each span records one operation (a DB call, an HTTP request, a function). Jaeger (CNCF) and Grafana Tempo are the open-source standards. Tempo is particularly efficient because it stores raw trace data in object storage (S3/GCS) with minimal indexing, keeping costs low.

Visualisation

Grafana has become the universal visualisation layer โ€” it can query Prometheus, Loki, Tempo, Elasticsearch, InfluxDB, ClickHouse, and many more from a single dashboard. Kibana serves a similar role for the Elastic ecosystem. In commercial stacks, Datadog and New Relic provide integrated dashboards out of the box.

Alerting

Alertmanager is the standard Prometheus alerting router โ€” it deduplicates, groups, and routes alerts to Slack, PagerDuty, OpsGenie, and email. Grafana has a built-in alerting engine that can fire on any data source query. Commercial tools like PagerDuty add on-call scheduling, escalation policies, and incident management on top.

OpenTelemetry is the emerging standard: Instead of each language SDK sending data directly to Prometheus, Jaeger, and Loki separately, your code instruments once with the OpenTelemetry SDK, sends everything to an OTel Collector, and the collector routes metrics to your TSDB, logs to Loki, and traces to Tempo. One instrumentation point โ€” backend-agnostic. Most new projects start with OTel today.
Section 17

Cloud-Managed Time-Series

Most teams offload TSDB operations to cloud-managed services. Six options cover different trade-offs between control, cost, and ecosystem lock-in.

Running your own Prometheus cluster sounds fine until it's 2 AM and your TSDB ran out of memory, taking your entire monitoring stack offline. Managed services push that operational burden to someone else. The trade-off is cost and lock-in โ€” managed services bill per query, per series, or per GB, which can surprise you at scale.

CLOUD TSDB DECISION GUIDE โ€” Where Each Fits Heavy AWS shop all infra on AWS, IAM integration matters โ†’ Amazon Timestream serverless, pay-per-query, tight AWS integration Prometheus team existing PromQL expertise, open-source preference โ†’ Grafana Cloud managed Prometheus + Loki + Tempo bundle Turn-key commercial observability want everything in one product โ†’ Datadog Metrics metrics + logs + APM + dashboards + alerts Other Managed Options InfluxDB Cloud โ€” InfluxData hosted v2/v3 Google Cloud Monitoring โ€” GCP-native TSDB Azure Monitor โ€” Microsoft stack + KQL Choose by ecosystem: GCP shop โ†’ GC Monitoring Azure shop โ†’ Azure Monitor. Mixed โ†’ Grafana Cloud. InfluxDB users migrating from self-hosted โ†’ InfluxDB Cloud.

Six Managed Services

Amazon Timestream

AWS's fully managed time-series service. Serverless โ€” you don't provision nodes. Automatically moves data between an in-memory store (recent data) and a magnetic store (historical data). Priced per write, query, and storage separately. Best fit when you're already deep in AWS and want tight IAM + CloudWatch integration without running your own Prometheus.

Google Cloud Monitoring

Google's built-in monitoring service, historically called Stackdriver. Built on Google's internal time-series infrastructure. Best for GCP-native shops โ€” it automatically ingests metrics from GCE, GKE, Cloud Run, and other GCP services with zero setup. MQL (Monitoring Query Language) is Google's query language for it.

Azure Monitor / Log Analytics

Microsoft's unified observability stack. Metrics go to Azure Monitor, logs go to Log Analytics Workspace, both queryable via KQL (Kusto Query Language). Integrates directly with Azure resources โ€” VMs, AKS, Functions, App Service all emit metrics automatically. Also supports OpenTelemetry ingestion via Azure Monitor OpenTelemetry Distro.

Datadog Metrics

Datadog is a fully managed observability platform โ€” it bundles metrics (Prometheus-compatible), logs, traces, dashboards, and alerting under one roof. High adoption in enterprise. The pricing model is per host + per custom metric, which can get expensive fast at scale. The benefit is seamless correlation between metrics, logs, and traces in one UI.

InfluxDB Cloud

InfluxData's managed version of InfluxDB. Available on AWS, GCP, and Azure. Runs the same Flux query language and line protocol as self-hosted InfluxDB v2, making migration straightforward. InfluxDB Cloud Serverless (v3) uses Apache Arrow + Parquet under the hood for columnar storage.

Grafana Cloud

The most popular choice for teams already using the Prometheus + Grafana stack. Grafana Cloud bundles managed Prometheus (backed by Cortex/Mimir), Loki for logs, and Tempo for traces. Generous free tier. Prometheus-compatible API means you can point existing Prometheus remoteWrite config at Grafana Cloud with minimal changes.

Cost trap: Cloud TSDBs bill on dimensions that scale non-linearly โ€” high cardinality (many unique label combinations) and long retention can produce surprisingly large bills. Always run the vendor's cost estimator before committing. Datadog in particular can become one of your top infrastructure costs at scale. Consider self-hosted VictoriaMetrics or Mimir for very high-volume workloads where managed costs become significant.
Section 18

Use Cases & Patterns

Time-series databases appear wherever something is measured repeatedly over time. Six canonical use cases cover nearly every production TSDB deployment you'll encounter.

The common thread across all TSDB use cases is the same pattern: something generates a number at regular intervals, that number needs to be stored cheaply and queried by time range efficiently, and old data gradually becomes less valuable. If your problem fits that shape โ€” it's a time-series workload.

SIX USE CASES โ€” What Makes Each One a Time-Series Pattern Time-Series Pattern Server / Infra Monitoring CPU/mem every 15 s, retention 30โ€“90 d, 1Kโ€“1M series APM (Application Perf) req rate, error rate, p99 latency; high cardinality labels IoT Telemetry millions of devices, irregular intervals, edge batching Financial Market Data tick-by-tick quotes, microsecond precision, never delete Real-User Monitoring (RUM) page load times, clicks; bursty, user-scoped tags Industrial OT (SCADA) factory sensors, regulatory retention (7โ€“25 yrs), OSIsoft PI

Six Production Use Cases

Server & Infrastructure Monitoring

The most common TSDB use case. Collect CPU, memory, disk I/O, network throughput, and process counts from every server every 15โ€“60 seconds. Prometheus + Grafana is the standard open-source stack. The cardinality is manageable (one series per host per metric), retention is typically 30โ€“90 days, and alerting on threshold breaches is straightforward.

APM โ€” Application Performance Monitoring

Track request rate (how many requests per second), error rate (what fraction fail), and latency percentiles (p50, p95, p99) per service, endpoint, and environment. This is the RED method (Rate, Errors, Duration). APM metrics tend to have higher cardinality than infrastructure metrics because each endpoint or user journey is a separate dimension.

IoT Telemetry

Smart meters, environmental sensors, connected vehicles, industrial equipment โ€” all generating streams of readings. The challenge here is scale (millions of devices) and irregular intervals (a sensor might drop off and reconnect). TimescaleDB and InfluxDB are popular choices. Edge devices often batch locally and send in bursts to avoid connectivity costs.

Financial Market Data

Stock exchanges generate tick-by-tick price and volume data at microsecond resolution. This is some of the most demanding time-series work: high write throughput, microsecond precision, regulatory requirements that often mandate indefinite retention, and complex analytical queries (OHLCV aggregations, rolling averages, correlation analysis). Kdb+ (from KX Systems) is the specialist database built for this use case, though ClickHouse is also used.

Real-User Monitoring (RUM)

Measure the experience of real users in production โ€” page load times, time-to-first-byte, JavaScript errors, click events, and funnel drop-offs. Traffic is bursty (weekday business hours vs. weekends). Each event is tagged with browser, country, and device type, which pushes cardinality up. Datadog RUM and similar tools store this in managed TSDBs with pre-built dashboards.

Industrial OT (Operational Technology)

Factory assembly lines, oil refineries, power plants, and water treatment facilities all run SCADA (Supervisory Control and Data Acquisition) systems generating continuous sensor streams. Regulatory bodies often require retention of 7โ€“25 years. OSIsoft PI (now AVEVA PI) is the dominant proprietary TSDB in this space. Reliability and data integrity matter more than query speed.

Section 19

Tools & Ecosystem โ€” The TSDB Toolbox

A time-series database does not operate in isolation. Around every TSDB there is a small constellation of tools โ€” collectors that push data in, dashboards that pull data out, query languages that let you express time-windowed questions, and SDKs that instrument your own code. Knowing the six tools below gets you from zero to a working observability stack in an afternoon.

TYPICAL TSDB ECOSYSTEM FLOW Application OTel SDK or Prom client Collector Telegraf / OTel Collector (push) TSDB TimescaleDB / InfluxDB Prometheus / ClickHouse kdb+ (finance) chunked + compressed Grafana PromQL / Flux / SQL dashboards + alerts emit metrics batch + forward compress + retain time-window query Prometheus scrapes (pull)

Grafana

The industry-standard open-source dashboard platform. Grafana speaks directly to almost every TSDB โ€” Prometheus, InfluxDB, TimescaleDB, ClickHouse, and many more โ€” through a plugin data-source system. You build dashboards by writing queries in the native language of your TSDB (PromQL, Flux, SQL) and Grafana renders them as time-series graphs, heatmaps, stat panels, or tables. It also handles alerting: define a threshold in a query, and Grafana pages your on-call team when the value crosses it. If you only add one tool to a TSDB deployment, make it Grafana โ€” it replaces the entire visualization and alerting layer.

PromQL / Flux / SQL

Each major TSDB has its own query language, because time-series queries have patterns (rate-of-change, rolling windows, percentiles) that plain SQL handles awkwardly. PromQL (Prometheus Query Language) is a functional language built around instant vectors and range vectors โ€” it excels at rate calculations and threshold alerts. Flux (InfluxDB v2+) is a pipeline language: you chain operations like |> filter() |> aggregateWindow() to transform streams. SQL via TimescaleDB gives Postgres users familiar syntax augmented with time_bucket() and continuous aggregates. Choosing a TSDB partly means choosing which query language your team will live in.

Telegraf

InfluxData's open-source metrics collection agent โ€” the "push" counterpart to Prometheus's pull model. You deploy Telegraf as a sidecar or system service; it reads from 200+ input plugins (CPU, memory, Docker stats, MySQL queries, Kafka lag, AWS CloudWatch โ€” nearly anything) and writes to 50+ output plugins (InfluxDB, Prometheus remote write, Kafka, Elasticsearch, etc.). Why use it instead of writing your own collector? Telegraf handles batching, buffering on disk during network outages, and plugin versioning so you do not have to build any of that yourself. It is the fastest way to get a machine's metrics flowing into any TSDB.

Prometheus Exporters

Prometheus uses a pull model โ€” the Prometheus server scrapes an HTTP endpoint (/metrics) on each monitored target every 15 seconds. An exporter is a small process that translates a system's internal metrics format into the Prometheus text exposition format. The node_exporter exposes Linux kernel metrics (CPU, disk I/O, network). The mysqld_exporter exposes MySQL performance schema. The redis_exporter exposes INFO output. There are 100+ community exporters. For your own application, you use an official client library (Go, Python, Java, Ruby) to instrument code directly, exposing custom counters, gauges, and histograms at the same /metrics endpoint.

OpenTelemetry SDK

OpenTelemetry (OTel) is a vendor-neutral CNCF observability standard for instrumentation โ€” it graduated from CNCF in 2025 and is the second most active CNCF project after Kubernetes. Instead of writing Prometheus-specific client code or InfluxDB-specific line protocol code, you instrument your application once using the OTel SDK and then configure an exporter at startup to ship to whichever backend you want (Prometheus, Jaeger, Tempo, InfluxDB, Datadog, Honeycomb). This decouples your code from your storage choice. OTel covers three signals: metrics (counters, histograms, gauges), traces (distributed request flows), and logs โ€” making it a genuine observability standard rather than just a metrics library. Increasingly, new projects start with OTel rather than a TSDB-specific SDK.

kdb+ / q

kdb+ is the financial industry's TSDB of choice โ€” and it is in a different performance league from everything else on this list. Built by KX Systems (founded 1993; kdb+ first released in 1998), kdb+ stores time-series data in a column-oriented, in-memory-first architecture and ships with q, a terse array programming language that executes time-series queries at speeds measured in microseconds rather than milliseconds. A single kdb+ process can ingest millions of ticks per second and answer complex rolling-window queries over years of market data in single-digit milliseconds. The tradeoff: kdb+ is expensive (commercial license), has a steep q learning curve, and is almost exclusively used in finance (high-frequency trading, options pricing, backtesting). For most engineers it is background knowledge, not a practical choice.

Query Language Examples

# Rate of HTTP requests per second, averaged over the last 5 minutes
# rate() calculates per-second increase of a counter over a time window
rate(http_requests_total{job="api-server", status="200"}[5m])

# 95th percentile request duration over last 5 minutes
# histogram_quantile() reconstructs percentile from Prometheus histogram buckets
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# CPU usage by host โ€” instant vector, point-in-time
# 1 - idle = used; avg by instance collapses all CPU cores to one number
1 - avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
)
-- time_bucket() is TimescaleDB's key extension: rounds timestamp to a bucket width
-- This gives CPU average per 1-minute window over the last hour
SELECT
  time_bucket('1 minute', recorded_at) AS bucket,
  host,
  AVG(cpu_percent)                     AS avg_cpu,
  MAX(cpu_percent)                     AS peak_cpu
FROM metrics
WHERE recorded_at > NOW() - INTERVAL '1 hour'
  AND metric_name = 'cpu'
GROUP BY bucket, host
ORDER BY bucket DESC;

-- Continuous aggregate: pre-computed 1-hour rollup (created once, updated automatically)
CREATE MATERIALIZED VIEW cpu_hourly
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', recorded_at) AS hour,
  host,
  AVG(cpu_percent) AS avg_cpu
FROM metrics
GROUP BY hour, host;
// Flux: pipeline syntax โ€” chain operations with |>
// Average CPU usage per host, bucketed to 1-minute windows, last hour
from(bucket: "telemetry")
  |> range(start: -1h)                         // time window
  |> filter(fn: (r) => r._measurement == "cpu" // narrow to cpu measurement
    and r._field == "usage_percent")
  |> aggregateWindow(
       every: 1m,                               // bucket width
       fn: mean,                                // aggregation function
       createEmpty: false                       // skip empty buckets
     )
  |> group(columns: ["host"])                   // one line per host
  |> yield(name: "cpu_by_host")
The TSDB ecosystem has six key tools: Grafana (universal dashboard + alerting), the native query languages (PromQL, Flux, SQL) each tuned for time-series patterns, Telegraf (push-model collector for 200+ sources), Prometheus exporters (pull-model with language clients + pre-built adapters), OpenTelemetry SDK (vendor-neutral instrumentation standard), and kdb+ (financial-grade ultra-low-latency TSDB).
Section 20

Common Misconceptions

Time-series databases look deceptively simple at first. You write timestamped data, you read it back โ€” how hard can it be? The six misconceptions below are responsible for most TSDB production incidents and architecture regrets. Clearing them up before you deploy saves painful rework later.

1. "Just use Postgres for everything โ€” it handles time-series fine."

Postgres handles time-series data fine up to roughly 1 billion rows with good partitioning. The friction starts at high write throughput and long retention. At 100 K inserts/sec, Postgres's WAL becomes a bottleneck, the B-tree index on your timestamp column balloons beyond shared_buffers, and DELETE FROM metrics WHERE ts < now() - interval '30 days' is a slow row-by-row scan followed by a VACUUM pass. A TSDB's time-partitioned chunks, columnar compression (10โ€“50ร—), and O(1) retention-by-chunk-drop make a real difference at that scale. The rule of thumb: if you are below a few hundred million rows and your team is already fluent in Postgres, stay there. If you are writing millions of rows per minute and need years of retention, a dedicated TSDB pays for its operational complexity.

2. "TSDBs replace my main database โ€” I can store everything there."

A TSDB is a secondary store for derived data, not a replacement for your primary database. TSDBs do not support referential integrity, multi-row transactions, or random-access updates. If a node crashes and you have not configured replication, InfluxDB OSS can lose recent data. The right architecture: your primary store (Postgres, MongoDB, etc.) holds authoritative state; the TSDB holds time-indexed copies of measurements and metrics. You can always rebuild the TSDB from the primary store if needed. Treating a TSDB as a primary store is a common early mistake that causes data consistency headaches and tricky recovery scenarios later.

3. "Disk is filling up? Just shorten the retention period."

Slashing retention solves the immediate disk crisis but destroys the insight you might need in three months. Imagine your servers had an anomaly six weeks ago that is now manifesting as a production issue โ€” if your raw retention is 30 days you have lost the evidence. The better approach is downsampling: keep raw 1-second data for 7 days (needed for debugging), downsample to 1-minute averages for 90 days (trending), downsample to 1-hour averages for 2 years (capacity planning). You lose resolution but keep historical context. Most TSDBs โ€” InfluxDB's downsampling tasks, TimescaleDB's continuous aggregates, Prometheus's recording rules โ€” support tiered retention natively.

4. "All TSDBs are interchangeable โ€” just pick any one."

TSDBs differ significantly on four axes. Push vs pull: InfluxDB and TimescaleDB accept pushed data; Prometheus scrapes targets on a schedule โ€” push works well for IoT and cloud autoscaling, pull works well for known static targets. Cardinality limits: Prometheus keeps its series index in RAM, so cardinality above a few million active series causes OOM crashes; TimescaleDB and ClickHouse handle much higher cardinality on disk. Query language: PromQL, Flux, and SQL are not equivalent in expressiveness or learning curve. Durability model: InfluxDB OSS has weaker durability guarantees than TimescaleDB (which inherits Postgres's WAL). Picking the wrong one is a painful migration. Match the tool to the workload before committing.

5. "Cardinality doesn't matter at our scale โ€” we're a small team."

Cardinality explosions sneak in through seemingly small decisions. A common scenario: a developer adds request_id as a Prometheus label "just for debugging." Each HTTP request gets a unique ID โ€” with 1 000 req/sec, that is 86 million new series per day. Prometheus stores series metadata in a head block in RAM; within hours the server OOMs. The fix requires a rollback, a database restart, and possibly a manual series deletion. The rule is absolute: never use an unbounded value (user ID, request ID, session token, IP address) as a label or tag. Every label value that can vary per request multiplies your cardinality by the number of possible values.

6. "Real-time dashboards mean millisecond data freshness."

"Real-time" in the TSDB world usually means 5โ€“30 seconds of ingestion-to-query latency, not milliseconds. Prometheus scrapes on a 15-second interval by default โ€” a value written right after a scrape will not appear until the next one, 15 seconds later. InfluxDB OSS flushes its write cache every 10 seconds. TimescaleDB continuous aggregates refresh on a schedule you configure (often 30โ€“60 seconds). True millisecond freshness requires kdb+ (which costs significantly) or a purpose-built stream processor (Flink, Kafka Streams) in front of the TSDB. For alerting and dashboards, 15โ€“30 second lag is almost always fine. Design your architecture around actual latency numbers, not the marketing term "real-time."

Six misconceptions to internalize: Postgres is fine up to ~1B rows but TSDBs win at high write throughput; TSDBs are secondary stores, not primaries; shorten retention only as a last resort โ€” downsample instead; TSDBs differ widely on push/pull model, cardinality limits, and query language; never use unbounded values as labels; and "real-time" in TSDBs typically means 5โ€“30 second lag, not milliseconds.
Section 21

Real-World Disasters & Lessons

The disasters below all happened in real production systems. The patterns repeat with embarrassing regularity across different companies and different TSDBs โ€” mostly because TSDB-specific failure modes are not widely taught. Reading these stories costs nothing. Learning them the hard way can cost your team weeks of incident recovery.

Cardinality Explosion โ€” Active Series Over Time Active Series Time (hours after deploying request_id label) 500K 2M 5M OOM Prometheus OOM crash head block exhausts RAM "request_id" label added here normal โ€” bounded labels only hockey-stick: unbounded label added
Disaster 1 โ€” Cardinality Explosion in Production

A team added request_id as a Prometheus label on their HTTP metrics "so we can trace individual requests in Grafana." Every HTTP request generates a unique ID. At 1 000 req/sec, 86 million new series appear per day. Prometheus holds its series index in RAM. Within 6 hours the server OOM-crashed; the monitoring and alerting system went dark at the same moment as a production incident was starting. Recovery required restarting Prometheus with a clean data directory, losing hours of metrics history.

Lesson: Never use unbounded values (request ID, user ID, session token, IP address) as Prometheus labels or InfluxDB tags. Label values should come from a small, finite set โ€” host name (tens), region (a handful), status code (a dozen). If you need to trace individual requests, use distributed tracing (Jaeger, Tempo) โ€” that is what it is for.

Disaster 2 โ€” InfluxDB v2 Migration Regret

A team upgraded from InfluxDB v1 to v2 on the day v2 released, excited by the new Flux query language. The upgrade path broke the InfluxQL-to-Flux compatibility layer โ€” every Grafana dashboard stopped working. The Flux equivalents for their InfluxQL queries were not documented clearly at the time. Two engineers spent a week rewriting dashboards; some complex aggregations required re-architecting how data was stored.

Lesson: Wait for ecosystem maturity before upgrading to a new major version of a TSDB โ€” especially when the query language changes. If you do upgrade, run old and new in parallel during a migration window, migrate dashboards systematically (one at a time, not all at once), and have a tested rollback path before cutting over. "Upgrade first" is a TSDB anti-pattern.

Disaster 3 โ€” Retention Policy Forgotten Until Disk Is Full

A team deployed TimescaleDB to store application metrics. They set up the table and continuous aggregates but forgot to configure a data retention policy. Three months later, the cluster's disks filled completely. Writes failed โ€” the application started returning errors. The alerting pipeline also broke, because the alerts themselves were being written to the same database. The chicken-and-egg: the system that would have warned them was the first thing to break.

Lesson: Configure retention policies before production launch, not after. Monitor disk usage and ingestion lag as first-class SLO metrics โ€” if your TSDB disk is above 70%, you should already know and have a plan. In TimescaleDB, set an automated retention job: SELECT add_retention_policy('metrics', INTERVAL '30 days');. In InfluxDB, configure the retention on bucket creation. Never launch without it.

Disaster 4 โ€” Prometheus Pull Failures in Cloud Autoscaling

A team ran Prometheus with a static scrape_configs file listing their servers by IP. They moved to Kubernetes and enabled horizontal autoscaling. New pods got new IPs; old pods disappeared. Prometheus's static config knew nothing about the pod churn โ€” it was still trying to scrape dead IPs and had never learned about the new ones. Gaps appeared in dashboards, and CPU spike alerts stopped firing for the new pod fleet entirely, masking a real performance regression.

Lesson: In dynamic environments (Kubernetes, AWS ASG, GCP managed instance groups), use service discovery. Prometheus has built-in Kubernetes SD (kubernetes_sd_configs) and the Prometheus Operator automates this entirely โ€” it watches Kubernetes resources and updates scrape targets automatically. Static configs are fine for stable bare-metal environments; they are an antipattern in cloud-native autoscaling.

Disaster 5 โ€” No Backups on InfluxDB OSS

A team ran InfluxDB OSS on a single EC2 instance without snapshots, reasoning that metrics data is "not critical โ€” we can lose some." The EBS volume suffered a silent corruption event during a storage maintenance window. Two months of metrics history were unrecoverable. Post-incident, the team realized they had used those metrics for capacity planning, trending, and post-mortems โ€” the data was not as disposable as assumed.

Lesson: Even "disposable" time-series data has value. Configure regular backups (influx backup to S3 for InfluxDB, pg_dump or PITR for TimescaleDB). Consider Telegraf double-write โ€” configure two output plugins so every metric goes to your primary TSDB and simultaneously to a cheaper backup destination (e.g., S3 as line-protocol files). For Prometheus, enable remote write to a durable long-term storage backend (Thanos, Cortex, Grafana Mimir).

Five production disaster patterns to memorize: cardinality explosion from unbounded labels (never label per-request); TSDB major-version migration regret (wait for ecosystem maturity, run parallel); retention forgotten until disk full (configure before launch, monitor disk as SLO); Prometheus pull gaps in autoscaling (use service discovery, not static configs); and data loss from no backups (even metrics need backups โ€” Telegraf double-write, Thanos remote write).
Section 22

Performance & Best Practices Recap

Eight rules cover the vast majority of TSDB performance decisions. None of them require deep internals knowledge โ€” they are practical choices every engineer running a TSDB in production should have already made. If your TSDB is struggling, run through this checklist before investigating anything else.

TSDB Production Best Practices โ€” Quick Reference โ‘  Bound cardinality absolutely Never label with user ID, request ID, IP. Labels must come from a small, finite set. Cap before deploying. โ‘ก Batch writes: 1 000โ€“10 000 points Each individual write = 1 HTTP round-trip + 1 fsync. Batch amortises both. Use Telegraf or client buffers. โ‘ข Time-bucket your data โ€” always Partitioning by time is what enables partition pruning and O(1) chunk-drop retention. Non-negotiable. โ‘ฃ Tier retention with downsampling Raw 7โ€“30d โ†’ 1-min aggregates 90d โ†’ 1-hour aggregates 2y. Keeps insight without keeping raw data forever. โ‘ค Match TSDB to workload Prometheus for K8s metrics, ClickHouse for analytics, TimescaleDB if team knows SQL, InfluxDB for IoT. โ‘ฅ Monitor disk + cardinality + ingest lag Top 3 SLO metrics for any TSDB. Disk >70% โ†’ act now. Cardinality spike โ†’ find the offending label. โ‘ฆ TSDB = secondary, not primary store Authoritative state lives in Postgres / DynamoDB / etc. TSDB is derived / re-buildable. No transactions needed. โ‘ง Use Grafana for all visualization Don't reinvent dashboards. Grafana supports every TSDB, handles alerting, and has a huge plugin ecosystem. Apply all 8 rules before tuning anything else โ€” they eliminate 90% of common TSDB production problems

Bound cardinality absolutely

Before adding any label or tag, ask: "How many distinct values can this take?" If the answer is "unbounded" (per-request, per-user, per-IP), do not make it a label. Labels should come from a finite, small set โ€” host name, region, environment, status code. If you need per-request context for debugging, use distributed tracing (Jaeger, Tempo), not metric labels. A useful mental model: if you printed all your label combinations on a whiteboard and they could not fit, the cardinality is already too high.

Batch writes

Every single-point write request carries HTTP overhead (~0.5โ€“1 ms round-trip) and often a write-ahead log fsync. At 100 points/sec, individual writes use 100ร— the network and disk resources of a single batched write of 100 points. Use client libraries with built-in batching (InfluxDB Python client, Telegraf's output buffer), or batch manually: accumulate points in memory for 1 second and flush as one HTTP request. Aim for 1 000โ€“10 000 points per batch โ€” smaller is OK for low-volume sources, larger can cause timeout issues on slow networks.

Time-bucket partitioning

Every major TSDB does this automatically, but you need to understand it to tune it. TimescaleDB's default chunk interval is 7 days โ€” fine for most workloads. If you query mostly within the last hour, a 1-day chunk means only 1 chunk is usually opened per query. If your queries span months, wider chunks reduce metadata overhead. InfluxDB and Prometheus manage this internally without configuration. ClickHouse requires you to define an ORDER BY (metric, timestamp) partition key explicitly. The point: always understand your TSDB's partitioning strategy and tune chunk size to match your most common query window.

Tiered retention + downsampling

A practical tiered retention strategy for most production systems: raw 1-second data for 7 days (needed for real-time debugging), 1-minute aggregates for 90 days (trend analysis and capacity planning), 1-hour aggregates for 2 years (annual reviews, SLA reporting). Each tier is roughly 60โ€“3 600ร— smaller than the previous. In TimescaleDB, continuous aggregates handle the rollups automatically. In Prometheus, recording rules pre-compute aggregates and remote-write them to long-term storage (Thanos/Mimir). In InfluxDB, scheduled Flux tasks run the downsampling and write to a second bucket with a longer retention.

Right TSDB for the workload

There is no universally best TSDB. Prometheus: pull model, Kubernetes-native, excellent for infrastructure metrics + alerting, cardinality is the main limit. TimescaleDB: full SQL, Postgres ecosystem, best if team already knows Postgres, handles high cardinality on disk. InfluxDB: purpose-built for IoT and general time-series push workloads, simpler operational model than TimescaleDB. ClickHouse: columnar OLAP powerhouse, use when you also need complex analytics and JOIN-style queries across large datasets, not just pure time-series. Choose based on your team's existing skills, your write model (push vs pull), and your expected cardinality.

Monitor disk + cardinality + ingest lag

These are the three leading indicators that your TSDB is in trouble. Disk usage above 70%: you are six weeks from a full-disk write failure โ€” act now (add storage, tighten retention, enable compression). Cardinality spike: a new label was added with unbounded values โ€” find it and remove it before it OOMs the server. Ingestion lag growing: your write throughput exceeds the TSDB's flush rate โ€” add nodes, reduce scrape frequency, or increase batch size. Track all three in Grafana with alerts on each. They are early-warning indicators, not post-mortems.

Eight TSDB best practices: never use unbounded values as labels (cardinality kills); batch 1Kโ€“10K points per write; always use time-partitioned chunks; tier retention with downsampling rather than slashing it; match TSDB to workload (Prometheus for K8s, TimescaleDB for SQL teams, ClickHouse for analytics); monitor disk/cardinality/lag as SLOs; treat TSDB as secondary store only; and use Grafana for all visualization.
Section 23

Frequently Asked Questions

These are the questions that come up in every architecture review, every interview, and every Slack thread where someone says "should we add a TSDB?" Each answer is written for someone who understands databases generally but is still learning the time-series niche.

Q1: Do I actually need a TSDB, or is Postgres with partitioning enough?

Use Postgres if: your write rate is below a few thousand rows per second, your dataset will stay under a few hundred million rows, and your team is already fluent in Postgres. Postgres with PARTITION BY RANGE (created_at) and a GiST or BRIN index handles this well โ€” no new operational overhead, no new query language. Reach for a dedicated TSDB when: you are hitting 10K+ writes/sec and Postgres WAL is a bottleneck; you need 10โ€“50ร— compression to make storage costs manageable; you are building a multi-tenant metrics platform where cardinality and retention complexity are real concerns; or you need built-in continuous aggregates and tiered downsampling. The decision should be driven by actual numbers hitting actual walls, not by enthusiasm for a new tool.

Q2: Prometheus or InfluxDB โ€” which should I choose?

They solve the same problem differently. Choose Prometheus if: you are running Kubernetes (the Prometheus Operator is the de facto standard); you want a pull model (Prometheus scrapes your services, which simplifies firewall rules); you need tight integration with the CNCF alerting ecosystem (Alertmanager). Choose InfluxDB if: your data sources push metrics (IoT sensors, embedded devices, legacy systems that cannot run an HTTP server for Prometheus to scrape); you want a richer query language for complex transformations (Flux); or you need a standalone TSDB without the Prometheus ecosystem overhead. In practice: Kubernetes observability โ†’ Prometheus. IoT / general telemetry โ†’ InfluxDB. Both together โ†’ use OTel SDK and remote write to both.

Q3: TimescaleDB or InfluxDB โ€” when does SQL win over Flux?

TimescaleDB wins when: your team already knows Postgres SQL; you want to JOIN time-series data with relational data (e.g., enrich metrics with user account details from the same Postgres cluster); you need the full Postgres feature set (foreign keys, triggers, pg extensions like PostGIS); or you are migrating an existing Postgres metrics table and want the TSDB benefits without switching databases. InfluxDB wins when: you want a purpose-built TSDB with no relational overhead; your team is comfortable with Flux's pipeline syntax; you need native multi-tenancy via bucket isolation; or you are starting fresh and your use case is purely time-series with no relational cross-queries. The bottom line: if SQL is your team's first language, TimescaleDB has a dramatically lower learning curve.

Q4: What about ClickHouse โ€” is it a TSDB?

ClickHouse is a columnar OLAP database that happens to be excellent at time-series analytics. It is not purpose-built as a TSDB โ€” it lacks native concepts like series, retention policies, or downsampling tasks. But its columnar compression, vectorized query execution, and SQL dialect with excellent time functions (toStartOfMinute(), windowFunnel()) make it genuinely competitive for time-series analytics workloads, especially when you also need to JOIN against other large datasets or run complex analytical queries alongside your metrics. Use ClickHouse when: you need both time-series and general OLAP in one system; your queries are complex SQL with GROUP BY, JOIN, and window functions; and raw write throughput (not query latency) is less of a concern. It is overkill for a pure Kubernetes metrics stack โ€” use Prometheus there.

Q5: How long should I keep data? What retention tiers make sense?

A practical starting point for most production systems: raw data 7โ€“30 days (enough for on-call debugging and incident investigation); 1-minute aggregates 90 days (trend analysis, regression detection, capacity planning); 1-hour aggregates 1โ€“5 years (annual capacity reviews, SLA reporting, long-term trends). The key insight: 90% of queries hit data from the last 24 hours. Data from more than 90 days ago is almost always accessed at hourly resolution. Paying for full-resolution raw data beyond 30 days usually provides no practical benefit. Start with these defaults and adjust based on how often your team actually queries historical data and at what granularity.

Q6: What is the difference between Prometheus and Grafana?

They are different layers of the observability stack. Prometheus is a time-series database and collection system โ€” it scrapes metrics from your services, stores them, and evaluates alert rules. You can query it directly via PromQL, but its built-in UI is minimal. Grafana is a visualization and dashboard platform โ€” it connects to Prometheus (and dozens of other data sources) as a read-only client, renders beautiful graphs and panels, and has a sophisticated alerting UI. Prometheus stores the data; Grafana makes it beautiful and actionable. Most production observability stacks use both: Prometheus for storage and alert evaluation, Grafana for the human-readable dashboards. You could use Grafana without Prometheus (connecting it to InfluxDB or TimescaleDB instead) โ€” the two are independent.

Q7: How do I migrate from one TSDB to another without losing history?

There is no magic โ€” migration is a batch backfill process. The safest approach: (1) run both in parallel โ€” configure your collectors (Telegraf, OTel) to double-write to old and new simultaneously; (2) backfill history โ€” read old data in time chunks from the source TSDB and write it to the new one via the new TSDB's bulk write API; (3) validate โ€” compare a sample of queries against both systems; (4) migrate dashboards โ€” update Grafana data sources one at a time, keeping old dashboards on old data source during transition; (5) cut over โ€” stop double-writing once you trust the new system. The hardest part is usually query translation โ€” PromQL and Flux are not equivalent, and some aggregations need rethinking. Budget 2โ€“4 weeks for a careful migration of a production metrics stack.

Q8: Is OpenTelemetry replacing Prometheus?

They are complementary, not competing. OpenTelemetry is an instrumentation and collection standard โ€” it defines how your application code emits metrics, traces, and logs, and how those signals flow to a backend. It is agnostic about what stores the data. Prometheus is a storage and query system โ€” it is one of many backends that OTel can export to. The direction of the ecosystem: new projects increasingly instrument with OTel SDK (avoiding lock-in to a specific client library), then export to Prometheus for storage and Grafana for visualization. Old projects using the native Prometheus client library still work fine and there is no urgent reason to migrate. OTel adds value at the instrumentation layer; Prometheus keeps its role as a dominant metrics backend, especially in Kubernetes environments.

Eight key answers: use Postgres up to ~hundreds of millions of rows, TSDB beyond that; Prometheus for Kubernetes pull-model, InfluxDB for IoT push; TimescaleDB if you know SQL and need relational joins; ClickHouse for time-series + OLAP but overkill for pure metrics; tiered retention (raw 7โ€“30d, 1-min 90d, 1-hour 1โ€“5yr) is the standard pattern; Prometheus stores, Grafana visualizes โ€” different layers; TSDB migration = double-write + backfill + dashboard migration; OTel is the instrumentation standard, Prometheus is one storage backend.