Part II · Programming & Software Engineering · Chapter 05

Databases, where the data actually lives.

Somewhere behind every model, every dashboard, every feature in every pipeline, there is a database. For fifty years that database has most often been a SQL engine running on a single machine; for the last fifteen, it has been an increasingly strange ecosystem of distributed key-value stores, document engines, graph databases, and columnar warehouses. This chapter covers the relational core first — the model, the language, the physical machinery that makes it fast — and then the landscape around it, with a practitioner's eye on what shows up in ML and analytics work.

How to read this chapter

The first seven sections are about the relational core: what a table is, what SQL is, how to join and aggregate and compose queries. Sections eight through eleven cover the physical layer — how schemas are designed, how indexes make reads fast, how the planner turns a query into an execution plan, and what transactions guarantee under concurrency. Section twelve is about the small list of SQL pitfalls that catch even experienced people. Sections thirteen through fifteen step outside the relational world, surveying the NoSQL families, the OLTP/OLAP split, and the columnar engines that now dominate analytics. The last section brings it back to ML — feature stores, training-set extraction, experiment tracking, and the surprisingly large database footprint of a well-run modelling team.

Conventions: SQL examples are written in a portable dialect, with notes where PostgreSQL, SQLite, and BigQuery differ meaningfully. Database references assume the current (2026) state of the ecosystem. The goal is not to make anyone a DBA but to give an ML practitioner enough literacy to read production queries, reason about their performance, and know which kind of database a new problem actually calls for.

Contents

  1. What databases do for ML practitionersMotivation
  2. The relational modelTables, rows, types, keys
  3. SQL, the shape of a queryDDL, DML, SELECT
  4. Joins, the operation SQL exists forInner, outer, semi, anti
  5. Aggregation, GROUP BY, and HAVINGCounting done right
  6. Window functionsThe analyst's secret weapon
  7. CTEs, subqueries, and compositionReadable queries
  8. Schema design and normalization1NF, 2NF, 3NF, the star
  9. Indexes, or how queries get fastB-trees, hash, GIN, covering
  10. Query planning and EXPLAINCost-based, scans, joins
  11. Transactions, ACID, isolationConcurrency without tears
  12. Nulls, dates, and the quiet foot-gunsWhere SQL surprises you
  13. NoSQL, briefly and honestlyDocument, KV, wide-column, graph
  14. OLTP vs. OLAP, two worldsRow-store vs. column-store
  15. Columnar analytics enginesDuckDB, BigQuery, Snowflake
  16. Where databases meet MLFeatures, training sets, tracking
Section 01

What databases do for ML practitioners

The data that feeds a model almost never arrives from thin air. It arrives from a database — a transactional one that records what users did, an analytics one that aggregates it, or an event log that is on its way to being one. Fluency in SQL is the single highest-leverage technical skill a data-science practitioner can learn, because every downstream artefact begins with the rows you manage to pull out.

Pandas is a great tool, and notebooks are a great medium, but they describe computations on data that is already in memory. The question a real project has to answer first is where that data comes from, how it is selected, how it is joined to the other data it needs, and how much of it there is. These questions are answered in the language of databases, and the answers shape everything downstream — the size of your training set, the latency of your feature pipeline, the reproducibility of your experiments, the cost of your cloud bill.

This chapter is a working field guide. The relational core comes first, because relational databases are still the workhorse of most organisations and because SQL is the one data language that every ML engineer, analyst, and statistician on a team will share. The later sections cover the NoSQL families, the OLTP/OLAP split, and the columnar analytics engines (DuckDB, BigQuery, Snowflake, ClickHouse) that have quietly become the default for data-science work. The final section brings it back to ML specifically — feature stores, training-set extraction, experiment tracking, and the surprisingly large database footprint of any serious modelling pipeline.

Key idea

The best ML practitioners are usually very good at SQL. Not because databases are glamorous, but because the bulk of what separates a working model from a stuck one is the upstream work of selecting, joining, and shaping data — and that work happens, overwhelmingly, in SQL.

Section 02

The relational model, rows and types and the shapes between

The relational model is Edgar Codd's 1970 idea that data should live in tables — unordered sets of rows with a fixed shape — and that everything else you want to know should be computed from relationships between tables. Fifty-five years later, most of the world's important data still lives there, because the model is astonishingly good at keeping its promises.

What a table is

A table (relation, in Codd's vocabulary) is a set of rows with a named, typed, ordered list of columns. Each row is a single record with one value per column; each column has a name and a data type (integer, text, timestamp, boolean, decimal, JSON, and a few more). Unlike a spreadsheet, a relational table is an unordered set — there is no "row 1 is special because it is first." If you want order, you ask for it.

Keys, uniqueness, and references

Two kinds of columns do disproportionate work. A primary key is the column (or small tuple of columns) that uniquely identifies a row — usually an integer id, sometimes a natural key like an ISO country code. A foreign key is a column that points at another table's primary key, declaring "this row belongs to that row." The database enforces the reference: you cannot insert an order for a nonexistent customer, and you cannot delete the customer without telling the database what to do with their orders. These constraints are not decoration; they are the reason relational data stays consistent even when many people are writing to it at once.

Types and domains

Columns have types, and databases enforce them. An INTEGER column rejects a string; a DATE column rejects "N/A". This feels restrictive compared to a CSV or a pandas DataFrame — where anything can be anywhere — until you inherit a six-year-old dataset where someone wrote "about 40" in an integer column. The types are a schema contract, and the contract is cheap insurance.

A few type distinctions are worth knowing. VARCHAR(n) vs. TEXT — modern engines treat them almost identically; pick the one your style guide prefers. FLOAT vs. NUMERIC(p, s) — always use NUMERIC (or DECIMAL) for money and exact quantities; floats round in unexpected places. TIMESTAMP vs. TIMESTAMP WITH TIME ZONE — pick the second unless you have a very specific reason, and keep everything in UTC at the storage layer.

Key idea

The relational model's power is that it separates what the data means (the schema, the keys, the constraints) from how you query it (SQL, the planner, the indexes). Both sides can evolve independently, which is why a schema written in 1985 can still be queried effectively in 2026.

Section 03

SQL, the shape of a query

SQL is a declarative language: you describe the result you want, not the steps to compute it. A query planner decides the steps. This separation is what lets a 1974 query still run sensibly on a 2026 database — the syntax says what, the engine says how.

The three kinds of SQL

SQL statements fall into three loose families. DDL (data-definition language) creates, alters, and drops objects — CREATE TABLE, ALTER TABLE, DROP INDEX. DML (data-manipulation language) changes rows — INSERT, UPDATE, DELETE, MERGE. Queries — the SELECT statement and its relatives — read the database. Analytics and ML work is almost entirely the third, with an occasional CREATE TABLE AS SELECT ... to materialise a result.

The anatomy of a SELECT

The canonical clauses of a SELECT, in the order you write them, are:

SELECT   column_list               -- what to return
FROM     table_or_subquery          -- where rows come from
JOIN     other_table ON ...         -- combined with which other rows
WHERE    predicate                  -- filter individual rows
GROUP BY grouping_columns           -- collapse into groups
HAVING   group_predicate            -- filter groups
ORDER BY sort_keys                  -- order the final rows
LIMIT    n                          -- and take only this many

The important surprise is that SQL executes these clauses in a different order than you write them. Logically: FROM and JOIN first (assemble rows), then WHERE (filter), then GROUP BY (collapse), then HAVING (filter groups), then SELECT (project the columns you want), and finally ORDER BY and LIMIT. Knowing the logical order explains why you can't use an alias from the SELECT clause in a WHERE (the alias doesn't exist yet) but you can in an ORDER BY (it does).

A first real query

SELECT
    c.country,
    COUNT(*)                     AS order_count,
    SUM(o.total_usd)             AS total_revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.placed_at >= DATE '2026-01-01'
  AND o.status = 'fulfilled'
GROUP BY c.country
HAVING SUM(o.total_usd) > 10000
ORDER BY total_revenue DESC
LIMIT 20;

Every analytics query you will ever write is some elaboration of this template. The mental habit that makes SQL tractable is reading it in logical order: first understand what rows you are starting from (FROM/JOIN), then what subset survives (WHERE), then how they get collapsed (GROUP BY/aggregates), and only then what the final shape looks like.

Rule of thumb

When a SQL query confuses you, rewrite it with every clause on its own line, indent the JOINs, and fully qualify every column (customers.country, not country). You will catch a surprising fraction of bugs in the rewriting.

Section 04

Joins, the operation SQL exists for

A join combines rows from two tables on a condition — usually "these rows are related because one's foreign key matches the other's primary key." Joins are the operation that makes the relational model pay: you factor your data into clean tables, and at query time you stitch it back together in whatever shape the question wants.

The five kinds of join

Four standard joins and two lesser-known ones cover the territory.

INNER JOIN returns only rows that match on both sides. orders INNER JOIN customers returns one row per order, with the customer's columns attached; orders whose customer record is missing are silently dropped. This is the join you want roughly 80% of the time.

LEFT JOIN (also LEFT OUTER JOIN) returns every row from the left table, with NULLs filled in where the right side has no match. Useful for "every customer, and their orders if they have any" — the unmatched customers survive into the result with NULL in the order columns. RIGHT JOIN is the mirror image and is rarely used in practice because flipping the tables and using LEFT JOIN reads more clearly.

FULL OUTER JOIN returns every row from both sides, matched where it can, with NULLs elsewhere. It shows up in data-reconciliation work — comparing two snapshots of the same table — and almost nowhere else.

CROSS JOIN returns the Cartesian product: every row of the left table paired with every row of the right. It is a foot-gun when accidental (a missing ON clause in an older dialect will silently produce one), and a useful idiom when deliberate (generating a date spine, pairing users with every product in a catalogue).

Semi-joins and anti-joins

The two less obvious joins are the ones that express membership rather than attachment. A semi-join returns rows from A that have at least one match in B, without duplicating A's rows for each match. A anti-join returns rows from A that have no match in B. SQL writes them with EXISTS and NOT EXISTS:

-- Customers who placed at least one order in 2026 (semi-join)
SELECT c.*
FROM customers c
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.id
    AND o.placed_at >= DATE '2026-01-01'
);

-- Customers who placed no order in 2026 (anti-join)
SELECT c.*
FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.id
    AND o.placed_at >= DATE '2026-01-01'
);

You can write the anti-join as LEFT JOIN ... WHERE right_table.id IS NULL, and on many engines the planner produces the same execution plan. The NOT EXISTS form reads more like what you mean, which matters when someone else has to maintain it.

Join order and cardinality

The performance question in any multi-join query is usually "how many rows are flowing between joins?" If each join step expands the row count by 10x, a five-step chain is 100,000x; if each step shrinks it, the query is cheap. Most planners are smart about reordering joins on their own, but a predicate on the most selective table at the top (WHERE early) is a gift: it tells the planner where the small set lives, and the rest of the query rides on that.

Rule of thumb

Name your joins by cardinality as you write them. "Each order has exactly one customer" is a 1-to-1 attach; "each customer has many orders" is a 1-to-many fan-out. A query that unintentionally turns the second into a join across three tables can multiply rows so fast that an aggregate at the bottom is off by an order of magnitude.

Section 05

Aggregation, GROUP BY, and HAVING

Aggregation collapses many rows into one, by group. It is the step that turns raw event data into the kind of summary humans and dashboards actually consume — revenue per country, click-through rate per experiment, loss per epoch. Most analytics queries end in an aggregate; understanding how GROUP BY works is what separates good SQL from cargo-cult SQL.

The aggregation functions

The standard list is short: COUNT, SUM, AVG, MIN, MAX, STDDEV, VARIANCE. Modern engines add ARRAY_AGG (collect values into an array), STRING_AGG (concatenate with a delimiter), APPROX_COUNT_DISTINCT (HyperLogLog — fast and approximate), and PERCENTILE_CONT/PERCENTILE_DISC (quantiles). Knowing the long tail matters: the difference between COUNT(*) (every row, including NULLs) and COUNT(column) (only rows where column is not NULL) catches people at every level.

GROUP BY, the rule and the habit

GROUP BY partitions rows by one or more columns, and the aggregates are computed within each partition. The rule the standard enforces, and that every engine agrees on, is: every non-aggregated column in the SELECT list must appear in the GROUP BY. If you try to SELECT country, name, SUM(revenue) FROM ... GROUP BY country, the engine will refuse (Postgres, BigQuery, DuckDB) or silently pick a "random" name (MySQL in lax mode) — the second behaviour is what nightmares are made of.

HAVING vs. WHERE

WHERE filters rows before grouping; HAVING filters groups after grouping. "Countries with more than ten thousand dollars in revenue" is a HAVING on an aggregate; "orders placed after January 1" is a WHERE on a row. Mix them up and you either filter too early (losing rows that would have contributed to a group) or too late (forcing the engine to group rows it will throw away).

SELECT
    customer_id,
    COUNT(*)                   AS sessions,
    AVG(duration_seconds)      AS avg_duration,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_seconds) AS median_duration
FROM sessions
WHERE started_at >= DATE '2026-03-01'          -- filter rows
GROUP BY customer_id
HAVING COUNT(*) >= 5                            -- filter groups
ORDER BY avg_duration DESC;

The arithmetic of averages

A trap that never stops catching people: the average of averages is not the average. If country A has 1000 orders averaging $20 and country B has 10 orders averaging $200, the average of the two averages is $110, but the true mean across orders is $21.80. Always compute AVG from the underlying rows, not from pre-aggregated per-group averages — or if you must aggregate in stages, carry the counts and use a weighted sum at the top.

Rule of thumb

Before running a GROUP BY on a query you did not write, guess the row count. Is it one row per customer? One per day? One per customer-day? Being wrong by one key is the single most common source of "wait, why do I have two billion rows" queries.

Section 06

Window functions, the analyst's secret weapon

A window function is an aggregate that does not collapse its rows. It computes "the average over this customer's last seven orders" and attaches the answer to each of those rows, without losing the rows themselves. Learning window functions is the single biggest step from novice SQL to analyst-grade SQL.

The OVER clause

A window function is any aggregate or ranking function followed by OVER (...). The OVER clause defines the window: which rows the function sees for the current row. Three optional pieces: PARTITION BY splits the input into groups (like GROUP BY but without collapsing), ORDER BY orders rows within each partition, and a frame clause (ROWS BETWEEN ... AND ...) decides which subset of the ordered partition is actually visible.

SELECT
    order_id,
    customer_id,
    total_usd,
    placed_at,
    -- lifetime spend per customer, attached to every row
    SUM(total_usd) OVER (PARTITION BY customer_id) AS lifetime_spend,
    -- this customer's N-th order, 1-indexed
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY placed_at) AS order_rank,
    -- trailing 7-order moving average of the order total
    AVG(total_usd) OVER (
      PARTITION BY customer_id
      ORDER BY placed_at
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS trailing_avg_7
FROM orders;

The ranking family

Three ranking functions do more work than all the rest combined. ROW_NUMBER() gives every row a unique 1-indexed position within its partition — perfect for "pick the most recent event per user." RANK() and DENSE_RANK() give ties the same rank, differing only in whether they skip the next value (RANK) or not (DENSE_RANK). Combined with PARTITION BY and ORDER BY, they express almost all "top-N per group" queries.

LAG and LEAD

LAG(column, n) returns the value of column from the row n positions before the current one within the window; LEAD is the same thing looking forward. They are how you compute session durations (next_event.timestamp - LAG(timestamp)), churn flags (LEAD(session_start) IS NULL), and any "compare this row to its neighbour" logic that would require a painful self-join without them.

Running totals and moving averages

Once you have frames, running totals (SUM(x) OVER (ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) and moving averages fall out naturally. Most analytics dashboards you have ever seen were built on exactly this pattern, and knowing how to write it straight — rather than via seven nested subqueries — is what makes an analyst fast.

Key idea

Think of GROUP BY as the operation that collapses rows and window functions as the operation that attaches a per-group computation to rows without collapsing them. Which one you need is a function of whether the consumer wants one row per group or all the underlying rows with the summary stapled on.

Section 07

CTEs, subqueries, and composition

A 200-line SQL query that looks like a single SELECT is almost always a nightmare; the same query factored into named intermediate results reads like an outline. The tools for that factoring are common table expressions (CTEs) and subqueries, and using them well is what turns long SQL from a wall into a staircase.

Common table expressions

A CTE is a named query that exists only for the duration of the surrounding statement. You introduce it with a WITH clause, give it a name, and then reference it downstream as if it were a table. The effect is exactly like declaring a local variable in a Python script — you push the messy parts into named boxes and compose the final result out of them.

WITH active_customers AS (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE placed_at >= DATE '2026-01-01'
),
per_customer_spend AS (
    SELECT
        o.customer_id,
        SUM(o.total_usd) AS spend_ytd,
        COUNT(*)         AS orders_ytd
    FROM orders o
    JOIN active_customers ac ON ac.customer_id = o.customer_id
    WHERE o.placed_at >= DATE '2026-01-01'
    GROUP BY o.customer_id
)
SELECT
    c.name,
    c.country,
    p.spend_ytd,
    p.orders_ytd
FROM per_customer_spend p
JOIN customers c ON c.id = p.customer_id
ORDER BY p.spend_ytd DESC
LIMIT 100;

The same query without CTEs would be three nested subqueries, each harder to read than the last. With CTEs, the pipeline reads top-down: which customers are active, then how much did they spend, then join with customer details and take the top N.

Scalar subqueries

A subquery that returns exactly one row and one column can be used anywhere a literal can — the SELECT list, the WHERE, a CASE. "What fraction of this user's orders were fulfilled?" is a scalar subquery in a projection; "customers who spent above the overall median" is a scalar subquery in a WHERE. The engine is usually smart enough to compute the scalar once and reuse it, but if performance matters, promote the scalar to a CTE so the plan is explicit.

Recursive CTEs

SQL's underappreciated power move is the recursive CTE: a CTE that references itself. It expresses tree and graph traversals that would otherwise require application code — exploding a bill-of-materials, walking an org hierarchy, computing transitive dependencies. The syntax is heavy (WITH RECURSIVE cte AS (anchor UNION ALL recursive_case)) but the capability is the reason SQL can, with some effort, replace a whole graph query language for simple cases.

Materialisation and re-reading

One subtlety worth knowing: different engines treat CTEs differently. Postgres 11 and earlier always materialised CTEs (computed them once, stored the result, and read it back); Postgres 12+ and most other engines inline them, treating the CTE like a named view and letting the planner push predicates in. In most cases the inline behaviour is what you want; when you specifically want materialisation (to force a cheap result to be computed once), use WITH cte AS MATERIALIZED (...).

Rule of thumb

If a query has more than two nested SELECTs, rewrite it with CTEs. The diff is almost always the same code, the readability is always better, and the plan is usually identical. Your six-months-older self will send thanks.

Section 08

Schema design and normalization

Schema design is the act of deciding what tables exist and how they relate. The classical discipline for doing it well is normalization — the progressive decomposition of one big wide table into many smaller tables, each with a single subject. A well-normalized schema is cheap to update and expensive to be wrong about; a denormalized one is fast to read and easy to corrupt.

The first three normal forms

The normal forms are Codd's ladder of schema rigour. A table is in 1NF if every column holds an atomic value — no comma-separated lists, no embedded JSON standing in for three fields. It is in 2NF if, additionally, every non-key column depends on the whole primary key (this only matters for composite keys). It is in 3NF if, additionally, every non-key column depends only on the primary key and not on another non-key column — if zip_code determines city, then city should not also live in a table whose key is customer_id; it belongs in a zip_codes table.

3NF is where most transactional schemas should land. Higher forms (BCNF, 4NF, 5NF) exist, solve rarer anomalies, and are worth knowing by name; you will not reach for them often. Normalization's purpose is to prevent update anomalies — the bug where a customer changes their email, you forget to update it in the three places it was copied, and now your database disagrees with itself.

When to denormalize

Analytics workloads often deliberately denormalize. The star schema — one central "fact" table (orders, impressions, events) surrounded by descriptive "dimension" tables (customers, products, regions) — trades strict 3NF for query simplicity: every analytics question is "aggregate the facts, joined to the dimensions you want to group by." The snowflake variant normalizes the dimensions one level further; in practice, stars are more common. For high-throughput reads where joins are expensive, you sometimes also go further: materialize a denormalized view that embeds the dimensional attributes directly into the fact rows.

Surrogate keys and natural keys

A natural key is a real-world identifier (ISBN, country code, social security number). A surrogate key is an opaque integer or UUID generated by the database. Modern practice favours surrogates for primary keys — they are fixed-size, stable under business changes, and never accidentally encode semantics (an ISBN that turns out not to be unique, a country code that gets retired). Keep the natural key as a unique constraint on another column; use the surrogate for joins.

Evolving a schema

Schemas change. The discipline of managing those changes — migrations — is as important as the schema itself. Tools like Flyway, Liquibase, Alembic, and sqitch make each change a versioned, reversible script that runs in order on every environment. The two habits that keep migrations painless: write each one to be independently runnable on a database of any size, and never edit an existing migration — add a new one that undoes or extends it.

Rule of thumb

Start normalized. Denormalize when a concrete query proves that the joins are too slow, and document the trade. A schema that was denormalized speculatively ages into a tangle of inconsistent copies; a schema that was denormalized with receipts ages into a fast, legible warehouse.

Section 09

Indexes, or how queries get fast

An index is a secondary data structure that lets the database find rows by a column's value without scanning the whole table. The difference between a query that takes 2 ms and one that takes 20 seconds is, in roughly 90% of cases, whether the right index exists. Understanding them — at least at the shape level — is the highest-leverage SQL literacy there is.

The B-tree, workhorse of databases

The default index type on every major relational engine is a B-tree — or more precisely a B+tree, which stores data only in the leaves. A B-tree is a balanced tree in which each node holds many keys (typically hundreds, sized to one disk page) and children sorted by those keys. A lookup walks from root to leaf following the key ranges, touching only O(log n) pages along the way. For a billion-row table, that is about four disk reads instead of a billion.

40 80 12 28 52 67 92 104 3, 7 15, 21 32, 38 43, 49 55, 61 71, 78 85, 90 95, 101 120 root page internal pages leaf pages (sorted; doubly-linked) lookup for key = 49: root → (40 ≤ 49 < 80) → middle → (43 ≤ 49 < 55) → leaf [≈ 3 page reads]
A B+tree index. The root holds the coarsest key ranges, internal nodes narrow them down, and the sorted leaves hold (or point to) the row data. A lookup for any key descends O(log n) levels. Because leaves are doubly-linked, range scans (WHERE key BETWEEN 40 AND 80) walk neighbouring leaves in order without re-descending the tree.

What indexes help with

A B-tree index on column x makes every query with a predicate of the form x = ?, x < ?, x BETWEEN ? AND ?, or ORDER BY x ... LIMIT n much faster. It does not help with WHERE f(x) = ?, unless you build an expression index on f(x), and it does not help with WHERE x LIKE '%pattern' (leading wildcard; the sort order cannot be used).

Composite indexes and the left-prefix rule

A composite index on (a, b, c) can serve queries that filter on a, or a and b, or a and b and c — always from the left. A query that filters only on c cannot use it; a query that filters on a and c can use the a part but not the c part. The mnemonic is "left prefix," and getting the column order right is how a good index helps ten queries at once.

Other index types

Beyond B-trees, most engines offer: hash indexes (equality only, faster than B-tree for pure =, useless for ranges), GIN and GiST (Postgres, for JSON, arrays, full-text, geospatial), BRIN (block-range indexes on huge, naturally-ordered tables — a time-series table by timestamp is the canonical case), and bitmap indexes (classical analytics databases, low-cardinality columns).

The cost of indexes

Indexes make reads fast and writes slow. Every INSERT, UPDATE, and DELETE must update every index on the affected table. A table with ten indexes takes roughly ten times the write work of the same table with one. Analytics tables can afford many indexes; high-write transactional tables cannot. The rule is: index deliberately, measure, and be willing to drop indexes that are not earning their cost.

Rule of thumb

The first index on any table should cover its most common WHERE/JOIN predicates. The second should cover the next-most-common. By the fourth or fifth, stop and check whether they are actually being used — most databases expose an index_usage view that will tell you.

Section 10

Query planning and EXPLAIN

SQL is declarative, but the engine executes an imperative plan. The query planner takes your query, estimates the cost of every reasonable way to answer it, and picks the cheapest. Reading the plan — with the EXPLAIN statement — is how you understand why a query is fast or slow, and how you fix it when it is slow.

How the planner thinks

A plan is a tree of physical operators: sequential scans, index scans, nested-loop joins, hash joins, merge joins, sorts, aggregates, limits. The planner walks the logical query, generates candidate plans, estimates their cost using table statistics (row counts, distinct values, histograms) and per-operator cost models, and picks the cheapest. Crucially, the estimates are estimates; if the statistics are stale or the data is skewed, the planner will sometimes pick badly, and the fix is to update statistics (ANALYZE), add an index, or rewrite the query to nudge the plan in a better direction.

Reading EXPLAIN

Every major engine supports EXPLAIN, which shows the plan, and EXPLAIN ANALYZE (Postgres) or EXPLAIN ANALYZE, VERBOSE (DuckDB, MySQL), which actually runs the query and reports real timings. Read plans from the inside out — the leaves scan tables, and their outputs flow up through joins, filters, aggregates, and finally the top-level sort or limit.

EXPLAIN ANALYZE
SELECT c.country, SUM(o.total_usd) AS revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.placed_at >= DATE '2026-01-01'
GROUP BY c.country;

-- Postgres output, lightly edited:
HashAggregate  (cost=8421..8431 rows=12)        (actual time=184..185)
  Group Key: c.country
  -> Hash Join  (cost=220..8301 rows=48000)      (actual time=4..140)
       Hash Cond: (o.customer_id = c.id)
       -> Index Scan using orders_placed_at_idx on orders o
            (cost=0..7800 rows=48000)            (actual time=0.1..80)
            Index Cond: (placed_at >= '2026-01-01')
       -> Hash  (cost=120..120 rows=8000)       (actual time=3..3)
            -> Seq Scan on customers c

The three things to look for

Three patterns cover most plan-reading: sequential scans on large tables (often a missing index); large row-count estimates that are wildly off (stale statistics; run ANALYZE); and nested-loop joins on big inputs (the planner chose the wrong join strategy, usually because it underestimated a row count). Fixing any of these can take a query from seconds to milliseconds.

Join strategies, briefly

Three physical joins appear over and over. Nested loop — for each row of the outer table, look up matches in the inner; fast for small inputs or when the inner is indexed on the join key. Hash join — build a hash table of the smaller side, probe it with the larger; the planner's default for medium-to-large joins. Merge join — both inputs are sorted on the join key, walk them together; cheapest of all when inputs happen to be already sorted. Knowing which one the planner picked tells you a lot about why your query performs the way it does.

Rule of thumb

When a query surprises you with its speed or slowness, run EXPLAIN ANALYZE and read the plan before theorising. The plan's actual row counts are the ground truth; everything you believed about the shape of the data is a guess until you check it.

Section 11

Logging and observability

The first thing a production engineer does when a system breaks is read the logs. If the logs are missing, empty, or unreadable, the system is effectively a black box and debugging becomes guesswork. Good logging is not decorative; it is the primary way a running program talks to the humans responsible for it.

The ACID properties

ACID is the list of guarantees a transactional database gives a group of statements that happen inside BEGIN ... COMMIT. Atomicity — the group either all happens or none of it does; a crash mid-way leaves the database as if nothing was attempted. Consistency — the database's invariants (constraints, foreign keys) hold before and after, even though they may be violated in the middle. Isolation — concurrent transactions do not see each other's in-progress work; the result is as if they had run one after another. Durability — once COMMIT returns, the change survives a crash, power failure, or kernel panic.

Why atomicity matters

Consider the classic transfer: debit one account, credit another. Without atomicity, a crash between the two statements leaves money vanished. With it, the two statements are either both applied or both rolled back; there is no state of the database in which only one happened. Every multi-statement update to related rows should live inside a transaction for the same reason.

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
UPDATE accounts SET balance = balance + 100 WHERE id = 77;
COMMIT;    -- both rows updated, or neither

Isolation levels

Isolation is a sliding scale. The SQL standard names four levels — read uncommitted, read committed, repeatable read, serializable — each stricter and more expensive than the last. Read committed is the default on most engines and rules out "dirty reads" (seeing another transaction's uncommitted changes). Repeatable read additionally rules out "non-repeatable reads" (the same query returning different results within one transaction). Serializable additionally rules out "phantom reads" (new rows appearing mid-transaction) and is the only level where concurrent transactions are guaranteed to produce the same result as some serial execution of them.

Most production systems run at read-committed because stricter levels cost throughput. The trade is worth knowing: if you need to compute a multi-step financial aggregate from a consistent snapshot, serializable (or an explicit transaction with FOR UPDATE locking) is what prevents you from reading half-updated rows.

Deadlocks and optimistic concurrency

Two transactions that each hold a lock the other wants will deadlock; most engines detect this and abort one with an error ("deadlock detected, transaction rolled back"). The application is expected to retry. An alternative pattern is optimistic concurrency: read a row with its version number, do your work, write back conditioned on the version; if the write fails because someone else updated the row first, retry. This works well when conflicts are rare — which, in most systems, they are.

Key idea

A transaction is the smallest unit of correctness in a database. Everything the database promises about consistency holds at transaction boundaries and nowhere else. Wrap related writes in one, keep them short, and let the database do the locking so you do not have to.

Section 12

Nulls, dates, and the quiet foot-guns

SQL is an old language, and it carries a few design decisions that surprise every practitioner at least once. Most of the bugs that reach production are not about joins or indexes; they are about NULL, timezones, and the fact that 'Abel' and 'abel' may or may not be the same string. Knowing the foot-guns is half of becoming fluent.

NULL is not a value, and = does not compare it

NULL represents "unknown" — not zero, not empty string, not false. The comparison x = NULL does not return TRUE; it returns NULL, which in a WHERE clause is indistinguishable from false. That means WHERE column = NULL silently returns zero rows, even on rows where the column is in fact NULL. The correct test is IS NULL. This is the single most common SQL bug. Most dialects will accept the malformed comparison without warning.

NULL also propagates through arithmetic: NULL + 1 is NULL, NULL OR TRUE is TRUE (because one operand is enough to decide), but NULL OR FALSE is NULL. This three-valued logic — true, false, unknown — is mathematically clean and operationally surprising. Defend against it with COALESCE(column, default_value), which returns the first non-NULL argument.

NULL in joins and aggregates

Aggregates ignore NULL: COUNT(column) counts non-NULL values, while COUNT(*) counts all rows including those where every column is NULL. AVG over a column with NULLs averages the non-NULL subset — which may or may not be what you want. SUM of an empty set is NULL, not zero. These are the kind of subtleties that produce "the dashboard is wrong" tickets; the cure is to know the rules and to wrap totals in COALESCE(SUM(amount), 0) when you want a guaranteed number.

Outer joins manufacture NULLs deliberately, to represent rows on one side that had no match on the other. A filter on a right-hand column — such as WHERE orders.status = 'shipped' after a LEFT JOIN — will silently convert your outer join back into an inner join, because the manufactured NULLs fail the equality. If the intent is to preserve the outer behaviour, move the predicate into the join condition (ON orders.status = 'shipped') or use IS NULL explicitly.

Dates, timezones, and "today"

Time is the other great foot-gun. A column declared TIMESTAMP without qualifier is, in most engines, a wall-clock time with no timezone — ambiguous across the world and especially around daylight-saving transitions. Prefer TIMESTAMP WITH TIME ZONE (in PostgreSQL, stored as UTC and translated on display) for anything that represents a moment. Store instants in UTC; convert to local time only at the edge, for display.

"Today" is a localised concept: the boundary between Tuesday and Wednesday happens at different UTC instants in New York, London, and Tokyo. A report titled "sales today" must decide whose today. Most subtle bugs in analytics pipelines come from the quiet assumption that every timestamp was generated in the same timezone; they were not. Write it down, agree on UTC for storage, and pin down the reporting timezone in the query.

Strings, collations, and empty versus NULL

String comparison depends on collation — the rules for case, accent, and ordering. Two databases may disagree on whether 'Abel' = 'abel'; even one database may disagree with itself depending on column definitions. If case-insensitive matching matters, normalise with LOWER(column) in both the query and, if possible, a functional index.

Finally: an empty string '' is not the same as NULL (except in Oracle, which treats them identically and confuses everyone else). A column that allows both will have two distinct "missing" states, and filters that check one will miss the other. Pick a convention per column — NULL means "unknown", empty means "explicitly empty" — document it, and enforce it with a CHECK constraint if you care.

Rule of thumb

When a query returns a number that seems off by one, check for NULLs first, timezones second, and collation third. That is the order of how often each one is the culprit.

Section 13

NoSQL, briefly and honestly

"NoSQL" is an umbrella label for four quite different families of databases that happened to reject the relational model around the same time, for different reasons. Understanding which family a product belongs to — and what problem that family solves — is more useful than debating "SQL versus NoSQL" in the abstract, which is the wrong argument on almost every axis.

The four families

Document stores (MongoDB, Couchbase, Firestore) keep JSON-shaped documents keyed by an ID. The schema lives in the document, not in the database, so fields can vary per row and nested structures are first-class. They are natural for content with heterogeneous fields (product catalogues, user profiles with optional attributes) and awkward for relationships that span many documents. Most modern document stores have grown secondary indexes, partial transactions, and rich query languages, narrowing the practical gap with relational systems.

Key-value stores (Redis, DynamoDB, RocksDB) map a key to a blob. That is the whole interface. They are the simplest possible database, extremely fast, and serve as caches, session stores, and building blocks for higher-level systems. Queries by anything other than the key are impossible or expensive; the schema is whatever you put in the blob.

Wide-column stores (Cassandra, HBase, ScyllaDB) look like a table but behave more like a distributed sorted map. Rows are keyed by a partition key that determines which machine holds the data, and columns within a row can be added dynamically. They are built for write-heavy, horizontally-scaled workloads — time-series telemetry, event logs, product analytics — where a single-node relational database would drown. They give up flexible joins and strong transactions in return.

Graph databases (Neo4j, JanusGraph, TigerGraph, Amazon Neptune) treat nodes and edges as the primary citizens. Queries traverse the graph — "friends of friends", "paths between A and B with these constraints" — which is possible but awkward in SQL and natural in a graph query language like Cypher or Gremlin. They are the right tool for genuinely graph-shaped data (social networks, knowledge graphs, fraud rings) and usually the wrong one when the data is fundamentally tabular.

CAP, and what it really says

The CAP theorem (Brewer, formalised 2002) states that a distributed data store cannot simultaneously provide consistency, availability, and partition tolerance — in the presence of a network partition, the system must choose either consistency or availability. In practice, partitions are inevitable, so the real choice is between CP (stay consistent, refuse writes during a partition) and AP (stay available, reconcile divergence later). Most NoSQL systems default to AP with eventual consistency: writes may temporarily disagree across replicas, but will converge given enough time and no new updates.

"Eventual" is not a synonym for "soon". In a system that has chosen availability, an application that reads its own writes must do so through the same replica, or be prepared to see stale data. This is the source of most "I just updated my profile, why does it still show the old name?" bugs.

When to reach for NoSQL

Choose a document store when schema varies per row and you need it today, not after a migration. Choose a key-value store when access is keyed and latency budgets are in single-digit milliseconds. Choose a wide-column store when write throughput exceeds what a single relational node can sustain. Choose a graph database when queries are genuinely path-shaped. Choose PostgreSQL — which now supports JSON, arrays, full-text search, time-series extensions, and horizontal scaling in various forms — when you are not sure, and upgrade only when the relational model stops fitting.

Rule of thumb

The question "should we use SQL or NoSQL?" is almost always the wrong framing. The right question is "which family of database solves this specific access pattern?" The answer, for most application workloads, is still a mature relational engine.

Section 14

OLTP vs. OLAP, two worlds

A database optimised to serve a checkout page and a database optimised to compute the weekly revenue report are, internally, almost unrelated systems — even though both may speak SQL. The split between transactional and analytical workloads is the single most important structural choice in a data stack, and the reason most mature organisations end up running both.

Transactional (OLTP) workloads

Online Transaction Processing (OLTP) is the database serving a running application — small, frequent reads and writes, each touching a handful of rows, all with strict latency expectations. A user clicks "place order"; the system must insert one order row, decrement one inventory row, insert a payment row, and return within about a hundred milliseconds. The total data volume may be modest; the concurrency is high; correctness under concurrent update is non-negotiable. PostgreSQL, MySQL, SQL Server, and Oracle are the classical OLTP engines. They store rows together on disk — row-store layout — because most operations touch every column of a few rows.

Analytical (OLAP) workloads

Online Analytical Processing (OLAP) is the database answering business and research questions — "what was our revenue by region last quarter?" "how many distinct users saw this feature in March?" Each query scans millions or billions of rows but touches only a few columns and returns an aggregate. The workload is read-mostly, with nightly or hourly batch ingest; latency budgets are seconds to minutes; the queries are too expensive to serve from the transactional system without crushing it.

For this workload, row-store layout is precisely wrong. A query that averages one column over a billion rows would, in a row-store, read the other twenty columns of every row off disk and throw them away. The fix is a column-store — data on disk is organised column by column, so reading one column touches only its file and the I/O drops by an order of magnitude or more. Column storage also compresses exceptionally well because adjacent values in a column are often similar; typical compression ratios of 5–20× are routine.

Row-store versus column-store in one picture

A row-store writes (1, 'Anna', 30) (2, 'Bob', 25) (3, 'Cho', 28) and reads a row efficiently. A column-store writes (1, 2, 3) for the id, ('Anna', 'Bob', 'Cho') for the name, (30, 25, 28) for the age, and reads a column efficiently. Either can emulate the other at the cost of performance. The lesson is to pick the representation that matches the access pattern, and to put transactional and analytical data in systems whose internal layout agrees with the queries being asked.

The normal pattern: both, and a pipeline between them

Most mature stacks run an OLTP database for the application and an OLAP warehouse for the analysts, with a nightly or streaming pipeline moving data from one to the other. The pipeline — extract from the transactional store, transform into analytical shape, load into the warehouse — is called ETL (or, in modern stacks that transform after loading, ELT). Tools like Airflow, dbt, and Fivetran exist to make this pipeline declarative and testable. Skipping the split — running analytics against the production database — works until the first serious query saturates the transactional system at exactly the wrong moment.

Key idea

OLTP and OLAP look like the same discipline — both are relational, both speak SQL — but they are engineered for opposite access patterns. Serving an application and powering analytics from the same engine is a beginner's mistake; serving each from a specialised system connected by a pipeline is the grown-up architecture.

Section 15

Columnar analytics, the engines you query with

The last fifteen years have produced a generation of column-store engines that are now the default for serious analytics. They share a small set of design ideas — columnar storage, vectorised execution, massively parallel scans — and differ mostly in whether you run them on a laptop or on someone else's cloud. Knowing what they have in common is more portable than knowing any one product.

Vectorised execution and why it is fast

Traditional databases process queries one row at a time, dispatching through a tree of operators with one virtual call per row. Vectorised engines instead process batches of a few thousand values from a single column in tight loops, fitting inside CPU caches and SIMD instructions. Combined with column storage (no wasted I/O on unused columns) and compressed encodings (run-length, dictionary, bit-packed), this is how a modern engine can scan a hundred million rows in a few seconds on a laptop.

The current landscape

DuckDB is the surprise of the last few years — an embedded OLAP engine, like SQLite but for analytics, that runs in-process in Python, R, or at the command line. It reads Parquet and CSV natively, supports the full ANSI SQL surface most analyses need, and is the right default for single-node analytical work up to the hundred-gigabyte range. It has quietly replaced pandas as many practitioners' preferred way to transform data.

BigQuery (Google), Snowflake, and Redshift (AWS) are the commercial cloud warehouses: you push data in, you write SQL, the service provisions compute behind the scenes. They differ in pricing model (per-query bytes scanned versus per-second compute) and in their approach to separating storage from compute, but all three are cost-effective for teams that would rather pay by the query than operate their own cluster.

ClickHouse is the self-hosted counterpart — an open-source columnar engine built for ingest-heavy, real-time analytical workloads at very large scale. It tends to show up when the numbers get big (hundreds of terabytes, millions of events per second) and the team has the operational appetite to run its own cluster.

Apache Parquet deserves a name of its own: it is not an engine but a columnar file format, the interchange layer between all of the above. Data lakes store Parquet on object storage; DuckDB reads it directly; warehouses import it. Learning Parquet's model — row groups, column chunks, page statistics, pushdown predicates — pays off the first time a query runs against a Parquet file on S3 and returns in seconds instead of hours.

The lakehouse pattern

The architectural idea that has consolidated in the last few years is the lakehouse: Parquet files on cheap object storage (the "lake"), a table-format layer (Delta, Iceberg, Hudi) that adds ACID transactions and schema evolution on top, and a compute engine (DuckDB, Spark, Trino) that queries the tables. It gives you warehouse semantics on lake economics, and it decouples storage from compute so that a ten-terabyte dataset does not have to sit inside any one vendor's system.

Rule of thumb

If the data fits on one machine, start with DuckDB over Parquet. If it does not, pick a warehouse (BigQuery, Snowflake) for elasticity or ClickHouse for raw throughput. Save Spark for the cases that truly need distributed compute — which, thanks to modern single-node performance, is fewer than you might think.

Section 16

Where databases meet machine learning

An ML system is, for most of its life, a database system with some models attached. Training data is queried; features are computed, cached, and joined; predictions are logged; experiments are compared. The teams that treat their data layer with the same seriousness as their model layer are the ones whose pipelines survive contact with production.

Training sets are queries. "The dataset" is rarely a file — it is the output of a query that joins many tables, filters by time, and possibly samples. Writing that query in SQL, versioning it alongside the code, and materialising the result to Parquet gives you a reproducible dataset with an audit trail. Every published metric should trace back to exactly one such query, run against exactly one database snapshot.

Feature stores. A feature store (Feast, Tecton, Vertex AI Feature Store, Databricks) solves a deceptively hard problem: the same feature computed the same way at training time and at serving time, so that a model's predictions are not subtly different in production than they were in the offline evaluation. Under the hood, a feature store is usually two databases — an offline columnar store for training queries, an online key-value store for low-latency serving — kept in sync by a pipeline. Most of the value is less in the tool and more in the discipline of defining features once, by name, with an owner.

Training–serving skew. The bug most responsible for quiet model degradation is the one where a feature is computed one way in the offline pipeline and another way in the live service. A NULL handled differently, a timezone off by an hour, a categorical encoded against a different vocabulary — each looks innocuous in code review, each drops production accuracy measurably. The cure is to compute every feature exactly once, in one place, and to let both training and serving read from there.

Experiment tracking. Tools like MLflow, Weights & Biases, and Neptune are, in the end, specialised databases — each run gets a row, each metric a time series, each artefact a pointer. What matters is that every model you ever consider shipping has a complete, queryable record: config, code SHA, dataset snapshot, metrics on every split. Without that record, the question "was this model actually better than the last one?" is unanswerable.

Prediction logs are a dataset. A model in production emits predictions, and those predictions — together with the inputs that produced them and, eventually, the ground-truth outcomes — are the training data for the next iteration of the model. Logging them into a warehouse, by user or request ID, with schema stability, is what makes continuous improvement possible. A deployed model that does not log its inputs is one that cannot learn from its own mistakes.

Vector databases. The rise of embedding-based retrieval (recommendations, semantic search, RAG systems) has produced a new family — pgvector, Pinecone, Weaviate, Milvus, Qdrant — optimised for nearest-neighbour queries in high-dimensional space. They are, architecturally, a specialised index layered on top of the older database families; the same disciplines (schema, transactions, backups, reproducibility) still apply.

Key idea

The boundary between "data engineering" and "machine learning" is mostly historical. A serious ML practitioner reads query plans, understands indexes, and treats the database as a first-class participant in the modelling loop. The next section of the compendium — the algorithms and data structures underneath these engines, and the version-control discipline that keeps the code around them tractable — completes the picture.

Further reading

Where to go next

The database literature spans four decades and several worlds — the theoretical foundations of the relational model, the practical craft of SQL and query tuning, the modern architecture of distributed and analytical engines, and the specific documentation that pays back an hour's reading. The list below picks a handful from each.

Foundations and the relational model

SQL, query tuning, and indexing

Modern architectures, NoSQL, and scale

Documentation that rewards reading

This page is the fifth chapter of Part II: Programming & Software Engineering. Up next, the sixth chapter: version control and collaborative development — git fluency, branching models, code review as a practice, and the habits that let many people work on one codebase without stepping on each other. After that, the compendium turns to Part III: data engineering and the ML pipeline.