Part III · Data Engineering & Systems · Chapter 02

Data storage, and the slow triumph of the columnar lakehouse.

Between the ingestion pipeline that lands a row and the model that trains on it sits the whole of data storage — the question of where bytes live, in what shape, on which hardware, at what cost, with which guarantees, and for how long. The answer has changed more in the last decade than in the three before it. The classical data warehouse, the Hadoop-era data lake, and the modern lakehouse are in some sense the same artefact viewed from three epochs; this chapter works through the architectures, file formats (Parquet, ORC, Avro), open table formats (Delta Lake, Apache Iceberg, Apache Hudi), and compute-storage separation patterns that together make analytical data engineering possible at any non-trivial scale.

How to read this chapter

The first two sections are orientation: why storage is the substrate every downstream system depends on, and the transactional/analytical split (OLTP versus OLAP) that explains why different workloads demand different storage. Sections three through five tell the architectural story — warehouse, lake, lakehouse — and why the industry has, after twenty years of oscillation, settled on the third as the reasonable default. Sections six and seven are the row-versus-columnar question and a detailed walk through Parquet, the file format most analytical systems now write. Sections eight and nine cover the rest of the file-format landscape (ORC, Avro, JSON/CSV) and the compression and encoding techniques that make columnar storage fast. Sections ten and eleven are about physical layout: partitioning, clustering, file sizing. Sections twelve and thirteen bring in the open table formats — Delta, Iceberg, Hudi — and the ACID guarantees they add on top of object storage. Sections fourteen through sixteen cover the architectural consequences: dimensional data modelling, separation of storage from compute, and the cloud object stores (S3, GCS, Azure Blob) that now underlie nearly every serious analytical platform. The final section connects all of it to what it changes about training, serving, and evaluating machine-learning models.

Conventions: code snippets are Python, SQL, or shell where illustrative; sizes and throughput figures assume commodity cloud object storage and current-generation warehouse engines. "Warehouse" in what follows means a columnar analytical engine — Snowflake, BigQuery, Redshift, Databricks SQL, DuckDB — rather than the classical 1990s appliance. The goal is to leave the reader able to reason about any analytical data system at the level of its storage and table-format choices, which is where most of the decisions with long-lived consequences are actually made.

Contents

  1. Why storage is the substrateThe hidden foundation
  2. OLTP versus OLAPTransactions vs. analytics
  3. The classical data warehouseKimball, Inmon, the star schema
  4. The data lake and its discontentsHDFS, Hadoop, schema-on-read
  5. The lakehouse synthesisWarehouse semantics on lake storage
  6. Why columnar wins for analyticsRow-oriented vs. column-oriented
  7. Parquet, in detailRow groups, column chunks, pages
  8. ORC, Avro, and the file-format landscapeColumnar cousins, row-wise siblings
  9. Compression and encodingDictionary, RLE, bit-packing, Snappy, Zstd
  10. Partitioning, clustering, and file layoutPhysical layout for query speed
  11. Open table formatsDelta Lake, Iceberg, Hudi
  12. ACID on object storageAtomicity, time travel, isolation
  13. Dimensional modellingStar, snowflake, wide-table
  14. Separating storage from computeWhy the decoupling changed everything
  15. S3, GCS, Azure Blob — the substrateObject storage as a primitive
  16. Access, encryption, costThe non-technical edges of storage
  17. Where it compounds in MLTraining, serving, evaluation
Section 01

Storage is the substrate every other system depends on

A data platform is a tower, and the floor beneath it is storage. Every pipeline, every query engine, every dashboard, every training run inherits its properties — its latency, its durability, its cost, its schema — from the storage layer. The choices made there compound for years; the choices made in the layers above are comparatively easy to undo.

Storage as a constraint, not a detail

It is tempting to think of storage as a commodity — a place where bytes wait for work to be done on them — and treat the interesting questions as belonging to the compute layer. For an individual analyst running a single query against a small table, that view is defensible. For any team operating at the scale of terabytes or larger, it is exactly backwards. The cost, the throughput, the consistency model, and the schema evolution of the storage layer are the constraints within which everything else operates.

The three questions storage answers

Every storage decision is an answer, explicit or otherwise, to three questions. Where do the bytes live? — on attached disk, in an object store, in a warehouse's proprietary format. In what shape? — row-oriented or columnar, one file or many, partitioned or not. With what guarantees? — durable to how many copies, consistent under which workloads, recoverable after which failures, readable by which engines. The answers vary by workload, and this chapter is largely an elaboration of what those variations look like.

Three decades of architectural swings

The dominant architecture has swung twice in thirty years. The 1990s and early 2000s were the age of the data warehouse — dedicated appliances, proprietary columnar formats, rigid schemas. The late 2000s and 2010s were the age of the data lake — cheap object storage, flexible schemas, MapReduce and then Spark on top. The 2020s have brought the lakehouse, which restores warehouse semantics (ACID, schema enforcement, time travel) to lake-scale storage. Understanding the swings is not a history lesson; it is how to recognise which stage of the pendulum any given team is actually on.

The long-horizon property

Storage is the longest-lived part of a data platform. Pipelines get rewritten, warehouses get migrated, job schedulers change every few years; the bytes themselves last decades. The effort spent on getting storage right early pays back every time something above it gets replaced.

Section 02

OLTP versus OLAP, the split that explains most of storage

The single most useful distinction in data storage is between the systems that record what just happened and the systems that analyse what has happened over time. They look similar from the outside — both store tables, both are queried with SQL — and they are engineered on almost opposite principles.

OLTP — operational workloads

OLTP (online transaction processing) systems are the operational backends of a business: the user database, the order ledger, the payments system. Their workload is many small transactions per second, each touching a handful of rows, each requiring strict consistency (no double-charged credit cards) and low latency (no user waiting on a page load). PostgreSQL, MySQL, SQL Server, Oracle, CockroachDB, Spanner — all are row-oriented, B-tree indexed, and optimised for tens of thousands of single-row operations per second per core.

OLAP — analytical workloads

OLAP (online analytical processing) systems answer questions about what happened. Their workload is few large queries, each scanning millions or billions of rows but usually only a handful of columns, often aggregating across time. Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse, Druid, DuckDB — all are columnar, compressed, and optimised to scan tens of gigabytes per second per core.

The one-table example

A table of a billion orders, 40 columns wide, lives on the OLTP side as a row-oriented B-tree where finding order #12345 is an O(log n) operation touching one page. On the OLAP side, it lives as a columnar file where summing the total_usd column across ten years reads exactly one column, compressed, in a single streaming scan. Either architecture, asked to do the other's workload, performs one to two orders of magnitude worse.

The dual-write pattern

The reconciliation, in practice, is to keep both — the OLTP system of record is the source of truth; a copy of its state, updated continuously or periodically, lives in the OLAP warehouse. Change-data-capture (CDC) pipelines read the OLTP write-ahead log and replay its mutations into the warehouse; dual writes are more fragile and less favoured. The next chapter covers pipelines and orchestration in detail; the storage story depends on which side of this split a given dataset is meant to support.

HTAP — the nearly-one-system ambition

A decade of effort has gone into HTAP systems (hybrid transaction-analytical processing) that serve both workloads from one store. SingleStore, TiDB, and Snowflake's Unistore are the best-known attempts. They work well for modest scales; at large scale, the fundamental trade-offs between row-oriented and column-oriented layouts reassert themselves, and most serious platforms still run both.

Section 03

The classical data warehouse, and why it still shapes the field

The data warehouse, as an idea, is fifty years old. As a product category, it is thirty. The modern cloud warehouse has little in common with the Teradata appliance of 1995 — but the vocabulary, the schema patterns, and the mental model all come from that lineage, and are still how working engineers describe what they are doing.

Inmon versus Kimball

Two competing philosophies dominate the warehouse literature. Bill Inmon's 1990 Building the Data Warehouse argued for a single, enterprise-wide, third-normal-form warehouse — the "single version of the truth" — feeding downstream data marts. Ralph Kimball's 1996 The Data Warehouse Toolkit argued for a bottom-up collection of dimensional data marts, each structured as a star schema, and is the approach most modern warehouses actually implement. The two books are still the canon; both repay a slow reading, thirty years later.

What the warehouse actually is

A warehouse is a schema-on-write store — every table has a declared schema and writes must conform — optimised for analytical queries. Classically it sat on dedicated hardware with columnar storage, massive parallelism across shared-nothing nodes, and a SQL dialect with extensions for windowing, rollups, and cube operations. Teradata, Netezza, Vertica, and early Redshift are the archetypes.

The cloud-native reinvention

Snowflake (2014) broke the appliance model by separating storage (S3 by default) from compute (ephemeral virtual warehouses). BigQuery (2011) did the same on Google's Colossus. Databricks SQL added warehouse semantics on top of lakehouse storage. The result, by the mid-2020s, is that almost no serious organisation buys warehouse hardware any more; the warehouse is a SQL engine rented by the second, reading from cheap object storage.

What the warehouse is good for

Schema enforcement, SQL support, transactional semantics, mature permissions, and predictable query latency. A warehouse is what a BI team wants to query; it is what a finance team wants to audit; it is what a compliance team wants to govern. For workloads that fit its schema-on-write model — structured tabular data, joined across well-understood business entities — it remains, thirty years on, the right tool.

Why the warehouse is not enough

Schema-on-write is a virtue when the schema is known and a constraint when it is not. Semi-structured logs, image and video data, machine-learning features, and the long tail of "we do not yet know what this will be used for" data all pushed against the warehouse's rigidity, and eventually produced the data lake.

Section 04

The data lake and its discontents

If the warehouse was an answer to "how do we query structured business data efficiently", the data lake was the answer to "how do we store everything else, cheaply, without deciding in advance what it is for". It was a brilliant idea for about five years and a warning story for the five after that.

The original architecture

A data lake is object or distributed-file storage that holds raw data in its native format — CSVs, JSON logs, images, audio, Parquet, whatever — organised by folder conventions rather than by a central schema. The canonical 2010s stack was HDFS (Hadoop Distributed File System) for storage and MapReduce, then Spark, for compute. The promise: cheap petabyte-scale storage, flexible schemas, any workload that could be expressed as a distributed job.

Schema-on-read

The lake's defining property is schema-on-read: data is written in whatever shape it arrives; a consumer imposes a schema only at query time. This is liberating when a team does not yet know what the data will be used for, and corrosive when many teams impose incompatible schemas on the same files. The virtue at small scale becomes a liability at organisational scale.

The data-swamp failure mode

By the late 2010s, almost every large company that built a lake had also built a data swamp: petabytes of undocumented files, redundant copies, broken partition layouts, jobs that no one remembers writing. The failure was not technical — the storage worked fine — but governance: lakes lacked the schema enforcement, access control, transactions, and discoverability that warehouses had spent thirty years getting right.

Cloud object storage replaces HDFS

The other shift was underneath: HDFS, which required dedicated clusters and complex operations, was superseded by cloud object stores — Amazon S3, Google Cloud Storage, Azure Blob Storage — which offered effectively infinite capacity, eleven-nines durability, pay-per-use pricing, and no cluster to run. By the early 2020s almost every new lake was built on object storage; HDFS survives mostly in on-premises Hadoop installations.

The lake was right about storage

What the lake got right — cheap, decoupled, format-agnostic storage — is now universal. What it got wrong — no transactions, no schema, no lineage — is what the lakehouse was built to fix.

Section 05

The lakehouse, or warehouse semantics on lake storage

The lakehouse is the third act: keep the cheap, scalable, open-format storage of the lake, and bolt onto it the transactional guarantees, schema enforcement, and indexing that made the warehouse usable. The idea was named and defended in a 2021 paper by Armbrust, Ghodsi, Xin, and Zaharia; in the years since it has become the de-facto architecture for new analytical platforms.

The architectural move

A lakehouse stores table data as columnar files (almost always Parquet) in an object store, and stores metadata — which files belong to which table, what their schema is, which versions exist, who touched them — as a small, versioned, ACID-managed catalogue. The catalogue lives either as JSON/Avro log files alongside the data (Delta Lake, Iceberg) or in a separate service (Hive Metastore, Unity Catalog, Snowflake's Polaris). Query engines read the catalogue first, then the data.

What it fixes

Four properties come back. Atomic writes: a failed job leaves no half-committed partial files visible. Schema enforcement: writes that violate the declared schema are rejected before landing. Time travel: any historical version of the table can be queried by timestamp or ID, which is the reproducibility property ML training depends on. Concurrent readers and writers: a streaming job and an analytical query can touch the same table without stepping on each other.

Open formats, open engines

The critical change relative to the old warehouse model is that the data and the metadata are open. A Parquet file written by Spark is readable by DuckDB; a Delta table written by Databricks is now readable by Snowflake, BigQuery, and Trino; an Iceberg table written by Flink is readable by almost everything. Storage and compute become independently replaceable components.

The mid-2020s consensus

By 2025, the three major cloud warehouses (Snowflake, BigQuery, Databricks) all read and write open table formats natively; Iceberg specifically has become the near-universal wire format. The industry argument of "warehouse versus lake" is effectively over, and what replaced it is a set of engineering decisions about which open format, which compute engine, and which catalogue to pick — all within a broadly lakehouse shape.

The useful mental model

A modern data platform is: cheap columnar files in object storage, an open table format layered on top for transactions and time travel, a catalogue service for discovery and access control, and one or more query engines renting compute by the second to run queries against it. Everything else in this chapter is a zoom-in on one of those four boxes.

Section 06

Why columnar storage wins for analytics

Two workloads, two storage layouts. Row-oriented storage keeps all fields of a record together; columnar storage keeps all values of a column together. The choice looks trivial and is not — it produces order-of-magnitude performance differences on exactly the workloads that analytical systems run.

The physical layout

Consider a table with columns (user_id, country, signup_ts, revenue_usd, ...) and a billion rows. In a row-oriented store, byte 0 is user_id of row 0, byte 4 is country of row 0, and so on — each record is contiguous. In a columnar store, the first block is every user_id in order, the next block is every country, and the fourth block is every revenue_usd. Physically, the same data; logically, the same table; I/O behaviour, totally different.

Why it helps

Four compounding effects. I/O reduction: an analytical query of the form SELECT SUM(revenue_usd) FROM orders WHERE country='FR' reads two columns, not forty, so 95% of the file is skipped before any work begins. Compression: a column of a single type compresses far better than a row of mixed types — a country column with high repetition compresses 20–50×. Vectorised execution: a CPU operating on a tight array of floats can use SIMD and keep its pipelines full in a way a pointer-chasing row scan cannot. Cache friendliness: each column fits contiguous cache lines, so L1 and L2 hit rates rise dramatically.

When row orientation still wins

Transactional workloads. Reading the entire record for user 42 — all forty columns — costs forty seeks in a columnar store and one in a row store. Writing a new order pays a similar penalty: a columnar store has to append to forty column files; a row store appends to one. OLTP still runs on row-oriented storage for this reason, and nothing in the last decade has changed that.

The history compressed

Columnar databases existed in research before 2000 (Sybase IQ, MonetDB) but did not go mainstream until C-Store (2005) and its commercial descendant Vertica (2007). The idea then spread to Parquet (2013) and ORC (2013) as file formats, Redshift (2012), Snowflake (2014), and BigQuery's Dremel/Capacitor as cloud warehouses, and ClickHouse (2016) as the open-source poster child. By now every serious analytical engine is column-oriented; the only arguments are about which flavour.

The mental model to keep

A columnar query engine is reading tight arrays of one type, compressed, vectorised, and streamed. A row-oriented engine is reading records one at a time, mixed types, with pointer chasing. Analytics looks nothing like OLTP once you look at it from the storage up, and almost all the interesting engineering sits at this boundary.

Section 07

Parquet, in more detail than most practitioners know

Apache Parquet is the file format analytical systems write. Every warehouse, every lakehouse engine, every Spark job, every DuckDB session, every Hugging Face dataset of any size lands ultimately in Parquet. Knowing the internals is the difference between writing files that query well and files that are a drag on every consumer of them.

The hierarchical layout

A Parquet file is a tree. At the top, a file contains one or more row groups (typical size: 128–512 MB). Each row group contains one column chunk per column. Each column chunk contains a sequence of pages (typical size: 1 MB), which are the smallest unit of compression and decoding. Metadata — schemas, statistics, offsets, encodings — lives in a footer at the end of the file, which lets readers do one seek to the footer, decide which row groups and column chunks are relevant, and skip the rest.

Statistics and predicate pushdown

Every column chunk in the footer carries min/max/null-count statistics. A query with WHERE revenue_usd > 10000 reads the statistics first and skips any column chunk whose max is below 10000 — often skipping 90% of the file without reading a byte of data. This predicate pushdown is why Parquet files with well-sorted columns and correctly-sized row groups can outperform a full-table scan by two orders of magnitude.

Encodings

Parquet supports multiple encodings per column. PLAIN is the fallback. Dictionary encoding replaces repeated values with small integer codes and a dictionary — dramatically effective for categorical columns. Run-length encoding (RLE) compresses runs of identical values. Bit-packing uses only as many bits as the value range requires. Delta encoding stores differences rather than absolute values, which wins for sorted integers and timestamps. Modern writers (pyarrow, Spark) pick encodings automatically; a Parquet tuner manually overrides them only in extreme cases.

Writing Parquet well

import pyarrow as pa
import pyarrow.parquet as pq

pq.write_table(
    table,
    "orders/part-0001.parquet",
    compression="zstd",        # better than snappy for most analytical workloads
    row_group_size=500_000,    # target ~256 MB row groups
    use_dictionary=True,
    write_statistics=True,     # enables predicate pushdown
)

Three rules of thumb: row groups of 128–512 MB (smaller causes excessive metadata overhead; larger hurts parallelism), Zstd compression unless the query engine cannot read it, and sort on the columns most commonly filtered — which makes min/max statistics far more selective.

Why Parquet is the lingua franca

Every major engine — Spark, Flink, Trino, DuckDB, Snowflake, BigQuery, Databricks, Athena, Redshift Spectrum, pandas, Polars — reads Parquet natively. Writing Parquet rather than a proprietary format is the single most portable storage decision a team can make, and it costs nothing beyond picking sensible defaults.

Section 08

ORC, Avro, and the rest of the file-format landscape

Parquet won the columnar file-format wars, but it did not eliminate the competition. A working engineer encounters at least three other formats regularly and should know which one is meant to solve which problem.

ORC — Parquet's columnar cousin

Apache ORC (Optimized Row Columnar) emerged from the Hive ecosystem in 2013, roughly simultaneously with Parquet. Internally it is similar — row groups called stripes, columnar chunks, per-stripe statistics — but with somewhat richer built-in indexing (bloom filters, row-level indexes) and tighter coupling to Hive and older Hadoop engines. It remains the default in some Cloudera deployments; outside that ecosystem, Parquet dominates.

Avro — the row-oriented sibling

Apache Avro is row-oriented, not columnar, and aimed at a different workload: the serialised-message problem. An Avro file stores records with their schema; readers can evolve the schema (add optional fields, rename with aliases) without breaking. This makes Avro the default choice for Kafka topics, Snowflake's Snowpipe ingestion, and any workload where each record is read as a whole. It is strictly worse than Parquet for analytical scans, and strictly better for event streams.

JSON and CSV — the formats you will still see

CSV is 1972 technology; JSON is 2001. Both are row-oriented, uncompressed, and slow — and both are still ubiquitous in ingestion because humans can read them, every tool writes them, and the path of least resistance from an API to a file is json.dump. A pragmatic pipeline ingests in JSON or CSV, converts to Parquet on landing, and queries only the Parquet. CSV is, in particular, a minefield — no schema, ambiguous quoting, no portable null encoding — and no production analytical system should be built on it.

Feather/Arrow IPC — the in-memory cousin

Apache Arrow is an in-memory columnar layout, not a file format, but its on-disk serialisation — Feather V2 / Arrow IPC — is a useful near-zero-copy format for handing columnar data between Python, R, and Rust processes on the same machine. It is faster to read than Parquet but larger on disk and lacks Parquet's statistics; good for caches and scratch space, not for the long-term warehouse.

The emerging contenders

Lance (2023, by LanceDB) is a columnar format aimed specifically at ML: vector indexes, random row access, versioning. Vortex (2024, by Spiral) promises better compression than Parquet with faster scans. Neither has displaced Parquet yet; both are worth watching for 2027.

The one-line rule

Store analytical tables in Parquet. Store event streams in Avro. Use JSON and CSV as transient ingest formats only. Revisit the decision if, and only if, a specific workload shows that a newer format would actually move a measurable number.

Section 09

Compression and encoding, the quiet doublings

A columnar file stacks two independent kinds of size reduction: the column-specific encodings that exploit structure inside a single column, and a general-purpose compressor applied on top. Understanding the two layers is how a team decides whether a 2 TB file should really be 200 GB.

The encoding layer

Encodings are type-aware and structure-aware. Dictionary encoding gives a 10–100× win on categorical columns with low cardinality — a country column with 200 distinct values becomes an 8-bit dictionary code plus a 200-entry dictionary. Run-length encoding turns [1,1,1,1,2,2] into [(1,4),(2,2)]. Bit-packing uses the minimum bits needed for the range — a column of integers in [0,1023] uses 10 bits per value, not 32 or 64. Delta encoding stores differences — brilliant for sorted timestamps where consecutive differences are small.

The compression layer

On top of encoded bytes, a general-purpose compressor runs. The common choices: Snappy (fast, modest ratio — the historical Parquet default); Gzip (slow, good ratio — older and slower than alternatives); Zstandard / Zstd (tunable ratio, excellent speed — the modern default); LZ4 (fastest, worst ratio — useful for hot caches); Brotli (highest ratio, slowest — useful for write-once archival data).

The empirical rule

For most analytical workloads, Zstd at level 3 is the current sweet spot: roughly 2× better compression than Snappy, roughly 5% slower to decompress, readable by every modern engine. For long-lived cold archives, Zstd at level 9 or Brotli produce 10–30% smaller files at a much higher write cost. For a real-time ingest path where write throughput dominates, Snappy or LZ4 still win.

Why the two-layer view matters

A well-encoded Parquet file gains most of its compression from the encoding layer: a country column of a billion rows is already dictionary-encoded to 1 GB before any compressor runs. The compressor then adds another 2–3×. A poorly-encoded file — e.g., a PLAIN-encoded string column — leans entirely on the compressor and produces files two to five times larger than they should be. Diagnosing Parquet size issues almost always means checking the encoding, not the compressor.

The rule of thumb

Typical Parquet files with sensible defaults are 5–20× smaller than the equivalent CSV, and roughly 2× smaller than ORC on the same workload. A file that does not hit those ratios almost always has a miscoded column — usually a high-cardinality string that defeated the dictionary, or a wide JSON blob that defeated everything.

Section 10

Partitioning, clustering, and file layout that makes queries fast

A columnar file is fast. A directory of columnar files is fast only if its physical layout matches the queries run against it. The decisions about how to partition and sort data are among the most consequential and least visible a team makes.

Partitioning

Partitioning splits a table into disjoint subsets — typically by date, region, or tenant — stored in separate folders. A query filtered by the partition column reads only the relevant folders; the engine prunes the rest without opening a file. A well-partitioned events table (events/year=2026/month=04/day=17/) serves a one-day query by reading 1/365 of the data, with no other optimisation.

The over-partitioning trap

Partitioning on a high-cardinality column — user_id, say — produces a partition per value and catastrophic small-file overhead. Object stores charge per request; a table split into a million partitions costs a million HEADs before any real work begins. The rule: partition on columns with tens to low thousands of distinct values, not more. Date at day granularity is almost always safe; user_id almost never is.

Clustering and sort order

Inside a partition, sorting the data on columns commonly filtered makes min/max statistics in Parquet (Section 7) sharply more selective. Z-ordering (Delta Lake, Hudi) and Hilbert curves (Iceberg) are multi-dimensional sort techniques that preserve locality across two or more columns at once — useful when both user_id and event_ts are common filters.

File size, and compaction

Optimal Parquet files are 128 MB to 1 GB. Smaller files pay per-file overhead; larger files hurt parallelism and predicate pushdown. A streaming ingest writes thousands of small files; a batch compaction job periodically rewrites them into a few large ones. Delta Lake's OPTIMIZE, Iceberg's rewrite_data_files, and Hudi's compaction schedulers all exist specifically for this.

The layout-driven query

For almost any analytical table, roughly 90% of practical query performance comes from three decisions: the partition column, the clustering column, and whether compaction runs. A query that is slow on a well-laid-out table almost never gets faster by adding indexes; it gets faster by fixing the layout.

Section 11

Open table formats — Delta Lake, Iceberg, Hudi

A single Parquet file is a file. A table is the claim that a specific set of files, taken together with a schema and a history, constitute a logical unit. The open table formats — Delta Lake, Apache Iceberg, Apache Hudi — are the three widely-adopted standards for making that claim.

Delta Lake

Delta Lake was created at Databricks and open-sourced in 2019. Its metadata is a sequence of JSON log files (_delta_log/) that record every commit — files added, files removed, schema changes, statistics. The log is append-only, ACID-managed via optimistic concurrency control, and replayable from scratch. Delta's tight integration with Spark was historically its main advantage; as of 2024, Delta is also natively readable by Snowflake, BigQuery, DuckDB, Trino, and many more.

Apache Iceberg

Apache Iceberg originated at Netflix (2017) to solve the Hive-metastore problems that had become acute at their scale. Its metadata is more elaborate than Delta's: a current table state, a list of manifests, and per-file metadata — all stored as Avro files. The layered design makes Iceberg's commits cheap even for tables with millions of files, and its schema and partition evolution are more flexible than Delta's. Iceberg has become the de-facto standard for multi-engine interoperability, with first-class support across Snowflake, BigQuery, Databricks, Trino, Flink, and most others.

Apache Hudi

Apache Hudi (Hadoop Upserts Deletes and Incrementals) originated at Uber (2016) with a specific workload in mind: very frequent upserts and deletes, incremental consumption by downstream pipelines. It supports two table types — Copy-on-Write, like Delta and Iceberg, and Merge-on-Read, which keeps recent changes in log files and merges on query. Hudi is less universally adopted than Delta or Iceberg, but it is still the right choice when the workload is dominated by streaming upserts.

Convergence and interoperability

The three formats solve the same problem with different shapes, and the 2023–2025 period saw a wave of interoperability: Apache XTable (née OneTable) translates metadata between all three; Databricks' Delta UniForm publishes Iceberg metadata alongside Delta; Snowflake's Polaris and Databricks' Unity Catalog both speak Iceberg natively. The practical consequence: picking any one of the three no longer locks a team out of the others.

The choice for most teams

If the stack is Databricks-first, use Delta; the integration is tightest and the tooling is most mature. Otherwise, use Iceberg; it has the broadest multi-engine support and the most conservative governance. Use Hudi only if the dominant workload is high-frequency upserts from streaming sources.

Section 12

ACID on object storage, and why it was a hard problem

Object storage is eventually consistent, offers no multi-object transactions, and charges per request. Turning that into an ACID-compliant transactional store — which is exactly what the open table formats did — is one of the more underappreciated engineering achievements of the last decade.

The classical ACID properties

Atomicity: a transaction either commits in full or leaves no trace. Consistency: the table never appears in a state that violates its schema or constraints. Isolation: concurrent transactions appear to run one at a time. Durability: once a transaction is acknowledged, it survives failure. Every OLTP database has delivered all four for fifty years; doing it on object storage took twenty years and is the core contribution of table formats.

Optimistic concurrency on immutable files

The shared trick across Delta, Iceberg, and Hudi: data files are immutable — once written, they are never modified. Updates and deletes produce new files; the "version" of the table is a pointer to a specific set of them. A writer prepares its change, attempts to commit it by atomically swapping the pointer; if another writer beat it to the commit, it retries. This optimistic concurrency control works because the atomic swap is supported (barely) by modern object stores.

Time travel

Because old commits are never deleted until an explicit cleanup, every historical version of the table is queryable. Time travel (SELECT ... FROM orders VERSION AS OF '2026-04-01') is the reproducibility property ML depends on — retrain the exact model on the exact data, six months later, with zero ambiguity. It is also the primary tool for debugging data-quality incidents: diff the version before and after the bad commit to find what changed.

The isolation level is snapshot

All three formats implement snapshot isolation: a transaction sees a consistent view of the table as of its start time, even if other transactions commit in the meantime. This is weaker than serialisable isolation but strong enough for almost every analytical workload, and the performance is dramatically better.

The surprising robustness

The combination of immutable files, optimistic concurrency, and snapshot isolation turns out to scale to very large tables — hundreds of millions of files, petabytes of data, hundreds of concurrent readers and a handful of concurrent writers — with correctness guarantees that the older warehouse appliances also had, but at a fraction of the cost. This, more than any other single technical fact, is why the lakehouse architecture has become dominant.

Section 13

Dimensional modelling, still the most useful thing Kimball ever wrote

Storage format is what; schema shape is how. Regardless of whether the table lives in a warehouse or a lakehouse, the way its columns are organised — which columns are facts, which are dimensions, which are keys — determines whether analytical queries are simple or tortured.

Facts and dimensions

Fact tables record the business events: orders, clicks, sensor readings, payments. They are narrow (a dozen columns), deep (billions of rows), and mostly numeric. Dimension tables describe the entities referenced by facts: users, products, dates, regions. They are wide (dozens to hundreds of columns), shallow (thousands to millions of rows), and mostly text. A query almost always joins one fact table to several dimension tables.

The star schema

The canonical layout: a central fact table surrounded by dimension tables joined by foreign keys. It is called a star because the diagram looks like one. A sale-fact table keyed by (date_id, store_id, product_id, customer_id) joined to dim_date, dim_store, dim_product, dim_customer is, thirty years on, still the most productive way to model a retail warehouse.

Snowflake, and why most teams do not use it

In a snowflake schema (no relation to the product), dimension tables are themselves normalised — dim_product splits into dim_product, dim_category, dim_supplier. It looks cleaner on a diagram and queries worse: each extra join adds a shuffle. In analytical systems where disk is cheap, the denormalised star wins; most modern warehouses recommend it.

Slowly-changing dimensions

A customer moves from France to Germany; does the old sale become a German sale? The answer depends on the business, and the modelling technique that expresses the choices is slowly-changing dimensions (SCD). SCD Type 1 overwrites; SCD Type 2 keeps history with effective-date ranges; SCD Type 3 keeps a previous-value column. Type 2 is the right choice for almost any dimension where history matters for analytics.

The one big table and the wide-table renaissance

Modern columnar engines compress sparse wide tables so well that many teams have abandoned star schemas in favour of a single denormalised wide table — every sale joined to every dimension at write time, stored once. This loses the update efficiency of dimensions but simplifies queries and is often faster. The choice between modelled star and denormalised wide depends on whether dimensions change often and whether the team has the discipline to maintain a model.

The still-valid advice

The exact physical shape matters less than the discipline of naming facts, dimensions, and grain explicitly. A team that can answer "at what grain is this fact table, and which dimensions describe it?" has already done the hardest part of dimensional modelling, regardless of whether the answer ends up as a star schema or a one-big-table.

Section 14

Separating storage from compute, the architectural move that changed everything

For most of the history of databases, storage and compute were welded together — data lived on the machines that queried it, scaling one meant scaling the other. The cloud era broke the welding; the consequences have propagated through almost every decision in this chapter.

The old coupling

A classical warehouse appliance — Teradata, Netezza, original Redshift — stored its data on the local disks of the compute nodes. Adding capacity meant adding a node; adding compute meant moving data to rebalance; shrinking a cluster was almost impossible without downtime. The capacity of the cluster and the size of the data were the same number, which is fine when both grow together and punishing when they do not.

The decoupling move

Snowflake (2014) and BigQuery (2011, public 2012) shared the critical design choice: data lives permanently in object storage (S3, GCS); compute is a stateless, elastic pool of virtual warehouses that mount the storage on demand. Ten virtual warehouses can query the same table simultaneously without contention. A warehouse can be scaled up for a big ETL job and scaled down the moment it finishes. The cost of storage (pennies per GB-month) is decoupled from the cost of compute (dollars per node-hour).

What it unlocked

Four things followed. Isolation by workload: finance, analytics, and ML each get their own compute pool, sized to their own workload. Pay-per-query economics: idle compute costs nothing. Multi-engine access: the same files can be queried by Spark, DuckDB, and Snowflake, depending on which is best for the job. The lakehouse itself: the whole architecture of Section 5 is an outcome of this decoupling.

The catch: caching and locality

The weakness of the decoupled model is that every query reads from object storage, which has 10–100 ms latency per request — much slower than a local disk. Modern warehouses compensate with aggressive local caching on the compute nodes: the first query on a file is slow, subsequent queries on the same file are fast. Designing for a good cache-hit rate — keeping queries against a compact, recent slice of the data — is the main performance tuning concern of the modern era.

The architectural consequence

Almost every data-platform decision made since 2015 that looked modern and strange — open table formats, the lakehouse, Iceberg's multi-engine claim, the ability to run ten query engines against one copy of the data — is an implication of storage and compute having come apart. Understanding the decoupling is the single most useful way to reason about why the architectures of this chapter look the way they do.

Section 15

S3, GCS, Azure Blob — the substrate of modern data platforms

Underneath every warehouse, every lakehouse, every ML training job that reads a dataset, sits cloud object storage. Understanding it as a primitive — with its specific latency, consistency, and cost profile — is what lets an engineer reason about what everything built on top of it can and cannot do.

The object-storage model

A bucket is a flat namespace of objects. An object is a blob of bytes with a key, some metadata, and a content hash. There are no filesystems, no seeks, no partial updates — an object is written once and either replaced whole or deleted. The data plane exposes four operations: PUT, GET, HEAD, DELETE. The semantics are simple enough that the system can guarantee eleven nines of durability and near-unlimited scale, and restrictive enough that building a transactional table on top of it required a decade.

The three major providers

Amazon S3 (2006) invented the category and remains the market and technical reference. Google Cloud Storage is broadly equivalent with a cleaner API and tighter integration with BigQuery. Azure Blob Storage is the Microsoft equivalent, with the same primitives and a separate Data Lake Storage Gen2 layer for hierarchical access. A dataset written to one can be made accessible to engines running on the others, usually via a cross-cloud read path or a replication job.

Consistency, finally

For most of S3's history it was eventually consistent — a PUT might not be visible to an immediate GET. In December 2020, S3 became strongly consistent for reads, PUTs, and LISTs without any architectural change visible to clients. GCS and Azure Blob followed. This single change retroactively simplified every data pipeline ever written against object storage, and is one of the quieter large wins of the last decade.

Cost — the one dimension people get wrong

Storage itself is cheap — roughly $0.02 per GB-month in 2026. What tends to surprise first-time operators is request cost ($0.005 per 1000 PUTs, $0.0004 per 1000 GETs), egress cost ($0.09 per GB moved out of a region), and cross-region replication. A badly-designed pipeline that writes millions of small files, or reads entire datasets across regions for every query, produces bills that dwarf storage. This is why Section 10's "files should be 128 MB to 1 GB" rule is not aesthetics: it is arithmetic.

Storage classes and lifecycle

Each provider offers a ladder: Standard → Infrequent Access → Glacier / Archive, trading availability for lower per-GB cost. A lifecycle policy moves objects down the ladder automatically — after thirty days to Standard-IA, after a year to Glacier, after seven years to deletion. For long-horizon data — audit logs, ML training sets that may be retrained on — the difference between "store everything forever in Standard" and "store everything forever with a lifecycle" is the difference between an expensive platform and a sustainable one.

The mental image to hold

Every table in the modern platform is ultimately a pile of Parquet files in an S3 or GCS or Azure Blob bucket, with a small metadata log next to them. Everything above — the warehouses, the catalogues, the query engines — is about making that pile act like a first-class transactional table. The pile itself is where durability, scale, and cost live.

Section 16

Access, encryption, and cost — the non-technical edges of storage

A data platform that computes the right answer, fast, and exposes the wrong data to the wrong people, is a disaster. A platform that is correct, fast, and secure but costs five times what it should is a different kind of disaster. Storage is where both categories of failure most commonly land.

Access control

The dominant model is role-based access control (RBAC): users inherit permissions from roles; roles grant permissions on objects (tables, views, columns, rows). Every modern warehouse and lakehouse catalogue — Snowflake, BigQuery, Databricks Unity Catalog — implements RBAC with similar grammar: GRANT SELECT ON table TO role. Fine-grained controls — column masking for PII, row-level security for tenant isolation — are now standard.

Encryption — at rest and in transit

Every major object store encrypts data at rest by default, using provider-managed keys. Customer-managed keys (CMK) add the option of encrypting with a key the customer controls, so revoking the key renders the data unreadable — useful for tenant isolation and for regulatory requirements that demand it. Encryption in transit is universal (TLS 1.2+) and does not require configuration. The remaining question is which data, if any, needs client-side encryption — encrypted before it reaches the storage provider — for workloads where the provider itself is outside the trust boundary.

Cost controls

The four axes of cost on a modern platform: storage (cheap but cumulative), compute (expensive, pay-per-query or pay-per-hour), egress (expensive, especially cross-cloud), and request overhead (small per request, catastrophic at small-file scale). The controls: reservation pricing for predictable compute; lifecycle policies for aging storage; query governance (resource monitors, query timeouts, workload management) to cap runaway jobs; and simply measurement — a FinOps dashboard that surfaces cost per team, per table, per query.

Multi-tenancy

A platform shared by multiple teams, products, or external customers multiplies every governance challenge. The common architectural answer: one catalogue per organisation; one schema (database) per team; row-level security enforcing customer boundaries inside tables. A heavier variant is one physical warehouse per tenant — simpler to reason about but more expensive and harder to evolve.

The boring part is the hard part

Access, encryption, and cost are not technically deep compared to the storage-format choices earlier in this chapter. They are, in practice, the part that gets platforms shut down, audited, or quietly abandoned because they became too expensive. Investing in them is unglamorous and disproportionately rewarded.

Section 17

Where storage compounds in machine learning

An ML system is a read workload against a storage system. Training scans enormous slices of the warehouse; feature serving reads tiny rows at low latency; evaluation reproduces historical states. Each of these demands something specific from the storage layer, and each has pathological failure modes when the layer is wrong.

Training reads columnar

A training job on a billion-row table reads a handful of columns, streams them through a model, and repeats. This is exactly the workload Parquet and columnar engines are optimised for. A team training against a CSV or a row-oriented PostgreSQL table is paying an order of magnitude for the privilege; moving to Parquet-backed tables is, in practice, the highest-leverage storage change most ML teams ever make.

Feature stores sit on a different workload

Serving a model in production needs features for a specific entity at a specific moment: user 42's last-30-day spend, right now, in under ten milliseconds. This is row-oriented, low-latency, high-QPS — the opposite of the training workload. A feature store (Feast, Tecton, Databricks Feature Store) maintains two copies: an offline columnar store for training, an online row-oriented store (Redis, DynamoDB, CockroachDB) for serving, and an ingest pipeline that keeps them consistent.

Time travel, for reproducibility

Training produces a model; the model is a function of the exact dataset it saw. The table-format time-travel feature (Section 12) is, for ML, the reproducibility property. Retrain the model on the state of the warehouse as of 2026-04-01 is a one-line query against an Iceberg or Delta table; it is an archaeological dig against a traditional warehouse. Every experiment's provenance record should include the table version read.

Point-in-time correctness

A subtler trap: joining current dimensions onto historical facts. If a customer moved countries in March, a feature computed in April must not use the current country for a January event — or the model learns from information it could not have known at the time. Point-in-time joins, supported natively by modern feature stores and by any warehouse with SCD Type 2 dimensions, prevent this form of temporal leakage and are one of the most common sources of silent bugs.

The cost of a bad storage choice

A typical retraining job that reads from a well-laid-out Parquet lakehouse completes in minutes and costs cents. The same job against a badly-modelled JSON pile can take hours and cost thousands, and is often why ML teams talk about "our training is too expensive" — the model is not the problem, the storage is. The investment in storage discipline pays back, literally, on every training run for the life of the project.

The through-line

Machine learning looks like an algorithmic discipline and is, in production, a storage discipline wearing an algorithmic mask. The shape of the tables, the format of the files, the partitioning of the folders, the version log of the lakehouse — these determine how fast training is, how reproducible experiments are, how cheap the platform is to run, and how fast the next model can be shipped. Every chapter in the rest of Part III builds on top of these choices; getting them right early is the single most durable compounding investment a data team makes.

Further reading

Where to go next

Storage is a field with a small canonical bibliography (the warehouse bibles, the Dremel and C-Store papers, the lakehouse paper) and a much larger practical one (format specs, engine docs, operational playbooks). The list below picks the references that repay serious reading, the specs worth knowing in detail, and the project documentation most working engineers return to.

The canonical books

Foundational papers

Columnar file-format specifications

Open table formats — project documentation

Warehouses and query engines

Cloud object storage

ML-adjacent storage topics

This page is the second chapter of Part III: Data Engineering & Systems. The next — Data Pipelines & Orchestration — covers how bytes actually get from the ingestion sources of Chapter 01 into the storage layouts described here, and how they get transformed along the way: Airflow, Prefect, Dagster, dbt, the batch-versus-stream trade-off, and the discipline of making a pipeline whose failure modes are visible. After that come streaming and real-time systems, distributed compute, the cloud platforms everything runs on, and finally the governance layer that keeps it all auditable.