Between the ingestion pipeline that lands a row and the model that trains on it sits the whole of data storage — the question of where bytes live, in what shape, on which hardware, at what cost, with which guarantees, and for how long. The answer has changed more in the last decade than in the three before it. The classical data warehouse, the Hadoop-era data lake, and the modern lakehouse are in some sense the same artefact viewed from three epochs; this chapter works through the architectures, file formats (Parquet, ORC, Avro), open table formats (Delta Lake, Apache Iceberg, Apache Hudi), and compute-storage separation patterns that together make analytical data engineering possible at any non-trivial scale.
The first two sections are orientation: why storage is the substrate every downstream system depends on, and the transactional/analytical split (OLTP versus OLAP) that explains why different workloads demand different storage. Sections three through five tell the architectural story — warehouse, lake, lakehouse — and why the industry has, after twenty years of oscillation, settled on the third as the reasonable default. Sections six and seven are the row-versus-columnar question and a detailed walk through Parquet, the file format most analytical systems now write. Sections eight and nine cover the rest of the file-format landscape (ORC, Avro, JSON/CSV) and the compression and encoding techniques that make columnar storage fast. Sections ten and eleven are about physical layout: partitioning, clustering, file sizing. Sections twelve and thirteen bring in the open table formats — Delta, Iceberg, Hudi — and the ACID guarantees they add on top of object storage. Sections fourteen through sixteen cover the architectural consequences: dimensional data modelling, separation of storage from compute, and the cloud object stores (S3, GCS, Azure Blob) that now underlie nearly every serious analytical platform. The final section connects all of it to what it changes about training, serving, and evaluating machine-learning models.
Conventions: code snippets are Python, SQL, or shell where illustrative; sizes and throughput figures assume commodity cloud object storage and current-generation warehouse engines. "Warehouse" in what follows means a columnar analytical engine — Snowflake, BigQuery, Redshift, Databricks SQL, DuckDB — rather than the classical 1990s appliance. The goal is to leave the reader able to reason about any analytical data system at the level of its storage and table-format choices, which is where most of the decisions with long-lived consequences are actually made.
A data platform is a tower, and the floor beneath it is storage. Every pipeline, every query engine, every dashboard, every training run inherits its properties — its latency, its durability, its cost, its schema — from the storage layer. The choices made there compound for years; the choices made in the layers above are comparatively easy to undo.
It is tempting to think of storage as a commodity — a place where bytes wait for work to be done on them — and treat the interesting questions as belonging to the compute layer. For an individual analyst running a single query against a small table, that view is defensible. For any team operating at the scale of terabytes or larger, it is exactly backwards. The cost, the throughput, the consistency model, and the schema evolution of the storage layer are the constraints within which everything else operates.
Every storage decision is an answer, explicit or otherwise, to three questions. Where do the bytes live? — on attached disk, in an object store, in a warehouse's proprietary format. In what shape? — row-oriented or columnar, one file or many, partitioned or not. With what guarantees? — durable to how many copies, consistent under which workloads, recoverable after which failures, readable by which engines. The answers vary by workload, and this chapter is largely an elaboration of what those variations look like.
The dominant architecture has swung twice in thirty years. The 1990s and early 2000s were the age of the data warehouse — dedicated appliances, proprietary columnar formats, rigid schemas. The late 2000s and 2010s were the age of the data lake — cheap object storage, flexible schemas, MapReduce and then Spark on top. The 2020s have brought the lakehouse, which restores warehouse semantics (ACID, schema enforcement, time travel) to lake-scale storage. Understanding the swings is not a history lesson; it is how to recognise which stage of the pendulum any given team is actually on.
Storage is the longest-lived part of a data platform. Pipelines get rewritten, warehouses get migrated, job schedulers change every few years; the bytes themselves last decades. The effort spent on getting storage right early pays back every time something above it gets replaced.
The single most useful distinction in data storage is between the systems that record what just happened and the systems that analyse what has happened over time. They look similar from the outside — both store tables, both are queried with SQL — and they are engineered on almost opposite principles.
OLTP (online transaction processing) systems are the operational backends of a business: the user database, the order ledger, the payments system. Their workload is many small transactions per second, each touching a handful of rows, each requiring strict consistency (no double-charged credit cards) and low latency (no user waiting on a page load). PostgreSQL, MySQL, SQL Server, Oracle, CockroachDB, Spanner — all are row-oriented, B-tree indexed, and optimised for tens of thousands of single-row operations per second per core.
OLAP (online analytical processing) systems answer questions about what happened. Their workload is few large queries, each scanning millions or billions of rows but usually only a handful of columns, often aggregating across time. Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse, Druid, DuckDB — all are columnar, compressed, and optimised to scan tens of gigabytes per second per core.
A table of a billion orders, 40 columns wide, lives on the OLTP side as a row-oriented B-tree where finding order #12345 is an O(log n) operation touching one page. On the OLAP side, it lives as a columnar file where summing the total_usd column across ten years reads exactly one column, compressed, in a single streaming scan. Either architecture, asked to do the other's workload, performs one to two orders of magnitude worse.
The reconciliation, in practice, is to keep both — the OLTP system of record is the source of truth; a copy of its state, updated continuously or periodically, lives in the OLAP warehouse. Change-data-capture (CDC) pipelines read the OLTP write-ahead log and replay its mutations into the warehouse; dual writes are more fragile and less favoured. The next chapter covers pipelines and orchestration in detail; the storage story depends on which side of this split a given dataset is meant to support.
A decade of effort has gone into HTAP systems (hybrid transaction-analytical processing) that serve both workloads from one store. SingleStore, TiDB, and Snowflake's Unistore are the best-known attempts. They work well for modest scales; at large scale, the fundamental trade-offs between row-oriented and column-oriented layouts reassert themselves, and most serious platforms still run both.
The data warehouse, as an idea, is fifty years old. As a product category, it is thirty. The modern cloud warehouse has little in common with the Teradata appliance of 1995 — but the vocabulary, the schema patterns, and the mental model all come from that lineage, and are still how working engineers describe what they are doing.
Two competing philosophies dominate the warehouse literature. Bill Inmon's 1990 Building the Data Warehouse argued for a single, enterprise-wide, third-normal-form warehouse — the "single version of the truth" — feeding downstream data marts. Ralph Kimball's 1996 The Data Warehouse Toolkit argued for a bottom-up collection of dimensional data marts, each structured as a star schema, and is the approach most modern warehouses actually implement. The two books are still the canon; both repay a slow reading, thirty years later.
A warehouse is a schema-on-write store — every table has a declared schema and writes must conform — optimised for analytical queries. Classically it sat on dedicated hardware with columnar storage, massive parallelism across shared-nothing nodes, and a SQL dialect with extensions for windowing, rollups, and cube operations. Teradata, Netezza, Vertica, and early Redshift are the archetypes.
Snowflake (2014) broke the appliance model by separating storage (S3 by default) from compute (ephemeral virtual warehouses). BigQuery (2011) did the same on Google's Colossus. Databricks SQL added warehouse semantics on top of lakehouse storage. The result, by the mid-2020s, is that almost no serious organisation buys warehouse hardware any more; the warehouse is a SQL engine rented by the second, reading from cheap object storage.
Schema enforcement, SQL support, transactional semantics, mature permissions, and predictable query latency. A warehouse is what a BI team wants to query; it is what a finance team wants to audit; it is what a compliance team wants to govern. For workloads that fit its schema-on-write model — structured tabular data, joined across well-understood business entities — it remains, thirty years on, the right tool.
Schema-on-write is a virtue when the schema is known and a constraint when it is not. Semi-structured logs, image and video data, machine-learning features, and the long tail of "we do not yet know what this will be used for" data all pushed against the warehouse's rigidity, and eventually produced the data lake.
If the warehouse was an answer to "how do we query structured business data efficiently", the data lake was the answer to "how do we store everything else, cheaply, without deciding in advance what it is for". It was a brilliant idea for about five years and a warning story for the five after that.
A data lake is object or distributed-file storage that holds raw data in its native format — CSVs, JSON logs, images, audio, Parquet, whatever — organised by folder conventions rather than by a central schema. The canonical 2010s stack was HDFS (Hadoop Distributed File System) for storage and MapReduce, then Spark, for compute. The promise: cheap petabyte-scale storage, flexible schemas, any workload that could be expressed as a distributed job.
The lake's defining property is schema-on-read: data is written in whatever shape it arrives; a consumer imposes a schema only at query time. This is liberating when a team does not yet know what the data will be used for, and corrosive when many teams impose incompatible schemas on the same files. The virtue at small scale becomes a liability at organisational scale.
By the late 2010s, almost every large company that built a lake had also built a data swamp: petabytes of undocumented files, redundant copies, broken partition layouts, jobs that no one remembers writing. The failure was not technical — the storage worked fine — but governance: lakes lacked the schema enforcement, access control, transactions, and discoverability that warehouses had spent thirty years getting right.
The other shift was underneath: HDFS, which required dedicated clusters and complex operations, was superseded by cloud object stores — Amazon S3, Google Cloud Storage, Azure Blob Storage — which offered effectively infinite capacity, eleven-nines durability, pay-per-use pricing, and no cluster to run. By the early 2020s almost every new lake was built on object storage; HDFS survives mostly in on-premises Hadoop installations.
What the lake got right — cheap, decoupled, format-agnostic storage — is now universal. What it got wrong — no transactions, no schema, no lineage — is what the lakehouse was built to fix.
The lakehouse is the third act: keep the cheap, scalable, open-format storage of the lake, and bolt onto it the transactional guarantees, schema enforcement, and indexing that made the warehouse usable. The idea was named and defended in a 2021 paper by Armbrust, Ghodsi, Xin, and Zaharia; in the years since it has become the de-facto architecture for new analytical platforms.
A lakehouse stores table data as columnar files (almost always Parquet) in an object store, and stores metadata — which files belong to which table, what their schema is, which versions exist, who touched them — as a small, versioned, ACID-managed catalogue. The catalogue lives either as JSON/Avro log files alongside the data (Delta Lake, Iceberg) or in a separate service (Hive Metastore, Unity Catalog, Snowflake's Polaris). Query engines read the catalogue first, then the data.
Four properties come back. Atomic writes: a failed job leaves no half-committed partial files visible. Schema enforcement: writes that violate the declared schema are rejected before landing. Time travel: any historical version of the table can be queried by timestamp or ID, which is the reproducibility property ML training depends on. Concurrent readers and writers: a streaming job and an analytical query can touch the same table without stepping on each other.
The critical change relative to the old warehouse model is that the data and the metadata are open. A Parquet file written by Spark is readable by DuckDB; a Delta table written by Databricks is now readable by Snowflake, BigQuery, and Trino; an Iceberg table written by Flink is readable by almost everything. Storage and compute become independently replaceable components.
By 2025, the three major cloud warehouses (Snowflake, BigQuery, Databricks) all read and write open table formats natively; Iceberg specifically has become the near-universal wire format. The industry argument of "warehouse versus lake" is effectively over, and what replaced it is a set of engineering decisions about which open format, which compute engine, and which catalogue to pick — all within a broadly lakehouse shape.
A modern data platform is: cheap columnar files in object storage, an open table format layered on top for transactions and time travel, a catalogue service for discovery and access control, and one or more query engines renting compute by the second to run queries against it. Everything else in this chapter is a zoom-in on one of those four boxes.
Two workloads, two storage layouts. Row-oriented storage keeps all fields of a record together; columnar storage keeps all values of a column together. The choice looks trivial and is not — it produces order-of-magnitude performance differences on exactly the workloads that analytical systems run.
Consider a table with columns (user_id, country, signup_ts, revenue_usd, ...) and a billion rows. In a row-oriented store, byte 0 is user_id of row 0, byte 4 is country of row 0, and so on — each record is contiguous. In a columnar store, the first block is every user_id in order, the next block is every country, and the fourth block is every revenue_usd. Physically, the same data; logically, the same table; I/O behaviour, totally different.
Four compounding effects. I/O reduction: an analytical query of the form SELECT SUM(revenue_usd) FROM orders WHERE country='FR' reads two columns, not forty, so 95% of the file is skipped before any work begins. Compression: a column of a single type compresses far better than a row of mixed types — a country column with high repetition compresses 20–50×. Vectorised execution: a CPU operating on a tight array of floats can use SIMD and keep its pipelines full in a way a pointer-chasing row scan cannot. Cache friendliness: each column fits contiguous cache lines, so L1 and L2 hit rates rise dramatically.
Transactional workloads. Reading the entire record for user 42 — all forty columns — costs forty seeks in a columnar store and one in a row store. Writing a new order pays a similar penalty: a columnar store has to append to forty column files; a row store appends to one. OLTP still runs on row-oriented storage for this reason, and nothing in the last decade has changed that.
Columnar databases existed in research before 2000 (Sybase IQ, MonetDB) but did not go mainstream until C-Store (2005) and its commercial descendant Vertica (2007). The idea then spread to Parquet (2013) and ORC (2013) as file formats, Redshift (2012), Snowflake (2014), and BigQuery's Dremel/Capacitor as cloud warehouses, and ClickHouse (2016) as the open-source poster child. By now every serious analytical engine is column-oriented; the only arguments are about which flavour.
A columnar query engine is reading tight arrays of one type, compressed, vectorised, and streamed. A row-oriented engine is reading records one at a time, mixed types, with pointer chasing. Analytics looks nothing like OLTP once you look at it from the storage up, and almost all the interesting engineering sits at this boundary.
Apache Parquet is the file format analytical systems write. Every warehouse, every lakehouse engine, every Spark job, every DuckDB session, every Hugging Face dataset of any size lands ultimately in Parquet. Knowing the internals is the difference between writing files that query well and files that are a drag on every consumer of them.
A Parquet file is a tree. At the top, a file contains one or more row groups (typical size: 128–512 MB). Each row group contains one column chunk per column. Each column chunk contains a sequence of pages (typical size: 1 MB), which are the smallest unit of compression and decoding. Metadata — schemas, statistics, offsets, encodings — lives in a footer at the end of the file, which lets readers do one seek to the footer, decide which row groups and column chunks are relevant, and skip the rest.
Every column chunk in the footer carries min/max/null-count statistics. A query with WHERE revenue_usd > 10000 reads the statistics first and skips any column chunk whose max is below 10000 — often skipping 90% of the file without reading a byte of data. This predicate pushdown is why Parquet files with well-sorted columns and correctly-sized row groups can outperform a full-table scan by two orders of magnitude.
Parquet supports multiple encodings per column. PLAIN is the fallback. Dictionary encoding replaces repeated values with small integer codes and a dictionary — dramatically effective for categorical columns. Run-length encoding (RLE) compresses runs of identical values. Bit-packing uses only as many bits as the value range requires. Delta encoding stores differences rather than absolute values, which wins for sorted integers and timestamps. Modern writers (pyarrow, Spark) pick encodings automatically; a Parquet tuner manually overrides them only in extreme cases.
import pyarrow as pa
import pyarrow.parquet as pq
pq.write_table(
table,
"orders/part-0001.parquet",
compression="zstd", # better than snappy for most analytical workloads
row_group_size=500_000, # target ~256 MB row groups
use_dictionary=True,
write_statistics=True, # enables predicate pushdown
)
Three rules of thumb: row groups of 128–512 MB (smaller causes excessive metadata overhead; larger hurts parallelism), Zstd compression unless the query engine cannot read it, and sort on the columns most commonly filtered — which makes min/max statistics far more selective.
Every major engine — Spark, Flink, Trino, DuckDB, Snowflake, BigQuery, Databricks, Athena, Redshift Spectrum, pandas, Polars — reads Parquet natively. Writing Parquet rather than a proprietary format is the single most portable storage decision a team can make, and it costs nothing beyond picking sensible defaults.
Parquet won the columnar file-format wars, but it did not eliminate the competition. A working engineer encounters at least three other formats regularly and should know which one is meant to solve which problem.
Apache ORC (Optimized Row Columnar) emerged from the Hive ecosystem in 2013, roughly simultaneously with Parquet. Internally it is similar — row groups called stripes, columnar chunks, per-stripe statistics — but with somewhat richer built-in indexing (bloom filters, row-level indexes) and tighter coupling to Hive and older Hadoop engines. It remains the default in some Cloudera deployments; outside that ecosystem, Parquet dominates.
Apache Avro is row-oriented, not columnar, and aimed at a different workload: the serialised-message problem. An Avro file stores records with their schema; readers can evolve the schema (add optional fields, rename with aliases) without breaking. This makes Avro the default choice for Kafka topics, Snowflake's Snowpipe ingestion, and any workload where each record is read as a whole. It is strictly worse than Parquet for analytical scans, and strictly better for event streams.
CSV is 1972 technology; JSON is 2001. Both are row-oriented, uncompressed, and slow — and both are still ubiquitous in ingestion because humans can read them, every tool writes them, and the path of least resistance from an API to a file is json.dump. A pragmatic pipeline ingests in JSON or CSV, converts to Parquet on landing, and queries only the Parquet. CSV is, in particular, a minefield — no schema, ambiguous quoting, no portable null encoding — and no production analytical system should be built on it.
Apache Arrow is an in-memory columnar layout, not a file format, but its on-disk serialisation — Feather V2 / Arrow IPC — is a useful near-zero-copy format for handing columnar data between Python, R, and Rust processes on the same machine. It is faster to read than Parquet but larger on disk and lacks Parquet's statistics; good for caches and scratch space, not for the long-term warehouse.
Lance (2023, by LanceDB) is a columnar format aimed specifically at ML: vector indexes, random row access, versioning. Vortex (2024, by Spiral) promises better compression than Parquet with faster scans. Neither has displaced Parquet yet; both are worth watching for 2027.
Store analytical tables in Parquet. Store event streams in Avro. Use JSON and CSV as transient ingest formats only. Revisit the decision if, and only if, a specific workload shows that a newer format would actually move a measurable number.
A columnar file stacks two independent kinds of size reduction: the column-specific encodings that exploit structure inside a single column, and a general-purpose compressor applied on top. Understanding the two layers is how a team decides whether a 2 TB file should really be 200 GB.
Encodings are type-aware and structure-aware. Dictionary encoding gives a 10–100× win on categorical columns with low cardinality — a country column with 200 distinct values becomes an 8-bit dictionary code plus a 200-entry dictionary. Run-length encoding turns [1,1,1,1,2,2] into [(1,4),(2,2)]. Bit-packing uses the minimum bits needed for the range — a column of integers in [0,1023] uses 10 bits per value, not 32 or 64. Delta encoding stores differences — brilliant for sorted timestamps where consecutive differences are small.
On top of encoded bytes, a general-purpose compressor runs. The common choices: Snappy (fast, modest ratio — the historical Parquet default); Gzip (slow, good ratio — older and slower than alternatives); Zstandard / Zstd (tunable ratio, excellent speed — the modern default); LZ4 (fastest, worst ratio — useful for hot caches); Brotli (highest ratio, slowest — useful for write-once archival data).
For most analytical workloads, Zstd at level 3 is the current sweet spot: roughly 2× better compression than Snappy, roughly 5% slower to decompress, readable by every modern engine. For long-lived cold archives, Zstd at level 9 or Brotli produce 10–30% smaller files at a much higher write cost. For a real-time ingest path where write throughput dominates, Snappy or LZ4 still win.
A well-encoded Parquet file gains most of its compression from the encoding layer: a country column of a billion rows is already dictionary-encoded to 1 GB before any compressor runs. The compressor then adds another 2–3×. A poorly-encoded file — e.g., a PLAIN-encoded string column — leans entirely on the compressor and produces files two to five times larger than they should be. Diagnosing Parquet size issues almost always means checking the encoding, not the compressor.
Typical Parquet files with sensible defaults are 5–20× smaller than the equivalent CSV, and roughly 2× smaller than ORC on the same workload. A file that does not hit those ratios almost always has a miscoded column — usually a high-cardinality string that defeated the dictionary, or a wide JSON blob that defeated everything.
A columnar file is fast. A directory of columnar files is fast only if its physical layout matches the queries run against it. The decisions about how to partition and sort data are among the most consequential and least visible a team makes.
Partitioning splits a table into disjoint subsets — typically by date, region, or tenant — stored in separate folders. A query filtered by the partition column reads only the relevant folders; the engine prunes the rest without opening a file. A well-partitioned events table (events/year=2026/month=04/day=17/) serves a one-day query by reading 1/365 of the data, with no other optimisation.
Partitioning on a high-cardinality column — user_id, say — produces a partition per value and catastrophic small-file overhead. Object stores charge per request; a table split into a million partitions costs a million HEADs before any real work begins. The rule: partition on columns with tens to low thousands of distinct values, not more. Date at day granularity is almost always safe; user_id almost never is.
Inside a partition, sorting the data on columns commonly filtered makes min/max statistics in Parquet (Section 7) sharply more selective. Z-ordering (Delta Lake, Hudi) and Hilbert curves (Iceberg) are multi-dimensional sort techniques that preserve locality across two or more columns at once — useful when both user_id and event_ts are common filters.
Optimal Parquet files are 128 MB to 1 GB. Smaller files pay per-file overhead; larger files hurt parallelism and predicate pushdown. A streaming ingest writes thousands of small files; a batch compaction job periodically rewrites them into a few large ones. Delta Lake's OPTIMIZE, Iceberg's rewrite_data_files, and Hudi's compaction schedulers all exist specifically for this.
For almost any analytical table, roughly 90% of practical query performance comes from three decisions: the partition column, the clustering column, and whether compaction runs. A query that is slow on a well-laid-out table almost never gets faster by adding indexes; it gets faster by fixing the layout.
A single Parquet file is a file. A table is the claim that a specific set of files, taken together with a schema and a history, constitute a logical unit. The open table formats — Delta Lake, Apache Iceberg, Apache Hudi — are the three widely-adopted standards for making that claim.
Delta Lake was created at Databricks and open-sourced in 2019. Its metadata is a sequence of JSON log files (_delta_log/) that record every commit — files added, files removed, schema changes, statistics. The log is append-only, ACID-managed via optimistic concurrency control, and replayable from scratch. Delta's tight integration with Spark was historically its main advantage; as of 2024, Delta is also natively readable by Snowflake, BigQuery, DuckDB, Trino, and many more.
Apache Iceberg originated at Netflix (2017) to solve the Hive-metastore problems that had become acute at their scale. Its metadata is more elaborate than Delta's: a current table state, a list of manifests, and per-file metadata — all stored as Avro files. The layered design makes Iceberg's commits cheap even for tables with millions of files, and its schema and partition evolution are more flexible than Delta's. Iceberg has become the de-facto standard for multi-engine interoperability, with first-class support across Snowflake, BigQuery, Databricks, Trino, Flink, and most others.
Apache Hudi (Hadoop Upserts Deletes and Incrementals) originated at Uber (2016) with a specific workload in mind: very frequent upserts and deletes, incremental consumption by downstream pipelines. It supports two table types — Copy-on-Write, like Delta and Iceberg, and Merge-on-Read, which keeps recent changes in log files and merges on query. Hudi is less universally adopted than Delta or Iceberg, but it is still the right choice when the workload is dominated by streaming upserts.
The three formats solve the same problem with different shapes, and the 2023–2025 period saw a wave of interoperability: Apache XTable (née OneTable) translates metadata between all three; Databricks' Delta UniForm publishes Iceberg metadata alongside Delta; Snowflake's Polaris and Databricks' Unity Catalog both speak Iceberg natively. The practical consequence: picking any one of the three no longer locks a team out of the others.
If the stack is Databricks-first, use Delta; the integration is tightest and the tooling is most mature. Otherwise, use Iceberg; it has the broadest multi-engine support and the most conservative governance. Use Hudi only if the dominant workload is high-frequency upserts from streaming sources.
Object storage is eventually consistent, offers no multi-object transactions, and charges per request. Turning that into an ACID-compliant transactional store — which is exactly what the open table formats did — is one of the more underappreciated engineering achievements of the last decade.
Atomicity: a transaction either commits in full or leaves no trace. Consistency: the table never appears in a state that violates its schema or constraints. Isolation: concurrent transactions appear to run one at a time. Durability: once a transaction is acknowledged, it survives failure. Every OLTP database has delivered all four for fifty years; doing it on object storage took twenty years and is the core contribution of table formats.
The shared trick across Delta, Iceberg, and Hudi: data files are immutable — once written, they are never modified. Updates and deletes produce new files; the "version" of the table is a pointer to a specific set of them. A writer prepares its change, attempts to commit it by atomically swapping the pointer; if another writer beat it to the commit, it retries. This optimistic concurrency control works because the atomic swap is supported (barely) by modern object stores.
Because old commits are never deleted until an explicit cleanup, every historical version of the table is queryable. Time travel (SELECT ... FROM orders VERSION AS OF '2026-04-01') is the reproducibility property ML depends on — retrain the exact model on the exact data, six months later, with zero ambiguity. It is also the primary tool for debugging data-quality incidents: diff the version before and after the bad commit to find what changed.
All three formats implement snapshot isolation: a transaction sees a consistent view of the table as of its start time, even if other transactions commit in the meantime. This is weaker than serialisable isolation but strong enough for almost every analytical workload, and the performance is dramatically better.
The combination of immutable files, optimistic concurrency, and snapshot isolation turns out to scale to very large tables — hundreds of millions of files, petabytes of data, hundreds of concurrent readers and a handful of concurrent writers — with correctness guarantees that the older warehouse appliances also had, but at a fraction of the cost. This, more than any other single technical fact, is why the lakehouse architecture has become dominant.
Storage format is what; schema shape is how. Regardless of whether the table lives in a warehouse or a lakehouse, the way its columns are organised — which columns are facts, which are dimensions, which are keys — determines whether analytical queries are simple or tortured.
Fact tables record the business events: orders, clicks, sensor readings, payments. They are narrow (a dozen columns), deep (billions of rows), and mostly numeric. Dimension tables describe the entities referenced by facts: users, products, dates, regions. They are wide (dozens to hundreds of columns), shallow (thousands to millions of rows), and mostly text. A query almost always joins one fact table to several dimension tables.
The canonical layout: a central fact table surrounded by dimension tables joined by foreign keys. It is called a star because the diagram looks like one. A sale-fact table keyed by (date_id, store_id, product_id, customer_id) joined to dim_date, dim_store, dim_product, dim_customer is, thirty years on, still the most productive way to model a retail warehouse.
In a snowflake schema (no relation to the product), dimension tables are themselves normalised — dim_product splits into dim_product, dim_category, dim_supplier. It looks cleaner on a diagram and queries worse: each extra join adds a shuffle. In analytical systems where disk is cheap, the denormalised star wins; most modern warehouses recommend it.
A customer moves from France to Germany; does the old sale become a German sale? The answer depends on the business, and the modelling technique that expresses the choices is slowly-changing dimensions (SCD). SCD Type 1 overwrites; SCD Type 2 keeps history with effective-date ranges; SCD Type 3 keeps a previous-value column. Type 2 is the right choice for almost any dimension where history matters for analytics.
Modern columnar engines compress sparse wide tables so well that many teams have abandoned star schemas in favour of a single denormalised wide table — every sale joined to every dimension at write time, stored once. This loses the update efficiency of dimensions but simplifies queries and is often faster. The choice between modelled star and denormalised wide depends on whether dimensions change often and whether the team has the discipline to maintain a model.
The exact physical shape matters less than the discipline of naming facts, dimensions, and grain explicitly. A team that can answer "at what grain is this fact table, and which dimensions describe it?" has already done the hardest part of dimensional modelling, regardless of whether the answer ends up as a star schema or a one-big-table.
For most of the history of databases, storage and compute were welded together — data lived on the machines that queried it, scaling one meant scaling the other. The cloud era broke the welding; the consequences have propagated through almost every decision in this chapter.
A classical warehouse appliance — Teradata, Netezza, original Redshift — stored its data on the local disks of the compute nodes. Adding capacity meant adding a node; adding compute meant moving data to rebalance; shrinking a cluster was almost impossible without downtime. The capacity of the cluster and the size of the data were the same number, which is fine when both grow together and punishing when they do not.
Snowflake (2014) and BigQuery (2011, public 2012) shared the critical design choice: data lives permanently in object storage (S3, GCS); compute is a stateless, elastic pool of virtual warehouses that mount the storage on demand. Ten virtual warehouses can query the same table simultaneously without contention. A warehouse can be scaled up for a big ETL job and scaled down the moment it finishes. The cost of storage (pennies per GB-month) is decoupled from the cost of compute (dollars per node-hour).
Four things followed. Isolation by workload: finance, analytics, and ML each get their own compute pool, sized to their own workload. Pay-per-query economics: idle compute costs nothing. Multi-engine access: the same files can be queried by Spark, DuckDB, and Snowflake, depending on which is best for the job. The lakehouse itself: the whole architecture of Section 5 is an outcome of this decoupling.
The weakness of the decoupled model is that every query reads from object storage, which has 10–100 ms latency per request — much slower than a local disk. Modern warehouses compensate with aggressive local caching on the compute nodes: the first query on a file is slow, subsequent queries on the same file are fast. Designing for a good cache-hit rate — keeping queries against a compact, recent slice of the data — is the main performance tuning concern of the modern era.
Almost every data-platform decision made since 2015 that looked modern and strange — open table formats, the lakehouse, Iceberg's multi-engine claim, the ability to run ten query engines against one copy of the data — is an implication of storage and compute having come apart. Understanding the decoupling is the single most useful way to reason about why the architectures of this chapter look the way they do.
Underneath every warehouse, every lakehouse, every ML training job that reads a dataset, sits cloud object storage. Understanding it as a primitive — with its specific latency, consistency, and cost profile — is what lets an engineer reason about what everything built on top of it can and cannot do.
A bucket is a flat namespace of objects. An object is a blob of bytes with a key, some metadata, and a content hash. There are no filesystems, no seeks, no partial updates — an object is written once and either replaced whole or deleted. The data plane exposes four operations: PUT, GET, HEAD, DELETE. The semantics are simple enough that the system can guarantee eleven nines of durability and near-unlimited scale, and restrictive enough that building a transactional table on top of it required a decade.
Amazon S3 (2006) invented the category and remains the market and technical reference. Google Cloud Storage is broadly equivalent with a cleaner API and tighter integration with BigQuery. Azure Blob Storage is the Microsoft equivalent, with the same primitives and a separate Data Lake Storage Gen2 layer for hierarchical access. A dataset written to one can be made accessible to engines running on the others, usually via a cross-cloud read path or a replication job.
For most of S3's history it was eventually consistent — a PUT might not be visible to an immediate GET. In December 2020, S3 became strongly consistent for reads, PUTs, and LISTs without any architectural change visible to clients. GCS and Azure Blob followed. This single change retroactively simplified every data pipeline ever written against object storage, and is one of the quieter large wins of the last decade.
Storage itself is cheap — roughly $0.02 per GB-month in 2026. What tends to surprise first-time operators is request cost ($0.005 per 1000 PUTs, $0.0004 per 1000 GETs), egress cost ($0.09 per GB moved out of a region), and cross-region replication. A badly-designed pipeline that writes millions of small files, or reads entire datasets across regions for every query, produces bills that dwarf storage. This is why Section 10's "files should be 128 MB to 1 GB" rule is not aesthetics: it is arithmetic.
Each provider offers a ladder: Standard → Infrequent Access → Glacier / Archive, trading availability for lower per-GB cost. A lifecycle policy moves objects down the ladder automatically — after thirty days to Standard-IA, after a year to Glacier, after seven years to deletion. For long-horizon data — audit logs, ML training sets that may be retrained on — the difference between "store everything forever in Standard" and "store everything forever with a lifecycle" is the difference between an expensive platform and a sustainable one.
Every table in the modern platform is ultimately a pile of Parquet files in an S3 or GCS or Azure Blob bucket, with a small metadata log next to them. Everything above — the warehouses, the catalogues, the query engines — is about making that pile act like a first-class transactional table. The pile itself is where durability, scale, and cost live.
A data platform that computes the right answer, fast, and exposes the wrong data to the wrong people, is a disaster. A platform that is correct, fast, and secure but costs five times what it should is a different kind of disaster. Storage is where both categories of failure most commonly land.
The dominant model is role-based access control (RBAC): users inherit permissions from roles; roles grant permissions on objects (tables, views, columns, rows). Every modern warehouse and lakehouse catalogue — Snowflake, BigQuery, Databricks Unity Catalog — implements RBAC with similar grammar: GRANT SELECT ON table TO role. Fine-grained controls — column masking for PII, row-level security for tenant isolation — are now standard.
Every major object store encrypts data at rest by default, using provider-managed keys. Customer-managed keys (CMK) add the option of encrypting with a key the customer controls, so revoking the key renders the data unreadable — useful for tenant isolation and for regulatory requirements that demand it. Encryption in transit is universal (TLS 1.2+) and does not require configuration. The remaining question is which data, if any, needs client-side encryption — encrypted before it reaches the storage provider — for workloads where the provider itself is outside the trust boundary.
The four axes of cost on a modern platform: storage (cheap but cumulative), compute (expensive, pay-per-query or pay-per-hour), egress (expensive, especially cross-cloud), and request overhead (small per request, catastrophic at small-file scale). The controls: reservation pricing for predictable compute; lifecycle policies for aging storage; query governance (resource monitors, query timeouts, workload management) to cap runaway jobs; and simply measurement — a FinOps dashboard that surfaces cost per team, per table, per query.
A platform shared by multiple teams, products, or external customers multiplies every governance challenge. The common architectural answer: one catalogue per organisation; one schema (database) per team; row-level security enforcing customer boundaries inside tables. A heavier variant is one physical warehouse per tenant — simpler to reason about but more expensive and harder to evolve.
Access, encryption, and cost are not technically deep compared to the storage-format choices earlier in this chapter. They are, in practice, the part that gets platforms shut down, audited, or quietly abandoned because they became too expensive. Investing in them is unglamorous and disproportionately rewarded.
An ML system is a read workload against a storage system. Training scans enormous slices of the warehouse; feature serving reads tiny rows at low latency; evaluation reproduces historical states. Each of these demands something specific from the storage layer, and each has pathological failure modes when the layer is wrong.
A training job on a billion-row table reads a handful of columns, streams them through a model, and repeats. This is exactly the workload Parquet and columnar engines are optimised for. A team training against a CSV or a row-oriented PostgreSQL table is paying an order of magnitude for the privilege; moving to Parquet-backed tables is, in practice, the highest-leverage storage change most ML teams ever make.
Serving a model in production needs features for a specific entity at a specific moment: user 42's last-30-day spend, right now, in under ten milliseconds. This is row-oriented, low-latency, high-QPS — the opposite of the training workload. A feature store (Feast, Tecton, Databricks Feature Store) maintains two copies: an offline columnar store for training, an online row-oriented store (Redis, DynamoDB, CockroachDB) for serving, and an ingest pipeline that keeps them consistent.
Training produces a model; the model is a function of the exact dataset it saw. The table-format time-travel feature (Section 12) is, for ML, the reproducibility property. Retrain the model on the state of the warehouse as of 2026-04-01 is a one-line query against an Iceberg or Delta table; it is an archaeological dig against a traditional warehouse. Every experiment's provenance record should include the table version read.
A subtler trap: joining current dimensions onto historical facts. If a customer moved countries in March, a feature computed in April must not use the current country for a January event — or the model learns from information it could not have known at the time. Point-in-time joins, supported natively by modern feature stores and by any warehouse with SCD Type 2 dimensions, prevent this form of temporal leakage and are one of the most common sources of silent bugs.
A typical retraining job that reads from a well-laid-out Parquet lakehouse completes in minutes and costs cents. The same job against a badly-modelled JSON pile can take hours and cost thousands, and is often why ML teams talk about "our training is too expensive" — the model is not the problem, the storage is. The investment in storage discipline pays back, literally, on every training run for the life of the project.
Machine learning looks like an algorithmic discipline and is, in production, a storage discipline wearing an algorithmic mask. The shape of the tables, the format of the files, the partitioning of the folders, the version log of the lakehouse — these determine how fast training is, how reproducible experiments are, how cheap the platform is to run, and how fast the next model can be shipped. Every chapter in the rest of Part III builds on top of these choices; getting them right early is the single most durable compounding investment a data team makes.
Storage is a field with a small canonical bibliography (the warehouse bibles, the Dremel and C-Store papers, the lakehouse paper) and a much larger practical one (format specs, engine docs, operational playbooks). The list below picks the references that repay serious reading, the specs worth knowing in detail, and the project documentation most working engineers return to.
This page is the second chapter of Part III: Data Engineering & Systems. The next — Data Pipelines & Orchestration — covers how bytes actually get from the ingestion sources of Chapter 01 into the storage layouts described here, and how they get transformed along the way: Airflow, Prefect, Dagster, dbt, the batch-versus-stream trade-off, and the discipline of making a pipeline whose failure modes are visible. After that come streaming and real-time systems, distributed compute, the cloud platforms everything runs on, and finally the governance layer that keeps it all auditable.