Part III · Data Engineering & Systems · Chapter 02

Data storage, and the slow triumph of the columnar lakehouse.

Between the ingestion pipeline that lands a row and the model that trains on it sits the whole of data storage — the question of where bytes live, in what shape, on which hardware, at what cost, with which guarantees, and for how long. The answer has changed more in the last decade than in the three before it. The classical data warehouse, the Hadoop-era data lake, and the modern lakehouse are in some sense the same artefact viewed from three epochs; this chapter works through the architectures, file formats (Parquet, ORC, Avro), open table formats (Delta Lake, Apache Iceberg, Apache Hudi), and compute-storage separation patterns that together make analytical data engineering possible at any non-trivial scale.

How to read this chapter

The first two sections are orientation: why storage is the substrate every downstream system depends on, and the transactional/analytical split (OLTP versus OLAP) that explains why different workloads demand different storage. Sections three through five tell the architectural story — warehouse, lake, lakehouse — and why the industry has, after twenty years of oscillation, settled on the third as the reasonable default. Sections six and seven are the row-versus-columnar question and a detailed walk through Parquet, the file format most analytical systems now write. Sections eight and nine cover the rest of the file-format landscape (ORC, Avro, JSON/CSV) and the compression and encoding techniques that make columnar storage fast. Sections ten and eleven are about physical layout: partitioning, clustering, file sizing. Sections twelve and thirteen bring in the open table formats — Delta, Iceberg, Hudi — and the ACID guarantees they add on top of object storage. Sections fourteen through sixteen cover the architectural consequences: dimensional data modelling, separation of storage from compute, and the cloud object stores (S3, GCS, Azure Blob) that now underlie nearly every serious analytical platform. The final section connects all of it to what it changes about training, serving, and evaluating machine-learning models.

Conventions: code snippets are Python, SQL, or shell where illustrative; sizes and throughput figures assume commodity cloud object storage and current-generation warehouse engines. "Warehouse" in what follows means a columnar analytical engine — Snowflake, BigQuery, Redshift, Databricks SQL, DuckDB — rather than the classical 1990s appliance. The goal is to leave the reader able to reason about any analytical data system at the level of its storage and table-format choices, which is where most of the decisions with long-lived consequences are actually made.

Why storage is the substrateThe hidden foundation
OLTP versus OLAPTransactions vs. analytics
The classical data warehouseKimball, Inmon, the star schema
The data lake and its discontentsHDFS, Hadoop, schema-on-read
The lakehouse synthesisWarehouse semantics on lake storage
Why columnar wins for analyticsRow-oriented vs. column-oriented
Parquet, in detailRow groups, column chunks, pages
ORC, Avro, and the file-format landscapeColumnar cousins, row-wise siblings
Compression and encodingDictionary, RLE, bit-packing, Snappy, Zstd
Partitioning, clustering, and file layoutPhysical layout for query speed
Open table formatsDelta Lake, Iceberg, Hudi
ACID on object storageAtomicity, time travel, isolation
Dimensional modellingStar, snowflake, wide-table
Separating storage from computeWhy the decoupling changed everything
S3, GCS, Azure Blob — the substrateObject storage as a primitive
Access, encryption, costThe non-technical edges of storage
Where it compounds in MLTraining, serving, evaluation

Section 01

Storage is the substrate every other system depends on

A data platform is a tower, and the floor beneath it is storage. Every pipeline, every query engine, every dashboard, every training run inherits its properties — its latency, its durability, its cost, its schema — from the storage layer. The choices made there compound for years; the choices made in the layers above are comparatively easy to undo.

Storage as a constraint, not a detail

It is tempting to think of storage as a commodity — a place where bytes wait for work to be done on them — and treat the interesting questions as belonging to the compute layer. For an individual analyst running a single query against a small table, that view is defensible. For any team operating at the scale of terabytes or larger, it is exactly backwards. The cost, the throughput, the consistency model, and the schema evolution of the storage layer are the constraints within which everything else operates.

The three questions storage answers

Every storage decision is an answer, explicit or otherwise, to three questions. Where do the bytes live? — on attached disk, in an object store, in a warehouse's proprietary format. In what shape? — row-oriented or columnar, one file or many, partitioned or not. With what guarantees? — durable to how many copies, consistent under which workloads, recoverable after which failures, readable by which engines. The answers vary by workload, and this chapter is largely an elaboration of what those variations look like.

Three decades of architectural swings

The dominant architecture has swung twice in thirty years. The 1990s and early 2000s were the age of the data warehouse — dedicated appliances, proprietary columnar formats, rigid schemas. The late 2000s and 2010s were the age of the data lake — cheap object storage, flexible schemas, MapReduce and then Spark on top. The 2020s have brought the lakehouse, which restores warehouse semantics (ACID, schema enforcement, time travel) to lake-scale storage. Understanding the swings is not a history lesson; it is how to recognise which stage of the pendulum any given team is actually on.

The long-horizon property

Storage is the longest-lived part of a data platform. Pipelines get rewritten, warehouses get migrated, job schedulers change every few years; the bytes themselves last decades. The effort spent on getting storage right early pays back every time something above it gets replaced.

Section 02

OLTP versus OLAP, the split that explains most of storage

The single most useful distinction in data storage is between the systems that record what just happened and the systems that analyse what has happened over time. They look similar from the outside — both store tables, both are queried with SQL — and they are engineered on almost opposite principles.

OLTP — operational workloads

OLTP (online transaction processing) systems are the operational backends of a business: the user database, the order ledger, the payments system. Their workload is many small transactions per second, each touching a handful of rows, each requiring strict consistency (no double-charged credit cards) and low latency (no user waiting on a page load). PostgreSQL, MySQL, SQL Server, Oracle, CockroachDB, Spanner — all are row-oriented, B-tree indexed, and optimised for tens of thousands of single-row operations per second per core.

OLAP — analytical workloads

OLAP (online analytical processing) systems answer questions about what happened. Their workload is few large queries, each scanning millions or billions of rows but usually only a handful of columns, often aggregating across time. Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse, Druid, DuckDB — all are columnar, compressed, and optimised to scan tens of gigabytes per second per core.

The one-table example

A table of a billion orders, 40 columns wide, lives on the OLTP side as a row-oriented B-tree where finding order #12345 is an O(log n) operation touching one page. On the OLAP side, it lives as a columnar file where summing the total_usd column across ten years reads exactly one column, compressed, in a single streaming scan. Either architecture, asked to do the other's workload, performs one to two orders of magnitude worse.

The dual-write pattern

The reconciliation, in practice, is to keep both — the OLTP system of record is the source of truth; a copy of its state, updated continuously or periodically, lives in the OLAP warehouse. Change-data-capture (CDC) pipelines read the OLTP write-ahead log and replay its mutations into the warehouse; dual writes are more fragile and less favoured. The next chapter covers pipelines and orchestration in detail; the storage story depends on which side of this split a given dataset is meant to support.

HTAP — the nearly-one-system ambition

A decade of effort has gone into HTAP systems (hybrid transaction-analytical processing) that serve both workloads from one store. SingleStore, TiDB, and Snowflake's Unistore are the best-known attempts. They work well for modest scales; at large scale, the fundamental trade-offs between row-oriented and column-oriented layouts reassert themselves, and most serious platforms still run both.

Section 03

The classical data warehouse, and why it still shapes the field

The data warehouse, as an idea, is fifty years old. As a product category, it is thirty. The modern cloud warehouse has little in common with the Teradata appliance of 1995 — but the vocabulary, the schema patterns, and the mental model all come from that lineage, and are still how working engineers describe what they are doing.

Inmon versus Kimball

Two competing philosophies dominate the warehouse literature. Bill Inmon's 1990 Building the Data Warehouse argued for a single, enterprise-wide, third-normal-form warehouse — the "single version of the truth" — feeding downstream data marts. Ralph Kimball's 1996 The Data Warehouse Toolkit argued for a bottom-up collection of dimensional data marts, each structured as a star schema, and is the approach most modern warehouses actually implement. The two books are still the canon; both repay a slow reading, thirty years later.

What the warehouse actually is

A warehouse is a schema-on-write store — every table has a declared schema and writes must conform — optimised for analytical queries. Classically it sat on dedicated hardware with columnar storage, massive parallelism across shared-nothing nodes, and a SQL dialect with extensions for windowing, rollups, and cube operations. Teradata, Netezza, Vertica, and early Redshift are the archetypes.

The cloud-native reinvention

Snowflake (2014) broke the appliance model by separating storage (S3 by default) from compute (ephemeral virtual warehouses). BigQuery (2011) did the same on Google's Colossus. Databricks SQL added warehouse semantics on top of lakehouse storage. The result, by the mid-2020s, is that almost no serious organisation buys warehouse hardware any more; the warehouse is a SQL engine rented by the second, reading from cheap object storage.

What the warehouse is good for

Schema enforcement, SQL support, transactional semantics, mature permissions, and predictable query latency. A warehouse is what a BI team wants to query; it is what a finance team wants to audit; it is what a compliance team wants to govern. For workloads that fit its schema-on-write model — structured tabular data, joined across well-understood business entities — it remains, thirty years on, the right tool.

Why the warehouse is not enough

Schema-on-write is a virtue when the schema is known and a constraint when it is not. Semi-structured logs, image and video data, machine-learning features, and the long tail of "we do not yet know what this will be used for" data all pushed against the warehouse's rigidity, and eventually produced the data lake.

Section 04

The data lake and its discontents

If the warehouse was an answer to "how do we query structured business data efficiently", the data lake was the answer to "how do we store everything else, cheaply, without deciding in advance what it is for". It was a brilliant idea for about five years and a warning story for the five after that.

The original architecture

A data lake is object or distributed-file storage that holds raw data in its native format — CSVs, JSON logs, images, audio, Parquet, whatever — organised by folder conventions rather than by a central schema. The canonical 2010s stack was HDFS (Hadoop Distributed File System) for storage and MapReduce, then Spark, for compute. The promise: cheap petabyte-scale storage, flexible schemas, any workload that could be expressed as a distributed job.

Schema-on-read

The lake's defining property is schema-on-read: data is written in whatever shape it arrives; a consumer imposes a schema only at query time. This is liberating when a team does not yet know what the data will be used for, and corrosive when many teams impose incompatible schemas on the same files. The virtue at small scale becomes a liability at organisational scale.

The data-swamp failure mode

By the late 2010s, almost every large company that built a lake had also built a data swamp: petabytes of undocumented files, redundant copies, broken partition layouts, jobs that no one remembers writing. The failure was not technical — the storage worked fine — but governance: lakes lacked the schema enforcement, access control, transactions, and discoverability that warehouses had spent thirty years getting right.

Cloud object storage replaces HDFS

The other shift was underneath: HDFS, which required dedicated clusters and complex operations, was superseded by cloud object stores — Amazon S3, Google Cloud Storage, Azure Blob Storage — which offered effectively infinite capacity, eleven-nines durability, pay-per-use pricing, and no cluster to run. By the early 2020s almost every new lake was built on object storage; HDFS survives mostly in on-premises Hadoop installations.

The lake was right about storage

What the lake got right — cheap, decoupled, format-agnostic storage — is now universal. What it got wrong — no transactions, no schema, no lineage — is what the lakehouse was built to fix.

Section 05

The lakehouse, or warehouse semantics on lake storage

The lakehouse is the third act: keep the cheap, scalable, open-format storage of the lake, and bolt onto it the transactional guarantees, schema enforcement, and indexing that made the warehouse usable. The idea was named and defended in a 2021 paper by Armbrust, Ghodsi, Xin, and Zaharia; in the years since it has become the de-facto architecture for new analytical platforms.

The architectural move

A lakehouse stores table data as columnar files (almost always Parquet) in an object store, and stores metadata — which files belong to which table, what their schema is, which versions exist, who touched them — as a small, versioned, ACID-managed catalogue. The catalogue lives either as JSON/Avro log files alongside the data (Delta Lake, Iceberg) or in a separate service (Hive Metastore, Unity Catalog, Snowflake's Polaris). Query engines read the catalogue first, then the data.

What it fixes

Four properties come back. Atomic writes: a failed job leaves no half-committed partial files visible. Schema enforcement: writes that violate the declared schema are rejected before landing. Time travel: any historical version of the table can be queried by timestamp or ID, which is the reproducibility property ML training depends on. Concurrent readers and writers: a streaming job and an analytical query can touch the same table without stepping on each other.

Open formats, open engines

The critical change relative to the old warehouse model is that the data and the metadata are open. A Parquet file written by Spark is readable by DuckDB; a Delta table written by Databricks is now readable by Snowflake, BigQuery, and Trino; an Iceberg table written by Flink is readable by almost everything. Storage and compute become independently replaceable components.

The mid-2020s consensus

By 2025, the three major cloud warehouses (Snowflake, BigQuery, Databricks) all read and write open table formats natively; Iceberg specifically has become the near-universal wire format. The industry argument of "warehouse versus lake" is effectively over, and what replaced it is a set of engineering decisions about which open format, which compute engine, and which catalogue to pick — all within a broadly lakehouse shape.

The useful mental model

A modern data platform is: cheap columnar files in object storage, an open table format layered on top for transactions and time travel, a catalogue service for discovery and access control, and one or more query engines renting compute by the second to run queries against it. Everything else in this chapter is a zoom-in on one of those four boxes.

Section 06

Why columnar storage wins for analytics

Two workloads, two storage layouts. Row-oriented storage keeps all fields of a record together; columnar storage keeps all values of a column together. The choice looks trivial and is not — it produces order-of-magnitude performance differences on exactly the workloads that analytical systems run.

The physical layout

Consider a table with columns (user_id, country, signup_ts, revenue_usd, ...) and a billion rows. In a row-oriented store, byte 0 is user_id of row 0, byte 4 is country of row 0, and so on — each record is contiguous. In a columnar store, the first block is every user_id in order, the next block is every country, and the fourth block is every revenue_usd. Physically, the same data; logically, the same table; I/O behaviour, totally different.

Why it helps

Four compounding effects. I/O reduction: an analytical query of the form SELECT SUM(revenue_usd) FROM orders WHERE country='FR' reads two columns, not forty, so 95% of the file is skipped before any work begins. Compression: a column of a single type compresses far better than a row of mixed types — a country column with high repetition compresses 20–50×. Vectorised execution: a CPU operating on a tight array of floats can use SIMD and keep its pipelines full in a way a pointer-chasing row scan cannot. Cache friendliness: each column fits contiguous cache lines, so L1 and L2 hit rates rise dramatically.

When row orientation still wins

Transactional workloads. Reading the entire record for user 42 — all forty columns — costs forty seeks in a columnar store and one in a row store. Writing a new order pays a similar penalty: a columnar store has to append to forty column files; a row store appends to one. OLTP still runs on row-oriented storage for this reason, and nothing in the last decade has changed that.

The history compressed

Columnar databases existed in research before 2000 (Sybase IQ, MonetDB) but did not go mainstream until C-Store (2005) and its commercial descendant Vertica (2007). The idea then spread to Parquet (2013) and ORC (2013) as file formats, Redshift (2012), Snowflake (2014), and BigQuery's Dremel/Capacitor as cloud warehouses, and ClickHouse (2016) as the open-source poster child. By now every serious analytical engine is column-oriented; the only arguments are about which flavour.

The mental model to keep

A columnar query engine is reading tight arrays of one type, compressed, vectorised, and streamed. A row-oriented engine is reading records one at a time, mixed types, with pointer chasing. Analytics looks nothing like OLTP once you look at it from the storage up, and almost all the interesting engineering sits at this boundary.

Section 07

Parquet, in more detail than most practitioners know

Apache Parquet is the file format analytical systems write. Every warehouse, every lakehouse engine, every Spark job, every DuckDB session, every Hugging Face dataset of any size lands ultimately in Parquet. Knowing the internals is the difference between writing files that query well and files that are a drag on every consumer of them.

The hierarchical layout

A Parquet file is a tree. At the top, a file contains one or more row groups (typical size: 128–512 MB). Each row group contains one column chunk per column. Each column chunk contains a sequence of pages (typical size: 1 MB), which are the smallest unit of compression and decoding. Metadata — schemas, statistics, offsets, encodings — lives in a footer at the end of the file, which lets readers do one seek to the footer, decide which row groups and column chunks are relevant, and skip the rest.

Statistics and predicate pushdown

Every column chunk in the footer carries min/max/null-count statistics. A query with WHERE revenue_usd > 10000 reads the statistics first and skips any column chunk whose max is below 10000 — often skipping 90% of the file without reading a byte of data. This predicate pushdown is why Parquet files with well-sorted columns and correctly-sized row groups can outperform a full-table scan by two orders of magnitude.

Encodings

Parquet supports multiple encodings per column. PLAIN is the fallback. Dictionary encoding replaces repeated values with small integer codes and a dictionary — dramatically effective for categorical columns. Run-length encoding (RLE) compresses runs of identical values. Bit-packing uses only as many bits as the value range requires. Delta encoding stores differences rather than absolute values, which wins for sorted integers and timestamps. Modern writers (pyarrow, Spark) pick encodings automatically; a Parquet tuner manually overrides them only in extreme cases.

Writing Parquet well

import pyarrow as pa
import pyarrow.parquet as pq

pq.write_table(
    table,
    "orders/part-0001.parquet",
    compression="zstd",        # better than snappy for most analytical workloads
    row_group_size=500_000,    # target ~256 MB row groups
    use_dictionary=True,
    write_statistics=True,     # enables predicate pushdown
)

Three rules of thumb: row groups of 128–512 MB (smaller causes excessive metadata overhead; larger hurts parallelism), Zstd compression unless the query engine cannot read it, and sort on the columns most commonly filtered — which makes min/max statistics far more selective.

Why Parquet is the lingua franca

Every major engine — Spark, Flink, Trino, DuckDB, Snowflake, BigQuery, Databricks, Athena, Redshift Spectrum, pandas, Polars — reads Parquet natively. Writing Parquet rather than a proprietary format is the single most portable storage decision a team can make, and it costs nothing beyond picking sensible defaults.

Section 08

ORC, Avro, and the rest of the file-format landscape

Parquet won the columnar file-format wars, but it did not eliminate the competition. A working engineer encounters at least three other formats regularly and should know which one is meant to solve which problem.

ORC — Parquet's columnar cousin

Apache ORC (Optimized Row Columnar) emerged from the Hive ecosystem in 2013, roughly simultaneously with Parquet. Internally it is similar — row groups called stripes, columnar chunks, per-stripe statistics — but with somewhat richer built-in indexing (bloom filters, row-level indexes) and tighter coupling to Hive and older Hadoop engines. It remains the default in some Cloudera deployments; outside that ecosystem, Parquet dominates.

Avro — the row-oriented sibling

Apache Avro is row-oriented, not columnar, and aimed at a different workload: the serialised-message problem. An Avro file stores records with their schema; readers can evolve the schema (add optional fields, rename with aliases) without breaking. This makes Avro the default choice for Kafka topics, Snowflake's Snowpipe ingestion, and any workload where each record is read as a whole. It is strictly worse than Parquet for analytical scans, and strictly better for event streams.

JSON and CSV — the formats you will still see

CSV is 1972 technology; JSON is 2001. Both are row-oriented, uncompressed, and slow — and both are still ubiquitous in ingestion because humans can read them, every tool writes them, and the path of least resistance from an API to a file is json.dump. A pragmatic pipeline ingests in JSON or CSV, converts to Parquet on landing, and queries only the Parquet. CSV is, in particular, a minefield — no schema, ambiguous quoting, no portable null encoding — and no production analytical system should be built on it.

Feather/Arrow IPC — the in-memory cousin

Apache Arrow is an in-memory columnar layout, not a file format, but its on-disk serialisation — Feather V2 / Arrow IPC — is a useful near-zero-copy format for handing columnar data between Python, R, and Rust processes on the same machine. It is faster to read than Parquet but larger on disk and lacks Parquet's statistics; good for caches and scratch space, not for the long-term warehouse.

The emerging contenders

Lance (2023, by LanceDB) is a columnar format aimed specifically at ML: vector indexes, random row access, versioning. Vortex (2024, by Spiral) promises better compression than Parquet with faster scans. Neither has displaced Parquet yet; both are worth watching for 2027.

The one-line rule

Store analytical tables in Parquet. Store event streams in Avro. Use JSON and CSV as transient ingest formats only. Revisit the decision if, and only if, a specific workload shows that a newer format would actually move a measurable number.

Section 09

Compression and encoding, the quiet doublings

A columnar file stacks two independent kinds of size reduction: the column-specific encodings that exploit structure inside a single column, and a general-purpose compressor applied on top. Understanding the two layers is how a team decides whether a 2 TB file should really be 200 GB.

The encoding layer

Encodings are type-aware and structure-aware. Dictionary encoding gives a 10–100× win on categorical columns with low cardinality — a country column with 200 distinct values becomes an 8-bit dictionary code plus a 200-entry dictionary. Run-length encoding turns [1,1,1,1,2,2] into [(1,4),(2,2)]. Bit-packing uses the minimum bits needed for the range — a column of integers in [0,1023] uses 10 bits per value, not 32 or 64. Delta encoding stores differences — brilliant for sorted timestamps where consecutive differences are small.

The compression layer

On top of encoded bytes, a general-purpose compressor runs. The common choices: Snappy (fast, modest ratio — the historical Parquet default); Gzip (slow, good ratio — older and slower than alternatives); Zstandard / Zstd (tunable ratio, excellent speed — the modern default); LZ4 (fastest, worst ratio — useful for hot caches); Brotli (highest ratio, slowest — useful for write-once archival data).

The empirical rule

For most analytical workloads, Zstd at level 3 is the current sweet spot: roughly 2× better compression than Snappy, roughly 5% slower to decompress, readable by every modern engine. For long-lived cold archives, Zstd at level 9 or Brotli produce 10–30% smaller files at a much higher write cost. For a real-time ingest path where write throughput dominates, Snappy or LZ4 still win.

Why the two-layer view matters

A well-encoded Parquet file gains most of its compression from the encoding layer: a country column of a billion rows is already dictionary-encoded to 1 GB before any compressor runs. The compressor then adds another 2–3×. A poorly-encoded file — e.g., a PLAIN-encoded string column — leans entirely on the compressor and produces files two to five times larger than they should be. Diagnosing Parquet size issues almost always means checking the encoding, not the compressor.

The rule of thumb

Typical Parquet files with sensible defaults are 5–20× smaller than the equivalent CSV, and roughly 2× smaller than ORC on the same workload. A file that does not hit those ratios almost always has a miscoded column — usually a high-cardinality string that defeated the dictionary, or a wide JSON blob that defeated everything.

Section 10

Partitioning, clustering, and file layout that makes queries fast

A columnar file is fast. A directory of columnar files is fast only if its physical layout matches the queries run against it. The decisions about how to partition and sort data are among the most consequential and least visible a team makes.

Partitioning

Partitioning splits a table into disjoint subsets — typically by date, region, or tenant — stored in separate folders. A query filtered by the partition column reads only the relevant folders; the engine prunes the rest without opening a file. A well-partitioned events table (events/year=2026/month=04/day=17/) serves a one-day query by reading 1/365 of the data, with no other optimisation.

The over-partitioning trap

Partitioning on a high-cardinality column — user_id, say — produces a partition per value and catastrophic small-file overhead. Object stores charge per request; a table split into a million partitions costs a million HEADs before any real work begins. The rule: partition on columns with tens to low thousands of distinct values, not more. Date at day granularity is almost always safe; user_id almost never is.

Clustering and sort order

Inside a partition, sorting the data on columns commonly filtered makes min/max statistics in Parquet (Section 7) sharply more selective. Z-ordering (Delta Lake, Hudi) and Hilbert curves (Iceberg) are multi-dimensional sort techniques that preserve locality across two or more columns at once — useful when both user_id and event_ts are common filters.

File size, and compaction

Optimal Parquet files are 128 MB to 1 GB. Smaller files pay per-file overhead; larger files hurt parallelism and predicate pushdown. A streaming ingest writes thousands of small files; a batch compaction job periodically rewrites them into a few large ones. Delta Lake's OPTIMIZE, Iceberg's rewrite_data_files, and Hudi's compaction schedulers all exist specifically for this.

The layout-driven query

For almost any analytical table, roughly 90% of practical query performance comes from three decisions: the partition column, the clustering column, and whether compaction runs. A query that is slow on a well-laid-out table almost never gets faster by adding indexes; it gets faster by fixing the layout.

Section 11

Open table formats — Delta Lake, Iceberg, Hudi

A single Parquet file is a file. A table is the claim that a specific set of files, taken together with a schema and a history, constitute a logical unit. The open table formats — Delta Lake, Apache Iceberg, Apache Hudi — are the three widely-adopted standards for making that claim.

Delta Lake

Delta Lake was created at Databricks and open-sourced in 2019. Its metadata is a sequence of JSON log files (_delta_log/) that record every commit — files added, files removed, schema changes, statistics. The log is append-only, ACID-managed via optimistic concurrency control, and replayable from scratch. Delta's tight integration with Spark was historically its main advantage; as of 2024, Delta is also natively readable by Snowflake, BigQuery, DuckDB, Trino, and many more.

Apache Iceberg

Apache Iceberg originated at Netflix (2017) to solve the Hive-metastore problems that had become acute at their scale. Its metadata is more elaborate than Delta's: a current table state, a list of manifests, and per-file metadata — all stored as Avro files. The layered design makes Iceberg's commits cheap even for tables with millions of files, and its schema and partition evolution are more flexible than Delta's. Iceberg has become the de-facto standard for multi-engine interoperability, with first-class support across Snowflake, BigQuery, Databricks, Trino, Flink, and most others.

Apache Hudi

Apache Hudi (Hadoop Upserts Deletes and Incrementals) originated at Uber (2016) with a specific workload in mind: very frequent upserts and deletes, incremental consumption by downstream pipelines. It supports two table types — Copy-on-Write, like Delta and Iceberg, and Merge-on-Read, which keeps recent changes in log files and merges on query. Hudi is less universally adopted than Delta or Iceberg, but it is still the right choice when the workload is dominated by streaming upserts.

Convergence and interoperability

The three formats solve the same problem with different shapes, and the 2023–2025 period saw a wave of interoperability: Apache XTable (née OneTable) translates metadata between all three; Databricks' Delta UniForm publishes Iceberg metadata alongside Delta; Snowflake's Polaris and Databricks' Unity Catalog both speak Iceberg natively. The practical consequence: picking any one of the three no longer locks a team out of the others.

The choice for most teams

If the stack is Databricks-first, use Delta; the integration is tightest and the tooling is most mature. Otherwise, use Iceberg; it has the broadest multi-engine support and the most conservative governance. Use Hudi only if the dominant workload is high-frequency upserts from streaming sources.

Section 12

ACID on object storage, and why it was a hard problem

Object storage is eventually consistent, offers no multi-object transactions, and charges per request. Turning that into an ACID-compliant transactional store — which is exactly what the open table formats did — is one of the more underappreciated engineering achievements of the last decade.

The classical ACID properties

Atomicity: a transaction either commits in full or leaves no trace. Consistency: the table never appears in a state that violates its schema or constraints. Isolation: concurrent transactions appear to run one at a time. Durability: once a transaction is acknowledged, it survives failure. Every OLTP database has delivered all four for fifty years; doing it on object storage took twenty years and is the core contribution of table formats.

Optimistic concurrency on immutable files

The shared trick across Delta, Iceberg, and Hudi: data files are immutable — once written, they are never modified. Updates and deletes produce new files; the "version" of the table is a pointer to a specific set of them. A writer prepares its change, attempts to commit it by atomically swapping the pointer; if another writer beat it to the commit, it retries. This optimistic concurrency control works because the atomic swap is supported (barely) by modern object stores.

Time travel

Because old commits are never deleted until an explicit cleanup, every historical version of the table is queryable. Time travel (SELECT ... FROM orders VERSION AS OF '2026-04-01') is the reproducibility property ML depends on — retrain the exact model on the exact data, six months later, with zero ambiguity. It is also the primary tool for debugging data-quality incidents: diff the version before and after the bad commit to find what changed.

The isolation level is snapshot

All three formats implement snapshot isolation: a transaction sees a consistent view of the table as of its start time, even if other transactions commit in the meantime. This is weaker than serialisable isolation but strong enough for almost every analytical workload, and the performance is dramatically better.

The surprising robustness

The combination of immutable files, optimistic concurrency, and snapshot isolation turns out to scale to very large tables — hundreds of millions of files, petabytes of data, hundreds of concurrent readers and a handful of concurrent writers — with correctness guarantees that the older warehouse appliances also had, but at a fraction of the cost. This, more than any other single technical fact, is why the lakehouse architecture has become dominant.

Section 13

Dimensional modelling, still the most useful thing Kimball ever wrote

Storage format is what; schema shape is how. Regardless of whether the table lives in a warehouse or a lakehouse, the way its columns are organised — which columns are facts, which are dimensions, which are keys — determines whether analytical queries are simple or tortured.

Facts and dimensions

Fact tables record the business events: orders, clicks, sensor readings, payments. They are narrow (a dozen columns), deep (billions of rows), and mostly numeric. Dimension tables describe the entities referenced by facts: users, products, dates, regions. They are wide (dozens to hundreds of columns), shallow (thousands to millions of rows), and mostly text. A query almost always joins one fact table to several dimension tables.

The star schema

The canonical layout: a central fact table surrounded by dimension tables joined by foreign keys. It is called a star because the diagram looks like one. A sale-fact table keyed by (date_id, store_id, product_id, customer_id) joined to dim_date, dim_store, dim_product, dim_customer is, thirty years on, still the most productive way to model a retail warehouse.

Snowflake, and why most teams do not use it

In a snowflake schema (no relation to the product), dimension tables are themselves normalised — dim_product splits into dim_product, dim_category, dim_supplier. It looks cleaner on a diagram and queries worse: each extra join adds a shuffle. In analytical systems where disk is cheap, the denormalised star wins; most modern warehouses recommend it.

Slowly-changing dimensions

A customer moves from France to Germany; does the old sale become a German sale? The answer depends on the business, and the modelling technique that expresses the choices is slowly-changing dimensions (SCD). SCD Type 1 overwrites; SCD Type 2 keeps history with effective-date ranges; SCD Type 3 keeps a previous-value column. Type 2 is the right choice for almost any dimension where history matters for analytics.

The one big table and the wide-table renaissance

Modern columnar engines compress sparse wide tables so well that many teams have abandoned star schemas in favour of a single denormalised wide table — every sale joined to every dimension at write time, stored once. This loses the update efficiency of dimensions but simplifies queries and is often faster. The choice between modelled star and denormalised wide depends on whether dimensions change often and whether the team has the discipline to maintain a model.

The still-valid advice

The exact physical shape matters less than the discipline of naming facts, dimensions, and grain explicitly. A team that can answer "at what grain is this fact table, and which dimensions describe it?" has already done the hardest part of dimensional modelling, regardless of whether the answer ends up as a star schema or a one-big-table.

Section 14

Separating storage from compute, the architectural move that changed everything

For most of the history of databases, storage and compute were welded together — data lived on the machines that queried it, scaling one meant scaling the other. The cloud era broke the welding; the consequences have propagated through almost every decision in this chapter.

The old coupling

A classical warehouse appliance — Teradata, Netezza, original Redshift — stored its data on the local disks of the compute nodes. Adding capacity meant adding a node; adding compute meant moving data to rebalance; shrinking a cluster was almost impossible without downtime. The capacity of the cluster and the size of the data were the same number, which is fine when both grow together and punishing when they do not.

The decoupling move

Snowflake (2014) and BigQuery (2011, public 2012) shared the critical design choice: data lives permanently in object storage (S3, GCS); compute is a stateless, elastic pool of virtual warehouses that mount the storage on demand. Ten virtual warehouses can query the same table simultaneously without contention. A warehouse can be scaled up for a big ETL job and scaled down the moment it finishes. The cost of storage (pennies per GB-month) is decoupled from the cost of compute (dollars per node-hour).

What it unlocked

Four things followed. Isolation by workload: finance, analytics, and ML each get their own compute pool, sized to their own workload. Pay-per-query economics: idle compute costs nothing. Multi-engine access: the same files can be queried by Spark, DuckDB, and Snowflake, depending on which is best for the job. The lakehouse itself: the whole architecture of Section 5 is an outcome of this decoupling.

The catch: caching and locality

The weakness of the decoupled model is that every query reads from object storage, which has 10–100 ms latency per request — much slower than a local disk. Modern warehouses compensate with aggressive local caching on the compute nodes: the first query on a file is slow, subsequent queries on the same file are fast. Designing for a good cache-hit rate — keeping queries against a compact, recent slice of the data — is the main performance tuning concern of the modern era.

The architectural consequence

Almost every data-platform decision made since 2015 that looked modern and strange — open table formats, the lakehouse, Iceberg's multi-engine claim, the ability to run ten query engines against one copy of the data — is an implication of storage and compute having come apart. Understanding the decoupling is the single most useful way to reason about why the architectures of this chapter look the way they do.

Section 15

S3, GCS, Azure Blob — the substrate of modern data platforms

Underneath every warehouse, every lakehouse, every ML training job that reads a dataset, sits cloud object storage. Understanding it as a primitive — with its specific latency, consistency, and cost profile — is what lets an engineer reason about what everything built on top of it can and cannot do.

The object-storage model

A bucket is a flat namespace of objects. An object is a blob of bytes with a key, some metadata, and a content hash. There are no filesystems, no seeks, no partial updates — an object is written once and either replaced whole or deleted. The data plane exposes four operations: PUT, GET, HEAD, DELETE. The semantics are simple enough that the system can guarantee eleven nines of durability and near-unlimited scale, and restrictive enough that building a transactional table on top of it required a decade.

The three major providers

Amazon S3 (2006) invented the category and remains the market and technical reference. Google Cloud Storage is broadly equivalent with a cleaner API and tighter integration with BigQuery. Azure Blob Storage is the Microsoft equivalent, with the same primitives and a separate Data Lake Storage Gen2 layer for hierarchical access. A dataset written to one can be made accessible to engines running on the others, usually via a cross-cloud read path or a replication job.

Consistency, finally

For most of S3's history it was eventually consistent — a PUT might not be visible to an immediate GET. In December 2020, S3 became strongly consistent for reads, PUTs, and LISTs without any architectural change visible to clients. GCS and Azure Blob followed. This single change retroactively simplified every data pipeline ever written against object storage, and is one of the quieter large wins of the last decade.

Cost — the one dimension people get wrong

Storage itself is cheap — roughly $0.02 per GB-month in 2026. What tends to surprise first-time operators is request cost ($0.005 per 1000 PUTs, $0.0004 per 1000 GETs), egress cost ($0.09 per GB moved out of a region), and cross-region replication. A badly-designed pipeline that writes millions of small files, or reads entire datasets across regions for every query, produces bills that dwarf storage. This is why Section 10's "files should be 128 MB to 1 GB" rule is not aesthetics: it is arithmetic.

Storage classes and lifecycle

Each provider offers a ladder: Standard → Infrequent Access → Glacier / Archive, trading availability for lower per-GB cost. A lifecycle policy moves objects down the ladder automatically — after thirty days to Standard-IA, after a year to Glacier, after seven years to deletion. For long-horizon data — audit logs, ML training sets that may be retrained on — the difference between "store everything forever in Standard" and "store everything forever with a lifecycle" is the difference between an expensive platform and a sustainable one.

The mental image to hold

Every table in the modern platform is ultimately a pile of Parquet files in an S3 or GCS or Azure Blob bucket, with a small metadata log next to them. Everything above — the warehouses, the catalogues, the query engines — is about making that pile act like a first-class transactional table. The pile itself is where durability, scale, and cost live.

Section 16

Access, encryption, and cost — the non-technical edges of storage

A data platform that computes the right answer, fast, and exposes the wrong data to the wrong people, is a disaster. A platform that is correct, fast, and secure but costs five times what it should is a different kind of disaster. Storage is where both categories of failure most commonly land.

Access control

The dominant model is role-based access control (RBAC): users inherit permissions from roles; roles grant permissions on objects (tables, views, columns, rows). Every modern warehouse and lakehouse catalogue — Snowflake, BigQuery, Databricks Unity Catalog — implements RBAC with similar grammar: GRANT SELECT ON table TO role. Fine-grained controls — column masking for PII, row-level security for tenant isolation — are now standard.

Encryption — at rest and in transit

Every major object store encrypts data at rest by default, using provider-managed keys. Customer-managed keys (CMK) add the option of encrypting with a key the customer controls, so revoking the key renders the data unreadable — useful for tenant isolation and for regulatory requirements that demand it. Encryption in transit is universal (TLS 1.2+) and does not require configuration. The remaining question is which data, if any, needs client-side encryption — encrypted before it reaches the storage provider — for workloads where the provider itself is outside the trust boundary.

Cost controls

The four axes of cost on a modern platform: storage (cheap but cumulative), compute (expensive, pay-per-query or pay-per-hour), egress (expensive, especially cross-cloud), and request overhead (small per request, catastrophic at small-file scale). The controls: reservation pricing for predictable compute; lifecycle policies for aging storage; query governance (resource monitors, query timeouts, workload management) to cap runaway jobs; and simply measurement — a FinOps dashboard that surfaces cost per team, per table, per query.

Multi-tenancy

A platform shared by multiple teams, products, or external customers multiplies every governance challenge. The common architectural answer: one catalogue per organisation; one schema (database) per team; row-level security enforcing customer boundaries inside tables. A heavier variant is one physical warehouse per tenant — simpler to reason about but more expensive and harder to evolve.

The boring part is the hard part

Access, encryption, and cost are not technically deep compared to the storage-format choices earlier in this chapter. They are, in practice, the part that gets platforms shut down, audited, or quietly abandoned because they became too expensive. Investing in them is unglamorous and disproportionately rewarded.

Section 17

Where storage compounds in machine learning

An ML system is a read workload against a storage system. Training scans enormous slices of the warehouse; feature serving reads tiny rows at low latency; evaluation reproduces historical states. Each of these demands something specific from the storage layer, and each has pathological failure modes when the layer is wrong.

Training reads columnar

A training job on a billion-row table reads a handful of columns, streams them through a model, and repeats. This is exactly the workload Parquet and columnar engines are optimised for. A team training against a CSV or a row-oriented PostgreSQL table is paying an order of magnitude for the privilege; moving to Parquet-backed tables is, in practice, the highest-leverage storage change most ML teams ever make.

Feature stores sit on a different workload

Serving a model in production needs features for a specific entity at a specific moment: user 42's last-30-day spend, right now, in under ten milliseconds. This is row-oriented, low-latency, high-QPS — the opposite of the training workload. A feature store (Feast, Tecton, Databricks Feature Store) maintains two copies: an offline columnar store for training, an online row-oriented store (Redis, DynamoDB, CockroachDB) for serving, and an ingest pipeline that keeps them consistent.

Time travel, for reproducibility

Training produces a model; the model is a function of the exact dataset it saw. The table-format time-travel feature (Section 12) is, for ML, the reproducibility property. Retrain the model on the state of the warehouse as of 2026-04-01 is a one-line query against an Iceberg or Delta table; it is an archaeological dig against a traditional warehouse. Every experiment's provenance record should include the table version read.

Point-in-time correctness

A subtler trap: joining current dimensions onto historical facts. If a customer moved countries in March, a feature computed in April must not use the current country for a January event — or the model learns from information it could not have known at the time. Point-in-time joins, supported natively by modern feature stores and by any warehouse with SCD Type 2 dimensions, prevent this form of temporal leakage and are one of the most common sources of silent bugs.

The cost of a bad storage choice

A typical retraining job that reads from a well-laid-out Parquet lakehouse completes in minutes and costs cents. The same job against a badly-modelled JSON pile can take hours and cost thousands, and is often why ML teams talk about "our training is too expensive" — the model is not the problem, the storage is. The investment in storage discipline pays back, literally, on every training run for the life of the project.

The through-line

Machine learning looks like an algorithmic discipline and is, in production, a storage discipline wearing an algorithmic mask. The shape of the tables, the format of the files, the partitioning of the folders, the version log of the lakehouse — these determine how fast training is, how reproducible experiments are, how cheap the platform is to run, and how fast the next model can be shipped. Every chapter in the rest of Part III builds on top of these choices; getting them right early is the single most durable compounding investment a data team makes.

Where to go next

Storage is a field with a small canonical bibliography (the warehouse bibles, the Dremel and C-Store papers, the lakehouse paper) and a much larger practical one (format specs, engine docs, operational playbooks). The list below picks the references that repay serious reading, the specs worth knowing in detail, and the project documentation most working engineers return to.

The canonical books

The Data Warehouse Toolkit

Ralph Kimball & Margy Ross · 3rd ed. 2013 · Wiley

The single most influential book on how analytical tables should be shaped. Star schemas, slowly-changing dimensions, the grain of a fact table, the conformed dimension — all come from here, and all are still how working data engineers talk about modelling. The third edition remains current; read at least the first eight chapters before designing any warehouse schema.

Kimball Group
Building the Data Warehouse

Bill Inmon · 4th ed. 2005 · Wiley

The other half of the canon. Inmon's top-down "corporate information factory" argument for a single third-normal-form enterprise warehouse, with data marts fed from it. Less often implemented than Kimball's approach but essential to understand the alternative — and the discussion of operational versus analytical data still reads well twenty years on.

Wiley
Designing Data-Intensive Applications

Martin Kleppmann · O'Reilly · 2017

The current reference for the whole of data-systems engineering, with the best available treatment of storage and retrieval (Chapter 3), encoding and evolution (Chapter 4), and batch and stream processing (Chapters 10–11). The "storage engine" discussion of B-trees versus LSM-trees is the clearest explanation of why OLTP and OLAP engines look so different. A second edition is in progress; the first remains the single most recommended book in the field.

dataintensive.net
Fundamentals of Data Engineering

Joe Reis & Matt Housley · O'Reilly · 2022

A modern, practitioner-oriented book that covers the end-to-end lifecycle: generation, storage, ingestion, transformation, serving. Shorter and broader than Kleppmann, more opinionated, and explicitly written for the lakehouse era. The best current book to hand to a new data engineer joining a team.

O'Reilly

Foundational papers

C-Store: A Column-Oriented DBMS

Stonebraker et al. · VLDB 2005 · free

The paper that relaunched interest in columnar databases and led directly to Vertica. Introduces the ideas — column-at-a-time processing, compression within columns, projection-based layouts — that became the foundation of every analytical engine since. Short by research-paper standards; essential reading for anyone designing an analytical system.

VLDB PDF
Dremel: Interactive Analysis of Web-Scale Datasets

Melnik, Gubarev, Long, et al. · VLDB 2010 · free

The Google paper that introduced the engine behind BigQuery. Covers nested-record columnar storage, tree-structured execution, and the serving architecture that made interactive SQL over petabyte tables practical. Still the most influential single paper on modern analytical engines.

Google Research
Lakehouse: A New Generation of Open Platforms

Armbrust, Ghodsi, Xin, Zaharia · CIDR 2021 · free

The paper that named the architecture this chapter is mostly about. Makes the case that the warehouse and the lake can — and should — be unified, with open columnar files, open table formats, and a common catalogue. Short (twelve pages) and argumentative; worth reading critically rather than as a neutral account.

CIDR PDF
Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

Armbrust et al. · VLDB 2020 · free

The companion paper from the Databricks team explaining how Delta Lake achieves ACID semantics on top of S3's eventually-consistent object storage. The clearest single explanation of optimistic concurrency control on immutable files, which is the technique all three open table formats share.

VLDB PDF

Columnar file-format specifications

Apache Parquet specification

parquet.apache.org · free

The authoritative specification for the file format that analytical systems now write almost universally. Covers row groups, column chunks, pages, encodings (RLE, dictionary, delta, bit-packing), and the Thrift-serialised footer that makes predicate pushdown possible. Not a gentle introduction but the single source of truth when tuning writer settings.

parquet.apache.org
Apache Arrow documentation

arrow.apache.org · free

The in-memory columnar format that Parquet is the on-disk counterpart to. Arrow's specification, its IPC / Feather format, and its bindings (PyArrow, Arrow C++, arrow-rs) are the standard way to move columnar data between processes. Increasingly the lingua franca beneath the entire Python analytical stack.

arrow.apache.org
Apache ORC specification

orc.apache.org · free

The alternative columnar file format born from Hive. Worth reading alongside Parquet to understand the design differences — stripes versus row groups, richer built-in indexing — even if the end decision is Parquet. The comparison is also the fastest way to understand what a columnar file format actually is.

orc.apache.org
Apache Avro specification

avro.apache.org · free

The row-oriented serialisation format used by almost every Kafka-based pipeline. The specification is short and covers the schema language, the binary encoding, the schema-evolution rules (aliases, default values, enum compatibility), and the schema registry patterns that make Avro work at scale. Read once if you will ever touch a streaming pipeline.

avro.apache.org

Open table formats — project documentation

Delta Lake documentation

docs.delta.io · free

Official Delta Lake documentation, including the full specification of the transaction log format. The protocol spec, the "big book of Delta", and the concurrency model pages are the material worth reading carefully; the recipe-level guides are fine for copying-and-pasting but rarely deepen understanding.

docs.delta.io Protocol spec
Apache Iceberg documentation

iceberg.apache.org · free

Iceberg's docs plus its specification are the best material for understanding modern table formats end to end. The table spec is unusually readable for a specification; the "evolution" and "partitioning" pages explain why Iceberg's design is more flexible than Delta's on schema and partition changes. Essential reading for anyone choosing between the two.

iceberg.apache.org Spec
Apache Hudi documentation

hudi.apache.org · free

Hudi's docs are heavier on operational concerns (compaction, clustering, cleaning) than Delta's or Iceberg's, which reflects the format's focus on streaming upsert workloads. The "Design & Concepts" section is the right starting point for understanding what Copy-on-Write and Merge-on-Read actually mean.

hudi.apache.org
Apache XTable (née OneTable)

xtable.apache.org · free

An incubating Apache project that translates between Delta, Iceberg, and Hudi metadata so that one physical set of files can be read by engines expecting any of the three. The best window into how similar the three formats actually are — and into the emerging reality that format choice no longer locks a team in.

xtable.apache.org

Warehouses and query engines

Snowflake documentation

docs.snowflake.com · free

The most comprehensive cloud-warehouse documentation, and the one most engineers should read first whether or not Snowflake is the chosen platform. The "Key Concepts & Architecture" and "Virtual Warehouses" pages are the clearest available explanation of the separation of storage from compute; the SQL reference is unusually precise.

docs.snowflake.com
BigQuery documentation

cloud.google.com/bigquery/docs · free

Google's serverless warehouse. The "Storage" and "Query execution" docs describe the Dremel lineage in practical terms; the pricing pages are the fastest way to build intuition for how the decoupled model produces very different economics from on-demand classical warehouses.

cloud.google.com
DuckDB documentation

duckdb.org · free

The in-process analytical engine that has quietly reshaped what "small-data" and "laptop-scale" mean. Reads Parquet, Iceberg, and Delta natively; runs OLAP queries on a laptop that would have required a cluster five years ago. The documentation and the official "Why DuckDB?" essay are short and influential.

duckdb.org Why DuckDB
Trino / Presto documentation

trino.io · free

The open-source distributed query engine originally built as Presto at Facebook, now split into Trino and PrestoDB. The primary open-source choice for federated queries across data lakes, lakehouses, and external sources; the documentation is the best material for understanding how a modern MPP query engine actually executes a join.

trino.io/docs

Cloud object storage

Amazon S3 User Guide

docs.aws.amazon.com/s3 · free

The operational reference for the most-used object store in the world. The "Storage classes" and "Lifecycle management" pages alone save most teams a material fraction of their bill; the "Strong consistency" and "Data consistency model" pages are the authoritative source on what S3 guarantees today (and, by implication, what every lakehouse built on top of it can rely on).

docs.aws.amazon.com/s3
Google Cloud Storage documentation

cloud.google.com/storage/docs · free

Google's object store. The "Storage classes" and "Object lifecycle" pages are the equivalent of the S3 references; the "Designing for scale" page has useful guidance on request-rate limits that is often the first thing a growing pipeline hits.

cloud.google.com/storage
Azure Blob Storage documentation

learn.microsoft.com/azure/storage/blobs · free

Microsoft's object store, with a separate hierarchical-namespace layer (Data Lake Storage Gen2) aimed specifically at lakehouse workloads. The ADLS Gen2 documentation is the place to start for a lakehouse on Azure; the access-control and encryption pages are unusually detailed.

learn.microsoft.com
Amazon S3 strong consistency announcement

AWS News Blog · December 2020 · free

The single blog post that retroactively simplified every data pipeline in existence, announcing that S3 had become strongly consistent for PUTs, GETs, and LISTs. Worth reading alongside the follow-up engineering post for the internal changes that made it possible; a concise historical anchor for why lakehouse architecture became viable when it did.

AWS Blog

ML-adjacent storage topics

Feast documentation

docs.feast.dev · open-source

The leading open-source feature store. The "Concepts" and "Architecture" pages explain the offline/online split, point-in-time correctness, and the feature-registry discipline that production ML systems eventually need. Worth reading before writing any custom feature-serving code.

docs.feast.dev
Tecton and Databricks Feature Store — vendor docs

docs.tecton.ai · docs.databricks.com/machine-learning/feature-store

The two most-used commercial feature stores. Reading their documentation is the clearest way to understand what the managed version adds on top of Feast — stream ingestion, freshness monitoring, point-in-time SQL, governance — and therefore what the costs and benefits of buying versus building look like.

tecton.ai Databricks Feature Store
Lance format and LanceDB

lancedb.com · open-source

A columnar format built specifically for ML workloads — vector indexes, versioning, random row access, all on top of cloud object storage. Worth reading about even if not adopted: it is the clearest current attempt to specialise a columnar format for the training-plus-vector-search workloads that modern ML produces.

lancedb.com
Hugging Face Datasets and the Parquet conversion

huggingface.co/docs/datasets · free

The largest catalogue of ML datasets in the world, stored and served as Parquet over Hugging Face's object storage. The "Dataset formats" and "Streaming" pages are useful reading for the operational patterns — sharded Parquet, streaming iteration — that most modern large-model training now uses.

huggingface.co/docs/datasets

This page is the second chapter of Part III: Data Engineering & Systems. The next — Data Pipelines & Orchestration — covers how bytes actually get from the ingestion sources of Chapter 01 into the storage layouts described here, and how they get transformed along the way: Airflow, Prefect, Dagster, dbt, the batch-versus-stream trade-off, and the discipline of making a pipeline whose failure modes are visible. After that come streaming and real-time systems, distributed compute, the cloud platforms everything runs on, and finally the governance layer that keeps it all auditable.