Part III · Data Engineering & Systems · Chapter 03

Data pipelines, and the orchestration of everything that moves.

Data does not stay where it lands. It arrives from sources, lands in storage, gets transformed, gets joined, gets copied, gets aggregated, gets served back to dashboards and models and operational systems — and every one of those arrows is a pipeline, and every pipeline has to run on schedule, notice its own failures, and be safe to re-run. The tools that express the arrows — Airflow, Prefect, Dagster, dbt — and the patterns that keep them correct — idempotence, incremental loads, data tests, observability — are what this chapter covers. It sits between the ingestion of Chapter 01 and the streaming of Chapter 04; it is the layer most working data engineers spend most of their time in.

How to read this chapter

The first three sections frame the problem: why pipelines are the nervous system of a data platform, the ETL-versus-ELT shift that has reshaped how transformation is done, and the batch-versus-stream divide that the rest of Part III will return to repeatedly. Sections four through seven are the orchestration story: the DAG-plus-scheduler abstraction, Apache Airflow as the incumbent, Prefect and Dagster as the modern contenders, and dbt as the transformation layer that made analytics engineering a distinct discipline. Sections eight through ten are the correctness primitives: idempotence, backfills, and the testing practices that separate a pipeline from a liability. Sections eleven and twelve cover observability and schema/contract management — how to know the pipeline is working and how to keep upstream and downstream in agreement. Sections thirteen and fourteen bring in change-data-capture and reverse-ETL: moving data out of warehouses back into operational systems, and the emerging "operational analytics" loop. Sections fifteen and sixteen are the ML-specific story: feature, training, and inference pipelines, and the operational disciplines (secrets, retries, SLAs) that production demands. The final section connects all of it to why pipeline discipline is, in practice, the load-bearing determinant of whether an ML team can ship.

Conventions: code snippets are Python where illustrative, SQL when the topic is transformation, YAML when the topic is configuration. References to "the warehouse" assume the storage layer from Chapter 02 — Parquet files in object storage, an open table format on top, a query engine (Snowflake, BigQuery, Databricks, DuckDB) reading from it. Vendor names are used where there is no good generic term; the choice of tool is almost always less important than the principles the chapter tries to articulate.

Contents

  1. Why pipelines are the nervous systemThe layer data engineers live in
  2. ETL versus ELTThe great transformation flip
  3. Batch versus streamWindows, latency, correctness
  4. Orchestration as a disciplineDAGs, schedulers, dependencies
  5. Apache AirflowThe incumbent
  6. Prefect and DagsterThe modern orchestrators
  7. dbt and the analytics-engineering layerSQL as software
  8. Idempotence and determinismRe-run without remorse
  9. Backfills and incremental loadsCatching up without double-counting
  10. Testing pipelines and dataUnit tests, data tests, assertions
  11. Observability and lineageKnowing what ran, when, and why
  12. Schemas, contracts, and evolutionKeeping upstream and downstream honest
  13. Change data captureMoving deltas without full reloads
  14. Reverse-ETL and the operational loopWarehouse out to the business
  15. Feature, training, and inference pipelinesThe three ML pipelines
  16. Secrets, retries, SLAs, costsThe operational surface
  17. Where it compounds in MLWhy pipeline discipline decides shipping
Section 01

Pipelines are the nervous system of a data platform

If storage is the substrate, pipelines are what happens on top of it — the set of moving processes that turn raw bytes into useful tables, that keep those tables fresh, and that get derived data to every system downstream that asks for it. More working hours of data engineering are spent inside pipelines than anywhere else, and more production incidents originate there too.

What a pipeline actually is

A pipeline is a directed chain of computations that turns one or more data inputs into one or more data outputs on a schedule or in response to an event. The inputs might be raw logs in object storage, rows in a Postgres database, events arriving on Kafka, or files appearing in a vendor's SFTP drop; the outputs are typically well-modelled tables in a warehouse, feature tables for a model, events on another topic, or extracts sent back out to a business system. The computations in between — filtering, joining, aggregating, enriching, validating — are what the pipeline does; the schedule and the dependencies and the failure handling are what the pipeline is.

Why pipelines deserve their own discipline

A pipeline is not a one-off query, and is not a long-running service, and is not a batch script. It is its own genre of software, with failure modes that neither analytical code nor application code fully share. Pipelines have to be safe to re-run after a partial failure, robust to late-arriving or malformed data, observable enough that a human can diagnose a stuck run at two in the morning, and cheap enough to run daily — or hourly, or continuously — for years. Those properties do not come for free; they are the result of a set of disciplines (idempotence, incremental design, testing, observability) that the rest of this chapter is about.

The three audiences a pipeline serves

Every pipeline is, implicitly, serving three audiences. The analyst, who wants the output table to be accurate, fresh, and well-named. The operator, who has to diagnose it when it breaks at three a.m. and wants logs, lineage, and a retry button. And the downstream engineer or model, who consumes the output and depends on its schema, its semantics, and its freshness guarantees. A pipeline that is written for only one of the three — fast for the analyst, opaque to the operator — is the most common failure mode in this entire field.

The load-bearing discipline

If an ML team ships late, the reason is almost never the model. It is almost always the pipelines — a feature that is not being computed correctly, a training table that is stale, a serving path that drifts from the training path, a daily refresh that quietly broke two weeks ago. Pipeline discipline is the load-bearing determinant of whether any data-dependent system ships on time.

Section 02

ETL versus ELT, and the great transformation flip

The phrase extract, transform, load has been the default description of a data pipeline for thirty years; for most of those thirty years, the letters were in the wrong order. The flip to extract, load, transform is not a branding exercise — it is the most consequential change in how analytical data is processed since the warehouse was invented, and it is the premise of every modern data stack.

The ETL world, and why it existed

In the classical ETL regime, data was pulled from sources, transformed in flight by a dedicated tool — Informatica, IBM DataStage, SSIS, Talend, Ab Initio — and only the finished, cleaned, conformed rows were loaded into the warehouse. The reason was economic: warehouse storage and warehouse compute were expensive, so nothing that could be discarded upstream should be allowed to land. The consequence was that the ETL tool became the centre of gravity, with its own proprietary expression language, its own scheduler, and its own team.

The ELT inversion

Cheap object storage and separated compute changed the economics. If loading raw data into the warehouse costs almost nothing, and if the warehouse's SQL engine is the most capable compute you own, the rational move is to load first and transform later, in-warehouse, in SQL. This is the ELT shift. Transformation moves from a specialised ETL tool into the warehouse itself, expressed in SQL, and the "T" becomes a dbt project rather than an Informatica mapping.

What ELT buys, and what it costs

The benefits are real: a single SQL dialect for transformations, raw history preserved for re-derivation, and one less system to operate. The costs are also real: warehouse bills become a line item engineers look at weekly rather than yearly; governance of raw PII is harder when everything lands in one place; and transformation logic that was once centralised in an ETL tool is now distributed across hundreds of dbt models maintained by many teams. The modern stack is almost entirely ELT; the operational discipline that used to live in ETL tools has had to be rebuilt, in dbt and in orchestrators, more or less from scratch.

The honest middle ground

In practice, most production systems are "EtLT": a light in-flight transform (parsing, deduplication, PII masking) followed by heavy in-warehouse transforms. Pure ELT is more an ideal than a policy; pure ETL is a legacy the industry has mostly left behind. Knowing which pieces of the transform belong upstream and which belong downstream is the working judgement call data engineers get paid for.

Section 03

Batch versus stream, and the latency dial everything else hangs on

Every pipeline has to answer one architectural question before any other: is it processing finite chunks of data on a schedule, or an unbounded stream of events as they arrive? The answer determines the tools, the failure modes, the cost profile, the correctness guarantees, and often the team that owns it. Most of the interesting difficulty in modern data systems is that real platforms contain both.

Batch, as a mental model

A batch pipeline processes a bounded, named collection of input data — "yesterday's orders", "last hour's click-stream", "every row written since the last run" — and produces a bounded output. It runs on a schedule, reads from storage, writes to storage, and has a clear notion of what "done" means. Batch systems are the incumbent paradigm, and almost every analytical pipeline still works this way; the tooling (Airflow, dbt, Spark, warehouse SQL) is mature, the reasoning is tractable, and the cost is predictable.

Stream, as a mental model

A stream pipeline processes an unbounded sequence of events one at a time — or in micro-batches of a few seconds — and produces a continuously updating output. "Done" is not a state a streaming pipeline is ever in; it runs forever, and its correctness is about invariants over time rather than about a single completed run. Stream systems are more complex: they have to reason about event-time versus processing-time, about windowing, about watermarks, about exactly-once delivery, about what happens when an event arrives late.

Where each belongs

Batch belongs wherever the consumer can tolerate an hour of staleness: almost every dashboard, almost every training job, almost every periodic report. Stream belongs wherever the consumer cannot: fraud detection, real-time personalisation, monitoring, operational alerting. The mistake to avoid is using streaming where batch would do, which is the most common over-engineering in the field; the opposite mistake — trying to get real-time behaviour out of a daily batch job — is more obvious and usually caught sooner.

The unified-API promise

Systems like Apache Beam, Flink's Table API, and Spark Structured Streaming all try to present batch and stream as special cases of a single programming model. The abstraction is useful and the convergence is real, but the operational differences — cost, failure handling, latency — do not go away because the API is shared. The architectural choice is still a choice.

Section 04

Orchestration, the DAG as the unit of thinking

A single transformation is a pipeline in only the most trivial sense. A real pipeline is a graph of transformations — one task's output is another task's input, and another's, and so on — that has to run in the right order, at the right time, with dependencies respected. The tool that plans and runs that graph is an orchestrator, and understanding the abstraction matters more than understanding any particular tool.

The DAG abstraction

Every orchestrator in the field expresses pipelines as a directed acyclic graph, a DAG: nodes are tasks, edges are dependencies. The DAG is acyclic because a task cannot depend on itself, directly or transitively; it is directed because ordering matters. The scheduler's job is, given a DAG and a schedule, to run each task after its upstream dependencies have succeeded, to retry it on transient failure, to surface it to a human on durable failure, and to record the result so the next run can decide whether to re-run or skip.

What an orchestrator actually provides

Every serious orchestrator provides roughly the same set of capabilities: a way to express the DAG in code, a scheduler that fires runs on time, an executor that runs tasks on workers, a state database that records what has succeeded and failed, a retry and backoff mechanism, a UI for operators to see runs, and a backfill tool for re-running historical windows. The differences between orchestrators are in how pleasant each of those is to work with, not in which exist.

What a scheduler is not

An orchestrator is not cron, though it often replaces cron. Cron is a timer: it fires a command on a schedule and does not know whether the command succeeded, whether its inputs are ready, whether its outputs landed, or whether any downstream task should react. An orchestrator is a state machine for the whole graph: it knows that task B depends on task A, that task A wrote a file, that task B reads it, and that if A fails, B should not run. The distance from cron to a real orchestrator is the distance from "a command ran at 2 a.m." to "yesterday's analytics are correct".

The canonical failure question

When an orchestrator run fails, the question that separates good pipelines from bad is what did it partially write? A well-designed pipeline writes nothing or writes completely — that is, each task is atomic from the perspective of its output. A badly-designed pipeline leaves the warehouse in a half-updated state, and the operator has to untangle it by hand. The discipline of idempotence in section eight exists to make this question answerable in one word: "nothing".

Section 05

Airflow, the incumbent and the default

Apache Airflow, born at Airbnb in 2014 and an Apache project since 2016, is the orchestrator most working data teams encounter first. Its DAGs-as-Python-code model defined the shape of the field, its plugin ecosystem is larger than every competitor's combined, and its operational quirks are, for better and worse, the baseline expectations every newer tool is measured against.

DAGs as Python

An Airflow DAG is a Python file that constructs Operator objects and wires them together with shift operators (a >> b). Each operator represents a task — run a SQL statement, call a Python function, run a Kubernetes pod, trigger an HTTP webhook — and Airflow's scheduler walks the graph, runs the tasks, and records the results. The model is deliberately code-first: DAGs are version-controlled, code-reviewed, and tested like any other Python, which is how a pipeline stops being a spreadsheet full of cron entries and starts being software.

The Airflow stack

A production Airflow deployment is four processes plus a database: a scheduler that decides what to run, a webserver that renders the UI, a set of workers (running under Celery, Kubernetes, or Local executors) that actually execute tasks, a metadata database (usually Postgres) that holds DAG state, and an optional triggerer for deferred async tasks. Operating all of it is a non-trivial investment, which is why managed offerings — Astronomer, AWS MWAA, Google Cloud Composer — exist.

Airflow's strengths and sharp edges

The strengths are the ecosystem (hundreds of providers: Snowflake, BigQuery, Databricks, S3, Kafka, dbt, you name it), the maturity of the scheduler, and the weight of existing deployments, which matters for hiring. The sharp edges are real: Python-level import time is scheduler-level cost (a slow DAG file slows the whole cluster), its scheduling model was originally built for daily batch and shows that heritage, and its data-passing primitive (XComs) is designed for metadata, not for data. Teams adopt Airflow because it is the safe choice; they migrate off it, when they do, because of those edges.

Airflow 2 and beyond

Airflow 2 (2020) rewrote the scheduler for horizontal scale and added the TaskFlow API, which made Python-function-as-task the default idiom; subsequent releases added dataset-aware scheduling, setup-and-teardown tasks, and dynamic task mapping. Airflow 3 (2025) is a larger break: a stable task execution API, decoupled workers, and first-class support for event-driven scheduling. New teams starting today should start on Airflow 3 if they start on Airflow at all.

Section 06

Prefect and Dagster, the modern orchestrators

The two most-discussed alternatives to Airflow start from different critiques of it. Prefect objects to Airflow's rigidity — schedules, top-level DAG definitions, the distinction between "flow" and "runtime" — and rebuilds orchestration around dynamic Python. Dagster objects to Airflow's task-centric model and rebuilds orchestration around assets: the data objects being produced, not the tasks that produce them. Both are worth understanding even if the deployment ends up on Airflow.

Prefect's flow-first model

Prefect's abstraction is the flow, a Python function decorated with @flow whose body can call @task-decorated functions. The DAG is constructed at runtime by the Python execution itself, which means control flow (if, for, try) behaves exactly as in normal code. Prefect's hosted plane handles scheduling, state, and observation; its workers pull work from the plane, so firewall-crossing deployments are easier than with Airflow's push model. It is the lightest-weight of the three to get running and the closest in feel to writing ordinary Python.

Dagster's asset-first model

Dagster inverts the unit of thinking: instead of describing tasks that happen to produce data, you describe software-defined assets — "the daily orders table", "the user-features table", "the trained model artifact" — and Dagster figures out the task graph to keep them current. The asset graph is the object the UI exposes, the lineage it shows, and the thing a developer reasons about. For data platforms where the important question is "what tables do we have, when were they last updated, and how were they derived?", Dagster's model is the sharpest fit.

How to choose between the three

The honest answer is that any of the three can run any pipeline, and the more interesting question is which fits the team's mental model. Teams with a large existing Airflow footprint, a cron-replacement mindset, and many heterogeneous tasks usually stay on Airflow. Teams writing mostly Python, iterating rapidly, and wanting the lightest operational load pick Prefect. Teams whose platform is dominated by tables and whose engineers think in terms of data products pick Dagster. On a green field, Dagster is the most opinionated and Airflow the most boring; both are defensible defaults for different teams.

The less-discussed alternatives

Beyond these three, Temporal (workflow-as-code with durable execution, aimed more at application workflows than analytics), Argo Workflows (Kubernetes-native, YAML-defined), and Kestra (YAML + plugins, declarative) are the other serious options. Cloud-native managed services — AWS Step Functions, Google Workflows, Azure Data Factory — fit narrower niches but are sometimes the pragmatic right answer inside a single cloud. None of them displace Airflow at the centre of the market yet.

Section 07

dbt, and the birth of analytics engineering

dbt — the data build tool — is not an orchestrator and is not a storage layer. It is a SQL-first templating and dependency-resolution framework that runs inside a warehouse, and it is, almost single-handedly, responsible for the emergence of analytics engineering as a distinct role. For any team whose warehouse is the centre of gravity, dbt is closer to a required skill than a choice of tool.

What dbt actually does

A dbt project is a directory of SQL files, each of which is a SELECT statement representing a model — a table or a view to be built in the warehouse. dbt parses the SQL, detects the dependencies between models using the ref() and source() macros, plans the execution order, materialises each model using the strategy specified (table, view, incremental, snapshot), and runs data tests against the results. That is most of the tool. The rest is Jinja templating, documentation generation, and a growing governance surface.

Why it changed analytical work

Before dbt, the transformation layer was either a pile of stored procedures, an ETL tool's proprietary graph, or a set of ad-hoc SQL scripts orchestrated by Airflow. dbt imposed three disciplines that the field had largely lacked: transformations are version-controlled SQL files, not UI artefacts; dependencies are declarative and machine-readable, not kept in a human's head; and tests are part of the transformation, not a separate QA activity. The result is that an analytical warehouse now has a software-engineering shape to it — pull requests, code review, CI, lineage — which is what made analytics engineer a title worth holding.

dbt Core, dbt Cloud, and where this is going

dbt Core is the open-source command-line tool and is the load-bearing piece; dbt Cloud is the hosted layer that adds scheduling, IDE, access control, and the semantic-layer features (MetricFlow, the semantic layer, dbt Mesh for multi-project setups). The 2024 relicensing of dbt Labs' Cloud-only features and the emergence of a fully open-source fork, SQLMesh, are the current political weather in this part of the ecosystem. For most teams the right answer is still dbt Core plus either dbt Cloud or a self-hosted Airflow/Prefect/Dagster runner; for teams uneasy about licensing direction, SQLMesh is worth a serious look.

The ELT pattern, in one sentence

Ingestion tool (Fivetran, Airbyte, a custom Python task) loads raw data into the warehouse; dbt transforms it into clean staging tables, then into conformed marts; the orchestrator fires dbt on a schedule, runs its tests, and surfaces failures. That pattern is the default data-stack shape as of the mid-2020s and is what most new analytical pipelines look like.

Section 08

Idempotence, the single most important property of a pipeline

If a pipeline has exactly one property, make it idempotent: running it twice on the same inputs produces the same outputs as running it once. Everything else — retries, backfills, disaster recovery, operator sanity — follows from this property, and nothing else usefully substitutes for it.

What idempotence means, precisely

A task is idempotent if, given the same inputs, running it any number of times leaves the system in the same state as running it exactly once. For a pipeline task, that usually means: an append-only task must not append duplicates; an upsert task must converge to the same row state; a transform task must overwrite its output rather than append to it. The operational pay-off is enormous: any failed run can simply be re-executed, and the state will be correct, no matter how many times it failed or how partially it completed.

How to build idempotent tasks

Three patterns cover most cases. Overwrite by partition: every task writes to a specific, parameterised output partition (typically by date), and re-running the task overwrites the partition atomically. Merge by key: every task writes into a target table keyed on a stable identifier, using a MERGE or INSERT … ON CONFLICT statement, so repeated runs converge on the same rows. Idempotent receipts: each outgoing side-effect (email, webhook, payment) is tagged with a deterministic idempotency key, so the downstream system can deduplicate.

Determinism, the adjacent virtue

Idempotence requires that inputs be the same; determinism requires that the computation be the same. The two together let you rebuild yesterday's output from yesterday's inputs byte-for-byte. Common determinism killers include unseeded randomness, CURRENT_TIMESTAMP calls inside transformations, references to mutable external state, and locale-dependent sorting. Every one of them is a bug in a production pipeline, and every one of them is easy to miss on first pass.

The single-sentence rule

If you cannot re-run today's pipeline tomorrow, starting from scratch, and expect the same output, the pipeline is not idempotent and every other discipline in this chapter rests on loose gravel. Idempotence is not an optimisation; it is the invariant that makes all the rest of pipeline engineering tractable.

Section 09

Backfills, and the arithmetic of catching up

Eventually, a pipeline's logic changes — a bug is fixed, a new column is added, a definition is revised — and yesterday's output becomes wrong by today's definition. The process of recomputing history under the new logic is a backfill, and the ease with which a pipeline can be backfilled is the clearest practical test of whether it was designed well.

The full-load / incremental-load / backfill triangle

Three modes matter. A full load recomputes the output from all the history available; simple, correct, slow, expensive. An incremental load processes only the data that has arrived since the last run, using a bookmark such as a timestamp or an auto-incrementing ID; fast, cheap, subtle. A backfill is a full or partial recomputation after the fact, typically over a named historical window. A well-designed pipeline supports all three, and a common mistake is designing only for the first two.

Partitioned writes, and why they exist

The operational trick that makes backfills tractable is partitioning the output by the processing date (or event date). Each daily run writes into its own dt=2026-04-21 partition; a backfill of a two-week window overwrites fourteen partitions and leaves the rest alone. Both Airflow and Dagster have first-class support for this pattern (Airflow's logical date templating, Dagster's partitioned assets), and both dbt (incremental with partition_by) and Spark (dynamic partition overwrite) cooperate natively. The pattern is what makes idempotence from section eight actually usable.

Catching up without double-counting

The subtle failure in incremental designs is that the bookmark itself can drift. If the bookmark is "the maximum event time seen so far", late-arriving events will be silently missed. If the bookmark is "the last successful run's end time", a re-run after a bug fix will skip the very rows the fix was for. The defensive pattern is to parameterise every incremental pipeline by a window (start-timestamp, end-timestamp) rather than by a single bookmark, and to record that window explicitly. The cost is minor; the payoff is that a backfill is "re-run these dates" rather than "surgically edit the bookmark table".

The rule of threes

If you cannot (1) describe which dates a given run covers, (2) re-run a single date independently of the others, and (3) count the number of rows that a given run added, then the pipeline is not really incremental; it is accidentally incremental, and it will eventually bite. Every production incremental pipeline should have a runs or manifest table that answers the first question explicitly.

Section 10

Testing, of the code and of the data

Pipelines fail in two ways: the code is wrong, or the data is wrong. Software engineering has spent decades on the first failure mode and has good tools for it; the second is more recent, more particular to data work, and is where most production incidents actually originate. A serious pipeline has tests for both.

Unit tests for the transformation

Every piece of non-trivial transformation logic — a SQL model, a Python transform, a UDF — should be tested with small synthetic inputs and expected outputs. Tools like dbt-unit-testing, sqlmesh's built-in unit tests, and pytest plus pandas/polars/DuckDB make this routine. The test runs in CI, on small in-memory fixtures, and proves that the code behaves on known cases. If this is the only kind of testing a pipeline has, it is better than most pipelines in the wild, but it is not enough.

Data tests against the live output

The other kind of test runs against the actual output of a pipeline, after the pipeline runs. dbt's built-in tests — unique, not_null, accepted_values, relationships — are the canonical starting point; tools like Great Expectations, Soda Core, and elementary cover a wider vocabulary (row-count anomalies, distributional checks, freshness assertions, column-level stats). A data test is a guardrail, not a logic check: it asserts that the data coming out of the pipeline meets the invariants the downstream assumes, and fails the run (or blocks the publish) when it doesn't.

Freshness, volume, and anomaly checks

Beyond per-column tests, three pipeline-level invariants are worth encoding explicitly. Freshness: the output was written within the last N hours. Volume: the run wrote between lo and hi rows — the most common failure a data platform sees is "the pipeline ran successfully and produced zero rows". Anomaly: a headline metric (sum of revenue, count of distinct users) is within historical variance. These three checks, together, catch most of the incidents that per-column tests miss.

The cultural prerequisite

Tests only matter if someone cares when they fail. A test that always fails, or fails and is always ignored, is worse than no test: it trains the on-call to mute the alarm. The practical rule is that every test should be actionable — its failure either blocks publication of the output or creates a ticket a human will look at within the working day. Tests that fit neither category should be deleted.

Section 11

Observability, and the map from every byte to the row it became

Knowing that a pipeline ran is a low bar; knowing which pipeline produced which row, with which code, from which inputs, at which cost, is the bar that production discipline actually has to clear. The category of tooling that tries to answer those questions is data observability, and it has grown into a distinct market and a distinct set of practices over the past five years.

The three observability questions

A data observability system tries to answer three questions at any moment. Is the data fresh? — when was the table last updated, and is that within SLA. Is the data correct? — row counts, distributions, schemas, null rates look as expected. Where did the data come from? — the lineage graph upstream of this table, and the lineage graph downstream. Tools like Monte Carlo, Bigeye, Datafold, Acceldata, and the open-source elementary and OpenLineage projects each pick some blend of these.

Lineage, as a platform capability

Lineage is the graph that records, for every derived table, the chain of transformations and inputs that produced it. Column-level lineage — which input columns contributed to which output columns — is more useful than table-level lineage and strictly harder to produce; both are more useful than no lineage at all. Lineage earns its keep in three moments: diagnosing an incident ("what broke upstream of this bad dashboard"), planning a change ("what breaks if I drop this column"), and answering a compliance question ("where did this person's data propagate to").

Metrics, logs, and the observability stack

Underneath the data-observability category sit the same primitives as any other service: structured logs emitted by tasks, metrics (run duration, row counts, bytes processed, cost) emitted to a time-series system, and traces that link a pipeline run to the downstream assets it produced. OpenLineage is the emerging open standard for emitting those events; most orchestrators and dbt have native integrations. A team that invests early in shipping lineage events to a catalogue gets observability, impact analysis, and compliance for roughly the cost of shipping one kind of event.

The catalogue convergence

Data observability, lineage, and cataloguing are collapsing into a single product category. DataHub, OpenMetadata, Atlan, Collibra, Unity Catalog, and others are each some blend of "a place where tables are documented" and "a place where runs are tracked". The specific tool matters less than having one; an ML or data team without any catalogue eventually turns its Slack into one, badly.

Section 12

Schemas and contracts, the handshake between producer and consumer

A pipeline's output is a promise to its consumers. The formalisation of that promise — the set of columns, their types, their semantics, the guarantees about nulls and uniqueness — is a schema, and its rules of change over time are the contract. Teams that manage schemas and contracts well have smooth data platforms; teams that don't have Slack channels full of "why did this break?".

Schema evolution, as a real problem

Schemas change: columns are added, types widened, nulls permitted where they weren't, names corrected. The consumer's tolerance for these changes differs by case. Adding a nullable column at the end is almost always safe; renaming a column is almost always not; changing a column's type is somewhere in between. Formats like Avro, Protobuf, and Parquet have well-defined compatibility rules (backward, forward, full) that can be enforced by a schema registry; open table formats (Iceberg, Delta) enforce evolution at the table level. The discipline is to declare the policy, not to hope that nothing breaks.

Data contracts

The term data contract is the 2020s formalisation of the idea that upstream producers owe downstream consumers a stable, tested interface. In practice a data contract is a declarative spec — a YAML file, a dbt model specification, a Protobuf schema — that says: this table will have these columns with these types, these quality guarantees, this freshness SLA, and will not break without a versioned, announced change. The engineering content is the enforcement: CI checks the contract on every PR, the orchestrator fails the run when the contract is violated, and the consumer depends on a stable interface rather than on whatever the upstream happens to be emitting this week.

Producers and consumers, and where the line sits

The deepest argument the data-contracts discussion has produced is about who owns the contract. In one view — the Zhamak Dehghani data mesh view — each domain team owns its data products and their contracts outright, and the platform team provides the infrastructure. In another — the more traditional view — a central data-platform team owns contracts because only they see the whole graph. Most real organisations land on a hybrid, and the real work is the governance that makes either model operate, not the choice between them.

The honest baseline

Most data platforms have no explicit contracts and a great deal of implicit ones. Moving to explicit contracts is usually worth doing, is usually a multi-quarter project, and is usually faster if begun with the most-depended-on two or three tables and the teams that own them. Trying to contract the whole warehouse at once is the most common way the effort dies.

Section 13

Change data capture, moving deltas instead of snapshots

The naïve way to replicate a source database into a warehouse is to snapshot it nightly. This works for small tables and falls over for anything else. Change data capture — CDC — is the collection of techniques for streaming only the changes to a source, in the order they happened, with enough fidelity to rebuild the source's state downstream.

The three CDC strategies

CDC is usually implemented in one of three ways. Query-based CDC polls the source for rows whose updated_at exceeds the last seen watermark; simple, and missing anything that doesn't update that column (notably deletes). Trigger-based CDC installs database triggers that write every change to an audit table; thorough, and usually unwelcome to the DBA. Log-based CDC tails the database's write-ahead log — Postgres logical replication slots, MySQL binlogs, SQL Server change tracking — and emits an event per row change. Log-based is the high-fidelity, low-impact choice and is what tools like Debezium do; most serious setups end up there.

Tools and the streaming / batch bridge

CDC is where streaming and batch meet. Debezium reads database logs and writes events to Kafka; downstream, either a streaming engine keeps a live copy of the source tables updated, or a batch pipeline periodically reads the events and applies them to a warehouse using MERGE. Fivetran, Airbyte, and Hevo all offer managed CDC connectors that hide the log-reading and produce daily or hourly merged tables. Which end of the latency spectrum fits depends on the consumer, but the underlying event stream is the same.

What CDC does not solve

Change-data-capture gives you deltas; it does not automatically give you a correctly-modelled target table. Handling deletes, reconstructing slowly-changing dimensions, deduplicating re-delivered events, and reconciling source-side transactional boundaries are all still your job. A common failure is treating CDC as a finished pipeline; it is in fact the ingestion half of a pipeline, and the modelling half still has to be built and tested like any other transformation.

Exactly-once, in practice

Every CDC provider claims some form of exactly-once or at-least-once delivery; in practice all deliver at-least-once plus deduplication-is-your-problem. The operational rule is that every CDC event should carry a monotonically increasing log-sequence number, and every consumer should deduplicate on it. The moment a consumer pretends the stream is exactly-once, it acquires a silent-duplicate bug waiting for a network blip.

Section 14

Reverse-ETL, and the closing of the analytics loop

For most of warehouse history, data flowed one way: from operational systems into the warehouse, where it was analysed and never seen again by the operational stack. Reverse-ETL is the recent practice of sending data back out — from the warehouse to Salesforce, Marketo, Zendesk, the product's own database — and it has collapsed what used to be a one-way mirror into an operational loop.

Why the loop closed

Two conditions made reverse-ETL possible. First, the warehouse became the richest repository of modelled customer data inside most companies — richer than any single operational system because it joined across all of them. Second, the lakehouse and ELT discipline meant the warehouse was fresh enough, and its schemas stable enough, that operational systems could reasonably depend on it. Reverse-ETL tools — Census, Hightouch, and the reverse-ETL features inside Fivetran — are the piping that sends specific warehouse tables into specific SaaS destinations on a schedule.

Operational analytics, as a pattern

The use case that drove reverse-ETL is operational analytics: the warehouse computes which customers are at risk of churning, or which leads are most valuable, or which users are active this week, and those computed segments are pushed to the tools that marketing, sales, and support actually use. The warehouse becomes the system of record for cross-system metrics; the SaaS tools become activation layers. The pattern works when the warehouse's latency and the consumer's tolerance match; it fails when a team tries to use reverse-ETL to drive real-time operational behaviour for which streaming would be the right tool.

The governance dimension

Once the warehouse is writing into operational systems, its PII-handling story becomes an obligation rather than a preference. Email addresses, phone numbers, and identifiers flow back out of the warehouse into tools the warehouse team may not control; consent flags have to be honoured, data-subject-access-requests have to propagate both ways, and the lineage graph from section eleven now has real audit consequences. Reverse-ETL is the point at which analytical discipline and operational risk management stop being separate concerns.

Not every table should leave

The attractive thing about reverse-ETL is also its hazard: if it is easy to push any warehouse table anywhere, teams will. Most organisations benefit from a small, well-governed list of tables whose destinations are reviewed — customer segments, lifecycle states, computed attributes — and a firm default of no for everything else. The cost of an accidental push of a raw join into a CRM is higher than the cost of waiting a day for review.

Section 15

Feature, training, and inference pipelines, the three ML pipelines

A production ML system is not one pipeline; it is three, and getting the relationship between them right is what separates "we have a model" from "we have a model in production". The three are the feature pipeline, the training pipeline, and the inference pipeline, and each has its own cadence, its own correctness concerns, and its own failure modes.

The feature pipeline

The feature pipeline computes model inputs — derived columns like rolling averages, session-level aggregates, user lifetime statistics — from the warehouse and writes them to a feature store (Feast, Tecton, Databricks Feature Store) or a feature table. Its correctness concern is point-in-time: training features must reflect only what was known at the time of each example, and serving features must reflect the same definitions. Time-travel bugs here — where training accidentally sees the future — are the single most common cause of models that work in notebooks and not in production.

The training pipeline

The training pipeline consumes historical features and labels and produces a model artifact. It runs less often than the other two — weekly, monthly, or on some trigger — but its run is expensive and its output is versioned. Orchestrators (Airflow, Kubeflow Pipelines, Metaflow, MLflow's pipeline features) handle the scheduling; the artifacts go to a model registry; the evaluation suite gates promotion to production. The pipeline's outputs are not only the model but also the evaluation report, the training-data manifest, and the configuration — all of which are needed to reproduce the run months later.

The inference pipeline

The inference pipeline consumes live features and a trained model and produces predictions. It may be a batch scoring job that runs nightly, a streaming consumer that scores events as they arrive, or an online service that answers per-request in milliseconds. Its correctness concern is training-serving skew: the features computed at inference time must match the features the model was trained on, bit-for-bit. A feature store exists largely to solve this problem by making the training and inference paths share code and definitions.

The three-cadence model

Features move at warehouse cadence (hourly to daily); models move at experiment cadence (weeks to months); predictions move at product cadence (milliseconds to minutes). The gaps between the three are where most ML incidents live — a feature that changed meaning without the model being retrained, a model that is older than its feature schema, a prediction path that silently diverged from the training path. Explicit pipelines at each cadence, with contracts between them, are how production ML systems stay honest.

Section 16

Secrets, retries, SLAs, and cost, the operational surface

Idempotence and testing make a pipeline correct. Running it in production means also making it operable: credentials managed safely, failures retried sensibly, latencies tracked against commitments, bills kept from running away. These concerns are not glamorous and are where the largest class of production incidents actually originate.

Secrets, credentials, and the connection sprawl

Every pipeline needs credentials: database passwords, cloud API keys, SaaS tokens, private-registry credentials. The rule is that they never live in code and never live in the orchestrator's plain-text configuration. A secrets manager — AWS Secrets Manager, Google Secret Manager, HashiCorp Vault, the orchestrator's native secret backend — is the only durable answer. The adjacent discipline is credential rotation: a production pipeline that does not tolerate a rotated credential will, eventually, go down at the worst moment because someone rotated it.

Retries, backoff, and transient-versus-permanent

Most pipeline failures are transient: a network blip, a queue pause, a temporary API rate limit. A well-configured retry policy — exponential backoff, a retry count, a per-error-type classifier — resolves most of them without human intervention. The failure mode to engineer around is the permanent error — a malformed input, a logic bug, a missing credential — being retried forever. Distinguishing the two is what makes the on-call shift bearable. Most orchestrators (Airflow, Prefect, Dagster, Temporal) have first-class retry configuration; using it well is more than setting retries=3.

SLAs, SLOs, and cost tracking

Every pipeline that downstream consumers depend on should have an explicit freshness commitment — an SLA — and an internal target — an SLO — that is stricter than the SLA. Orchestrators can emit SLA-miss events, which the on-call gets paged for before the customer-visible SLA is actually broken. The cost dimension is similar: each run emits bytes-scanned, compute-time, and dollar metrics to an observability system, and a budget alert fires when any pipeline's cost drifts. In a warehouse-centred architecture where every query has a price, cost observability is not optional.

The on-call checklist

When a pipeline wakes someone up, they should be able to answer four questions from the UI in under a minute: what ran, what failed, what data was touched, who depends on the output. A platform where those answers take an hour to assemble from logs, lineage, and Slack archaeology is not operable; a platform where they are one click away from any run is the observable platform this chapter has been building toward.

Section 17

Where pipelines compound into the whole ML operation

The modelling work that makes machine learning look interesting from the outside sits on top of the pipeline discipline described in this chapter. Every property an ML system has in production — freshness, reproducibility, auditability, speed of iteration, cost — is inherited, transitively, from the quality of the pipelines underneath it.

Reproducibility is a pipeline property

A model is reproducible when its inputs are reproducible, which means the feature pipeline that produced them is reproducible, which means the upstream warehouse models are reproducible, which means the raw ingestion pipelines are reproducible. Every link in that chain has to be idempotent, deterministic, and version-stamped for the end-to-end claim to hold. Teams that treat reproducibility as a MLflow configuration usually discover, a quarter in, that the real issue is a CDC feed whose schema drifted silently, or a dbt model whose definition changed without a contract break.

Shipping speed is a pipeline property

The single biggest determinant of how fast a data scientist can iterate is how fast the pipeline graph produces a usable training table. Teams with hour-scale pipeline turnarounds ship models in weeks; teams with day-scale turnarounds ship in quarters; teams without a functional pipeline graph do not ship at all. The investment in orchestration, testing, and observability does not look like velocity investment — it looks like infrastructure — but it is the largest single velocity lever the field has.

Trust is a pipeline property

The adoption of any data-derived decision — a model's score, a dashboard's headline metric, a reverse-ETL-driven campaign — is bounded by the recipient's confidence in the pipeline that produced it. Confidence is earned by freshness SLAs being met, tests passing, lineage being legible, and incidents being rare and well-communicated. A team that is on top of these wins adoption; a team that isn't finds every interesting output eventually second-guessed.

The through-line

Pipelines are the most load-bearing and least glamorous part of a data or ML platform. The tools — Airflow, Prefect, Dagster, dbt — are less important than the disciplines: idempotence, incremental design, testing, observability, contracts. Every chapter that follows in Part III — streaming, distributed compute, cloud infrastructure, governance — sits on top of the pipeline layer this chapter describes. Getting it right early is how everything above it stays shippable.

Further reading

Where to go next

Pipelines are a field with a thin canonical bibliography and a very large practical one. The list below leans on the books worth reading cover to cover, the foundational papers, the official documentation of the main orchestrators and transformation tools, and the operational references that working data engineers return to once the pipeline is no longer being written but being run.

The canonical books

Foundational and influential papers

Orchestrators — official documentation

Transformation — dbt and SQLMesh

Ingestion, CDC, and streaming

Testing, observability, and lineage

ML-adjacent pipeline topics

This page is the third chapter of Part III: Data Engineering & Systems. The next — Streaming & Real-Time Systems — takes the batch-versus-stream distinction of section three and opens up the stream side: Kafka in depth, Flink, Spark Structured Streaming, event-time semantics, exactly-once delivery, and the operational differences streaming platforms impose on a team. After that come distributed compute, the cloud platforms everything runs on, and finally the governance layer that keeps it all auditable.