Part III · Data Engineering & Systems · Chapter 04

Streaming, and the data that refuses to wait.

Not every piece of data can be held for the nightly batch. Payments clear in seconds, fraud has to be caught in milliseconds, a user's next recommendation is wanted before they scroll away, a sensor emits a reading that matters only while it is fresh. Streaming is the architectural answer: an unbounded, ordered flow of events rather than a bounded file; a processor that updates continuously rather than once a day; a set of correctness guarantees — exactly-once, event-time, watermarks — that batch never had to think about. The tools that carry the events — Kafka, Pulsar, Kinesis — and the engines that compute over them — Flink, Spark Structured Streaming, Kafka Streams — are the core of this chapter. So is the harder part: the semantics that make streaming correct rather than just fast.

How to read this chapter

The first three sections frame the subject: why streaming exists at all given how well batch works, why the append-only log is the primitive the whole field is built on, and how stream processing actually differs from the batch/stream sketch in Chapter 03. Sections four through six are the transport layer — Apache Kafka as the dominant broker, its ecosystem (Connect, Streams, Schema Registry), and the alternatives (Pulsar, Kinesis, Redpanda, NATS) that matter in different niches. Sections seven through nine move from transport to computation: what stream processing actually is, Apache Flink as the reference engine, and Spark Structured Streaming as the micro-batch alternative. Sections ten through thirteen are the semantic core of the chapter — event time versus processing time, windowing, state and checkpoints, and the three delivery guarantees (at-most-once, at-least-once, exactly-once) — the ideas that separate a correct streaming system from one that is merely live. Sections fourteen through sixteen widen the frame: event-driven architectures beyond analytics, streaming ML (online features, real-time inference), and the operational reality of running streams at scale (lag, backpressure, ordering). The final section connects all of it to why streaming infrastructure is increasingly on the critical path for modern ML.

Conventions: vendor names are used where there is no good generic term — Kafka specifically, brokers generically; Flink specifically, stream engines generically. References to "the warehouse" still assume the storage layer from Chapter 02, and the pipelines from Chapter 03 are assumed as context for how batch and stream interact in a real platform. The treatment of semantics (event time, watermarks, exactly-once) is deliberately careful: these are the topics where intuition from batch tends to mislead, and where most production streaming incidents originate.

Contents

  1. Why streaming exists at allThe latency pressure
  2. Events and the append-only logThe primitive
  3. Stream processing versus batch, revisitedWhat actually changes
  4. Apache KafkaThe dominant broker
  5. The Kafka ecosystemConnect, Streams, Schema Registry
  6. Pulsar, Kinesis, Redpanda, NATSThe alternatives that matter
  7. What stream processing actually isStateless versus stateful
  8. Apache FlinkThe reference stream engine
  9. Spark Structured StreamingMicro-batching and the unified API
  10. Event time, processing time, watermarksThe semantic core
  11. WindowingTumbling, sliding, session
  12. State and checkpointsKeeping memory across events
  13. Delivery semanticsAt-most, at-least, exactly-once
  14. Event-driven architecturesBeyond analytics
  15. Streaming MLOnline features and real-time inference
  16. Operating streaming systemsLag, backpressure, ordering
  17. Where it compounds in MLWhy streaming is on the critical path
Section 01

Streaming exists because some data will not wait until tomorrow

Batch is the default, and for most analytical work it is also the right answer. Streaming is the architecture you adopt when the default stops working — when the gap between an event happening in the world and a decision being made about it has to collapse from hours to seconds, and the whole shape of the system has to change to make that collapse possible.

The latency pressure

A surprising amount of software is a latency argument. Fraud detection that fires after yesterday's batch has run is useless; a recommendation that reflects a user's previous session is worth a tenth of one that reflects their current one; an alert on a failing production line that arrives an hour late is a postmortem, not an alert. Each of these is a case where the value of the data decays faster than a batch pipeline can deliver it, and each of them pushes the underlying architecture toward streaming. The question is rarely "should this be real-time?" in the abstract; it is "what is it worth to compute this in one hundred milliseconds instead of one hour?" — and for a narrow but important set of use cases the answer is: a lot.

What streaming actually is

A streaming system has three properties that distinguish it from a batch pipeline. Data arrives as an unbounded sequence of events rather than as a finite file. Processing runs continuously rather than on a schedule. And the system's correctness is defined in terms of invariants over time — "the running total is accurate within five seconds of the latest event", "no event is lost", "each event is counted exactly once" — rather than in terms of a completed batch's output matching an expected result. All three together are what make streaming a different discipline rather than a faster batch pipeline.

Why it is harder

Every concern that batch handles implicitly becomes explicit in a streaming system. Ordering: events arrive out of order, and the correct handling depends on the semantics the consumer expects. Time: is a count of "events in the last hour" based on when the event happened, when it arrived, or when it was processed? Failure: a stream runs forever, which means a worker crash cannot mean "restart the job" — it has to mean "resume from where the worker left off, without losing or duplicating events". State: most interesting stream computations are stateful (counts, joins, aggregates), which means there has to be a recovery story for the state itself. The rest of this chapter is the systematic answer to each of those problems.

The wrong default

The single most common streaming mistake is adopting it where batch would do. Streaming adds operational load, debugging difficulty, and cost; it earns its place only where the latency requirement genuinely demands it. A reasonable default is: batch first, streaming when batch no longer fits. Teams that invert this — stream first, batch only where stream is clearly overkill — spend a large fraction of their engineering budget on problems their use case never actually had.

Section 02

The append-only log, the primitive everything else is built on

The conceptual core of modern streaming is not a queue, a database, or a pub/sub bus — it is the log. An ordered, append-only, immutable sequence of records, partitioned for scale and replicated for durability. Most of what makes streaming systems behave the way they do is a direct consequence of this single data structure.

What an event is

An event is a fact: something happened at a time, has a key (a user, an order, a sensor), carries a payload, and is immutable once written. "User 42 clicked product 7 at 2026-04-21T13:02:11Z" is an event; so is "Order 9001 was paid" and "Sensor A reported 72°F". The discipline is to represent what happened, not the resulting state — the state is a view of the event stream, not the stream itself. Jay Kreps's 2013 essay "The Log" is the canonical articulation of this inversion, and it is the conceptual step most batch engineers have to make to think clearly about streaming.

Why append-only and immutable

Once written, an event is never modified. Corrections are themselves new events ("order 9001 was refunded"), and the derived state adjusts accordingly. The immutability property is what makes a log a useful foundation: it can be replayed, re-processed by a different consumer, re-hashed by a new model, and audited — properties that an in-place-mutated database table does not naturally provide. The cost is that the log grows; retention policies and compacted topics (see below) are the answer.

Partitioning, offsets, and replicated durability

A log at scale is not a single sequence; it is a set of partitions, each of which is its own ordered sequence. Events are routed to partitions by a key, so all events for the same user land on the same partition in order. Each event in a partition has a monotonically increasing offset; consumers track where they are by remembering the last offset they processed. Partitions are replicated across brokers so that the loss of any single broker does not lose events. These three properties — partitioned for throughput, offset-addressed for resumability, replicated for durability — together are the thing that makes a log usable as the backbone of a streaming platform.

Kafka as the reference implementation

The log abstraction predates Kafka, but Kafka is what made it operationally mainstream. "A distributed, replicated, partitioned, append-only commit log" is both a description of Kafka and a description of the abstraction the rest of this chapter will take as given. Alternatives (Pulsar, Kinesis, Redpanda) implement the same abstraction with different trade-offs; the vocabulary is substantially shared.

Section 03

Stream processing versus batch, the differences that actually matter

Chapter 03 opened the batch/stream distinction; this chapter has to close it. The useful view is not that stream is "batch but faster", but that several properties — the shape of the input, the definition of correctness, the cost profile, the failure handling — change together when you cross the line.

Bounded versus unbounded input

A batch job reads a bounded dataset — yesterday's Parquet files, this hour's partition, the Postgres table as of now — and produces a bounded output. A streaming job reads an unbounded input and produces an unbounded output. The practical consequence is that "done" is a state a batch job is in and a stream job never is. Every streaming system's vocabulary — windows, watermarks, checkpoints — exists so that a job whose input is infinite can still produce useful finite answers along the way.

Different failure models

When a batch job fails, the operational move is "re-run from the top". That is not available to a streaming job. The state has to survive the failure, the in-flight events have to be replayed without being processed twice, and the output has to remain correct after the restart. Every serious stream processor solves this with some variant of the same pattern: periodically snapshot the state, record the input offsets at the snapshot boundary, and on restart roll back to the most recent snapshot and replay from the recorded offsets. The difference in failure handling is one of the largest practical differences between the two worlds.

Costs and latencies are differently shaped

A batch pipeline's cost is proportional to the data it processes and easy to predict: a nightly job processes a day's worth of data, so the bill is the per-unit cost times one day of data. A streaming pipeline's cost is proportional to wall-clock time — it runs continuously, paying for compute whether or not events are arriving — and to the state it maintains. For workloads with high event rates the stream can be cheaper; for workloads with low rates it can be surprisingly expensive. The latency shape is the inverse: batch latency is its own run duration plus the schedule gap (typically hours), streaming latency is measured in seconds or milliseconds.

The honest middle ground

Real platforms are almost always hybrid. The warehouse is batch; the event bus is streaming; a CDC feed bridges between them. A reasonable design principle is to use batch for everything the consumer can tolerate it for, and streaming only for the specific data whose value decays faster than batch can deliver. Chapter 03's pipeline discipline applies to both; what this chapter adds is the semantics that only the stream side needs.

Section 04

Apache Kafka, the broker the whole field assumes

Kafka is, for streaming, what Parquet is for storage: the default. Written at LinkedIn, open-sourced in 2011, now the Apache project at the centre of most event-driven architectures — it is the single technology whose vocabulary the rest of streaming has adopted. An understanding of what Kafka actually is, and is not, is the load-bearing piece of this chapter.

Brokers, topics, partitions, replicas

A Kafka cluster is a set of brokers — servers that store and serve events. Events are written to topics, which are logical names; each topic is split into one or more partitions, each of which is an ordered, append-only log stored on disk and replicated across brokers. Producers write events to a topic; the partition is chosen by hashing a key (or round-robin if no key). Consumers read from one or more partitions, tracking their position (the offset) so that a restart resumes without loss. Every higher-level concept in Kafka — consumer groups, transactions, compacted topics — is built on this substrate.

Producers and consumers, and what each guarantees

A Kafka producer writes batches of records to a partition; its knobs (acks, retries, enable.idempotence) control what happens on network failure and how strong the delivery guarantee is. A Kafka consumer reads from partitions in order, processes events, and commits its offsets back to the broker so that a restart knows where to resume. Consumer groups let multiple consumers share the work of reading a topic: each partition is assigned to exactly one consumer in the group, and adding consumers up to the partition count is how a pipeline horizontally scales.

What Kafka is and is not

Kafka is a distributed log; it is not a database, and not a message queue in the classical RabbitMQ sense. It does not provide per-message acknowledgements (it provides per-offset commits), does not provide random access to events (it provides sequential reads from an offset), and does not provide complex routing (it provides topics and partitions, and the routing is the consumer's problem). The result is very high throughput, very durable storage, and a simple-but-not-easy programming model. Tools built around Kafka — Streams, ksqlDB, Flink connectors — provide the higher-level patterns; Kafka itself stays minimal.

KRaft and the end of ZooKeeper

Until recently, a Kafka cluster required a separate ZooKeeper ensemble for metadata coordination; Kafka 2.8 introduced KRaft, a Raft-based quorum protocol that moved metadata into the brokers themselves, and Kafka 3.x made it the default. Operational simplification is real; new deployments should start on KRaft without looking back.

Section 05

The Kafka ecosystem, Connect, Streams, Schema Registry, ksqlDB

Kafka itself is a log. The operational value comes from the ecosystem that has grown around it — the connector framework that moves data in and out, the stream-processing library that runs in-process, the schema registry that keeps producers and consumers honest, and the SQL-style query layer that sits over all of it.

Kafka Connect

Kafka Connect is a framework for moving data between Kafka and external systems — databases, object storage, warehouses, search indexes, SaaS applications — via declarative JSON configuration rather than bespoke code. It runs as its own process (a Connect cluster), hosts source connectors that emit data into Kafka and sink connectors that write data out, and handles offset tracking, at-least-once delivery, and failure recovery generically. Hundreds of community and vendor-supplied connectors exist; the Debezium CDC connectors (section 13 of Chapter 03) are among the most consequential.

Kafka Streams and ksqlDB

Kafka Streams is a Java/Scala library — not a separate cluster — that embeds a stream processor inside an application. You write transformations using a DSL (map, filter, join, windowedBy), and the library handles partition assignment, state store management (RocksDB-backed), and checkpointing using Kafka itself. ksqlDB sits on top and lets you express the same patterns as SQL. Neither competes directly with Flink; both are the right choice when the processing logic belongs with the application that already uses Kafka, rather than in a separate cluster.

Schema Registry and Avro/Protobuf discipline

An event's payload is bytes; making those bytes self-describing and evolvable is what a schema registry does. Confluent's Schema Registry is the canonical implementation: producers register Avro, Protobuf, or JSON Schema definitions with the registry, publish events carrying a schema ID, and the registry enforces compatibility rules (backward, forward, full) on every evolution. Consumers look up schemas by ID and deserialise safely. A Kafka cluster without a schema registry is a Kafka cluster where producers silently break consumers; the registry is not optional for production platforms.

Confluent, Strimzi, and the distribution question

Apache Kafka is the upstream project; most production deployments run either a managed service (Confluent Cloud, AWS MSK, Aiven, Redpanda Cloud) or an operator-based self-hosted cluster (Strimzi on Kubernetes). Running Kafka on bare VMs without either is possible and increasingly rare; the operational discipline involved — broker rolling, partition balancing, rack-aware replication, retention policy — is enough that most teams adopt the operator or the managed service rather than build it from scratch.

Section 06

Pulsar, Kinesis, Redpanda, NATS, the alternatives that matter

Kafka is the default but not the only choice. Four alternatives show up often enough to matter: Pulsar and Kinesis in the broker slot, Redpanda as a Kafka-protocol-compatible rewrite, and NATS in the lighter-weight messaging niche. Each is a reasonable answer in specific circumstances.

Apache Pulsar

Pulsar, originally built at Yahoo, separates the serving layer (brokers) from the storage layer (Apache BookKeeper), which makes scaling storage independent of scaling throughput. It has first-class multi-tenancy, geo-replication built in rather than bolted on, and native support for both queue-like and log-like consumption. For organisations that need heavy multi-tenant isolation or geo-replication, Pulsar's architecture is a genuinely better fit than Kafka's; for most other cases the larger Kafka ecosystem wins.

AWS Kinesis

Kinesis is Amazon's managed streaming service, with similar partitioned-log semantics to Kafka but a simpler (and narrower) API. Two products worth distinguishing: Kinesis Data Streams, the broker-equivalent, and Kinesis Data Firehose, a no-code pipeline that lands events into S3/Redshift/OpenSearch. On AWS-native stacks Kinesis is often the pragmatic choice; on anything multi-cloud or self-hosted, Kafka is usually preferred because the ecosystem is portable.

Redpanda, and the Kafka-compatible rewrite

Redpanda is a C++ reimplementation of the Kafka protocol that runs as a single binary, without JVM and without ZooKeeper. It uses the same client libraries, the same Schema Registry, and the same Kafka Connect ecosystem; from the application's perspective it is Kafka. What it changes is operations: lower memory, tighter tail latency, simpler deployment. For teams that want Kafka-the-protocol without JVM-the-tax, it is the obvious alternative.

NATS, and the lightweight end of the spectrum

NATS is a different category: a lightweight pub/sub messaging system with a persistent-log add-on (JetStream). Its strength is latency and simplicity — microseconds rather than milliseconds, a binary that runs on a Raspberry Pi. It does not replace Kafka for heavy analytical streams, but for microservice messaging, request/reply, and IoT-scale deployments, it is often a better fit.

The convergence

All four of these are converging on "Kafka-protocol, log-first, with extensions". The protocol has become a lingua franca even where the implementation differs; most mature applications write producer and consumer code once and can target any of them. The selection now is more about operations, cost, and cloud fit than about fundamentally different semantics.

Section 07

Stream processing, stateless, stateful, and the join

A broker moves events; a stream processor computes over them. The set of computations stream processors support is smaller than a batch SQL engine's but includes the operations that most real-time applications need: per-event transforms, aggregations over windows, joins between streams, and stateful pattern detection. Understanding the three shapes of computation is how the rest of the chapter becomes legible.

Stateless transforms

The simplest operations — map, filter, flatMap — are stateless: the output of any single event depends only on that event's contents. Enriching an event with a lookup from a static table, filtering events by type, parsing a raw payload into typed fields: all fall into this category. Stateless operators are easy to scale (each event is independent) and easy to reason about (a failure loses nothing except the in-flight event, which is replayed from the log). The majority of production stream code is stateless; the part that isn't is where the interesting work is.

Stateful aggregations and joins

An aggregation — a running count, a rolling average, a top-K — needs to remember something across events. So does a join between two streams (enrich a clickstream event with the user's current profile) or between a stream and a table. A stream processor with state stores (backed by RocksDB or a distributed key-value store) and a checkpointing mechanism (see section twelve) is what makes these tractable. The state is a first-class concern: it has a size, a memory footprint, a recovery story, and a cost.

Pattern detection and CEP

Complex event processing (CEP) is the oldest genre of streaming: detect patterns ("three failed logins then a successful one", "price rose more than 5% then reversed") across a stream. Flink's CEP library, Kafka Streams' sessionised patterns, and specialised engines (Esper, Drools) live here. It is a narrower use case than aggregation and join but the right tool when the question is truly "has this sequence of events occurred?".

The stateful-or-not question

Before choosing an engine, decide whether the processing is stateful. Stateless computations run happily on Kafka Streams, AWS Lambda, a plain consumer loop — the engine choice hardly matters. Stateful computations at scale are where the serious differences between Flink, Spark Structured Streaming, and Kafka Streams actually show up. The architectural question is not "which stream engine?" but "what state, with what size, with what recovery guarantees?".

Section 09

Spark Structured Streaming, the micro-batch alternative

Spark Structured Streaming, introduced in Spark 2.x as the successor to the older DStream API, approaches streaming differently from Flink: as the limiting case of batch. A continuously-growing table is queried repeatedly in small batches, and the same DataFrame API works on both bounded and unbounded input. It is the right answer for teams whose platform is already Spark-centric and whose latency tolerance is in seconds rather than milliseconds.

The "table grows over time" abstraction

The mental model is: imagine an input table that new rows keep being appended to. A streaming query is a query against that table whose result is continuously updated. Write the query once using the DataFrame or SQL API, tell Spark where to read from (Kafka, files, Delta Lake) and where to write to (Delta, Kafka, a warehouse), and Spark runs the query in micro-batches every few hundred milliseconds. The query author, in principle, does not have to think about batches at all; in practice, tuning the micro-batch interval and the trigger is most of the operational work.

Micro-batch versus continuous modes

The default execution mode processes events in micro-batches — typically 100 ms to a few seconds — which gives high throughput and moderate latency. A continuous processing mode exists for single-digit-millisecond latency but supports only a subset of operations. For the large majority of Spark Streaming deployments, the micro-batch default is what is actually used; for anything truly latency-sensitive, Flink is more often the right tool.

Integration with Delta and the lakehouse

Where Spark Structured Streaming is unambiguously the right choice is inside a lakehouse. Writing micro-batches into Delta Lake (or Iceberg, or Hudi) gives transactional streaming writes to the same tables that batch jobs and dashboards read; the "unified batch-and-stream" promise of the Dataflow Model becomes concrete. The Databricks-centric "medallion architecture" — bronze/silver/gold Delta tables built by mixed batch and streaming jobs — is built on this pattern, and for Spark-centric platforms it is a clean, well-integrated model.

The lineage of the choice

Flink and Spark Structured Streaming are the two serious choices for general-purpose stream processing, and for most teams the decision falls out of the existing platform. Databricks or a heavy Spark footprint → Spark Structured Streaming. Kubernetes-native streaming team with latency-sensitive work → Flink. New greenfield platform with neither as an incumbent → pick based on whether the workload is closer to "continuous ETL into a lakehouse" (Spark) or "low-latency event-at-a-time processing" (Flink).

Section 10

Event time, processing time, and watermarks, the semantic core

The single idea that most batch engineers stumble on when they move to streaming is that there are two clocks. When the event happened is one thing; when the system saw it is another; the lag between them is the subject of most production streaming bugs. The vocabulary for handling it — event time, processing time, watermarks — is the semantic core of the field.

Event time versus processing time

Event time is the timestamp at which the event occurred in the real world — the user clicked, the sensor read, the order was placed. Processing time is when the stream processor actually handled the event. In a well-connected system with no backpressure, the two are close; under network problems, batching at the source, or simple geographic distance, they can differ by seconds, minutes, or (for mobile apps that buffer events offline) days. A correct streaming computation almost always wants event time; a misleading one almost always accidentally uses processing time.

What a watermark actually is

If events can arrive out of order, when is a window "complete"? A watermark is the answer: a monotonically advancing heuristic from the stream processor that says "I have seen all events with event-time up to t; any event arriving now with event-time less than t is late". The watermark trails the latest event time by some margin (configurable) to tolerate lateness; it advances as new events arrive. Window computations wait for the watermark to cross the end of the window, then emit the result. Every streaming system that claims event-time correctness implements some version of this pattern.

Late, allowed lateness, and accumulation modes

What if an event arrives after its window has been emitted? Three options. Drop it — simple, lossy. Update the window's result with the late event (requires either an output sink that supports upserts or a downstream consumer that handles retractions). Side-output it to a dead-letter stream for later reconciliation. The choice is architectural, not automatic; Flink's allowedLateness and Beam's trigger/accumulation model make the policy explicit, which is the right design.

The rule the field learned the hard way

Every aggregation should be over event time, not processing time, unless the use case explicitly wants "what the system has seen so far", which is a much narrower question. The Dataflow Model's vocabulary (event time, watermarks, triggers, accumulation) is now the lingua franca of serious streaming; learning it once pays back across Flink, Spark, Beam, and every future engine in the same tradition.

Section 11

Windowing, tumbling, sliding, session

A window is how an unbounded stream gets chopped into bounded pieces for aggregation. The three canonical shapes — tumbling, sliding, and session — cover most use cases; the choice between them is the difference between an aggregation that matches the business question and one that subtly doesn't.

Tumbling windows

Tumbling windows are fixed-size, non-overlapping, contiguous. "Hourly count of events", "daily active users", "five-minute throughput" — each is a tumbling window. They are the simplest to reason about and the cheapest to compute; each event belongs to exactly one window, and the state per key is bounded by the window duration. Use them whenever the question genuinely is "how many in this bucket?".

Sliding windows

Sliding windows overlap: a five-minute window that advances every minute produces a new result every minute, each covering the last five. They are what you want when the question is "the current rolling average" or "how many events in the last hour, updated every minute". The state cost is higher — each event belongs to multiple windows — and the emission rate is higher, which matters when the downstream is a database or an alerting system.

Session windows

Session windows are event-driven rather than time-driven: a session opens at an event and extends as long as subsequent events arrive within a gap threshold, closing when the gap is exceeded. "User session on a website" is the canonical case — you do not know up front how long any given user's session will be, and a tumbling window would artificially split it. Sessions are more expensive and more complex, but for user-behaviour analytics they are the correct shape.

Global windows and custom triggers

Beyond the three standard shapes there is a global window (one window covering the entire stream, used with custom triggers) and various specialised constructions (padding, rolling, punctuated). For most production work, the three above cover the territory; when the question does not fit any of them, the answer is usually a different question, not a bespoke window.

Section 12

State and checkpoints, the memory streams carry forward

Aggregations need memory. Joins need memory. Pattern detection needs memory. The discipline of handling that memory — how it lives, how big it gets, how it survives a failure — is where most of the engineering weight of a serious streaming system lands.

What "state" actually is

In a stream processor, state is whatever the operator has to remember between events to produce correct output. Examples: the current count per key for a windowed aggregate; the most recent row per key for a stream-table join; the sequence of recent events for a CEP pattern; the offset position in the input stream. Each stream processor exposes state as first-class (Flink's ValueState / ListState / MapState; Kafka Streams' state stores; Spark's stateful DataFrames), and each provides a local-plus-durable layout: a fast backend (in-memory or RocksDB) for reads and writes, a durable snapshot (S3, HDFS) for recovery.

Checkpointing and the Chandy–Lamport trick

A checkpoint is a coordinated snapshot of the state of every operator plus the input offsets at a shared barrier, written durably so that recovery can roll back to a globally consistent point. Flink's implementation is based on the Chandy–Lamport distributed-snapshot algorithm (injecting barriers into the data stream that flush state along with them); Spark's is simpler (micro-batch boundaries are natural checkpoint boundaries). Either way, on failure the runtime restores the state, rewinds the input to the recorded offsets, and resumes — a guarantee that no event was lost and (with care) none was processed twice.

State size and retention

State is not free. A stateful operator's memory, disk, and checkpoint-storage costs scale with the number of keys and the depth of history kept per key. State that grows unboundedly is the most common cause of stream-job resource blowups; TTL (time-to-live) settings on state, state compaction, and explicit expiry are the operational answers. A useful rule is that every piece of state should have an explicit retention policy — written down, not implicit — so that a reader of the job can predict its memory footprint a year from now.

Savepoints and upgrades

Checkpoints are automatic and meant for recovery; savepoints are manual and meant for upgrades — a developer triggers a savepoint, changes the job, and restarts from the savepoint without losing state. Every serious streaming deployment has a savepoint discipline; teams that do not, eventually find themselves unable to deploy a code change without dropping hours of aggregations on the floor.

Section 13

At-most-once, at-least-once, exactly-once, the three delivery guarantees

No streaming conversation is complete without this triad. The three delivery guarantees are defined precisely enough that a production engineer should be able to name which one any given pipeline provides and why. Getting this wrong is the most common source of "the numbers are off by a little and we don't know why" incidents in streaming platforms.

At-most-once

At-most-once delivery means an event is delivered either zero times or one time, never more. It is the easy guarantee: the producer fires and forgets, the consumer commits its offset before processing, and any failure loses events. Use it only where loss is cheaper than duplication — telemetry in aggregate, UI pings where a missing event does not materially affect the dashboard, logging where sampling is already the point.

At-least-once

At-least-once delivery means an event is delivered one or more times. The producer retries on failure, the consumer commits its offset after processing, and duplicates can occur. This is the default for most production stream pipelines, because duplicate handling is usually tractable at the consumer (idempotent writes, deduplication by event ID) and loss is not. If a pipeline is "exactly-once" by reputation but at-least-once by architecture, the deduplication is happening somewhere downstream; know where.

Exactly-once, and what it really means

Exactly-once is an end-to-end property and a much stronger claim. Within a closed Kafka/Flink or Kafka/Kafka-Streams system, exactly-once is achievable using transactional writes plus idempotent producers plus coordinated checkpoints: the broker records a produce-and-commit as a single atomic unit, and a failure either rolls back both or commits both. Across heterogeneous systems — Kafka to a database, Kafka to an HTTP API — exactly-once is harder and usually provided via idempotent writes at the sink (a transactional outbox, an idempotency key per event) rather than by the broker. Marketing claims of "exactly-once" almost always mean "exactly-once within the bounded set of systems we ship"; reading the fine print is a useful habit.

The practical default

Most production streams should be at-least-once with deduplication at the consumer. Exactly-once within Kafka is worth turning on when available and the cost is tolerable; exactly-once across system boundaries is worth engineering for when the downstream cannot handle duplicates (payments, billing, critical counters). Ignoring the question and hoping duplicates do not happen is the most common wrong answer.

Section 14

Event-driven architectures, beyond analytics

Streaming is not only an analytical technique. Used at the application level, the same log primitive restructures how microservices communicate, how state is persisted, and how cross-service correctness is achieved. The pattern has a name — event-driven architecture — and its use of Kafka (or equivalents) as a durable integration bus is one of the most consequential architectural shifts of the last decade.

Event-driven, event-sourced, event-streamed

Three related but distinct ideas. An event-driven service reacts to events rather than being called directly — a downstream service subscribes to "order placed" events and reacts, rather than the upstream calling it. Event sourcing goes further: the service persists its state by writing events to a log, and the current state is a fold over the log; the log, not the database, is the system of record. Event-streamed architectures use a shared durable log (Kafka) as the integration substrate between otherwise independent services. Most production systems adopt one or two of the three; all three together is a stronger commitment.

The outbox pattern and dual-write problem

The hardest problem event-driven systems face is the dual-write: a service commits a change to its database and must also publish an event to Kafka, and if either succeeds without the other the system diverges. The outbox pattern is the canonical answer: the service writes the event to an outbox table in the same transaction as the database change, and a separate CDC or polling process forwards outbox rows to Kafka. The event-publish becomes a consequence of the database commit rather than a separate step, and the consistency problem goes away.

CQRS, sagas, and long-running workflows

On top of an event-driven substrate, two patterns recur. CQRS (Command Query Responsibility Segregation) separates writes (commands that produce events) from reads (queries against derived views), so each can be optimised independently. Sagas are long-running distributed transactions composed of events: a series of local commits with compensating events on failure, replacing the classical two-phase commit that distributed systems cannot actually afford. Both patterns predate Kafka and are more consequential with it.

The maturity curve

Event-driven architectures repay a team that has already mastered synchronous microservices; they are painful for a team that hasn't. Adopting Kafka because "microservices" without understanding eventual consistency, idempotency, and log-based integration usually produces a distributed monolith with extra hops. The order is: get the synchronous version right, then move the integrations that genuinely benefit from the log into the log.

Section 15

Streaming ML, online features, real-time scoring, online learning

Machine learning historically ran on batch-scale training and batch-scale scoring. Streaming adds three things: features that are fresh to the second, inference that runs as events arrive, and — more rarely — models that update themselves continuously. The first two are table stakes for modern ML systems; the third is still an active research area as much as a production technique.

Online features

Many of the best features are very recent: a user's actions in the last thirty seconds, a sensor's readings in the last minute, the running mean of a transaction amount over the last hour. Computing these from a daily batch pipeline throws away most of their value. Streaming feature pipelines — a Flink or Spark Streaming job that reads events from Kafka, aggregates per key, and writes to a feature store (Tecton, Feast, Redis, DynamoDB) — produce features that are fresh when the model needs them. The hard part is not the pipeline; it is keeping the training path and the serving path computing the same features from the same definitions, which is what a feature store exists to enforce.

Real-time inference

Scoring at event arrival is either a direct consumer of the event stream — a Flink or Kafka Streams job that joins events with a feature lookup and emits predictions — or a synchronous service called from a stream processor. The architectural choice depends on whether the prediction is consumed by another stream (stay in the stream, avoid synchronous calls) or by a synchronous caller (HTTP request-response pattern, model behind a gateway). The pattern to avoid is calling a synchronous HTTP service from inside a high-throughput stream without backpressure or isolation; it is the most common way a streaming ML system produces its own incident.

Online learning and the realistic version

True online learning — a model that updates its weights as events arrive — is niche. It works for bandit-style personalisation, for anomaly detection where concept drift is the norm, and for specialised cases like fraud models with very high event rates. For most ML, the production reality is frequent batch retraining — the model is retrained every hour or every day on recent data and redeployed — which captures most of the benefit of online learning without the operational complexity. If someone proposes online learning, the right first question is "would hourly retraining be enough?".

Training-serving skew, streaming edition

The training-serving skew problem from Chapter 03 becomes sharper in streaming ML. If the batch training pipeline computes a feature from a window of historical events, and the streaming serving pipeline computes it from a window of live events, the two can silently diverge — watermark differences, event-time versus processing-time mismatches, late events handled differently. A rigorous feature store with a single definition shared between batch backfill and stream serving is the only reliable defence; ad-hoc reimplementation of features in two languages is the canonical production bug in this space.

Section 16

Operating streaming systems, lag, backpressure, ordering

A streaming system that works in development will eventually fail in production on issues a batch system never had: consumer lag creeping, a slow consumer pushing back on the whole graph, events arriving out of order in a way that breaks an assumption. The operational vocabulary — lag, backpressure, rebalance, skew, poison pills — is worth owning before running a stream at production scale.

Consumer lag, and what it actually means

Consumer lag is the gap, in events or seconds, between the latest offset published to a partition and the latest offset the consumer has committed. A steady lag of "zero or small and stable" is healthy; lag that grows monotonically is the pipeline falling behind the producer and will not recover without intervention. Every Kafka observability stack (Burrow, Cruise Control, the broker's JMX metrics, Confluent Control Center) centres on this metric; an alert on "lag above threshold for N minutes" is the single most important alarm a streaming platform carries.

Backpressure and skew

Backpressure is the mechanism by which a slow downstream operator signals an upstream one to slow down. Flink and Kafka Streams both implement it natively; problems start when backpressure is not respected (a synchronous HTTP call in a Flink map, a database sink without batching). Partition skew — one partition receiving far more data than its siblings because of a bad hash key — is a sibling problem: if one key is hot (one user, one product, one sensor), the partition that key maps to becomes the bottleneck for the whole job. The fix is almost always to change the partitioning key; occasionally a key salting trick is warranted.

Rebalances, poison pills, and the rest of the zoo

When a consumer joins or leaves a group, Kafka rebalances the partitions across the remaining consumers. A rebalance storm — a consumer flapping in and out of the group — stalls the whole group while rebalances complete. Poison pills — events that crash the consumer on parse — halt an entire partition until handled, because the consumer cannot advance past a record it cannot process. Both have well-known operational mitigations (static consumer group membership, dead-letter queues for unparseable messages); both appear, in the same forms, across every stream processor in the field.

The golden dashboard

Every streaming platform deserves a single dashboard with five numbers: consumer lag per topic, events-per-second in and out per topic, end-to-end latency from producer timestamp to sink, checkpoint duration, and state size. If those five are visible and alarmed, 90% of streaming incidents get caught before the consumer notices. If any of them is not instrumented, that is where the next incident will come from.

Section 17

Where streaming compounds into the ML operation

Streaming is moving steadily from "a capability some ML teams need" to "a substrate most production ML touches". The trajectory traces the familiar arc of storage and pipelines: initially a specialist concern, eventually a default assumption. Understanding why helps a team decide when to invest.

Freshness is becoming a competitive axis

Five years ago, a recommendation system trained nightly and scored daily was state of the art. Today a recommendation system whose features are not fresh to the last event is visibly behind. The competitive pressure has pushed more of the feature pipeline, more of the inference path, and more of the training data preparation into streaming — not because streaming is intrinsically better, but because the product demands the latency. For teams whose ML products have a latency dimension (personalisation, ranking, fraud, ops-monitoring), streaming has stopped being optional.

The same primitive serves many purposes

The Kafka log that carries ML features also carries application events, CDC from databases, audit events, and sometimes the model predictions themselves. A team that has built the discipline for one — schema registry, contracts, retention, observability — has mostly built it for the others. The payback from investing in streaming infrastructure is highest when multiple teams share the same event substrate rather than each running its own.

The discipline carries

Almost everything this chapter argued about streaming — event-time semantics, exactly-once guarantees, state and checkpoints, watermarks, operational vocabulary — is more intricate than its batch equivalent but is genuinely learnable. Teams that treat streaming as a harder, specialised branch of pipelines (rather than as a different field) tend to produce durable systems; teams that treat it as "a queue we subscribe to" tend to produce incidents. The discipline generalises, and the investment compounds.

The through-line

Streaming extends the pipeline discipline of Chapter 03 with time semantics, state, and delivery guarantees that batch did not have to think about. The tools — Kafka, Flink, Spark Streaming, Kafka Streams — are less important than the primitives: the log, the watermark, the checkpoint, the exactly-once boundary. For modern ML, an increasing share of the platform sits on top of these primitives. The rest of Part III — distributed compute, cloud platforms, governance — builds further upward; the chapters to come assume that bytes can both sit still in storage and flow forever across a log, and that a correct platform knows what to do with both.

Further reading

Where to go next

Streaming has a small set of canonical essays and papers that reward careful reading, a substantial set of official documentation from the load-bearing projects, and a growing literature on operational practice. The list below picks the references working streaming engineers actually return to, starting with the foundational ideas and ending with the vendor documentation and ops-focused writing.

The canonical essays and books

Foundational papers

Apache Kafka — official documentation

Stream engines — official documentation

Alternatives and adjacent tools

ML-adjacent streaming topics

This page is the fourth chapter of Part III: Data Engineering & Systems. The next — Distributed Computing — steps back from the streaming/batch distinction to the substrate underneath both: Spark and the shuffle model, MapReduce's surviving ideas, the partitioned parallel execution model that every large-data processor is built on. After that come the cloud platforms everything runs on and the governance layer that keeps it all auditable.