Part III · Data Engineering & Systems · Chapter 04

Streaming, and the data that refuses to wait.

Not every piece of data can be held for the nightly batch. Payments clear in seconds, fraud has to be caught in milliseconds, a user's next recommendation is wanted before they scroll away, a sensor emits a reading that matters only while it is fresh. Streaming is the architectural answer: an unbounded, ordered flow of events rather than a bounded file; a processor that updates continuously rather than once a day; a set of correctness guarantees — exactly-once, event-time, watermarks — that batch never had to think about. The tools that carry the events — Kafka, Pulsar, Kinesis — and the engines that compute over them — Flink, Spark Structured Streaming, Kafka Streams — are the core of this chapter. So is the harder part: the semantics that make streaming correct rather than just fast.

How to read this chapter

The first three sections frame the subject: why streaming exists at all given how well batch works, why the append-only log is the primitive the whole field is built on, and how stream processing actually differs from the batch/stream sketch in Chapter 03. Sections four through six are the transport layer — Apache Kafka as the dominant broker, its ecosystem (Connect, Streams, Schema Registry), and the alternatives (Pulsar, Kinesis, Redpanda, NATS) that matter in different niches. Sections seven through nine move from transport to computation: what stream processing actually is, Apache Flink as the reference engine, and Spark Structured Streaming as the micro-batch alternative. Sections ten through thirteen are the semantic core of the chapter — event time versus processing time, windowing, state and checkpoints, and the three delivery guarantees (at-most-once, at-least-once, exactly-once) — the ideas that separate a correct streaming system from one that is merely live. Sections fourteen through sixteen widen the frame: event-driven architectures beyond analytics, streaming ML (online features, real-time inference), and the operational reality of running streams at scale (lag, backpressure, ordering). The final section connects all of it to why streaming infrastructure is increasingly on the critical path for modern ML.

Conventions: vendor names are used where there is no good generic term — Kafka specifically, brokers generically; Flink specifically, stream engines generically. References to "the warehouse" still assume the storage layer from Chapter 02, and the pipelines from Chapter 03 are assumed as context for how batch and stream interact in a real platform. The treatment of semantics (event time, watermarks, exactly-once) is deliberately careful: these are the topics where intuition from batch tends to mislead, and where most production streaming incidents originate.

Why streaming exists at allThe latency pressure
Events and the append-only logThe primitive
Stream processing versus batch, revisitedWhat actually changes
Apache KafkaThe dominant broker
The Kafka ecosystemConnect, Streams, Schema Registry
Pulsar, Kinesis, Redpanda, NATSThe alternatives that matter
What stream processing actually isStateless versus stateful
Apache FlinkThe reference stream engine
Spark Structured StreamingMicro-batching and the unified API
Event time, processing time, watermarksThe semantic core
WindowingTumbling, sliding, session
State and checkpointsKeeping memory across events
Delivery semanticsAt-most, at-least, exactly-once
Event-driven architecturesBeyond analytics
Streaming MLOnline features and real-time inference
Operating streaming systemsLag, backpressure, ordering
Where it compounds in MLWhy streaming is on the critical path

Section 01

Streaming exists because some data will not wait until tomorrow

Batch is the default, and for most analytical work it is also the right answer. Streaming is the architecture you adopt when the default stops working — when the gap between an event happening in the world and a decision being made about it has to collapse from hours to seconds, and the whole shape of the system has to change to make that collapse possible.

The latency pressure

A surprising amount of software is a latency argument. Fraud detection that fires after yesterday's batch has run is useless; a recommendation that reflects a user's previous session is worth a tenth of one that reflects their current one; an alert on a failing production line that arrives an hour late is a postmortem, not an alert. Each of these is a case where the value of the data decays faster than a batch pipeline can deliver it, and each of them pushes the underlying architecture toward streaming. The question is rarely "should this be real-time?" in the abstract; it is "what is it worth to compute this in one hundred milliseconds instead of one hour?" — and for a narrow but important set of use cases the answer is: a lot.

What streaming actually is

A streaming system has three properties that distinguish it from a batch pipeline. Data arrives as an unbounded sequence of events rather than as a finite file. Processing runs continuously rather than on a schedule. And the system's correctness is defined in terms of invariants over time — "the running total is accurate within five seconds of the latest event", "no event is lost", "each event is counted exactly once" — rather than in terms of a completed batch's output matching an expected result. All three together are what make streaming a different discipline rather than a faster batch pipeline.

Why it is harder

Every concern that batch handles implicitly becomes explicit in a streaming system. Ordering: events arrive out of order, and the correct handling depends on the semantics the consumer expects. Time: is a count of "events in the last hour" based on when the event happened, when it arrived, or when it was processed? Failure: a stream runs forever, which means a worker crash cannot mean "restart the job" — it has to mean "resume from where the worker left off, without losing or duplicating events". State: most interesting stream computations are stateful (counts, joins, aggregates), which means there has to be a recovery story for the state itself. The rest of this chapter is the systematic answer to each of those problems.

The wrong default

The single most common streaming mistake is adopting it where batch would do. Streaming adds operational load, debugging difficulty, and cost; it earns its place only where the latency requirement genuinely demands it. A reasonable default is: batch first, streaming when batch no longer fits. Teams that invert this — stream first, batch only where stream is clearly overkill — spend a large fraction of their engineering budget on problems their use case never actually had.

Section 02

The append-only log, the primitive everything else is built on

The conceptual core of modern streaming is not a queue, a database, or a pub/sub bus — it is the log. An ordered, append-only, immutable sequence of records, partitioned for scale and replicated for durability. Most of what makes streaming systems behave the way they do is a direct consequence of this single data structure.

What an event is

An event is a fact: something happened at a time, has a key (a user, an order, a sensor), carries a payload, and is immutable once written. "User 42 clicked product 7 at 2026-04-21T13:02:11Z" is an event; so is "Order 9001 was paid" and "Sensor A reported 72°F". The discipline is to represent what happened, not the resulting state — the state is a view of the event stream, not the stream itself. Jay Kreps's 2013 essay "The Log" is the canonical articulation of this inversion, and it is the conceptual step most batch engineers have to make to think clearly about streaming.

Why append-only and immutable

Once written, an event is never modified. Corrections are themselves new events ("order 9001 was refunded"), and the derived state adjusts accordingly. The immutability property is what makes a log a useful foundation: it can be replayed, re-processed by a different consumer, re-hashed by a new model, and audited — properties that an in-place-mutated database table does not naturally provide. The cost is that the log grows; retention policies and compacted topics (see below) are the answer.

Partitioning, offsets, and replicated durability

A log at scale is not a single sequence; it is a set of partitions, each of which is its own ordered sequence. Events are routed to partitions by a key, so all events for the same user land on the same partition in order. Each event in a partition has a monotonically increasing offset; consumers track where they are by remembering the last offset they processed. Partitions are replicated across brokers so that the loss of any single broker does not lose events. These three properties — partitioned for throughput, offset-addressed for resumability, replicated for durability — together are the thing that makes a log usable as the backbone of a streaming platform.

Kafka as the reference implementation

The log abstraction predates Kafka, but Kafka is what made it operationally mainstream. "A distributed, replicated, partitioned, append-only commit log" is both a description of Kafka and a description of the abstraction the rest of this chapter will take as given. Alternatives (Pulsar, Kinesis, Redpanda) implement the same abstraction with different trade-offs; the vocabulary is substantially shared.

Section 03

Stream processing versus batch, the differences that actually matter

Chapter 03 opened the batch/stream distinction; this chapter has to close it. The useful view is not that stream is "batch but faster", but that several properties — the shape of the input, the definition of correctness, the cost profile, the failure handling — change together when you cross the line.

Bounded versus unbounded input

A batch job reads a bounded dataset — yesterday's Parquet files, this hour's partition, the Postgres table as of now — and produces a bounded output. A streaming job reads an unbounded input and produces an unbounded output. The practical consequence is that "done" is a state a batch job is in and a stream job never is. Every streaming system's vocabulary — windows, watermarks, checkpoints — exists so that a job whose input is infinite can still produce useful finite answers along the way.

Different failure models

When a batch job fails, the operational move is "re-run from the top". That is not available to a streaming job. The state has to survive the failure, the in-flight events have to be replayed without being processed twice, and the output has to remain correct after the restart. Every serious stream processor solves this with some variant of the same pattern: periodically snapshot the state, record the input offsets at the snapshot boundary, and on restart roll back to the most recent snapshot and replay from the recorded offsets. The difference in failure handling is one of the largest practical differences between the two worlds.

Costs and latencies are differently shaped

A batch pipeline's cost is proportional to the data it processes and easy to predict: a nightly job processes a day's worth of data, so the bill is the per-unit cost times one day of data. A streaming pipeline's cost is proportional to wall-clock time — it runs continuously, paying for compute whether or not events are arriving — and to the state it maintains. For workloads with high event rates the stream can be cheaper; for workloads with low rates it can be surprisingly expensive. The latency shape is the inverse: batch latency is its own run duration plus the schedule gap (typically hours), streaming latency is measured in seconds or milliseconds.

The honest middle ground

Real platforms are almost always hybrid. The warehouse is batch; the event bus is streaming; a CDC feed bridges between them. A reasonable design principle is to use batch for everything the consumer can tolerate it for, and streaming only for the specific data whose value decays faster than batch can deliver. Chapter 03's pipeline discipline applies to both; what this chapter adds is the semantics that only the stream side needs.

Section 04

Apache Kafka, the broker the whole field assumes

Kafka is, for streaming, what Parquet is for storage: the default. Written at LinkedIn, open-sourced in 2011, now the Apache project at the centre of most event-driven architectures — it is the single technology whose vocabulary the rest of streaming has adopted. An understanding of what Kafka actually is, and is not, is the load-bearing piece of this chapter.

Brokers, topics, partitions, replicas

A Kafka cluster is a set of brokers — servers that store and serve events. Events are written to topics, which are logical names; each topic is split into one or more partitions, each of which is an ordered, append-only log stored on disk and replicated across brokers. Producers write events to a topic; the partition is chosen by hashing a key (or round-robin if no key). Consumers read from one or more partitions, tracking their position (the offset) so that a restart resumes without loss. Every higher-level concept in Kafka — consumer groups, transactions, compacted topics — is built on this substrate.

Producers and consumers, and what each guarantees

A Kafka producer writes batches of records to a partition; its knobs (acks, retries, enable.idempotence) control what happens on network failure and how strong the delivery guarantee is. A Kafka consumer reads from partitions in order, processes events, and commits its offsets back to the broker so that a restart knows where to resume. Consumer groups let multiple consumers share the work of reading a topic: each partition is assigned to exactly one consumer in the group, and adding consumers up to the partition count is how a pipeline horizontally scales.

What Kafka is and is not

Kafka is a distributed log; it is not a database, and not a message queue in the classical RabbitMQ sense. It does not provide per-message acknowledgements (it provides per-offset commits), does not provide random access to events (it provides sequential reads from an offset), and does not provide complex routing (it provides topics and partitions, and the routing is the consumer's problem). The result is very high throughput, very durable storage, and a simple-but-not-easy programming model. Tools built around Kafka — Streams, ksqlDB, Flink connectors — provide the higher-level patterns; Kafka itself stays minimal.

KRaft and the end of ZooKeeper

Until recently, a Kafka cluster required a separate ZooKeeper ensemble for metadata coordination; Kafka 2.8 introduced KRaft, a Raft-based quorum protocol that moved metadata into the brokers themselves, and Kafka 3.x made it the default. Operational simplification is real; new deployments should start on KRaft without looking back.

Section 05

The Kafka ecosystem, Connect, Streams, Schema Registry, ksqlDB

Kafka itself is a log. The operational value comes from the ecosystem that has grown around it — the connector framework that moves data in and out, the stream-processing library that runs in-process, the schema registry that keeps producers and consumers honest, and the SQL-style query layer that sits over all of it.

Kafka Connect

Kafka Connect is a framework for moving data between Kafka and external systems — databases, object storage, warehouses, search indexes, SaaS applications — via declarative JSON configuration rather than bespoke code. It runs as its own process (a Connect cluster), hosts source connectors that emit data into Kafka and sink connectors that write data out, and handles offset tracking, at-least-once delivery, and failure recovery generically. Hundreds of community and vendor-supplied connectors exist; the Debezium CDC connectors (section 13 of Chapter 03) are among the most consequential.

Kafka Streams and ksqlDB

Kafka Streams is a Java/Scala library — not a separate cluster — that embeds a stream processor inside an application. You write transformations using a DSL (map, filter, join, windowedBy), and the library handles partition assignment, state store management (RocksDB-backed), and checkpointing using Kafka itself. ksqlDB sits on top and lets you express the same patterns as SQL. Neither competes directly with Flink; both are the right choice when the processing logic belongs with the application that already uses Kafka, rather than in a separate cluster.

Schema Registry and Avro/Protobuf discipline

An event's payload is bytes; making those bytes self-describing and evolvable is what a schema registry does. Confluent's Schema Registry is the canonical implementation: producers register Avro, Protobuf, or JSON Schema definitions with the registry, publish events carrying a schema ID, and the registry enforces compatibility rules (backward, forward, full) on every evolution. Consumers look up schemas by ID and deserialise safely. A Kafka cluster without a schema registry is a Kafka cluster where producers silently break consumers; the registry is not optional for production platforms.

Confluent, Strimzi, and the distribution question

Apache Kafka is the upstream project; most production deployments run either a managed service (Confluent Cloud, AWS MSK, Aiven, Redpanda Cloud) or an operator-based self-hosted cluster (Strimzi on Kubernetes). Running Kafka on bare VMs without either is possible and increasingly rare; the operational discipline involved — broker rolling, partition balancing, rack-aware replication, retention policy — is enough that most teams adopt the operator or the managed service rather than build it from scratch.

Section 06

Pulsar, Kinesis, Redpanda, NATS, the alternatives that matter

Kafka is the default but not the only choice. Four alternatives show up often enough to matter: Pulsar and Kinesis in the broker slot, Redpanda as a Kafka-protocol-compatible rewrite, and NATS in the lighter-weight messaging niche. Each is a reasonable answer in specific circumstances.

Apache Pulsar

Pulsar, originally built at Yahoo, separates the serving layer (brokers) from the storage layer (Apache BookKeeper), which makes scaling storage independent of scaling throughput. It has first-class multi-tenancy, geo-replication built in rather than bolted on, and native support for both queue-like and log-like consumption. For organisations that need heavy multi-tenant isolation or geo-replication, Pulsar's architecture is a genuinely better fit than Kafka's; for most other cases the larger Kafka ecosystem wins.

AWS Kinesis

Kinesis is Amazon's managed streaming service, with similar partitioned-log semantics to Kafka but a simpler (and narrower) API. Two products worth distinguishing: Kinesis Data Streams, the broker-equivalent, and Kinesis Data Firehose, a no-code pipeline that lands events into S3/Redshift/OpenSearch. On AWS-native stacks Kinesis is often the pragmatic choice; on anything multi-cloud or self-hosted, Kafka is usually preferred because the ecosystem is portable.

Redpanda, and the Kafka-compatible rewrite

Redpanda is a C++ reimplementation of the Kafka protocol that runs as a single binary, without JVM and without ZooKeeper. It uses the same client libraries, the same Schema Registry, and the same Kafka Connect ecosystem; from the application's perspective it is Kafka. What it changes is operations: lower memory, tighter tail latency, simpler deployment. For teams that want Kafka-the-protocol without JVM-the-tax, it is the obvious alternative.

NATS, and the lightweight end of the spectrum

NATS is a different category: a lightweight pub/sub messaging system with a persistent-log add-on (JetStream). Its strength is latency and simplicity — microseconds rather than milliseconds, a binary that runs on a Raspberry Pi. It does not replace Kafka for heavy analytical streams, but for microservice messaging, request/reply, and IoT-scale deployments, it is often a better fit.

The convergence

All four of these are converging on "Kafka-protocol, log-first, with extensions". The protocol has become a lingua franca even where the implementation differs; most mature applications write producer and consumer code once and can target any of them. The selection now is more about operations, cost, and cloud fit than about fundamentally different semantics.

Section 07

Stream processing, stateless, stateful, and the join

A broker moves events; a stream processor computes over them. The set of computations stream processors support is smaller than a batch SQL engine's but includes the operations that most real-time applications need: per-event transforms, aggregations over windows, joins between streams, and stateful pattern detection. Understanding the three shapes of computation is how the rest of the chapter becomes legible.

Stateless transforms

The simplest operations — map, filter, flatMap — are stateless: the output of any single event depends only on that event's contents. Enriching an event with a lookup from a static table, filtering events by type, parsing a raw payload into typed fields: all fall into this category. Stateless operators are easy to scale (each event is independent) and easy to reason about (a failure loses nothing except the in-flight event, which is replayed from the log). The majority of production stream code is stateless; the part that isn't is where the interesting work is.

Stateful aggregations and joins

An aggregation — a running count, a rolling average, a top-K — needs to remember something across events. So does a join between two streams (enrich a clickstream event with the user's current profile) or between a stream and a table. A stream processor with state stores (backed by RocksDB or a distributed key-value store) and a checkpointing mechanism (see section twelve) is what makes these tractable. The state is a first-class concern: it has a size, a memory footprint, a recovery story, and a cost.

Pattern detection and CEP

Complex event processing (CEP) is the oldest genre of streaming: detect patterns ("three failed logins then a successful one", "price rose more than 5% then reversed") across a stream. Flink's CEP library, Kafka Streams' sessionised patterns, and specialised engines (Esper, Drools) live here. It is a narrower use case than aggregation and join but the right tool when the question is truly "has this sequence of events occurred?".

The stateful-or-not question

Before choosing an engine, decide whether the processing is stateful. Stateless computations run happily on Kafka Streams, AWS Lambda, a plain consumer loop — the engine choice hardly matters. Stateful computations at scale are where the serious differences between Flink, Spark Structured Streaming, and Kafka Streams actually show up. The architectural question is not "which stream engine?" but "what state, with what size, with what recovery guarantees?".

Section 08

Apache Flink, the reference stream-processing engine

Flink is the engine most of the concepts in this chapter are cleanest in. True event-at-a-time processing, rich state management, first-class event-time semantics, and exactly-once guarantees through distributed snapshots — they are all Flink's defaults, not retrofits. For serious stateful streaming at scale, it is the reference against which everything else is measured.

Event-at-a-time, not micro-batch

Flink's execution model is true streaming: each event flows through the operator graph as it arrives, with latency measured in single-digit milliseconds under load. Contrast with Spark Structured Streaming's default micro-batch execution, where events are buffered for a few hundred milliseconds and processed as small batches. For most use cases the difference is a factor of ten in tail latency — significant for real-time features, irrelevant for analytical dashboards. Flink's "continuous processing" mode and Spark's "continuous mode" blur the line, but the defaults still define the usual choice.

State backends and RocksDB

Flink's state story is the reason it handles large aggregations well. The default state backend (RocksDB, optionally with incremental checkpoints) lets a single task manager hold hundreds of gigabytes of state on local disk with memory-speed access to the hot set. Checkpoints are written asynchronously to durable storage (HDFS, S3, GCS) so that a restart rolls back to a consistent snapshot across all operators. The combination of local state for speed and durable checkpoints for recovery is what makes large stateful streaming tractable.

APIs and SQL

Flink exposes three main APIs: the low-level DataStream API for event-at-a-time Java/Scala programs; the Table API and Flink SQL for declarative stream-and-batch queries (the two share a planner under the unified Table API); and Stateful Functions for an actor-like model on top of the same runtime. Flink SQL has matured to the point that many production deployments use it as the primary interface, with the DataStream API as a fallback for the 10% of logic that needs it. For teams whose streaming logic is naturally expressible in SQL, Flink SQL plus a Schema Registry is a complete stack.

Where it is overkill

Flink earns its complexity on stateful streams at scale. For a stateless enrichment pipeline, a few-thousand-events-per-second workload, or an application-embedded processor, Kafka Streams or even a plain consumer loop is usually the right call. Flink's operational footprint — JobManager, TaskManagers, checkpoint storage, Kubernetes operator — is its own project to run well. Managed Flink (Confluent's, Ververica's, AWS Managed Service for Apache Flink) flattens some of that; it does not eliminate it.

Section 09

Spark Structured Streaming, the micro-batch alternative

Spark Structured Streaming, introduced in Spark 2.x as the successor to the older DStream API, approaches streaming differently from Flink: as the limiting case of batch. A continuously-growing table is queried repeatedly in small batches, and the same DataFrame API works on both bounded and unbounded input. It is the right answer for teams whose platform is already Spark-centric and whose latency tolerance is in seconds rather than milliseconds.

The "table grows over time" abstraction

The mental model is: imagine an input table that new rows keep being appended to. A streaming query is a query against that table whose result is continuously updated. Write the query once using the DataFrame or SQL API, tell Spark where to read from (Kafka, files, Delta Lake) and where to write to (Delta, Kafka, a warehouse), and Spark runs the query in micro-batches every few hundred milliseconds. The query author, in principle, does not have to think about batches at all; in practice, tuning the micro-batch interval and the trigger is most of the operational work.

Micro-batch versus continuous modes

The default execution mode processes events in micro-batches — typically 100 ms to a few seconds — which gives high throughput and moderate latency. A continuous processing mode exists for single-digit-millisecond latency but supports only a subset of operations. For the large majority of Spark Streaming deployments, the micro-batch default is what is actually used; for anything truly latency-sensitive, Flink is more often the right tool.

Integration with Delta and the lakehouse

Where Spark Structured Streaming is unambiguously the right choice is inside a lakehouse. Writing micro-batches into Delta Lake (or Iceberg, or Hudi) gives transactional streaming writes to the same tables that batch jobs and dashboards read; the "unified batch-and-stream" promise of the Dataflow Model becomes concrete. The Databricks-centric "medallion architecture" — bronze/silver/gold Delta tables built by mixed batch and streaming jobs — is built on this pattern, and for Spark-centric platforms it is a clean, well-integrated model.

The lineage of the choice

Flink and Spark Structured Streaming are the two serious choices for general-purpose stream processing, and for most teams the decision falls out of the existing platform. Databricks or a heavy Spark footprint → Spark Structured Streaming. Kubernetes-native streaming team with latency-sensitive work → Flink. New greenfield platform with neither as an incumbent → pick based on whether the workload is closer to "continuous ETL into a lakehouse" (Spark) or "low-latency event-at-a-time processing" (Flink).

Section 10

Event time, processing time, and watermarks, the semantic core

The single idea that most batch engineers stumble on when they move to streaming is that there are two clocks. When the event happened is one thing; when the system saw it is another; the lag between them is the subject of most production streaming bugs. The vocabulary for handling it — event time, processing time, watermarks — is the semantic core of the field.

Event time versus processing time

Event time is the timestamp at which the event occurred in the real world — the user clicked, the sensor read, the order was placed. Processing time is when the stream processor actually handled the event. In a well-connected system with no backpressure, the two are close; under network problems, batching at the source, or simple geographic distance, they can differ by seconds, minutes, or (for mobile apps that buffer events offline) days. A correct streaming computation almost always wants event time; a misleading one almost always accidentally uses processing time.

What a watermark actually is

If events can arrive out of order, when is a window "complete"? A watermark is the answer: a monotonically advancing heuristic from the stream processor that says "I have seen all events with event-time up to t; any event arriving now with event-time less than t is late". The watermark trails the latest event time by some margin (configurable) to tolerate lateness; it advances as new events arrive. Window computations wait for the watermark to cross the end of the window, then emit the result. Every streaming system that claims event-time correctness implements some version of this pattern.

Late, allowed lateness, and accumulation modes

What if an event arrives after its window has been emitted? Three options. Drop it — simple, lossy. Update the window's result with the late event (requires either an output sink that supports upserts or a downstream consumer that handles retractions). Side-output it to a dead-letter stream for later reconciliation. The choice is architectural, not automatic; Flink's allowedLateness and Beam's trigger/accumulation model make the policy explicit, which is the right design.

The rule the field learned the hard way

Every aggregation should be over event time, not processing time, unless the use case explicitly wants "what the system has seen so far", which is a much narrower question. The Dataflow Model's vocabulary (event time, watermarks, triggers, accumulation) is now the lingua franca of serious streaming; learning it once pays back across Flink, Spark, Beam, and every future engine in the same tradition.

Section 11

Windowing, tumbling, sliding, session

A window is how an unbounded stream gets chopped into bounded pieces for aggregation. The three canonical shapes — tumbling, sliding, and session — cover most use cases; the choice between them is the difference between an aggregation that matches the business question and one that subtly doesn't.

Tumbling windows

Tumbling windows are fixed-size, non-overlapping, contiguous. "Hourly count of events", "daily active users", "five-minute throughput" — each is a tumbling window. They are the simplest to reason about and the cheapest to compute; each event belongs to exactly one window, and the state per key is bounded by the window duration. Use them whenever the question genuinely is "how many in this bucket?".

Sliding windows

Sliding windows overlap: a five-minute window that advances every minute produces a new result every minute, each covering the last five. They are what you want when the question is "the current rolling average" or "how many events in the last hour, updated every minute". The state cost is higher — each event belongs to multiple windows — and the emission rate is higher, which matters when the downstream is a database or an alerting system.

Session windows

Session windows are event-driven rather than time-driven: a session opens at an event and extends as long as subsequent events arrive within a gap threshold, closing when the gap is exceeded. "User session on a website" is the canonical case — you do not know up front how long any given user's session will be, and a tumbling window would artificially split it. Sessions are more expensive and more complex, but for user-behaviour analytics they are the correct shape.

Global windows and custom triggers

Beyond the three standard shapes there is a global window (one window covering the entire stream, used with custom triggers) and various specialised constructions (padding, rolling, punctuated). For most production work, the three above cover the territory; when the question does not fit any of them, the answer is usually a different question, not a bespoke window.

Section 12

State and checkpoints, the memory streams carry forward

Aggregations need memory. Joins need memory. Pattern detection needs memory. The discipline of handling that memory — how it lives, how big it gets, how it survives a failure — is where most of the engineering weight of a serious streaming system lands.

What "state" actually is

In a stream processor, state is whatever the operator has to remember between events to produce correct output. Examples: the current count per key for a windowed aggregate; the most recent row per key for a stream-table join; the sequence of recent events for a CEP pattern; the offset position in the input stream. Each stream processor exposes state as first-class (Flink's ValueState / ListState / MapState; Kafka Streams' state stores; Spark's stateful DataFrames), and each provides a local-plus-durable layout: a fast backend (in-memory or RocksDB) for reads and writes, a durable snapshot (S3, HDFS) for recovery.

Checkpointing and the Chandy–Lamport trick

A checkpoint is a coordinated snapshot of the state of every operator plus the input offsets at a shared barrier, written durably so that recovery can roll back to a globally consistent point. Flink's implementation is based on the Chandy–Lamport distributed-snapshot algorithm (injecting barriers into the data stream that flush state along with them); Spark's is simpler (micro-batch boundaries are natural checkpoint boundaries). Either way, on failure the runtime restores the state, rewinds the input to the recorded offsets, and resumes — a guarantee that no event was lost and (with care) none was processed twice.

State size and retention

State is not free. A stateful operator's memory, disk, and checkpoint-storage costs scale with the number of keys and the depth of history kept per key. State that grows unboundedly is the most common cause of stream-job resource blowups; TTL (time-to-live) settings on state, state compaction, and explicit expiry are the operational answers. A useful rule is that every piece of state should have an explicit retention policy — written down, not implicit — so that a reader of the job can predict its memory footprint a year from now.

Savepoints and upgrades

Checkpoints are automatic and meant for recovery; savepoints are manual and meant for upgrades — a developer triggers a savepoint, changes the job, and restarts from the savepoint without losing state. Every serious streaming deployment has a savepoint discipline; teams that do not, eventually find themselves unable to deploy a code change without dropping hours of aggregations on the floor.

Section 13

At-most-once, at-least-once, exactly-once, the three delivery guarantees

No streaming conversation is complete without this triad. The three delivery guarantees are defined precisely enough that a production engineer should be able to name which one any given pipeline provides and why. Getting this wrong is the most common source of "the numbers are off by a little and we don't know why" incidents in streaming platforms.

At-most-once

At-most-once delivery means an event is delivered either zero times or one time, never more. It is the easy guarantee: the producer fires and forgets, the consumer commits its offset before processing, and any failure loses events. Use it only where loss is cheaper than duplication — telemetry in aggregate, UI pings where a missing event does not materially affect the dashboard, logging where sampling is already the point.

At-least-once

At-least-once delivery means an event is delivered one or more times. The producer retries on failure, the consumer commits its offset after processing, and duplicates can occur. This is the default for most production stream pipelines, because duplicate handling is usually tractable at the consumer (idempotent writes, deduplication by event ID) and loss is not. If a pipeline is "exactly-once" by reputation but at-least-once by architecture, the deduplication is happening somewhere downstream; know where.

Exactly-once, and what it really means

Exactly-once is an end-to-end property and a much stronger claim. Within a closed Kafka/Flink or Kafka/Kafka-Streams system, exactly-once is achievable using transactional writes plus idempotent producers plus coordinated checkpoints: the broker records a produce-and-commit as a single atomic unit, and a failure either rolls back both or commits both. Across heterogeneous systems — Kafka to a database, Kafka to an HTTP API — exactly-once is harder and usually provided via idempotent writes at the sink (a transactional outbox, an idempotency key per event) rather than by the broker. Marketing claims of "exactly-once" almost always mean "exactly-once within the bounded set of systems we ship"; reading the fine print is a useful habit.

The practical default

Most production streams should be at-least-once with deduplication at the consumer. Exactly-once within Kafka is worth turning on when available and the cost is tolerable; exactly-once across system boundaries is worth engineering for when the downstream cannot handle duplicates (payments, billing, critical counters). Ignoring the question and hoping duplicates do not happen is the most common wrong answer.

Section 14

Event-driven architectures, beyond analytics

Streaming is not only an analytical technique. Used at the application level, the same log primitive restructures how microservices communicate, how state is persisted, and how cross-service correctness is achieved. The pattern has a name — event-driven architecture — and its use of Kafka (or equivalents) as a durable integration bus is one of the most consequential architectural shifts of the last decade.

Event-driven, event-sourced, event-streamed

Three related but distinct ideas. An event-driven service reacts to events rather than being called directly — a downstream service subscribes to "order placed" events and reacts, rather than the upstream calling it. Event sourcing goes further: the service persists its state by writing events to a log, and the current state is a fold over the log; the log, not the database, is the system of record. Event-streamed architectures use a shared durable log (Kafka) as the integration substrate between otherwise independent services. Most production systems adopt one or two of the three; all three together is a stronger commitment.

The outbox pattern and dual-write problem

The hardest problem event-driven systems face is the dual-write: a service commits a change to its database and must also publish an event to Kafka, and if either succeeds without the other the system diverges. The outbox pattern is the canonical answer: the service writes the event to an outbox table in the same transaction as the database change, and a separate CDC or polling process forwards outbox rows to Kafka. The event-publish becomes a consequence of the database commit rather than a separate step, and the consistency problem goes away.

CQRS, sagas, and long-running workflows

On top of an event-driven substrate, two patterns recur. CQRS (Command Query Responsibility Segregation) separates writes (commands that produce events) from reads (queries against derived views), so each can be optimised independently. Sagas are long-running distributed transactions composed of events: a series of local commits with compensating events on failure, replacing the classical two-phase commit that distributed systems cannot actually afford. Both patterns predate Kafka and are more consequential with it.

The maturity curve

Event-driven architectures repay a team that has already mastered synchronous microservices; they are painful for a team that hasn't. Adopting Kafka because "microservices" without understanding eventual consistency, idempotency, and log-based integration usually produces a distributed monolith with extra hops. The order is: get the synchronous version right, then move the integrations that genuinely benefit from the log into the log.

Section 15

Streaming ML, online features, real-time scoring, online learning

Machine learning historically ran on batch-scale training and batch-scale scoring. Streaming adds three things: features that are fresh to the second, inference that runs as events arrive, and — more rarely — models that update themselves continuously. The first two are table stakes for modern ML systems; the third is still an active research area as much as a production technique.

Online features

Many of the best features are very recent: a user's actions in the last thirty seconds, a sensor's readings in the last minute, the running mean of a transaction amount over the last hour. Computing these from a daily batch pipeline throws away most of their value. Streaming feature pipelines — a Flink or Spark Streaming job that reads events from Kafka, aggregates per key, and writes to a feature store (Tecton, Feast, Redis, DynamoDB) — produce features that are fresh when the model needs them. The hard part is not the pipeline; it is keeping the training path and the serving path computing the same features from the same definitions, which is what a feature store exists to enforce.

Real-time inference

Scoring at event arrival is either a direct consumer of the event stream — a Flink or Kafka Streams job that joins events with a feature lookup and emits predictions — or a synchronous service called from a stream processor. The architectural choice depends on whether the prediction is consumed by another stream (stay in the stream, avoid synchronous calls) or by a synchronous caller (HTTP request-response pattern, model behind a gateway). The pattern to avoid is calling a synchronous HTTP service from inside a high-throughput stream without backpressure or isolation; it is the most common way a streaming ML system produces its own incident.

Online learning and the realistic version

True online learning — a model that updates its weights as events arrive — is niche. It works for bandit-style personalisation, for anomaly detection where concept drift is the norm, and for specialised cases like fraud models with very high event rates. For most ML, the production reality is frequent batch retraining — the model is retrained every hour or every day on recent data and redeployed — which captures most of the benefit of online learning without the operational complexity. If someone proposes online learning, the right first question is "would hourly retraining be enough?".

Training-serving skew, streaming edition

The training-serving skew problem from Chapter 03 becomes sharper in streaming ML. If the batch training pipeline computes a feature from a window of historical events, and the streaming serving pipeline computes it from a window of live events, the two can silently diverge — watermark differences, event-time versus processing-time mismatches, late events handled differently. A rigorous feature store with a single definition shared between batch backfill and stream serving is the only reliable defence; ad-hoc reimplementation of features in two languages is the canonical production bug in this space.

Section 16

Operating streaming systems, lag, backpressure, ordering

A streaming system that works in development will eventually fail in production on issues a batch system never had: consumer lag creeping, a slow consumer pushing back on the whole graph, events arriving out of order in a way that breaks an assumption. The operational vocabulary — lag, backpressure, rebalance, skew, poison pills — is worth owning before running a stream at production scale.

Consumer lag, and what it actually means

Consumer lag is the gap, in events or seconds, between the latest offset published to a partition and the latest offset the consumer has committed. A steady lag of "zero or small and stable" is healthy; lag that grows monotonically is the pipeline falling behind the producer and will not recover without intervention. Every Kafka observability stack (Burrow, Cruise Control, the broker's JMX metrics, Confluent Control Center) centres on this metric; an alert on "lag above threshold for N minutes" is the single most important alarm a streaming platform carries.

Backpressure and skew

Backpressure is the mechanism by which a slow downstream operator signals an upstream one to slow down. Flink and Kafka Streams both implement it natively; problems start when backpressure is not respected (a synchronous HTTP call in a Flink map, a database sink without batching). Partition skew — one partition receiving far more data than its siblings because of a bad hash key — is a sibling problem: if one key is hot (one user, one product, one sensor), the partition that key maps to becomes the bottleneck for the whole job. The fix is almost always to change the partitioning key; occasionally a key salting trick is warranted.

Rebalances, poison pills, and the rest of the zoo

When a consumer joins or leaves a group, Kafka rebalances the partitions across the remaining consumers. A rebalance storm — a consumer flapping in and out of the group — stalls the whole group while rebalances complete. Poison pills — events that crash the consumer on parse — halt an entire partition until handled, because the consumer cannot advance past a record it cannot process. Both have well-known operational mitigations (static consumer group membership, dead-letter queues for unparseable messages); both appear, in the same forms, across every stream processor in the field.

The golden dashboard

Every streaming platform deserves a single dashboard with five numbers: consumer lag per topic, events-per-second in and out per topic, end-to-end latency from producer timestamp to sink, checkpoint duration, and state size. If those five are visible and alarmed, 90% of streaming incidents get caught before the consumer notices. If any of them is not instrumented, that is where the next incident will come from.

Section 17

Where streaming compounds into the ML operation

Streaming is moving steadily from "a capability some ML teams need" to "a substrate most production ML touches". The trajectory traces the familiar arc of storage and pipelines: initially a specialist concern, eventually a default assumption. Understanding why helps a team decide when to invest.

Freshness is becoming a competitive axis

Five years ago, a recommendation system trained nightly and scored daily was state of the art. Today a recommendation system whose features are not fresh to the last event is visibly behind. The competitive pressure has pushed more of the feature pipeline, more of the inference path, and more of the training data preparation into streaming — not because streaming is intrinsically better, but because the product demands the latency. For teams whose ML products have a latency dimension (personalisation, ranking, fraud, ops-monitoring), streaming has stopped being optional.

The same primitive serves many purposes

The Kafka log that carries ML features also carries application events, CDC from databases, audit events, and sometimes the model predictions themselves. A team that has built the discipline for one — schema registry, contracts, retention, observability — has mostly built it for the others. The payback from investing in streaming infrastructure is highest when multiple teams share the same event substrate rather than each running its own.

The discipline carries

Almost everything this chapter argued about streaming — event-time semantics, exactly-once guarantees, state and checkpoints, watermarks, operational vocabulary — is more intricate than its batch equivalent but is genuinely learnable. Teams that treat streaming as a harder, specialised branch of pipelines (rather than as a different field) tend to produce durable systems; teams that treat it as "a queue we subscribe to" tend to produce incidents. The discipline generalises, and the investment compounds.

The through-line

Streaming extends the pipeline discipline of Chapter 03 with time semantics, state, and delivery guarantees that batch did not have to think about. The tools — Kafka, Flink, Spark Streaming, Kafka Streams — are less important than the primitives: the log, the watermark, the checkpoint, the exactly-once boundary. For modern ML, an increasing share of the platform sits on top of these primitives. The rest of Part III — distributed compute, cloud platforms, governance — builds further upward; the chapters to come assume that bytes can both sit still in storage and flow forever across a log, and that a correct platform knows what to do with both.

Where to go next

Streaming has a small set of canonical essays and papers that reward careful reading, a substantial set of official documentation from the load-bearing projects, and a growing literature on operational practice. The list below picks the references working streaming engineers actually return to, starting with the foundational ideas and ending with the vendor documentation and ops-focused writing.

The canonical essays and books

The Log: What every software engineer should know about real-time data's unifying abstraction

Jay Kreps · LinkedIn Engineering · 2013 · free

The essay that named the abstraction this chapter is built on, by one of the creators of Kafka. Long-form, deeply influential, and still the clearest articulation of why the append-only log is the right primitive for stream-integrated systems. The single piece of writing to read before the rest of this list.

LinkedIn Engineering
Designing Data-Intensive Applications

Martin Kleppmann · O'Reilly · 2017

Chapter 11 on stream processing is the single best textbook treatment of the subject: event streams, change streams, stream joins, fault tolerance, exactly-once. Chapter 4 on encoding and evolution is essential companion reading for schema registries. The second edition is in progress; the first remains the most recommended book in data systems.

dataintensive.net
Kafka: The Definitive Guide

Narkhede, Shapira, Palino · O'Reilly · 2nd ed. 2021

Written by Confluent engineers and the definitive reference for Kafka at production depth. Covers producers, consumers, the inner workings of the broker, replication, consumer groups, transactions, Streams, Connect, and operational topics. The chapter on reliable data delivery is worth the book on its own.

O'Reilly
Streaming Systems

Akidau, Chernyak, Lax · O'Reilly · 2018

The book-length treatment of the Dataflow Model by its authors. Event time, watermarks, windowing, triggers, accumulation — all the semantic concepts this chapter leans on — developed in far more depth, with worked examples. Dense, opinionated, and by a significant margin the most rigorous book on streaming semantics.

O'Reilly

Foundational papers

Kafka: a Distributed Messaging System for Log Processing

Kreps, Narkhede, Rao · NetDB 2011 · free

The original Kafka paper from LinkedIn. Short (six pages), worth reading once for the design decisions that remain load-bearing — partitioned log, offsets, consumer-pull, simple broker — and for the explicit comparison with the message-queue alternatives of the time.

Paper PDF
The Dataflow Model

Akidau et al. · VLDB 2015 · free

The paper that gave the field its vocabulary for event time, watermarks, windows, and triggers. Written by the Google team behind what became Apache Beam; its model underpins Flink, Beam, and Spark Structured Streaming. If one paper is read on streaming semantics, this is it.

Google Research
Streaming 101 and Streaming 102

Tyler Akidau · O'Reilly Radar · 2015–2016 · free

A pair of long-form essays by one of the Dataflow Model authors. The clearest gentle introduction to event time, watermarks, windowing, and triggers — read these first, then the paper. Referenced so widely in the field that the vocabulary in this chapter is effectively theirs.

Streaming 101 Streaming 102
Apache Flink: Stream and Batch Processing in a Single Engine

Carbone et al. · IEEE Data Engineering Bulletin 2015 · free

The paper describing Flink's unified batch-and-stream execution model and its implementation of Chandy–Lamport snapshots for exactly-once. Companion to the more recent "Lightweight asynchronous snapshots for distributed dataflows" paper, which is the precise description of how Flink's checkpoints work under the hood.

IEEE DEB PDF Checkpoints paper

Apache Kafka — official documentation

Apache Kafka documentation

kafka.apache.org · free

The primary reference. The "Design" section is the one that repays reading; it describes replication, leader election, the controller, ISR (in-sync replicas), and the semantics that every higher-level concept is built on. The "Operations" and "Configuration" sections are what production operators end up consulting weekly.

kafka.apache.org
Kafka Streams documentation

kafka.apache.org/documentation/streams · free

The documentation for Kafka's in-process stream processor. The "Core Concepts", "Architecture", and "Developer Guide" pages cover the DSL, state stores, interactive queries, and the exactly-once semantics the library provides. The right starting point for streaming that lives inside an application rather than in its own cluster.

kafka.apache.org/streams
Confluent documentation and the Schema Registry

docs.confluent.io · free

Confluent's documentation extends the Apache docs with the commercial additions — Control Center, tiered storage, cluster linking — and, critically, with the Schema Registry docs. The "Schema Registry Concepts" and "Schema Evolution and Compatibility" pages are essential reading for any Kafka deployment that cares about producer/consumer coexistence.

docs.confluent.io Schema Registry
Strimzi operator documentation

strimzi.io · free

The operator that runs Kafka on Kubernetes. Open source, CNCF-graduated, and the dominant way self-hosted Kafka is deployed in Kubernetes shops. The "Concepts", "Deployments", and "Operators" sections together are the operational reference for running Kafka well on Kubernetes without building it from scratch.

strimzi.io

Stream engines — official documentation

Apache Flink documentation

nightlies.apache.org/flink · free

The primary reference for Flink. The "Concepts" section (timely stream processing, stateful computation, fault tolerance) is where the streaming theory of this chapter becomes concrete; the "Table API & SQL" section is where most new users actually build. The "Operations" section is the production reference — checkpointing, state backends, recovery, deployment patterns.

Flink docs
Spark Structured Streaming Programming Guide

spark.apache.org · free

The canonical reference for Spark's streaming model. Covers the "table that grows" abstraction, event-time aggregation with watermarks, output modes (append / update / complete), and the sinks/sources catalogue. Companion reading to the Databricks-specific lakehouse docs for anyone on that platform.

spark.apache.org
Apache Beam documentation

beam.apache.org · free

The unified batch-and-stream programming model whose semantics came directly from the Dataflow paper. Beam programs run on multiple underlying runners (Flink, Spark, Google Dataflow). Worth reading for the "Programming Guide" even if Beam itself is not adopted — it is the cleanest exposition of the unified model in documentation form.

beam.apache.org
Apache Pulsar documentation

pulsar.apache.org · free

The primary reference for the leading Kafka alternative. The "Concepts and Architecture" section — tiered storage with BookKeeper, multi-tenancy, geo-replication — is the most useful explanation of where Pulsar's design differs from Kafka's, and is worth reading even for teams staying on Kafka.

pulsar.apache.org

Alternatives and adjacent tools

Amazon Kinesis documentation

docs.aws.amazon.com/kinesis · free

The AWS-native streaming services — Kinesis Data Streams, Firehose, Data Analytics (Managed Service for Apache Flink). The "Developer Guide" and the scaling/sharding guidance are the operational references for any production Kinesis deployment. For Kafka-on-AWS comparisons, read alongside the AWS MSK documentation.

AWS Kinesis
Redpanda documentation

docs.redpanda.com · free

The documentation for the Kafka-protocol-compatible rewrite. The "Get Started" and "Develop" sections are conventional; the "Manage" and "Reference" sections are the best material on the operational differences — single binary, no JVM, no ZooKeeper — that make Redpanda an attractive alternative for lean deployments.

docs.redpanda.com
NATS documentation

docs.nats.io · free

The lightweight pub/sub-plus-JetStream system. Worth reading for the narrow but real niche it occupies — microservice messaging, IoT, edge deployments — and for the contrast with Kafka's heavyweight distributed-log model. The JetStream documentation is the part relevant for stream-persistence use cases.

docs.nats.io
Debezium documentation

debezium.io · free

The leading open-source change-data-capture platform, already flagged in Chapter 03's Further Reading. Listed again here because CDC-plus-Kafka is the single most common event source in production streaming, and Debezium's documentation is the reference for how those events are actually shaped and delivered.

debezium.io

ML-adjacent streaming topics

Feast — streaming feature pipelines

docs.feast.dev · free

The open-source feature store's documentation on streaming sources — reading features from Kafka or Kinesis, writing them to an online store, keeping the offline/online split consistent. The "Streaming ingestion" pages are the practical reference for anyone building online features on an open-source stack.

docs.feast.dev
Tecton documentation

docs.tecton.ai · free tier

The commercial feature platform built explicitly around the streaming-features problem. The conceptual pages on online versus offline consistency, point-in-time correctness with stream sources, and materialisation windows are the clearest vendor-written account of what a production streaming feature store actually has to solve.

docs.tecton.ai
Netflix — Keystone and the Metaflow streaming story

Netflix TechBlog · free

Netflix's engineering blog has published extensively on their streaming data platform — Keystone (the pipeline-building platform), the move from Samza to Flink, Mantis (their realtime stream processing service), and the use of streams in personalisation. The posts are less canonical than the papers above but are the best window into how streaming actually runs at platform scale.

Netflix Tech Blog
Uber — Michelangelo and the real-time ML platform

Uber Engineering · free

Uber's engineering blog covers Michelangelo, their production ML platform, with particular depth on the streaming-features and real-time-inference parts of the stack. The posts on "Near Real-Time Prediction" and the evolving architecture of the platform are the most directly applicable examples of what section 15 of this chapter describes at production scale.

Uber Engineering

This page is the fourth chapter of Part III: Data Engineering & Systems. The next — Distributed Computing — steps back from the streaming/batch distinction to the substrate underneath both: Spark and the shuffle model, MapReduce's surviving ideas, the partitioned parallel execution model that every large-data processor is built on. After that come the cloud platforms everything runs on and the governance layer that keeps it all auditable.