Part III · Data Engineering & Systems · Chapter 06

Cloud platforms, and the rented infrastructure almost every modern ML team runs on.

Almost no team builds data and ML infrastructure from bare metal anymore. The default has become a menu of managed services — virtual machines, object stores, data warehouses, streaming pipelines, container clusters, GPU instances, ML platforms — rented from AWS, GCP, or Azure. The three hyperscalers offer broadly overlapping catalogues wrapped in different names and different design sensibilities; the engineering question is rarely "which cloud?" in the abstract and almost always "which services, at what cost, with what isolation, what identity model, what disaster-recovery story". This chapter maps the cloud landscape through that lens — the five primitives (compute, storage, network, identity, observability) that the rest of the catalogue is built on, the data and ML services that sit on top, and the operational practices (regions, cost, infrastructure-as-code, multi-cloud) that keep a platform honest once it is running.

How to read this chapter

The first two sections are orientation: why cloud platforms won, and the five primitives (compute, storage, network, identity, observability) that every higher-level service is assembled from. Sections three through six drill into those primitives: compute (VMs, containers, serverless, managed Kubernetes, GPU instances), storage (object, block, file, archive), networking (VPC, subnets, peering, load balancing, CDN), and identity and secrets (IAM, service accounts, KMS, Secrets Manager). Sections seven through eleven are the data-and-ML service catalogue: cloud data warehouses (BigQuery, Snowflake, Redshift, Synapse); lake and lakehouse services (S3/GCS/ADLS plus Glue, Dataproc, Synapse, Databricks); managed streaming (Kinesis, Pub/Sub, Event Hubs, MSK); managed orchestration (MWAA, Cloud Composer, Data Factory); and the full-surface ML platforms (SageMaker, Vertex AI, Azure Machine Learning). Section twelve is accelerators — GPU fleets, TPUs, spot/preemptible capacity — which are the hardware substrate Part IV will take for granted. Sections thirteen through sixteen are the operational layer: regions, zones, and disaster recovery; cost and FinOps; infrastructure as code (Terraform, Pulumi, CloudFormation); and multi-cloud, hybrid, and on-prem realities. Section seventeen closes with where the cloud compounds for ML, from experiment scale to inference economics.

Conventions: where a service exists in roughly equivalent form across the three clouds, all three names are given (AWS / GCP / Azure), with the AWS name first because the AWS catalogue is the oldest and usually the most-referenced. Where a single provider dominates a category — TPUs on GCP, Azure OpenAI for the frontier APIs, S3 as the reference object store — the chapter names the provider directly. "The cloud" without qualification means the hyperscaler public cloud; references to "the warehouse" still mean Chapter 02's abstraction, now running on a managed service; references to "the pipeline" still mean Chapter 03's orchestrator, now running on MWAA or Composer.

Contents

  1. Why cloud platforms wonEconomics, elasticity, capability
  2. The five primitivesCompute, storage, network, identity, observability
  3. Compute servicesVMs, containers, serverless, Kubernetes
  4. Storage servicesObject, block, file, archive
  5. NetworkingVPC, peering, load balancing, CDN
  6. Identity, access, and secretsIAM, KMS, service accounts
  7. Cloud data warehousesBigQuery, Snowflake, Redshift, Synapse
  8. Lake and lakehouse servicesS3/GCS/ADLS plus the engines on top
  9. Managed streamingKinesis, Pub/Sub, Event Hubs, MSK
  10. Managed orchestrationMWAA, Composer, Data Factory
  11. ML platformsSageMaker, Vertex AI, Azure ML
  12. Accelerators and training capacityGPU fleets, TPUs, spot
  13. Regions, zones, and disaster recoveryAvailability and blast radius
  14. Cost and FinOpsPricing models, egress, budgets
  15. Infrastructure as codeTerraform, Pulumi, CloudFormation
  16. Multi-cloud, hybrid, on-premWhen one cloud is not the answer
  17. Where it compounds in MLExperiments, training, serving economics
Section 01

Cloud platforms won because renting a data centre by the minute is a better deal than owning one

Twenty years ago, a serious data or ML project started with a purchase order for servers, a rack in a colocation facility, and three months of lead time. Today it starts with a credit card and an API call. The migration from owned infrastructure to rented infrastructure is not yet universal, but it is the overwhelming default for new work, and understanding why is the right starting point for any account of the cloud.

Elasticity is the real product

Cloud infrastructure is sometimes framed as "cheaper than on-prem" and sometimes as "more expensive than on-prem"; both are true under different workloads and neither is the real point. The real product is elasticity: the ability to provision a hundred GPUs for six hours, a thousand cores for an overnight batch job, a terabyte of storage for a week, and pay only for the minutes actually used. No owned facility can match that responsiveness because buying and physically installing servers is measured in months, not minutes, and once they are installed they are paid for whether or not they are doing useful work. For workloads with real variability — training spikes, quarterly rollups, product launches — that elasticity is worth a substantial premium on the per-hour unit cost.

Capability, not just capacity

The other half of the argument is capability. A small team on AWS or GCP has immediate access to managed databases, global object storage, real-time stream services, managed Kubernetes, GPU fleets, TPU pods, managed Spark, managed feature stores, managed vector databases, and a long list of services that would each have been a multi-month engineering project to build and operate in-house. The cloud providers have spread the fixed cost of building those services across millions of customers; renting access to them is almost always cheaper, and almost always faster, than building equivalents. The strategic question for most teams is not whether to use managed services but which ones, and where to hold the line against lock-in.

What cloud does not fix

Cloud removes the problem of buying servers; it does not remove the problem of designing systems. Most cloud-native disasters have nothing to do with the cloud provider and everything to do with the customer's architecture — an over-permissive IAM policy, a cross-region data-transfer bill, a runaway autoscaler, a region outage taking down a single-region deployment. The failure modes change, and the tools for addressing them change, but the underlying engineering discipline — careful identity, clear boundaries, observable behaviour, tested recovery — is at least as important on the cloud as on owned hardware, not less. This chapter spends most of its effort on those engineering decisions; the services are the easy part.

"Cloud" is a noun for a specific thing

When this chapter says "cloud", it means public-cloud infrastructure rented from a hyperscaler — AWS, GCP, Azure, and a small group of secondary providers (Oracle, IBM, Alibaba). It does not mean "any server not in the office". Private clouds, managed-service vendors, and colocated hardware are adjacent to the topic but governed by different economics and different design practices; they come up in Section 16.

Section 02

The five primitives, the layers every higher-level service is built from

Strip any cloud back to its foundations and five primitives appear in every catalogue: compute, storage, network, identity, observability. Every higher-level service — a warehouse, a streaming platform, a training job — is a curated combination of these five, wrapped in an API, operated by the provider. The rest of the chapter expands on the four most visible ones; this section names all five together because thinking about cloud architecture without all five tends to produce systems that are broken along the fifth axis.

Compute, storage, network

Compute is the ability to run code — a virtual machine, a container, a function invocation, a Spark job, a training run. Storage is the ability to persist bytes — object stores for unstructured data, block volumes for VM disks, file systems for shared mounts, managed databases for structured data. Network is the substrate connecting them — virtual private networks, load balancers, DNS, peering links, content delivery networks, and the public internet as an edge. These three are the visible infrastructure and are what most "cloud 101" resources focus on.

Identity and observability

The other two primitives are less visible and at least as important. Identity — who or what is making a request, what they are allowed to do, which secrets they can unlock — is the layer that governs every other service; a cloud architecture with the wrong identity model is a security incident waiting to happen and will almost certainly ship one. Observability — metrics, logs, traces, and the billing system itself as a signal — is what tells you whether anything you built is working, and whether it is costing what you expected it to cost. Both are easy to underbuild early and very expensive to retrofit; both come up repeatedly in the rest of the chapter.

The catalogue is assembled from these

Once the primitives are clear, the rest of the cloud catalogue is much less intimidating. A managed data warehouse is compute + storage + identity + observability, bundled with a query engine. A managed training job is compute (GPU) + storage + network + identity, bundled with a scheduler. A managed streaming service is compute + storage + network + identity, bundled with a log abstraction. When evaluating a new service the useful question is usually "which primitive is this actually commoditising, and what does the bundle charge me for?", not "what is this service?" in the abstract.

The primitives have generic names, the services do not

The generic name "object store" has three vendor incarnations: Amazon S3, Google Cloud Storage, Azure Blob Storage. "Virtual machine" is EC2, Compute Engine, Azure VMs. "Identity" is IAM, Cloud IAM, Entra ID (formerly Azure AD). The generic-to-branded translation is most of the cognitive tax of moving between clouds.

Section 03

Compute, from a single VM to a managed Kubernetes fleet to a serverless function

Compute is the broadest and oldest of the cloud primitives, and it comes in a ladder of abstractions: raw virtual machines at the bottom, where you control the OS; containers and managed Kubernetes in the middle, where you control the image; serverless functions at the top, where you control only the code. Different workloads sit at different rungs, and one of the more consequential design decisions of any cloud platform is how much of the ladder the team actually wants to climb.

Virtual machines

The original cloud compute service and still the lowest common denominator. AWS EC2, GCP Compute Engine, Azure Virtual Machines — each a multiplexed hypervisor exposing VMs by instance family (general-purpose, compute-optimised, memory-optimised, GPU, storage-optimised). Instance families come and go; the important thing is the cost-per-vCPU-hour, the attached storage, the network bandwidth, and (for GPUs) the accelerator generation. For workloads that do not fit the container or serverless shapes — Windows applications, custom kernels, HPC with specialised networking — VMs remain the right answer; for everything else, higher rungs of the ladder are usually cheaper and less work.

Containers and managed Kubernetes

Most modern cloud workloads run as containers — standalone process images — rather than as VMs. The container-running services come in two flavours. Simpler ones (AWS ECS / Fargate, Google Cloud Run, Azure Container Apps) hide the cluster entirely and run a container with a URL and an autoscaler. Managed Kubernetes — EKS, GKE, AKS — exposes a full Kubernetes API and lets the team run its own workloads with node pools, autoscalers, networking policies, and all the mechanisms from Chapter 05. GKE's reputation for operational quality is not accidental; Kubernetes itself is a Google project. For anything that looks like a microservice fleet, a Spark cluster, a Ray cluster, or an ML serving platform, Kubernetes is the de facto substrate.

Serverless functions and batch

At the top of the ladder, the provider runs the machines entirely. AWS Lambda, Google Cloud Functions, Azure Functions: deploy a function, pay per invocation and per millisecond, no server in sight. The function model is ideal for event-driven glue (respond to S3 writes, handle webhooks, run lightweight APIs) and poor for anything requiring long-lived state or specialised hardware. A middle category — AWS Batch, GCP Batch, Azure Batch — handles large one-off jobs (simulation sweeps, large rendering, bulk data processing) by scheduling containers onto temporary VM fleets without forcing the team to run the cluster itself. For ML practitioners specifically, managed training jobs (Section 11) are a purpose-built variant of this batch idea.

Climb the ladder, do not skip it

The most common compute mistake on the cloud is picking the wrong rung. VMs for a simple web API (too much ops), serverless for a steady-state service (cold starts, limits), Kubernetes for a two-person team (the cluster itself becomes a full-time job). A useful default: start at the highest rung that fits the workload, drop a rung only when a concrete requirement forces it.

Section 04

Storage, object, block, file, and the archive tier underneath them

Cloud storage divides into a small number of distinct services, each with a different consistency model, access pattern, and price. Treating them interchangeably is where most storage-cost disasters begin; matching the workload to the right service is where most data platforms earn their efficiency.

Object storage is the foundation

Object storageS3, GCS, Azure Blob — is the single most load-bearing service in the cloud. Flat namespace of keys, HTTP-accessible, nine-nines of durability, eleven-nines on the right storage class, regional or multi-regional, with lifecycle policies and strong read-after-write consistency. S3 is the reference implementation in the sense that GCS and Blob are broadly API-compatible via their respective wrappers, and that almost every analytics engine and ML framework can read and write it directly. It is cheap enough that it has become the substrate for data lakes (Chapter 02), for streaming archives, for model artefact stores, and for anything else that is written once and read many times. If there is a single service whose documentation every cloud engineer should have read, it is this one.

Block and file

Block storageEBS, Persistent Disk, Azure Managed Disks — is what a VM's root and data volumes are carved from. Feels like a local disk, is actually a networked SAN, comes in multiple performance tiers (standard, SSD, NVMe provisioned-IOPS). It is the right storage for anything that wants POSIX semantics for a single instance. File storageEFS, Filestore, Azure Files — adds a shared mount across many instances, usable as an NFS endpoint; it is the right choice when multiple VMs or pods need to share a filesystem, at noticeably higher cost than block. For most data-platform use cases, object storage is cheaper and more durable than either; block and file matter when a specific tool insists on them.

Archive and lifecycle

Below the hot storage tiers sit the archive tiers — S3 Glacier and Glacier Deep Archive, GCS Archive, Azure Archive. They trade a factor of ten to fifty on storage cost for a retrieval latency measured in hours and a per-retrieval charge. Data that has to be kept for compliance but will almost never be read lives here; the usual practice is a lifecycle rule that automatically transitions objects from standard storage to cold storage to archive as they age. Sensible lifecycle rules are the single highest-leverage thing a data engineer can do for cloud storage cost, and one of the most commonly missed.

Storage is cheap; egress is not

The list price of cloud object storage looks small. The list price of egress — data leaving the cloud provider's network — does not. A query run in the wrong region, a training job downloading datasets cross-cloud, an analytics tool pulling data back to a laptop — all of them can generate bills that dwarf the underlying storage cost. Egress discipline is the other half of storage economics, and it is the subject of its own subsection in Section 14.

Section 05

Networking, the virtual data centre you rent before you rent anything else

Before a single VM or container runs, there is a virtual network to design. Cloud networking is a full-surface version of what a physical data centre provides — private address spaces, subnets, routing, firewalls, load balancers, DNS, VPNs — exposed as software and billed per resource. Mistakes here tend to be corrosive: visible in the bill, hard to unwind, and load-bearing for every security decision that follows.

VPCs, subnets, routing, and firewalls

Every cloud project starts inside a Virtual Private CloudAWS VPC, GCP VPC, Azure VNet — a software-defined private network with its own address range. VPCs are divided into subnets per availability zone, with route tables controlling how traffic reaches other subnets, the internet, or other VPCs. Security groups (AWS) or firewall rules (GCP, Azure) define which packets are allowed where. Done well, this layer gives clean isolation between environments; done poorly, it produces either a flat network where every service is reachable from every other, or a tangle of overlapping rules no one fully understands. Getting this right early is cheaper than fixing it later.

Load balancers, DNS, and CDNs

Once traffic is inside the VPC, load balancers (AWS ELB/ALB/NLB, GCP Cloud Load Balancing, Azure Load Balancer/App Gateway) distribute it across backend instances. DNS services (Route 53, Cloud DNS, Azure DNS) map names to load-balancer addresses and to managed services. On the public edge, content delivery networks (CloudFront, Cloud CDN, Azure Front Door) cache responses close to end users; they also terminate TLS, enforce WAF rules, and absorb denial-of-service traffic. For anything with user-facing latency, the CDN is a first-order component, not a luxury.

Private connectivity and peering

Two patterns come up repeatedly as a cloud footprint grows. VPC peering and Transit Gateway-style hubs connect VPCs to each other within a region or across regions, keeping traffic on the provider's backbone rather than the public internet. Private endpoints (AWS PrivateLink, GCP Private Service Connect, Azure Private Link) expose managed services inside the VPC's address space, so a call to the storage or warehouse service never leaves the private network. Hybrid connections to on-prem data centres use dedicated-interconnect products — AWS Direct Connect, GCP Interconnect, Azure ExpressRoute — which are more expensive than VPN but offer deterministic bandwidth and latency. These are the pieces that make a real cloud estate hang together with its outside world.

The bill as an architecture diagram

An under-appreciated truth: the cloud bill is one of the best available diagrams of the network. A sudden spike in inter-region transfer, a surge in NAT gateway charges, a new appearance of cross-AZ traffic — each of these is a signal about the actual topology. Monthly cost review, read with a networking eye, catches architectural drift that no design review will.

Section 06

Identity, the layer that decides what every other service is allowed to do

Identity is the cross-cutting primitive. Every API call in the cloud — a VM launching, a function reading from S3, a training job writing a model artefact — carries a principal, and every service answers two questions about it: who are you and what are you allowed to do. The design of that system is often the single most consequential security decision a cloud team makes.

IAM across the three clouds

AWS IAM, Google Cloud IAM, and Microsoft Entra ID (formerly Azure AD, with Azure RBAC for resource permissions) share a conceptual model: principals (users, service accounts, workload identities) are granted permissions to perform actions on resources, often scoped by conditions. The details differ: AWS favours a powerful-but-verbose JSON policy language with explicit allow/deny; GCP leans on predefined and custom roles applied at the project/folder/organisation level; Azure blends RBAC with Entra ID groups and conditional access. Each one, well-used, supports least privilege — a principal gets the narrowest set of permissions it actually needs — which is the discipline most breaches trace back to when it is missing.

Service accounts, workload identity, and short-lived credentials

Services should not carry long-lived passwords. The modern pattern is workload identity: the workload presents proof of what it is (a pod running with a specific Kubernetes service account, a VM running with a specific instance profile, a GitHub Actions runner with an OIDC token) and the cloud identity system exchanges that proof for a short-lived credential scoped to the exact resources it needs. AWS's IAM Roles for Service Accounts (IRSA) on EKS, GCP's Workload Identity Federation, Azure's Workload Identity for AKS are all variants of the same idea. Done correctly, there are no API keys checked into code anywhere in the system; done incorrectly, there are access keys in environment variables that nobody is sure who issued. The gap between these two states is the gap most cloud security reviews focus on.

KMS and secrets management

Two adjacent services handle the other half of the problem. Key managementAWS KMS, Cloud KMS, Azure Key Vault — holds cryptographic keys and exposes envelope-encryption APIs, with all key use logged and auditable. Storage services, databases, and managed pipelines can be configured to encrypt at rest using customer-managed keys, which is often a compliance requirement. Secrets managementAWS Secrets Manager and Parameter Store, GCP Secret Manager, Azure Key Vault — stores rotating secrets (database passwords, API keys, OAuth tokens) and hands them to workloads via IAM-gated API calls rather than via files on disk. Using these services instead of rolling secrets into config is a one-time effort with very large downstream payoff.

The zero-trust default

The operational standard for modern cloud platforms is something like zero trust: assume no network location is inherently safe, authenticate and authorise every request, scope every credential to the minimum required, and make every access auditable. The implementation details vary; the principle does not. Platforms that ship this from day one rarely have catastrophic breaches; platforms that retrofit it after an incident usually do.

Section 07

Cloud data warehouses, the shape analytics took once storage and compute split

Chapter 02 introduced the cloud data warehouse as the modern analytical-storage default. This section grounds that abstraction in the specific services: BigQuery (GCP), Snowflake (multi-cloud), Redshift (AWS), and Azure Synapse Analytics — four products that now do the overwhelming majority of serious analytical SQL.

BigQuery

BigQuery is the purest expression of the serverless warehouse. No clusters to provision; queries run against managed compute pools, billed either per terabyte scanned (on-demand) or by reserved slots. Tables are stored in BigQuery's proprietary columnar format (Capacitor), with native Parquet/Iceberg/Hive-metadata integration through BigQuery external tables and BigLake. The combination — zero operational overhead, petabyte-scale queries in seconds, a deep ML-adjacent feature set (BigQuery ML, vector search, streaming ingestion) — makes it the obvious default on GCP and a genuine candidate on other clouds.

Snowflake and Redshift

Snowflake separates storage (on the underlying cloud's object store) from compute (virtual warehouses, each an independently-sized compute cluster that can be resized or suspended per workload). Its identity as a cross-cloud vendor — runs on AWS, GCP, and Azure with a uniform experience — is a significant part of its appeal; so is the time-travel, zero-copy cloning, and data-sharing feature set. Amazon Redshift was the first cloud MPP warehouse and has since been extensively redesigned around separation of storage and compute (Redshift RA3 nodes, Redshift Serverless) and tighter integration with the AWS analytics catalogue (Glue, Lake Formation, Athena). Azure Synapse Analytics plays a similar role on Azure, bundling a dedicated SQL pool, serverless SQL on the lake, and Spark pools into one product surface.

How to choose

The choice among these four is usually dominated by which cloud the team is already committed to. BigQuery is the obvious answer on GCP; Redshift and Snowflake both fit AWS, with Snowflake's multi-cloud story and Redshift's AWS-native tooling pulling in opposite directions; Synapse is the natural Azure default, with Snowflake again a common alternative. The technical differences matter more for very large workloads: streaming ingestion favours BigQuery, time-travel and data-sharing favour Snowflake, tight S3/Glue integration favours Redshift, tight Microsoft-stack integration favours Synapse. All four are mature enough that most teams will never hit a capability limit; the ergonomics of the cloud the team already runs on usually decides.

DuckDB and the single-node alternative

For teams whose data is not actually big — and many teams have more data than Pandas can handle but less than a cloud warehouse is worth — a single-node engine like DuckDB on a large VM is increasingly competitive. The cloud warehouse earns its place once the dataset, concurrent user count, or orchestration complexity exceeds what one beefy machine can do, and not earlier.

Section 08

Lake and lakehouse services, the object store plus the engines that query it

A data lake is an object store plus a catalogue plus a set of engines that know how to read the files. Cloud providers ship all three, plus managed versions of the Spark/Flink/Trino engines that actually run the queries, so "build a lakehouse" has gone from a multi-quarter engineering project to a set of terraform resources.

The storage layer

The raw substrate is the object store from Section 04 — S3, GCS, ADLS Gen2 — storing Parquet, ORC, or Avro (Chapter 02) organised as Iceberg, Delta Lake, or Hudi tables. On top of that sits a managed catalogue: AWS Glue Data Catalog, GCP Dataplex, Azure Purview and the Unity Catalog (Databricks). The catalogue holds schemas, partitions, and statistics, and is what engines query to plan reads without having to list every file. For many teams the catalogue becomes the most important piece of the stack over time, because it is the one that every engine — warehouse, notebook, pipeline, BI tool — depends on for metadata.

Managed analytics engines

On top of lake storage, the clouds ship a catalogue of managed engines. AWS Athena runs Trino against S3 with no cluster at all; pay per terabyte scanned, queries return in seconds for most analytical shapes. AWS EMR and Google Cloud Dataproc are managed Spark/Hadoop/Flink clusters that spin up on demand. Azure Synapse Spark pools and Databricks (on all three clouds) run Spark with heavy-weight IDE and catalogue integration. For most data teams, these managed engines have displaced the practice of operating long-running clusters directly on VMs; the provider handles patching, autoscaling, and most of the tuning.

The lakehouse convergence

The "lakehouse" pattern — warehouse-style tables stored in open formats on the lake, queryable by both warehouse engines and Spark/Python directly — has become the ambient assumption for new platforms. BigQuery's BigLake and Iceberg tables, Snowflake's Iceberg tables, Databricks's Delta Lake and Unity Catalog, and AWS's S3 Tables / Iceberg-first Glue catalogue all point in the same direction: a single storage layer, multiple engines on top, and queries that do not care which engine runs them. The architectural consequence is that choosing "warehouse" versus "lake" has become a softer decision than it was five years ago; in practice most platforms end up with both, reading the same tables.

Table format is the durable commitment

Engines come and go; storage formats are forever, or close enough. A lake built on Iceberg or Delta tables today will still be queryable in ten years; a lake built as a pile of CSVs will be readable but unpleasant. The table-format decision is the one worth getting right early; the choice of engine on top can always be revisited.

Section 09

Managed streaming, the same log abstraction with a service contract wrapped around it

Chapter 04 described the streaming world around the partitioned append-only log. Cloud providers sell that abstraction as a service: AWS Kinesis and MSK, GCP Pub/Sub, Azure Event Hubs. The underlying primitives are familiar; the operational shape is very different from running Kafka yourself, and the choice between "managed Kafka" and "managed proprietary service" is one most streaming platforms eventually make.

Managed Kafka

The cloud-native way to run Kafka without operating it. AWS MSK (Managed Streaming for Apache Kafka) runs a real Kafka cluster on your behalf, with the same wire protocol, the same client libraries, and the same semantic model as self-hosted Kafka. Confluent Cloud (on all three clouds) is the vendor-managed alternative, with extra features on top (tiered storage, Flink, data governance). Azure Event Hubs for Kafka exposes a Kafka-compatible endpoint on top of Event Hubs. For teams already standardised on Kafka — connectors, schemas, tooling — managed Kafka is a near-drop-in replacement for running it yourself.

Provider-native streaming

The cloud providers also ship their own streaming services with simpler semantics. AWS Kinesis Data Streams: partitioned, ordered-by-shard, paid by shard-hour and throughput, with Kinesis Data Firehose for buffered delivery and Kinesis Data Analytics (now Managed Service for Apache Flink) for processing. GCP Pub/Sub: partition-free (internally scaled), at-least-once delivery with optional exactly-once, minutes to deploy, natural fit for Cloud Functions and Dataflow. Azure Event Hubs: partitioned Kafka-adjacent service with a tight Event Grid / Service Bus cousin for different messaging shapes. The provider-native services trade Kafka's raw power and portability for simpler operations and tight integration with the rest of the cloud catalogue.

Stream processing on top

The compute side of streaming — what Chapter 04 covered with Flink, Spark Structured Streaming, and Kafka Streams — also comes in managed form. AWS Managed Service for Apache Flink, GCP Dataflow (running Apache Beam, which in turn runs on Flink or Spark), Azure Stream Analytics, Databricks Structured Streaming. These take the operational load of the processor off the team and leave the application semantics (watermarks, windows, state) squarely in the team's hands. A typical modern streaming platform is a managed log plus a managed processor plus the team's business logic; the surface area that used to demand a dedicated team is now closer to a dedicated person.

Portability is a real consideration

Kafka's wire protocol is an open standard; Pub/Sub's, Kinesis's, and Event Hubs's native APIs are not. That asymmetry matters in two scenarios: multi-cloud deployments, and switching providers. Teams that expect either of those is a realistic possibility often pay the small extra complexity to standardise on Kafka (self-hosted, MSK, or Confluent) rather than lock their producers and consumers to a provider-native API.

Section 10

Managed orchestration, somebody else runs Airflow for you

Chapter 03 introduced Airflow, Prefect, and Dagster as the dominant orchestrators. Each cloud provider ships at least one managed version, which removes the question of "how do I run Airflow in production?" in exchange for provider-specific opinions about where DAGs live, how secrets are handled, and how the UI is secured.

Managed Airflow

The closest thing to a default across all three clouds. AWS Managed Workflows for Apache Airflow (MWAA), Google Cloud Composer, and — less commonly — Astronomer's managed service (multi-cloud) all run real Airflow on managed Kubernetes, with DAGs stored in an object-store bucket, secrets delegated to the cloud secrets manager, and identity wired through the cloud's IAM. The developer experience is the same Airflow one: Python DAG files, operators, tasks, scheduler, web UI. The operational delta is enormous: no scheduler to babysit, no worker autoscaler to maintain, no upgrade cycle to plan. For most teams that picked Airflow, the question is no longer "run it ourselves or use a managed service" but "which managed service, on which cloud".

Azure Data Factory and the visual lineage

Azure's primary orchestration product is Azure Data Factory (with its Synapse-integrated variant, Synapse Pipelines), which predates the Airflow-everywhere era and has a different design sensibility: visual DAG builder, strong built-in connectors to Azure services, declarative pipelines, and a UI-first authoring experience. Teams used to the Airflow model find it unfamiliar; teams coming from SSIS or from the Microsoft analytics stack tend to find it natural. Azure also now offers Managed Airflow for Data Factory for teams that want the familiar Airflow model on Azure.

Dagster Cloud and Prefect Cloud

The newer orchestrators come with their own hosted products. Dagster Cloud (runs on AWS infrastructure, deployable into the team's own cloud) and Prefect Cloud offer managed control planes with local or cloud execution. These are not provider services in the same sense as MWAA or Composer — they are independent SaaS vendors — but they are the natural managed options for teams that have chosen those orchestrators rather than Airflow. They also tend to move faster on features than the cloud-provider variants, which are constrained by each cloud's broader release cadence.

Orchestration is the "glue" and glue has to be boring

Whatever orchestrator is chosen, the production discipline is the same: DAGs in version control, secrets in the cloud secrets manager, tasks scoped to narrow IAM roles, observability through the cloud's metrics/logs stack, alerting on run failures. The service choice matters less than the consistency of these practices; an unmanaged Airflow run well is usually better than a managed one run carelessly.

Section 11

ML platforms, SageMaker, Vertex AI, Azure Machine Learning, and what the clouds bundle for ML

Each hyperscaler now ships an end-to-end ML platform: notebooks, managed training jobs, hyperparameter tuning, experiment tracking, a feature store, a model registry, managed serving, and a pipeline DSL to stitch them together. They are not the only way to do ML on the cloud — many teams compose the individual primitives themselves — but they are the default these platforms' documentation assumes, and worth knowing on their own terms.

Amazon SageMaker

The oldest of the three, and the broadest in catalogue. SageMaker Studio (the IDE), SageMaker Training (managed training jobs on CPU/GPU instances, including distributed training), SageMaker Experiments, SageMaker Pipelines, SageMaker Feature Store, SageMaker Model Registry, SageMaker Endpoints for serving, plus a rotating menu of higher-level products (Autopilot, JumpStart, HyperPod for large-model training). The catalogue is encyclopedic to the point of being confusing; the compensation is depth — anything that can be done with containers on AWS can probably be done inside SageMaker with fewer moving parts.

Google Vertex AI

Google's consolidation of its earlier AI Platform and AutoML products. Vertex AI Workbench for notebooks, Vertex Training, Vertex Pipelines (on Kubeflow Pipelines), Vertex Feature Store, Vertex Model Registry, Vertex Endpoints for serving, and a tight integration with BigQuery for data and with Google's foundation models through Vertex AI Model Garden. The design tends to favour convention over configuration more than SageMaker does; the TPU integration is unique among the three clouds; the tie-in to Google's research models (Gemini, Imagen) is the main reason many teams on other clouds still use Vertex selectively.

Azure Machine Learning and Azure AI

Azure Machine Learning covers broadly the same surface as SageMaker and Vertex — workspaces, compute targets, training jobs, pipelines, model registry, endpoints — with closer integration to Microsoft's broader enterprise stack (Purview, Synapse, Power BI). The newer Azure AI Foundry layer sits above it, aimed specifically at generative-AI application development and tightly wrapped around Azure OpenAI. The two together are the strategy: Azure ML as the classical-ML platform, Azure AI for the generative-AI surface.

Platform or primitives?

There is a real trade-off between adopting an end-to-end platform (faster to start, easier to hire for, opinionated in helpful ways) and composing the primitives yourself with Kubernetes + Kubeflow / MLflow / Airflow / vLLM (slower to start, more control, more portable). Sophisticated teams often end up doing both — the platform for most of the team, custom primitives for the pieces that justify the engineering investment. The choice is rarely permanent either way.

Section 12

Accelerators, the hardware that turned ML into a cloud workload

The economics of cloud ML run almost entirely on specialised accelerators. GPUs — NVIDIA's A100, H100, H200, B100/B200 generations — dominate; TPUs (Google) are the credible alternative; AWS Trainium and Inferentia, and Azure Maia, are the hyperscalers' attempts to reduce their NVIDIA exposure. Understanding what each cloud actually offers, and how capacity is obtained, is most of what "doing ML on the cloud" reduces to.

GPU instance families

AWS names its GPU VMs by letter prefix (the P-series for NVIDIA training, G-series for inference and graphics, with P4/P5/P6 tracking the A100/H100/B-series generations). GCP's A-series and H-series track the same hardware. Azure's ND-series and NC-series play the same role. Every one of them rents the same underlying NVIDIA silicon; the cost differences are typically small and driven by committed-use discounts and spot/preemptible pricing rather than by the hardware itself. For distributed training, the network side matters as much as the GPUs: AWS EFA, GCP's A3/H3 with fast interconnects, Azure's InfiniBand fabric are the pieces that make multi-node all-reduces fast enough to run at scale.

TPUs and cloud-proprietary accelerators

TPUs are Google's custom tensor-processing units; they come in per-chip VMs (Cloud TPU VMs) and in larger interconnected pods. For workloads with well-optimised JAX or TensorFlow implementations, they can deliver substantially better price-performance than GPUs on equivalent problems; for arbitrary PyTorch code the story is more mixed, though PyTorch/XLA has improved that meaningfully. AWS Trainium and Inferentia chips are the AWS-side equivalents, accessed through the Neuron SDK; their ecosystem is narrower than CUDA's but improving. Teams at the large end of the capacity spectrum — frontier-model training runs — routinely mix accelerator families to hedge against availability and price.

Spot, reservations, and the capacity market

GPU capacity is scarce enough that obtaining it has become its own discipline. On-demand pricing is the nominal rate; spot (AWS) / preemptible (GCP) / low-priority (Azure) instances are the same hardware at 60–90% discount, with the catch that the provider can reclaim them with short notice — ideal for checkpoint-able training jobs and inference autoscaling, wrong for interactive or tight-deadline work. Capacity reservations and committed-use discounts trade flexibility for guaranteed availability and a 30–50% discount over three years. For teams training large models, the actual cost is often dominated by a contract negotiation rather than by per-hour list prices; for teams doing inference at scale, the cost is dominated by how cleanly the workload can autoscale into spot capacity.

Capacity is the scarce resource

A running theme across the 2023–2026 period: getting the hardware you want, when you want it, is harder than paying for it. Training clusters are often shaped by what capacity can actually be acquired, not what the research plan calls for; multi-cloud strategies are often pursued specifically to hedge GPU availability. This is an inconvenient fact about cloud ML that the marketing does not emphasise.

Section 13

Regions and zones, the geography that decides what survives what failure

Every cloud provider organises its hardware geographically: regions are continents-to-country-scale groupings, availability zones are independently-powered data centres inside a region. Knowing where a workload lives is most of what decides its latency, its durability, its compliance posture, and its disaster-recovery profile.

Regions and availability zones

A region (e.g., us-east-1, europe-west4, eastus2) is a set of data centres close enough to support synchronous replication between them. A zone inside a region is one of those data centres — independently powered, independently cooled, independently networked, so that the failure of one does not take down another. "Multi-AZ" is the cheapest durability upgrade available: most managed databases, caches, and storage services offer it as a configuration option, often at modest cost, and it protects against the single most common cloud failure (a single data-centre outage). Multi-AZ should be the default for anything load-bearing; single-AZ is a conscious choice to accept that risk in exchange for lower cost.

Cross-region and edge

Crossing region boundaries is a different story. Regions are independently operated and can (and occasionally do) fail entirely; a serious outage of a single region is the disaster the disaster-recovery plan has to survive. Cross-region strategies range from active-active (the workload runs in two regions, traffic is distributed, a region loss is invisible to users) through warm-standby (the second region is on but receives no live traffic) to cold-standby (the second region is configured but not running). Each step down the ladder is cheaper and slower to recover; the right choice depends on recovery-time-objective (RTO) and recovery-point-objective (RPO) targets. On top of all of this, edge locations — CDNs, edge compute (CloudFront Functions, Cloudflare Workers) — push latency-sensitive work closer to users but are usually not the place for durable state.

Data residency and compliance

Region choice is also a compliance decision. GDPR, India's DPDP Act, Australia's Privacy Act, US sector-specific rules (HIPAA, FedRAMP, SOC 2) all impose constraints on where data can be stored and processed and how it must be moved. Cloud providers publish per-region compliance certifications; many maintain sovereign or government-specific regions (AWS GovCloud, Azure Government, Google Distributed Cloud Sovereign) for workloads that require stricter controls. For teams operating in regulated industries, this layer often decides the overall architecture before any engineering question is asked.

Test the failover

A disaster-recovery plan that has never been exercised is a plan that does not work. Running a full region-loss drill at least annually is the only reliable way to find the dependency on a misconfigured IAM policy, the hard-coded region name, the staging-only DNS entry — all of which are exactly the problems that will surface during the actual outage. Teams that do this routinely have uneventful region failures; teams that do not have spectacular ones.

Section 14

Cost, the cloud's unavoidable engineering problem

The cloud bill is a real engineering artefact. It reflects every design decision, every forgotten instance, every cross-region transfer, every unreviewed service. The discipline of managing it — often called FinOps — is less a finance role than a software engineering practice, and it is one most teams learn the expensive way.

Pricing models

Every cloud service charges along one or more of a small set of axes: per hour of compute, per GB of storage, per GB of data transferred, per request, per query's bytes scanned, per provisioned throughput. Knowing which axes a given service bills on is the first step toward controlling it. On-demand pricing is the default and most expensive. Committed-use discounts (AWS Savings Plans, GCP Committed Use Discounts, Azure Reservations) trade a 1- or 3-year commitment for a 30–50% discount and are appropriate for steady-state workloads. Spot / preemptible pricing (Section 12) gives a further 60–90% discount on compute in exchange for eviction risk. The mix of these three is what decides whether a given cloud estate costs 2x or 0.5x what the naive list price would suggest.

Egress, the silent multiplier

Data leaving the cloud — egress — is the single most punitive charge in the standard cloud pricing sheet. Transferring 1 TB between clouds, or out to the public internet, routinely costs $70–$100 on top of the storage cost. This is why cross-cloud architectures are rare outside specific scenarios, why "bring your analytics to the data" is the default rather than the inverse, why CDNs pay for themselves on high-traffic websites, and why cross-region database replication gets expensive fast. Reading the egress page of the provider's pricing sheet before designing any distributed architecture is worth doing once, explicitly, for every architect on the team.

FinOps as a practice

The working discipline has three pieces. Cost attribution: tag every resource with team, project, and environment so the bill can be sliced by the people who control the spend. Cost observability: dashboards on weekly spend by service and by tag, alerts on anomalies, the bill read as a time series alongside the engineering change log. Cost engineering: storage-tier lifecycles, right-sizing of compute, autoscaling policies, spot-capable workloads, consolidation of underused resources. The results compound: a team with these three practices in place typically runs on 40–60% of the bill it would have without them, for the same workload.

The bill is the product owner

Engineering teams that treat the cloud bill as an ops concern — finance deals with it — produce the expensive cloud estates. Teams that treat it as a first-class product signal — every architectural choice has a dollar number attached, every launch has a cost forecast, every service's cost trend is reviewed monthly — produce the efficient ones. The difference is almost entirely cultural, not technical.

Section 15

Infrastructure as code, the click-ops to version-control migration every cloud team eventually makes

Clicking around the cloud console is a fine way to learn a service; it is a terrible way to run one. Infrastructure as code (IaC) — declaring cloud resources in text files, version-controlled, code-reviewed, applied through pipelines — has become the default operational posture for serious cloud estates. It is the engineering analogue of what Chapter 03's version control and dbt did for data pipelines: make the infrastructure reproducible, auditable, and diffable.

The tools

Terraform (HashiCorp, now under the BSL licence with an open-source fork OpenTofu) is the dominant cross-cloud choice — a declarative HCL language, provider plugins for AWS/GCP/Azure and a long list of SaaS services, a state file that tracks what is deployed, and a plan/apply workflow that previews changes before making them. Pulumi covers similar ground with general-purpose languages (Python, TypeScript, Go, C#) instead of HCL. Each cloud also ships its own native tool: AWS CloudFormation and CDK, Google Cloud Deployment Manager (now less pushed than Terraform on GCP), Azure Resource Manager and Bicep. For multi-cloud teams, Terraform/OpenTofu is the default; for single-cloud teams, the native tool is often a better fit and better integrated.

State, modules, and drift

Three issues come up repeatedly in IaC practice. State — the record of what is currently deployed — has to live somewhere remote and locked (Terraform Cloud, AWS S3 + DynamoDB, Azure Storage with blob leases); local state files on laptops are the fast path to a catastrophe when two engineers apply at once. Modules — reusable collections of resources — are how an IaC codebase stays readable past the first few hundred lines; good module design is closer to API design than to scripting. Drift — the gap between the declared state and the actual cloud state, introduced when someone clicks something in the console — is the constant antagonist; policies that detect and block drift (and culture that discourages console changes in production) are the way to keep the IaC honest.

Pipelines for infrastructure

The mature version of IaC runs through a CI/CD pipeline, not from an engineer's laptop. A merge to main triggers a plan that is reviewed; the apply runs in a CI runner with narrow IAM credentials; a post-apply check confirms the cloud matches the declaration; the whole thing is visible in the same PR workflow the application code uses. Tools like Atlantis, Terraform Cloud, Env0, and Spacelift wrap this pattern. The effect is to turn infrastructure into code in the operational sense as well as the declarative one: reviewed, tested, blameable, reversible.

Write it down, then never click again

The most valuable property of IaC is the audit trail. "Who changed this, when, and why" has a precise answer for every resource in the git log. When an incident happens, the first question — what changed — is instantly answerable. Teams that adopt this discipline rarely go back; teams that skip it usually end up with estates that nobody fully understands, including the people who built them.

Section 16

Multi-cloud, hybrid, and on-prem, the architectures that do not fit on one hyperscaler

Most teams run on one cloud, and most should. A meaningful minority — for reasons of capacity, cost, compliance, or acquisition — run on more than one, or on a mix of cloud and owned hardware. The patterns for each are established enough to name; the trade-offs are important enough to think through once.

Single cloud is the right default

A single cloud is where cloud economics work best: identity is one system, networking is one system, the services are tightly integrated, the billing is one invoice, the engineers only have to learn one set of names. Most of the failure modes that "multi-cloud" is supposed to hedge against — provider outages, price increases, policy changes — are smaller in practice than the cost of the parallel implementation. The honest default for a new team is to commit to one cloud, use its managed services deeply, and revisit the decision only when something specific forces it. Most teams that said they were "multi-cloud" in 2020 had meaningful workloads on exactly one.

Legitimate multi-cloud patterns

Real multi-cloud exists and has real reasons. GPU capacity is one: large training workloads routinely span clouds because no single provider has enough hardware in the right region at the right time. Specific services are another: Vertex for Google's models, Azure OpenAI for the Microsoft-hosted frontier models, AWS for the deepest managed-service catalogue. Compliance-driven residency is a third: European data in European regions, Chinese data in Chinese regions, each on whichever provider has the right certifications. And acquisitions: the acquiring company runs on one cloud, the acquired team ran on another, and the migration is a multi-year project. The common pattern in all of these is workload-level multi-cloud: each workload lives on one cloud; different workloads live on different clouds. "One application spanning clouds" is much rarer and much harder, and should be avoided unless there is a specific reason.

Hybrid and on-prem

Hybrid means cloud plus owned infrastructure in the same architecture. The common patterns: a cloud front end in front of on-prem mainframes; on-prem data centres for regulated data connected to the cloud for analytics; cloud bursting for workloads that exceed on-prem capacity. Tools like AWS Outposts, Google Distributed Cloud, and Azure Arc run cloud APIs on customer-owned hardware. On-prem ML, meanwhile, is having a small revival for narrow reasons: a few teams can genuinely amortise the cost of their own GPU fleet against the hyperscaler list price, and a few have latency or sovereignty constraints that force the decision. For most teams, though, both hybrid and on-prem add complexity that the cloud was supposed to eliminate; the burden of proof is on the architecture that chooses them.

Portability is not free

Any architecture designed to be "cloud-neutral" pays for that portability every day in lost leverage over the cloud-native services. Some of that tax is worth it — standardising on Kubernetes, Kafka, Postgres, S3-API-compatible storage — and some is not. Teams that commit fully to a single cloud's deep services generally ship faster; teams that hedge generally ship slower but preserve optionality. The right trade-off is specific to each team, not a general principle.

Section 17

Where the cloud compounds in ML, experiments, training economics, and the inference bill

Every layer of ML practice — experiment throughput, training capacity, serving economics, evaluation at scale — is shaped by the cloud platform it runs on. Teams that treat the cloud as a substrate and engineer carefully against it move much faster than teams that use it casually; the compounding over a year or two is dramatic.

Experiment throughput as the compounding variable

The number of experiments a team can run per week is one of the strongest predictors of ML progress, and it is almost entirely a function of infrastructure. A platform that can spin up 20 parallel training jobs in 10 minutes, with good checkpoint-restart, good spot-capacity handling, and tight integration to the experiment tracker, produces a team that tries ten ideas per week. A platform that takes an hour per job, breaks on spot preemption, and has no shared dashboards produces a team that tries one. The scaling laws in Chapter 05 Section 17 describe the ceiling; infrastructure quality decides how fast a team actually approaches it.

Training economics

For serious training workloads, cloud platform design becomes indistinguishable from training economics. Which accelerators can actually be obtained in a given region. What the spot availability looks like. Whether checkpointing is cheap enough to absorb preemption. Whether the interconnect supports the required collective throughput. How committed-use discounts map to the team's training cadence. Whether egress between the dataset store and the training cluster is even priced correctly. These are not incidental concerns; for a team spending six or seven figures a month on training, they are the difference between the work being possible and the work not being possible.

Inference at scale

Serving is the other half, and it has become the larger half in the LLM era. Cloud-side, this is a matter of GPU instance pricing for inference, autoscaling policies that respect cold-start latency, multi-region deployments for global latency, and careful egress modelling for responses that may be many tokens long. Provider-specific services — AWS Bedrock, GCP Vertex AI Model Garden, Azure OpenAI — bundle all of this behind a managed endpoint and charge per million tokens, which for many teams is cheaper than running the infrastructure themselves. For teams that do run their own serving (privacy, latency, cost, or fine-tuning requirements), the cloud primitives from Sections 3, 5, 6, 12, and 13 are the kit the platform is built from.

The cloud is the substrate; the discipline is the work

Nothing in this chapter replaces the need for good engineering. The cloud makes it possible for a small team to wield infrastructure that would have required an army a decade ago; it does not make that infrastructure design itself. The next chapter — on data quality, governance, lineage, and cataloguing — is the final piece of Part III, the layer that keeps all of this machinery auditable and trustworthy as the platform scales up.

Further reading

Where to go next

The cloud providers publish their own documentation at a scale no single book can match; the right reading list is a small set of anchor books and frameworks, plus the official docs and architecture references for each hyperscaler. The list below picks the references that repay re-reading — the Well-Architected Frameworks, the canonical cloud books, the IaC and FinOps literature, and the ML-platform docs that Chapter 11 is written against.

The canonical books

Well-Architected frameworks

AWS — official documentation

Google Cloud — official documentation

Microsoft Azure — official documentation

Infrastructure as code and multi-cloud

Cost, FinOps, and security

This page is the sixth chapter of Part III: Data Engineering & Systems. The seventh and final chapter of the part — Data Quality, Governance, & Metadata — turns from the systems that store and move data to the systems that keep it trustworthy: data contracts, lineage tracking, catalogues, data observability, and the policy apparatus (access controls, PII handling, regulatory compliance) that keeps a platform auditable as it grows. After that, Part IV begins with Classical Machine Learning.