Part V · Deep Learning Foundations · Chapter 01

Neural network fundamentals, the representational vocabulary and the gradient-based learning rule that together made deep learning possible.

A neural network is, viewed from far enough away, a very simple object: a composition of affine transformations and pointwise nonlinearities whose parameters are chosen to minimise a differentiable loss on a dataset. Viewed from close up, it is the most consequential computational object of the early twenty-first century — the substrate on which modern computer vision, speech recognition, machine translation, protein structure prediction, and large language models are all built. The distance between those two views is the distance this chapter tries to close. Part IV has taught us a catalogue of classical models, each with its own training algorithm, its own inductive bias, its own representational grammar. The neural network unifies almost all of those training algorithms under a single umbrella — gradient descent on a differentiable loss — and replaces the hand-crafted representational grammar with a representational grammar that is itself learned from data. This chapter is the smallest complete unit of the neural-network idea: a single neuron (the perceptron), an architecture for composing them (the multilayer perceptron), a mechanism for learning them (backpropagation), a gallery of the nonlinearities that make the composition work (activation functions), and the theoretical results (universal approximation) that say why the whole machinery should work in principle. The next six chapters of Part V scale the idea up: training tricks that turn a theoretically-viable network into a practically-trainable one, regularisation that turns a trainable network into a generalising one, and the architectural variations (convnets, RNNs, attention, transformers) that exploit the structure of specific data types.

How to read this chapter

Sections one through four are the historical-conceptual spine. Section one motivates the whole enterprise: why we are bothering to learn features rather than engineering them, where the neural-network idea came from, and the two "AI winters" whose memory still shapes how the field thinks about hype cycles. Section two is the perceptron — Rosenblatt's 1958 model of a single artificial neuron, its geometric interpretation as a linear classifier, and the Minsky–Papert 1969 XOR result that killed it for two decades. Section three assembles perceptrons into layers: the multilayer perceptron (MLP), the matrix form of a forward pass, and the feed-forward architecture that is the simplest neural network one can draw. Section four is the universal approximation theorem: the Cybenko 1989 / Hornik 1991 result that a single hidden layer of sufficient width can approximate any continuous function on a compact set — and the sharper distinction between what universal approximators can represent and what they can learn efficiently.

Sections five through nine are the mechanics of learning. Section five introduces the loss function — mean squared error for regression, cross-entropy for classification — and the interpretation of training as minimising an expectation over the training distribution. Section six is gradient descent in its simplest form, the update rule that makes every subsequent neural-network algorithm possible. Section seven is the core algorithmic contribution of the field: backpropagation, the efficient reverse-mode computation of gradients through a computation graph, derived here first for a scalar loss and then generalised to tensors. Section eight is the chain rule as a graph — the reframing of backprop in modern automatic-differentiation terms that underlies PyTorch, JAX, and TensorFlow. Section nine is the activation function menagerie — sigmoid, tanh, ReLU, leaky ReLU, GELU, Swish — and the reason the transition from sigmoid to ReLU in 2011 was the single most important architectural change in the deep-learning revival.

Sections ten through fourteen are the practical knobs that turn an MLP from a toy into a real model. Section ten is weight initialisation: Xavier/Glorot and He schemes, and why initialising at zero is catastrophic. Section eleven introduces stochastic gradient descent, mini-batches, and the stochastic-vs-batch trade-off that motivates everything in Chapter 02's optimizer zoo. Section twelve is the vanishing and exploding gradient problem — the reason deep networks were considered impossible to train for two decades, and the three-pronged fix (ReLU, careful initialisation, later residual connections) that solved it. Section thirteen is output layers and task heads: softmax for multiclass classification, linear heads for regression, sigmoid for multi-label, and the mapping between loss function and the final nonlinearity. Section fourteen is the capacity picture: width, depth, overparameterisation, and the surprising modern observation that over-parameterised networks generalise better than classical theory predicts.

Sections fifteen through eighteen zoom out. Section fifteen surveys the history: perceptrons (1958), the first AI winter (1969), backprop's re-invention (Rumelhart, Hinton, Williams 1986), the second AI winter, and the 2006–2012 deep-learning revival that culminated in AlexNet. Section sixteen is a worked example: training a small MLP on MNIST end-to-end, using NumPy only, to concretise every idea in the chapter. Section seventeen is the representation-learning perspective — the reframing of neural networks as feature learners rather than classifiers, and the way that perspective connects Part IV (where features were engineered by hand) to Part V (where features are learned from data). Section eighteen places neural-network fundamentals inside the broader Part V arc: what the rest of Part V (optimizers, regularisation, convnets, RNNs, attention, transfer learning) builds on top of the foundation you have just read.

Why neural networksLearned representations, two AI winters, the deep-learning bet
The perceptronRosenblatt 1958, linear classifier, Minsky–Papert 1969
The multilayer perceptronHidden layers, feed-forward, matrix form of a forward pass
Universal approximationCybenko 1989, Hornik 1991, expressivity vs learnability
Loss functionsMSE for regression, cross-entropy for classification, MLE view
Gradient descentThe update rule, learning rate, convexity vs non-convexity
BackpropagationReverse-mode chain rule, per-layer gradient formulas
Automatic differentiation and the computation graphForward vs reverse mode, PyTorch/JAX/TF
Activation functionsSigmoid, tanh, ReLU, leaky ReLU, GELU, Swish
Weight initialisationXavier/Glorot, He, the zero-init catastrophe
Stochastic gradient descent and mini-batchesBatch vs online, the stochastic objective
Vanishing and exploding gradientsSaturating nonlinearities, spectral radius, the ReLU fix
Output layers and task headsSoftmax, sigmoid, linear, and loss-coupling
Capacity, width, depth, and overparameterisationClassical VC view vs modern double-descent
A history of neural networksPerceptrons, winters, backprop 1986, the 2012 inflection
Worked example: an MLP for MNISTNumPy end-to-end, forward, backward, SGD
Representation learningNetworks as feature learners, not just classifiers
Where it compounds in MLWhat Part V builds on this foundation

Why neural networks

The central bet of deep learning is that a model can learn useful features from raw data if given enough capacity, enough data, and enough compute. The four previous parts of this Compendium have treated features as something a human engineer provides; Part V takes the opposite view. Understanding why that view eventually won out — after two ice ages in which it plainly did not — is the right place to start.

The feature-engineering bottleneck

The models of Part IV are almost all of the form "take a fixed feature vector x, apply a simple parametric function f(x; θ), optimise θ on some loss." The fixed feature vector is doing most of the work. In a classical machine-learning pipeline a human chooses how to extract colour histograms from an image, MFCCs from audio, bag-of-words from text, term-frequency weights, n-grams, part-of-speech tags, sentiment lexicons, date-part decompositions, and so on. The classifier on top — logistic regression, SVM, gradient boosting — is modest in comparison. When the features are good, classical ML works beautifully; when the features are bad, no amount of tuning rescues the model. For rich, high-dimensional, weakly-structured data — images, raw audio, text, graphs, video — nobody is very good at designing features. The feature-engineering bottleneck is the reason that, before 2012, computer-vision systems sat in the high-twenty-percent error band on ImageNet for a decade.

Learned representations

The neural-network answer to this bottleneck is to collapse feature-engineering and classification into a single end-to-end differentiable pipeline. The early layers learn low-level features (edges in an image, phonemes in audio, characters in text); the middle layers learn mid-level features (textures, syllables, words); the late layers learn task-specific abstractions (object parts, phrases, concepts). Nothing in this architecture is hand-designed at the feature level: the network chooses the features that are useful for minimising the training loss. The term of art for this is representation learning, and it is the thing that makes deep learning qualitatively different from the methods in Part IV. The Bengio–Courville–Vincent 2013 review Representation Learning: A Review and New Perspectives (IEEE TPAMI) is the definitive early statement of the idea.

The two AI winters and the 2012 inflection

Neural networks have had a historical fortune shaped by two long pauses. The first "AI winter" followed Minsky and Papert's 1969 book Perceptrons, which proved that a single-layer perceptron cannot compute XOR and that the community at the time read as a proof that neural networks in general were limited. Research funding evaporated; the field moved to symbolic AI. The second winter followed the expert-systems crash of the late 1980s. Through the 1990s and early 2000s, neural networks were a backwater while SVMs and graphical models took centre stage. The revival began around 2006 with Hinton's deep-belief networks, accelerated through unsupervised pre-training work in the late 2000s, and broke through definitively in 2012 when Krizhevsky, Sutskever, and Hinton's AlexNet won the ImageNet Large-Scale Visual Recognition Challenge by a margin so large (16.4% top-5 error versus the previous year's 25.8%) that the community recognised it as an inflection point. The factors were compute (GPUs), data (ImageNet's 1.2 million labelled images), and engineering (ReLU, dropout, data augmentation). Since 2012 the pattern has repeated across modality after modality: speech (2013), machine translation (2014), image generation (2014), board games (2016), protein structure (2020), language (2022).

The deep-learning bet. Given enough data, enough compute, and a model expressive enough to use both, a neural network can learn the features that matter for the task, and the features it learns are more useful than anything a human would have engineered. This statement was a minority view in 2005 and is almost universally held in 2026. Almost every major-impact ML system of the last decade is a manifestation of it.

What classical ML still does better

It is not true that neural networks have won on every problem. On small tabular datasets (say, less than a million rows, moderate dimensionality, no strong spatial or sequential structure) gradient-boosted trees still tend to match or beat neural networks — see Grinsztajn, Oyallon, and Varoquaux's 2022 Why do tree-based models still outperform deep learning on tabular data? for a careful head-to-head. On problems with strong prior knowledge that can be encoded as a small hand-designed feature set (much of traditional signal processing, many engineering and scientific problems), classical pipelines remain competitive. And on problems where interpretability matters more than accuracy (regulatory, medical, legal), a well-fit logistic regression is often preferable on deployment grounds regardless of whether a neural network would be 2% more accurate. Part V is about the class of problems where deep learning does win — which is broad, growing, and increasingly dominant — but the practitioner's job is still to pick the right tool per problem.

The perceptron

The perceptron is the smallest non-trivial neural network: a single artificial neuron, introduced by Frank Rosenblatt in 1958 as a computational model of biological neurons and a working binary classifier. It is the building block that everything else in this chapter composes, and the model whose limitations — proven definitively by Minsky and Papert in 1969 — set the agenda for the rest of the field.

The model

A perceptron computes a weighted sum of its inputs, compares the sum to a threshold, and outputs a binary label. Given input vector x ∈ ℝⁿ, weight vector w ∈ ℝⁿ, and bias b ∈ ℝ, the perceptron outputs y = sign(wᵀx + b), where sign returns +1 for positive arguments and −1 otherwise. The bias b shifts the decision threshold so that the separating hyperplane does not have to pass through the origin; it is equivalent to appending a constant 1 to x and folding b into w, and later sections use both conventions interchangeably.

The learning rule

Rosenblatt's perceptron learning rule is the first example of what we now call iterative, mistake-driven learning. Starting from arbitrary initial weights, for each training example (xᵢ, yᵢ), if the model classifies xᵢ correctly do nothing; if it misclassifies, update w ← w + η yᵢ xᵢ and b ← b + η yᵢ, where η > 0 is a learning rate. The update nudges the weight vector in the direction that would have classified the current example correctly. Rosenblatt's perceptron convergence theorem guarantees that if the data are linearly separable — that is, if there exists any hyperplane that separates the positive and negative examples — then the rule converges to a separating hyperplane in a finite number of updates bounded by a ratio of the data's geometric margin and its norm. The theorem is surprisingly satisfying: it says the learning rule works, it says how many updates it takes, and it fails to say anything about what happens when the data are not linearly separable.

The geometric picture

A perceptron's decision boundary is the hyperplane wᵀx + b = 0. Points on one side are labelled +1; points on the other side are labelled −1. The margin of a correctly classified point xᵢ is the signed distance from xᵢ to the hyperplane, scaled by ||w||. The perceptron, unlike the SVM of Chapter 07 of Part IV, does not try to maximise the margin — it stops as soon as it finds any separating hyperplane. This is the difference between a feasibility algorithm (which finds any solution) and an optimality algorithm (which finds the best solution), and it is the root cause of the perceptron's sensitivity to the order in which training examples are presented.

The XOR problem

A perceptron can learn AND, OR, and NOT. It cannot learn XOR. The four XOR examples (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 are not linearly separable: no line in the plane puts the two 1s on one side and the two 0s on the other. Minsky and Papert's 1969 Perceptrons generalised this observation to a family of impossibility results, showing that a single-layer perceptron cannot compute several natural Boolean and geometric functions — including parity and connectedness. Read sympathetically, the book was an argument for multilayer networks; read pessimistically, as it was at the time, it was a proof that neural networks were fundamentally limited. The pessimistic reading won, and the field went dormant.

Why the XOR result matters. The XOR problem is conceptually trivial but historically definitive. Its resolution — that a network with one hidden layer can compute XOR with two neurons and an OR/AND composition — is the origin of the multilayer perceptron of the next section. Every neural-network textbook opens with the XOR example because every subsequent idea in the field is, in some sense, the generalisation of "one extra layer fixes it."

The modern perceptron

A perceptron can also be understood as logistic regression with the threshold replaced by a sign function. Swap sign(wᵀx + b) for the sigmoid σ(wᵀx + b) and you have a smooth, probabilistic classifier trainable by gradient descent on a cross-entropy loss; this is the logistic neuron that occupies the output of a binary-classification neural network. The perceptron's influence is pervasive: every dense layer in a modern network is a vectorised batch of perceptron-like units, differing from Rosenblatt's original only in the choice of activation function (sigmoid, tanh, ReLU rather than a hard threshold) and the use of gradient-based learning on a smooth loss rather than mistake-driven updates on a sign.

The multilayer perceptron

Stack two or more perceptron-like layers and you have a multilayer perceptron (MLP), the simplest useful neural-network architecture and the object most of this chapter's theory and mechanics apply to. The MLP solves the XOR problem in ten lines of code and every deeper architecture in Part V is, at some level, a specialisation of it.

Layers, units, weights, and biases

An MLP is organised into layers. The input layer is the feature vector x ∈ ℝⁿ⁰; it has no parameters and no computation. Each hidden layer ℓ applies an affine transformation z^(ℓ) = W^(ℓ) a^(ℓ−1) + b^(ℓ), where a^(ℓ−1) is the previous layer's output, W^(ℓ) ∈ ℝⁿˡ×ⁿˡ⁻¹ is a matrix of weights, and b^(ℓ) ∈ ℝⁿˡ is a bias vector. It then applies a pointwise nonlinearity a^(ℓ) = φ(z^(ℓ)), typically ReLU, which produces the layer's output. The output layer applies one last affine transformation and, optionally, a task-specific nonlinearity (softmax for multiclass classification, sigmoid for multi-label, identity for regression). The set of parameters θ = {W^(1), b^(1), W^(2), b^(2), …} is what the network learns.

The feed-forward pass as matrix multiplication

The entire forward pass of an MLP is a sequence of affine → nonlinearity → affine → nonlinearity steps. Vectorised across a mini-batch of B examples, A^(ℓ−1) ∈ ℝᴮ×ⁿˡ⁻¹ becomes a matrix and the affine transformation becomes Z^(ℓ) = A^(ℓ−1) W^(ℓ)ᵀ + b^(ℓ), with the bias broadcast along the batch axis. This formulation is the reason neural networks run efficiently on GPUs: the bulk of the computation is dense matrix multiplication, which GPUs execute at terabytes-per-second throughput. For a network with layer widths 784 → 256 → 128 → 10 (the canonical MNIST MLP), a forward pass on a batch of 128 examples is three GEMM operations of shapes (128, 784) × (784, 256), (128, 256) × (256, 128), and (128, 128) × (128, 10), each of which completes in microseconds on a modern GPU.

Depth and width as design axes

A network's depth is the number of layers; its width is the number of units per layer. An MLP with one hidden layer and unlimited width is already a universal approximator (Section 04), but the expressivity theorems do not say that a single wide layer is the efficient representation — and in practice it is not. Depth gives exponential gains in expressiveness for certain function families (Telgarsky 2016), lets the network learn compositional features (low-level at the bottom, abstract at the top), and interacts with optimisation in useful ways. Width determines the number of distinct features a layer can represent at once; modern networks are much wider than the textbook MLPs of the 1990s. The depth-versus-width trade-off has no closed-form answer; empirically, moderate depths (10–100 layers in a convnet, 20–100 in a transformer) with large widths (hundreds to thousands of units per layer) are the norm.

Why the hidden layer matters

The XOR example made concrete: two hidden units with ReLU activations can represent XOR as an OR-of-two-ANDs, and the composition is a two-layer MLP whose representational capacity strictly exceeds that of any single-layer perceptron. The same argument generalises: each hidden layer takes the previous layer's representation and re-expresses it in a new basis, and nonlinear compositions of such re-expressions can express functions that no single affine transformation can. This is the hand-wavy but essentially correct story of why depth works; the next section gives a sharper version.

The MLP as a generic function approximator. An MLP with one hidden layer of sufficient width and a sufficiently-rich nonlinearity is a universal function approximator on any compact set. An MLP with one hidden layer of insufficient width is a disappointment. An MLP with many hidden layers of modest width is, in the modern view, the best engineering trade-off: exponential expressive gains, sub-linear parameter growth, and compatibility with residual connections, normalisation layers, and all the training tricks of Chapter 02.

Universal approximation

The universal approximation theorem is the formal statement that neural networks are, in principle, flexible enough to represent any reasonable function. It is a necessary but not sufficient foundation for the field: it tells us that the hypothesis class is rich, but not that our training procedures can find the right hypothesis inside it.

The Cybenko–Hornik theorem

The canonical statement, due independently to George Cybenko in 1989 (Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, and Systems) and Kurt Hornik, Maxwell Stinchcombe, and Halbert White in 1989 (Multilayer Feedforward Networks are Universal Approximators, Neural Networks), goes as follows. Let φ be a continuous non-constant non-polynomial activation function — a sigmoid qualifies, so does tanh, so does ReLU. Let K ⊂ ℝⁿ be any compact set and f : K → ℝ any continuous function. Then for any ε > 0, there exists a single-hidden-layer network of the form F(x) = ∑ᵢ αᵢ φ(wᵢᵀx + bᵢ) with finitely many hidden units that satisfies supₓ∈K |F(x) − f(x)| < ε. In words: any continuous function on a bounded region can be approximated arbitrarily well by a sufficiently wide single-hidden-layer MLP.

What the theorem does not say

Three gaps between the theorem's promise and practice are worth remembering. First, the theorem is an existence result — it does not give a construction. It says some set of weights achieves the approximation but does not say how to find them. Second, the required width may be enormous: for worst-case target functions the number of hidden units grows exponentially with input dimension, making a shallow universal approximator wildly parameter-inefficient compared to a deep network representing the same function. Third, the theorem is about approximation on a compact set — functions that blow up at infinity, or functions defined on all of ℝⁿ, require a more careful treatment. None of these caveats matter much for practitioners, but they are the reason the modern literature has moved on to sharper questions.

Expressivity versus learnability

The modern distinction is between what a network could represent and what a training procedure can find. A depth-k MLP with ReLU activations can represent piecewise-linear functions with up to a number of linear regions that grows exponentially in k (Montúfar et al., 2014), which vastly outstrips the expressive capacity of a single-hidden-layer network of the same total parameter count. But finding the right parameters inside that huge hypothesis class is a non-convex optimisation problem, and the existence of a good solution is not a guarantee that SGD will find it. The empirical fact that SGD often does find good solutions — despite the loss landscape being riddled with saddle points and local minima — is one of the great unsolved mysteries of deep learning, and the subject of a large literature on the "loss landscape" (Li et al. 2018) and "lottery-ticket" hypothesis (Frankle and Carbin 2019).

Depth separation

A number of theoretical results sharpen the "depth helps" intuition. Matus Telgarsky's 2016 paper Benefits of depth in neural networks exhibits explicit functions computable by a depth-k network with O(k) units that require any shallow network to have exponentially many units to approximate. Eldan and Shamir (2016) show a similar gap between depth-3 and depth-2 networks. These depth-separation results justify the design choice of using many narrow layers rather than one very wide one, and they are the theoretical counterpart to the empirical observation that modern architectures (convnets, ResNets, transformers) are dozens or hundreds of layers deep.

Universal approximation in perspective. The theorem is a reassurance, not a prescription. It tells the engineer that, if the model fails to fit the data well, the problem is almost certainly not that the model cannot represent the target function — it is that the optimisation has not found the right parameters, or the data is too limited to constrain them, or the generalisation gap is larger than the approximation error. This diagnosis is the reason the rest of Part V spends so much time on optimisation, initialisation, regularisation, and architectural priors rather than on expressive power.

Loss functions

A neural network is trained by minimising a scalar loss function that measures how badly the network's predictions match the training labels. The choice of loss function is not a detail: it determines the implicit probabilistic model, the output-layer nonlinearity, and the shape of the gradients that drive learning.

Regression: mean squared error

For a regression network that outputs a scalar ŷ, the canonical loss is mean squared error: L = (1/N) ∑ᵢ (ŷᵢ − yᵢ)². MSE is the negative log-likelihood of a Gaussian likelihood model y ∼ 𝒩(ŷ, σ²) with fixed variance, so minimising MSE is maximum-likelihood estimation under a Gaussian-noise assumption. The gradient of MSE with respect to ŷ is the residual (ŷ − y), which is linear in the error — large errors produce proportionally large gradients, which makes MSE sensitive to outliers but well-behaved on clean data. Chapter 01 of Part IV treated MSE exhaustively in the linear-regression context; here we inherit all of that theory and stack a neural network in front of it.

Binary classification: binary cross-entropy

For a binary classifier that outputs a probability p̂ = σ(z) ∈ (0,1), the canonical loss is binary cross-entropy: L = −(1/N) ∑ᵢ [yᵢ log p̂ᵢ + (1 − yᵢ) log(1 − p̂ᵢ)]. It is the negative log-likelihood of a Bernoulli model y ∼ Bern(p̂), so minimising BCE is maximum likelihood under a binary observation model. A crucial practical fact: the gradient of BCE with respect to the pre-activation z — the "logit" before the sigmoid — is the elegant (p̂ − y). This is the same form as the MSE residual, which is not a coincidence: both are instances of the exponential-family gradient formula ∂L/∂z = p̂ − y that makes generalised linear models and their neural-network extensions trivial to backpropagate. Implement BCE using the numerically stable binary_cross_entropy_with_logits routine (in PyTorch) or its equivalents; hand-rolling −log(σ(z)) loses precision for confident predictions.

Multiclass classification: categorical cross-entropy

For a K-class classifier that outputs a probability vector p̂ = softmax(z) ∈ Δᴷ, the canonical loss is categorical cross-entropy: L = −(1/N) ∑ᵢ ∑ₖ yᵢₖ log p̂ᵢₖ, where yᵢₖ is the one-hot-encoded true label. Equivalently, L = −(1/N) ∑ᵢ log p̂ᵢ,ᶜ(ᵢ), where c(i) is the true class index. The softmax is softmax(z)ₖ = exp(zₖ) / ∑ⱼ exp(zⱼ). As in the binary case, the gradient of softmax-cross-entropy with respect to the logits is the clean ∂L/∂zₖ = p̂ₖ − yₖ. Use the fused cross_entropy_loss in PyTorch (or its equivalents) rather than composing softmax and negative-log-likelihood separately; the fused version avoids the log-of-zero underflow that the naive composition can produce.

The MLE view of loss functions

Almost every loss function in deep learning is a negative log-likelihood of some probabilistic model. MSE is NLL of a Gaussian; BCE is NLL of a Bernoulli; cross-entropy is NLL of a categorical; Poisson regression is NLL of a Poisson; negative-binomial regression for count data is NLL of a negative binomial. This framing is useful because it tells you the right output-layer nonlinearity (the one that maps the pre-activation to the natural parameter of the distribution), it predicts the gradient form (∂L/∂z = p̂ − y for any exponential-family likelihood), and it suggests principled extensions — Huber loss as a robust-regression likelihood, focal loss as a reweighted cross-entropy for imbalanced classes, contrastive losses as a negative-log-likelihood of a ranked-pair model.

Task-specific losses

Beyond the three canonical cases, a catalogue of task-specific losses has accumulated. Huber loss, also called smooth L1, combines MSE's good behaviour near zero with L1's robustness to outliers: it is quadratic for residuals below a threshold and linear beyond. Hinge loss, the SVM objective, is max(0, 1 − yz); it is used in structured-output and margin-based classifiers. Focal loss (Lin et al. 2017), for imbalanced classification, reweights cross-entropy to downweight confidently-correct examples and focus gradient on the hard cases. Contrastive and triplet losses, central to metric learning and self-supervised representation learning, express the objective as "pull similar pairs together, push dissimilar pairs apart" rather than as a per-example classification. CTC loss, for unaligned sequence-to-sequence problems like speech recognition, marginalises over all possible alignments between input and output sequences. The shared pattern: pick a probabilistic model that captures the structure of the task, write down its NLL, and use automatic differentiation to compute its gradient.

Loss, output nonlinearity, and task coupling. The three choices — task, output nonlinearity, loss — form a matched set. For regression, use a linear output and MSE. For binary classification, sigmoid and BCE. For multiclass, softmax and cross-entropy. Deviating from the matched set is almost always a mistake; in particular, using MSE for classification (a tempting shortcut) produces flat gradients at the saturated ends of the sigmoid and trains an order of magnitude more slowly than cross-entropy.

Gradient descent

Once a network has a loss, we need an algorithm that reduces it. The algorithm is gradient descent: iteratively move the parameters in the direction of steepest descent of the loss. Everything in modern deep learning — Adam, learning-rate schedules, batch normalisation, residual connections — is a modification of this core idea.

The update rule

Given parameters θ ∈ ℝᵖ and a differentiable loss L(θ), the gradient descent update is θ ← θ − η ∇L(θ), where ∇L(θ) is the gradient (the vector of partial derivatives of L with respect to each parameter) and η > 0 is the learning rate. The gradient points in the direction of steepest ascent; subtracting η ∇L moves θ in the direction of steepest descent. Repeat until a stopping criterion is met — a fixed number of iterations, a threshold on the training loss, or no progress on a validation set. The elegance of the rule conceals a menagerie of practical questions: how to choose η, when to decay it, what to do on non-convex loss surfaces, how to compute ∇L for a complicated loss function.

The learning rate

The learning rate is the single most important hyperparameter in deep learning. Too small and training is glacially slow; too large and it diverges. The optimal value depends on the curvature of the loss surface (captured by the Hessian eigenspectrum) and on the scale of the gradients. For convex quadratic losses, the theory is clean: the optimal learning rate is 2 / (λₘᵢₙ + λₘₐₓ) of the Hessian eigenvalues, and convergence rate is governed by the condition number λₘₐₓ / λₘᵢₙ. For neural networks, the loss is non-convex and the Hessian is intractable to compute, so practitioners treat η as a hyperparameter to search — typically over a logarithmic grid from 10⁻⁵ to 10⁻¹. Modern recipes combine a learning-rate schedule (warmup followed by cosine decay, for example) with an adaptive optimiser (Adam or AdamW) that rescales the per-parameter effective learning rate based on running statistics of the gradients. Chapter 02 treats these schedules in detail.

Gradient descent on a convex loss

For a convex, differentiable loss with an L-Lipschitz gradient, gradient descent with step size η = 1/L converges to the global minimum at rate O(1/t) after t iterations. For μ-strongly convex losses the rate improves to O((1 − μ/L)ᵗ), which is exponential. These results give the classical ML practitioner (logistic regression, linear regression, SVM) a guarantee that training will make continuous, predictable progress. The results also tell them to normalise features — the condition number μ/L of the loss depends on the scale and correlation of the features, and unnormalised features can produce ill-conditioned losses that converge orders of magnitude more slowly than they would otherwise.

Gradient descent on a non-convex loss

A neural network's loss is non-convex: it has local minima, saddle points, flat plateaus, and valleys that gradient descent can get stuck in. Classical optimisation theory was pessimistic about training such losses, and for a long time the field believed that training deep networks was intractable for this reason. The modern view is more nuanced. Dauphin et al. (2014) showed that the stationary points encountered in high-dimensional non-convex optimisation are overwhelmingly saddle points rather than local minima, and that SGD's stochastic noise tends to escape saddles efficiently. Choromanska et al. (2015) showed that, for large networks, most local minima have losses close to that of the global minimum. Empirically: well-tuned SGD on a well-initialised network reliably finds solutions whose training loss is near zero and whose test loss is near the state of the art. The theoretical picture is still incomplete — "why does SGD work on neural networks?" is an open research question — but the empirical picture is clear, and the rest of Part V builds on the assumption that it continues to.

Why gradient descent, and not something else. For a convex problem with a few hundred parameters, specialised optimisers (Newton's method, IPM, quasi-Newton) are dramatically faster than gradient descent. For a non-convex problem with hundreds of millions or billions of parameters, Newton's method is infeasible (the Hessian alone would not fit in memory) and gradient descent with a few adaptive-per-parameter modifications becomes the only practical choice. The scaling properties of gradient descent — linear memory in the number of parameters, parallelisable across a mini-batch, GPU-friendly — are the reason it is the training algorithm of modern deep learning.

Backpropagation

Gradient descent needs gradients. Computing the gradient of a neural network's loss by naive finite differences would take one forward pass per parameter — prohibitive for a million-parameter network. Backpropagation, the reverse-mode accumulation of the chain rule through the network's computation graph, computes the full gradient in a single forward pass and a single backward pass. It is the single most important algorithm in deep learning.

The chain rule as a network

Consider the simplest two-layer MLP: z¹ = W¹ x + b¹, a¹ = φ(z¹), z² = W² a¹ + b², ŷ = softmax(z²), L = crossentropy(ŷ, y). The gradient of L with respect to W¹ — a tensor with the same shape as W¹ — can in principle be computed by writing out L as a nested function of W¹ and differentiating symbolically, but the expression would be unwieldy. Backpropagation replaces the symbolic manipulation with an algorithmic procedure: work through the reverse of the forward pass, carrying a gradient with respect to the current layer's output, and use the chain rule to transform it into the gradient with respect to the current layer's inputs and parameters.

The backward pass, derived

Define δ^(ℓ) = ∂L/∂z^(ℓ), the gradient of the loss with respect to the pre-activation of layer ℓ. For the output layer of a softmax-cross-entropy classifier, δ^(L) = ŷ − y (the elegant exponential-family form). For an interior layer ℓ, the chain rule gives δ^(ℓ) = (W^(ℓ+1)ᵀ δ^(ℓ+1)) ⊙ φ′(z^(ℓ)), where ⊙ denotes elementwise multiplication and φ′ is the derivative of the activation function. The gradients with respect to the layer's parameters are then ∂L/∂W^(ℓ) = δ^(ℓ) a^(ℓ−1)ᵀ and ∂L/∂b^(ℓ) = δ^(ℓ). The algorithm is: (1) forward pass, cache every z^(ℓ) and a^(ℓ); (2) compute δ^(L) at the output; (3) propagate backwards through the layers using the recursion; (4) at each layer, accumulate the parameter gradients. The cost of the backward pass is the same order as the cost of the forward pass — two matrix multiplications per layer — which is why neural-network training is feasible at scale.

Why backpropagation is "reverse mode"

The chain rule can be applied in two directions. Forward-mode automatic differentiation propagates directional derivatives alongside values, computing ∂y/∂x for one chosen input x in a single pass over the graph; for a network with P parameters, computing the full gradient this way takes P forward passes. Reverse-mode autodiff, which backpropagation is, computes ∂L/∂xᵢ for all inputs xᵢ in a single backward pass — exactly one, regardless of how many parameters there are. For scalar-valued losses with many parameters (the neural-network case), reverse mode is exponentially cheaper; for vector-valued outputs with few parameters (certain scientific-computing problems), forward mode wins. Neural networks have billions of parameters and a single scalar loss, which makes reverse-mode the obvious choice.

Backpropagation as credit assignment

An intuitive way to read backpropagation is as credit assignment: given the error the network made at the output, how much of that error should be attributed to each parameter deep inside the network? The algorithm apportions blame proportionally to how much each parameter contributed to each output-layer prediction, weighted by the downstream activations and the downstream nonlinearities' derivatives. This framing is useful when thinking about failure modes: if a parameter deep in the network receives a tiny gradient because the activations above it saturated the sigmoid, it will not update — this is the vanishing gradient problem of Section 12. If a parameter receives a gradient that explodes because the weights above it are too large, it will diverge — the exploding gradient problem. The design of modern architectures (ReLU activations, residual connections, layer normalisation) is largely about keeping the gradient signal well-behaved as it is propagated backward through many layers.

Rumelhart, Hinton, and Williams 1986

Backpropagation was independently discovered several times through the 1960s and 1970s — Seppo Linnainmaa's 1970 master's thesis derived it as a special case of reverse-mode AD, Paul Werbos's 1974 PhD thesis applied it to neural networks explicitly — but it was Rumelhart, Hinton, and Williams's 1986 Nature paper Learning representations by back-propagating errors that brought it to the community's attention, with empirical demonstrations that the algorithm could train multilayer networks to solve XOR-like problems and learn useful features. The paper is short, elegant, and one of the founding documents of modern deep learning.

Backpropagation in one sentence. It is the reverse-mode application of the chain rule through a computation graph, computing ∂L/∂θ for every parameter θ in a single backward pass whose cost is the same order as a single forward pass. Everything else in deep learning is a refinement or specialisation of this core computation.

Automatic differentiation and the computation graph

Modern deep learning frameworks generalise backpropagation into a fully automatic system for computing gradients. Rather than hand-deriving the backward pass for each architecture, the framework traces the forward pass as a directed acyclic graph, associates a gradient function with each elementary operation, and composes them automatically. This automation is the reason researchers can prototype a new architecture in an afternoon and why the transition from hand-written backprop code (1990s) to autodiff frameworks (2015 onwards) was another deep-learning inflection point.

The computation graph

A computation graph is a DAG whose nodes are tensors and whose edges are elementary operations: matrix multiplication, addition, elementwise nonlinearity, reduction. Running a forward pass on a network builds the graph by recording each operation along with its inputs and outputs; this is called tracing. For each operation, the framework knows how to compute both the forward value (the output given the inputs) and the vector-Jacobian product (the gradient of a downstream scalar loss with respect to the inputs, given the gradient with respect to the output). The backward pass is then a topological traversal of the graph in reverse order, applying vector-Jacobian products at each node. This framework is fully general: any differentiable computation — neural network, physics simulation, discrete-time recurrence, custom loss function — can be differentiated without hand-deriving gradients.

Static vs dynamic graphs

Frameworks split into two camps on when the graph is built. Static graphs (TensorFlow 1.x, Theano) compile the graph once, ahead of time, and execute it repeatedly; this allows aggressive optimisation (kernel fusion, memory planning, XLA compilation) but makes control flow and data-dependent shapes awkward to express. Dynamic graphs (PyTorch, JAX in some modes) build the graph afresh on every forward pass; this is slower per step but allows arbitrary Python control flow (if-statements, loops, recursion) and makes debugging trivial — an error in the forward pass surfaces at a natural Python line, not inside a compiled kernel. Modern frameworks blur the distinction: TensorFlow 2 defaults to dynamic ("eager") execution, PyTorch offers torch.compile for ahead-of-time tracing and fusion, JAX separates pure-function tracing from execution in a way that gets most of both benefits. The empirical outcome has been that dynamic graphs won the research-usability war in the late 2010s, and the frameworks have converged on a style that is dynamic by default and static on demand.

PyTorch's autograd engine

The PyTorch autograd engine is the reference implementation of dynamic reverse-mode AD and a useful concretisation of the ideas. Every tensor carries a flag requires_grad; when an operation is performed on such tensors, the result is annotated with a backward function describing how to propagate gradients through that operation. Calling loss.backward() walks the graph in reverse topological order and accumulates gradients into tensor.grad for every leaf that requires gradients. The optimiser (torch.optim.SGD, AdamW, etc.) then reads tensor.grad, applies its update rule, and zeroes the gradient buffer for the next iteration. The user never writes a backward pass by hand; they write only the forward pass, and gradients come free.

JAX's pure-function view

JAX takes a different tack. Rather than tracking gradients inside tensor objects, it exposes jax.grad(f) as a higher-order function that takes a pure function f and returns a new function that computes its gradient. This is more mathematically clean — gradients are a property of functions, not of tensors — and plays well with other JAX transformations (jax.jit for compilation, jax.vmap for batching, jax.pmap for parallelism). It enforces a discipline of functional programming (no in-place mutation, no hidden state) that many researchers find congenial, particularly for scientific-computing and RL workloads where the program structure is more complicated than a feed-forward network.

Higher-order derivatives and meta-learning

Because autodiff composes, gradients of gradients are computable: jax.grad(jax.grad(f)) returns the function that computes the Hessian-vector product. This ability to differentiate through a gradient step is the foundation of meta-learning (Finn et al.'s 2017 MAML, for example), hyperparameter optimisation by backprop (Maclaurin et al. 2015), and implicit-differentiation-based algorithms for bilevel optimisation. The capability is underused by most practitioners but is central to a growing slice of the research literature.

The framework as a tool. In 2026, a practitioner rarely writes backprop by hand. They write a forward pass in PyTorch or JAX, call a gradient function, and the framework handles the rest. This is the right level of abstraction for almost every task: it lets researchers iterate on architectures quickly, lets engineers optimise training pipelines without touching the math, and makes the whole field dramatically more productive than it was when every new architecture needed a hand-derived backward pass.

Activation functions

The activation function is the pointwise nonlinearity that turns a stack of linear layers into a universal function approximator. Without it, an MLP of any depth collapses to a single linear layer. The history of deep learning tracks, in large part, the history of which activation functions are in favour: sigmoids gave way to tanh, tanh gave way to ReLU, ReLU gave way in some domains to GELU and Swish, and each transition was a meaningful engineering improvement.

Sigmoid and tanh

The sigmoid σ(z) = 1/(1+e⁻ᶻ) and the hyperbolic tangent tanh(z) are the classical activation functions of the 1980s and 1990s. Both are smooth, monotonic, and bounded — sigmoid to (0,1), tanh to (−1,1) — and both are differentiable everywhere. The sigmoid has a natural probabilistic interpretation (it is the logistic CDF), and a network of sigmoid units can be read as a stack of logistic-regression classifiers. The price of their smoothness, however, is saturation: for large |z|, both functions flatten out and their derivatives go to zero. When derivatives of a network's many layers are multiplied together during backpropagation, the product shrinks exponentially — the vanishing gradient problem of Section 12. This is the single largest reason why training deep networks was considered intractable through the 1990s and early 2000s.

ReLU

The rectified linear unit ReLU(z) = max(0, z), popularised by Glorot, Bordes, and Bengio's 2011 Deep Sparse Rectifier Neural Networks and Krizhevsky et al.'s 2012 AlexNet paper, was the architectural change that made deep networks trainable. It is piecewise linear (the derivative is exactly 0 for z < 0 and exactly 1 for z > 0), unbounded above, and computationally trivial. The key property for deep-network training is that ReLU does not saturate on the positive side: for large positive z, the gradient is 1, so backpropagated gradients do not shrink multiplicatively through layers. The biological-plausibility arguments for ReLU are thin (real neurons do saturate), but the engineering arguments are overwhelming: training speed, gradient stability, and sparsity of activations (about half the units output zero at any time, which has both regularising and computational consequences).

Dead ReLUs and the leaky / parametric variants

ReLU has a well-known failure mode: dead ReLUs. If a unit's pre-activation becomes negative for all training examples, its gradient is zero everywhere and it never updates again — effectively dropping out of the network. This can happen if the bias drifts strongly negative early in training. Leaky ReLU replaces the zero on the negative side with a small slope αz (typically α = 0.01), giving the dead unit a gradient to climb back out. Parametric ReLU (He et al. 2015) lets α be learned. ELU (Clevert, Unterthiner, and Hochreiter 2015) smoothly saturates on the negative side with α(eᶻ − 1), producing mean activations closer to zero and slightly better training dynamics. In practice, the difference between ReLU and leaky ReLU on most modern architectures is small; dead ReLUs are less common when combined with good initialisation (He) and batch normalisation.

GELU and Swish

Two smooth ReLU-like activations have dominated in recent years. GELU (Hendrycks and Gimpel 2016), GELU(z) = z · Φ(z) where Φ is the standard normal CDF, is the activation used in BERT, GPT, and most transformer architectures; it is approximately ReLU for large |z| but has a smooth transition near zero with a slight dip for small negative inputs. Swish or SiLU (Ramachandran, Zoph, and Le 2017), Swish(z) = z · σ(z), has a similar shape and similar empirical performance. Both are slightly more expensive to compute than ReLU but give a small, consistent improvement on large-scale architectures. The history lesson: architectural choices that seem like minor engineering tweaks (sigmoid → ReLU → GELU) can unlock dramatic performance gains when combined with the scale and data of modern deep learning.

Special-purpose activations

A handful of activations exist for specific purposes. Softmax is the multi-class probability activation for the final layer of a classifier. Sigmoid remains the right choice for the final layer of a multi-label classifier, where each class is an independent binary decision. Softplus log(1 + eᶻ) is a smooth ReLU approximation that is sometimes useful when gradient smoothness matters. Gated linear units like GLU (Dauphin et al. 2016) multiply one linear projection by the sigmoid of another, encoding a multiplicative interaction that is particularly useful in sequence models and transformers (where the gated variant SwiGLU has become the standard in recent LLMs like LLaMA and PaLM). The general principle: the activation function is a design knob, and different parts of different architectures often benefit from different choices.

Default recipe. For a generic MLP or convnet, use ReLU (or leaky ReLU if you suspect dead-unit problems). For a transformer, use GELU or SwiGLU in feed-forward layers, softmax in attention and in the final classifier. For the output of a regression head, use identity; for binary classification, sigmoid; for multiclass classification, softmax (usually fused with the loss). These defaults are not right for every problem, but they are right for nearly every problem; deviate only with a specific reason.

Weight initialisation

A neural network's parameters have to start somewhere. The choice of starting distribution is not a detail — bad initialisation can make a perfectly reasonable network untrainable, while good initialisation makes training faster and more reliable.

Why zero initialisation fails

The naive choice — set all weights to zero — is catastrophic. With zero weights, every hidden unit in a layer computes the same output, receives the same gradient during backprop, and updates identically. The network can never break the symmetry: all units in a layer remain identical throughout training, effectively collapsing the layer to a single unit. The same symmetry problem afflicts initialising all weights to the same non-zero constant. To learn anything useful, hidden units must start out different from each other — so initial weights must be drawn from a probability distribution with some spread.

The signal-propagation heuristic

Xavier Glorot and Yoshua Bengio's 2010 paper Understanding the difficulty of training deep feedforward neural networks derived the first principled initialisation scheme. The idea: for a linear layer with n_in inputs and n_out outputs, if the inputs have unit variance and the weights are i.i.d. with zero mean and variance σ², then the outputs have variance n_in · σ². To keep the variance constant as signals propagate forward through the network, set σ² = 1/n_in. To also keep the variance of the backward gradient signal constant, set σ² = 1/n_out. As a compromise, Xavier initialisation sets σ² = 2/(n_in + n_out). With sigmoid or tanh activations, this keeps activations and gradients from blowing up or vanishing through the network's depth, which was the limiting factor in pre-ReLU era training.

He initialisation for ReLU

ReLU zeros out half the units on average, which halves the variance of the post-activation signal. Kaiming He et al.'s 2015 Delving Deep into Rectifiers paper corrected Xavier to account for this: for ReLU, set σ² = 2/n_in. This is the default in every modern deep-learning framework when you create a ReLU-activated dense or convolutional layer, and it is the reason networks with hundreds of layers can be trained from random initialisation today. The He derivation is a model for how to reason about initialisation in general: pick an activation function, compute the variance-transformation factor it induces, and set the initial weight variance to cancel it.

Orthogonal and identity initialisation

Two more specialised schemes deserve mention. Orthogonal initialisation (Saxe, McClelland, and Ganguli 2014) initialises weight matrices as random orthogonal matrices (obtained via a QR decomposition of a Gaussian matrix); this preserves signal norms exactly, and is particularly useful for recurrent networks where repeated application of the same weight matrix can amplify or attenuate signals dramatically. Identity-like initialisation appears in residual networks, where each block is initialised so that the residual branch is close to zero and the block acts like the identity at the start of training; this lets information flow through an arbitrarily deep residual stack before the residual branches learn to add useful corrections.

Modern practice and pretraining

For feed-forward and convolutional networks trained from scratch, He or Xavier initialisation remains the default and works well. For large transformer models, specific initialisation schemes — the "GPT-2" style of scaling the output projection of each residual block by 1/√L, where L is the depth, to stabilise the variance of the residual stream as depth grows — have become standard. For any network fine-tuned from a pretrained checkpoint, the initialisation question is replaced by the pretraining question: you start from whatever the pretrained weights were. The vast majority of deep-learning models deployed in 2026 start from some form of pretrained initialisation, which is the topic of Chapter 07 of Part V.

What to do in practice. In PyTorch, nn.Linear and nn.Conv2d default to a Kaiming-uniform initialisation that works for ReLU networks; override with nn.init.xavier_uniform_ or nn.init.kaiming_normal_ as needed. In JAX/Flax, initialisers are explicit arguments to layer constructors. For a production recipe: He for ReLU/variant layers, Xavier for tanh/sigmoid layers, identity-style for residual blocks in a very deep stack, and 1/√L output scaling for transformer residual blocks. These recipes matter less when the network is shallow and more when it is very deep; for modern architectures with hundreds or thousands of layers, they are the difference between a trainable network and a brick.

Stochastic gradient descent and mini-batches

Gradient descent in its textbook form computes the gradient over the entire training set before each update. For modern datasets of millions or billions of examples, this is computationally absurd. The practical compromise is stochastic gradient descent: approximate the full-batch gradient with the gradient of a small random mini-batch. SGD is slower per-update than full-batch GD but dramatically faster per-epoch, and its stochastic noise turns out to have surprisingly beneficial regularisation properties.

Batch, mini-batch, and online

Three sizes of gradient estimate are in common use. Full-batch gradient descent uses all N training examples to compute each gradient; it is deterministic, converges smoothly on convex problems, and is impossibly slow on any realistic modern dataset. Online SGD uses one example at a time, updating the parameters after every single example; it is maximally stochastic, makes fast initial progress, and has theoretical guarantees from the 1950s (Robbins and Monro 1951). Mini-batch SGD, the modern default, uses batches of 32 to 8192 examples; it trades off the low variance of full-batch with the fast iteration of online, and it fits naturally into the GPU's parallel-processing model (one mini-batch fits in GPU memory, all examples in the batch are processed in parallel). Almost every network in Part V is trained with mini-batch SGD or one of its adaptive descendants.

The stochastic objective

Formally, SGD is not minimising the empirical loss L(θ) = (1/N) ∑ᵢ ℓ(fθ(xᵢ), yᵢ) exactly. It is minimising its expectation under a random mini-batch index, which is an unbiased but noisy estimator of L. The stochastic gradient g̃ at step t is the average gradient over the mini-batch, and E[g̃] = ∇L(θ). The noise g̃ − ∇L(θ) has variance that scales as 1/B for batch size B, so small batches are noisier than large ones; this noise is not a bug but a feature, for reasons in the next subsection.

Why SGD generalises

One of the defining empirical observations of deep learning is that SGD often generalises better than full-batch optimisation of the same objective — sometimes dramatically so. The theoretical picture is still incomplete, but several partial explanations have emerged. Keskar et al.'s 2017 On Large-Batch Training for Deep Learning showed empirically that large batches produce "sharper" minima (high curvature, narrow basins) while small batches produce "flatter" minima (low curvature, wide basins), and flat minima tend to generalise better. Mandt, Hoffman, and Blei's 2017 Stochastic Gradient Descent as Approximate Bayesian Inference interpreted the noise in SGD as an implicit Bayesian prior, with smaller batches corresponding to stronger priors. The practical consequence: batch size is a regularisation knob as well as a compute knob, and very large batches often need a learning-rate adjustment or explicit regularisation to recover the small-batch generalisation.

Learning-rate scaling for batch size

If you scale the batch size up to use more hardware, you should generally scale the learning rate up too — the linear scaling rule (Goyal et al. 2017, from Facebook's 1-hour ImageNet paper) says that doubling the batch size should roughly double the learning rate, because the noise in the gradient estimate halves. The rule holds for batch sizes up to some architecture-dependent limit (typically a few thousand for convnets on ImageNet, up to tens of thousands for large transformers); beyond that point, the rule breaks down and training either diverges or stops generalising. Modern large-scale training uses the linear scaling rule with a warmup period (gradually increasing the learning rate from zero over the first several hundred steps) to prevent divergence at the start of training.

Beyond vanilla SGD

SGD with a fixed learning rate is the baseline, but modern practice layers several refinements on top. Momentum maintains a running exponential-moving-average of past gradients and updates in the direction of the accumulated velocity rather than the instantaneous gradient. Nesterov momentum refines this with a lookahead step. Adaptive methods (Adagrad, RMSprop, Adam, AdamW) maintain per-parameter running statistics of gradient magnitudes and scale the effective learning rate inversely, so that parameters with small but consistent gradients update as quickly as parameters with large gradients. These are the topic of Chapter 02 of Part V; what matters at this stage is that all of them are refinements of the SGD core — a step in the direction of a stochastic gradient estimate — rather than departures from it.

SGD as the workhorse. Nearly every neural network of the modern era is trained by some form of SGD: vanilla SGD for small problems, SGD with momentum for classic computer-vision tasks, Adam or AdamW for language models and most transformer variants. The choice of optimiser matters, the learning rate matters more, the batch size and schedule matter as much as both — but the underlying algorithm has been stable since the 1950s, and it is unlikely to change in a fundamental way.

Vanishing and exploding gradients

The reason deep neural networks were considered untrainable for two decades is a numerical problem: gradients that, propagated backward through many layers, either shrink to zero or blow up to infinity. The story of how the field got over this problem — through activation-function changes, better initialisation, and architectural innovations like residual connections — is the story of the deep-learning revival.

The mechanism

A gradient propagating backward through a chain of layers is, mathematically, a product of Jacobian matrices — one per layer. Each Jacobian is the product of a weight matrix and a diagonal matrix of activation derivatives. The spectral norm of this product grows or shrinks multiplicatively with depth. If each layer's Jacobian has spectral norm < 1, the product shrinks exponentially: a 50-layer network with per-layer norm 0.9 attenuates the gradient by a factor of 0.9⁵⁰ ≈ 0.005. If the per-layer norm is > 1, the product grows exponentially: a norm of 1.1 gives 1.1⁵⁰ ≈ 117. The first case is vanishing gradients — deep-layer parameters receive essentially no training signal. The second is exploding gradients — the optimiser takes enormous steps and training diverges.

Sigmoid saturation as the prototype

The classical version of the problem is with sigmoid or tanh activations. The derivative of the sigmoid is σ(z)(1 − σ(z)), which peaks at 0.25 when z = 0 and shrinks rapidly as |z| grows. When a sigmoid unit saturates (its input is strongly positive or strongly negative), its derivative is tiny, and the gradient passing through it is attenuated. Stack many sigmoid layers and the compounded attenuation reliably kills the gradient in the first few layers — the reason Hinton's 2006 deep-belief-network approach (layerwise pre-training) was necessary before direct backpropagation became viable.

The ReLU fix

ReLU's derivative is exactly 1 wherever the unit is active and exactly 0 where it is not. The "active" half of a ReLU does not attenuate gradients at all, and a typical ReLU network has about half its units active per layer. The gradient signal therefore propagates through the active half without the multiplicative shrinkage that sigmoids impose. This is the primary reason the switch from sigmoid to ReLU around 2011 unlocked much deeper networks, and it is the first of the three prongs of the solution.

Careful initialisation

The second prong is initialisation. He initialisation (Section 10) sets the initial weight variance to 2/n_in, which makes each layer's expected spectral norm approximately 1 at the start of training. Under these conditions the gradient magnitude is preserved (in expectation) as it propagates through layers, and the network trains stably even at 20–30 layers deep without any further tricks.

Residual connections and layer normalisation

For networks deeper than about 20 layers, He initialisation alone is not enough — small deviations from unit norm compound over many layers. The third prong is architectural: residual connections (He et al. 2015's ResNet paper) add skip paths that let gradients flow directly from output to deep layers without passing through many Jacobians; layer normalisation (Ba, Kiros, and Hinton 2016) and batch normalisation (Ioffe and Szegedy 2015) rescale activations within each layer to unit variance, which keeps the Jacobians well-conditioned throughout training. With all three prongs combined — ReLU, He init, residual connections, normalisation — networks with hundreds or even thousands of layers train reliably. ResNet-152 and GPT-4-scale models with many dozens of residual blocks are the proof that the vanishing-gradient problem, once an absolute ceiling, is in 2026 a solved engineering issue.

Gradient clipping

The exploding-gradient variant — gradients that grow rather than shrink — is most common in recurrent networks (Chapter 05) and in transformer training at large learning rates. The standard fix is gradient clipping: if the norm of the full gradient vector exceeds some threshold τ, rescale the gradient to have norm exactly τ. Gradient clipping is a hard cap on step size that prevents a single bad update from ruining training; it is almost universal in LLM training (where a few anomalous examples can cause huge gradient spikes) and common in RL.

The three-prong modern recipe. Use ReLU or a variant as the interior activation (not sigmoid/tanh). Initialise with He (for ReLU) or Xavier (for tanh). For networks deeper than about 20 layers, add residual connections and layer/batch normalisation. For transformer and RNN training, add gradient clipping at norm 1.0. With this combination of choices, vanishing and exploding gradients essentially stop being practical concerns, and the engineer is free to build networks as deep as their compute budget allows.

Output layers and task heads

The last layer of a neural network is where the architecture meets the task. A single choice of output nonlinearity and matched loss function determines whether the network solves a regression, binary-classification, multiclass-classification, multi-label, or ranked-output problem.

Linear heads for regression

For regression problems, the output layer is a simple affine transformation: ŷ = W^(L) a^(L−1) + b^(L), no nonlinearity. The loss is mean squared error (or Huber for robust regression), and the interpretation is that ŷ is the predicted mean of a Gaussian likelihood. For multi-output regression, extend the output dimension to d and apply MSE independently to each component; for heteroscedastic regression, predict both a mean and a variance and use the Gaussian NLL as the loss directly.

Sigmoid heads for binary and multi-label classification

For binary classification, the output is a single logit z and the predicted probability is p̂ = σ(z); the loss is binary cross-entropy. For multi-label classification — where each example can belong to any subset of K classes simultaneously — the output is a vector of K logits and each component passes through its own sigmoid, producing K independent binary probabilities. The loss is the sum of K binary cross-entropies. This is the correct setup for problems like image tagging, document topic assignment, and multi-label medical diagnosis, where classes are not mutually exclusive.

Softmax heads for multiclass classification

For mutually-exclusive multiclass classification — the canonical image-classification setup — the output is a vector of K logits z ∈ ℝᴷ and the predicted probability vector is p̂ = softmax(z). The softmax is softmax(z)ₖ = exp(zₖ) / ∑ⱼ exp(zⱼ), which produces a probability distribution over the K classes. The loss is categorical cross-entropy. One subtle but important point: the softmax is shift-invariant — adding a constant to every logit does not change the output probabilities. This means the logits are only meaningful up to an additive constant, and a well-trained network's logits typically have a consistent scale but an arbitrary mean. For numerical stability, subtract the maximum logit before exponentiating: softmax(z) = exp(z − max(z)) / ∑ⱼ exp(zⱼ − max(z)).

Ranked, ordinal, and structured heads

For ranking problems (information retrieval, recommendation), the natural output is a scalar score per candidate, and the training objective is a pairwise or listwise ranking loss (RankNet, LambdaMART-style, ListNet). For ordinal regression (problems where the classes have a natural order — for example, severity ratings), the output can be formulated as a cumulative-probability parameterisation, which respects the order. For structured outputs (sequence tagging with BIO labels, dependency parsing), the output is a structured prediction and the loss is typically a structured log-loss under a CRF or similar graphical-model formalism.

Task-specific output layers

Some tasks have unusual output shapes. Semantic segmentation outputs a per-pixel softmax, so the output is a tensor of shape (H, W, K). Object detection outputs bounding-box coordinates (continuous regression) plus class labels (categorical classification) plus an objectness score (binary classification), all jointly. Language modelling outputs a softmax over the full vocabulary (tens of thousands of classes), where the computationally expensive dot-product-with-embedding operation is the main bottleneck. The pattern — pick an output nonlinearity that maps into the natural parameter space of a probabilistic model, pick the NLL as the loss, and let autodiff do the rest — generalises to essentially every task in the deep-learning zoo.

The matched pair. The output nonlinearity and the loss function are matched: linear + MSE for Gaussian regression; sigmoid + BCE for binary; softmax + cross-entropy for multiclass. Breaking the match (for example, using MSE on a softmax output) produces flat gradients and poor training. The fused routines in deep-learning frameworks (BCEWithLogitsLoss, CrossEntropyLoss) combine the nonlinearity and the loss into a single numerically-stable operation; use them.

Capacity, width, depth, and overparameterisation

A central empirical fact of modern deep learning is that very large, very overparameterised networks train well and generalise well — a fact that classical statistical-learning theory did not predict and, for a long time, said should not happen. The modern understanding of capacity in deep networks departs from the classical VC-dimension story in ways worth knowing about.

The classical picture: bias–variance and VC

Classical statistical learning theory says that a model's generalisation gap is controlled by its capacity — VC dimension, Rademacher complexity, effective number of parameters. The prescription is U-shaped: too little capacity and the model underfits (high bias), too much and it overfits (high variance), and the sweet spot lies somewhere between. For a linear model with p parameters fit on n examples, the classical prescription is roughly "pick p ≪ n"; for a decision tree, "prune the tree to the depth that minimises CV loss"; for an SVM, "tune the regularisation parameter to minimise CV loss." The U-shaped curve is the central object of classical evaluation.

The modern observation: double descent

For deep neural networks, the U-shaped curve is wrong. As you increase a network's capacity (more parameters, more units per layer, more layers), the test error does decrease to a minimum and rise — but then, past a critical overparameterisation threshold at which the network can interpolate the training set exactly, the test error decreases again, often below the first minimum. This is the double descent phenomenon documented systematically by Belkin, Hsu, Ma, and Mandal in their 2019 paper Reconciling modern machine learning practice and the classical bias-variance trade-off. Nakkiran et al.'s 2020 Deep double descent: Where bigger models and more data hurt documents the same behaviour across a wide range of deep architectures and datasets. The empirical prescription is the opposite of the classical one: you want a heavily overparameterised network, with enough capacity to memorise the training set, because that regime generalises better than the one at the classical "right" capacity.

Width, depth, and the lottery ticket

Why does overparameterisation help? One partial answer is the lottery-ticket hypothesis (Frankle and Carbin 2019): a randomly initialised overparameterised network contains, with high probability, a small sub-network (a "winning ticket") that, trained in isolation, would match the full network's performance; the overparameterisation's role is to make it statistically likely that at least one such ticket exists. Another is the neural tangent kernel perspective (Jacot, Gabriel, and Hongler 2018): in the infinite-width limit, training dynamics become equivalent to kernel regression with a specific kernel, and the resulting estimator generalises well for structural reasons. A third, more empirical, is that SGD on an overparameterised network implicitly regularises toward low-complexity solutions (low-norm, low-rank, flat minima), and the overparameterisation gives SGD enough freedom to find such solutions.

Depth separation, revisited

Depth matters for expressive power — the depth-separation results of Section 04 say a deep network can represent functions that would need exponentially many units in a shallow network. In practice, the benefit of depth is moderated by diminishing returns and by the training difficulties of very deep networks. Convnets gain rapidly with depth up to a few dozen layers (with residual connections extending the benefit to 100+ layers); transformers benefit from depth into the dozens of blocks for smaller models and through low hundreds for frontier models. Past some depth, the marginal gain from additional layers is much smaller than the marginal gain from additional width — which is one reason recent large language models have tended to widen faster than they have deepened.

Scaling laws

The relationship between model size, data size, compute, and performance has become quantitative. Kaplan et al.'s 2020 Scaling Laws for Neural Language Models showed that test loss falls as a power law in each of model size N, dataset size D, and compute budget C, with exponents that are relatively stable across architecture and task. Hoffmann et al.'s 2022 Training Compute-Optimal Large Language Models (the "Chinchilla" paper) refined the model/data balance, showing that the optimal ratio of parameters to training tokens is closer to 1 : 20 than to the 1 : 1 ratio Kaplan et al. had suggested. The practical upshot: for a given compute budget, bigger is better, but only up to the ratio at which the data budget supports the parameter count. These scaling laws are the reason frontier-model training is plannable: given a compute budget, you can predict the optimal model size, the required dataset size, and the approximate test loss.

The overparameterisation revolution. Modern practice is built on the observation that, contrary to classical theory, a very large network trained well generalises better than a moderately-sized one at the classical capacity sweet spot. The best performing models of 2026 have tens to hundreds of billions of parameters, trained on trillions of tokens of data, and the scaling laws predict that they will continue to improve as long as data, compute, and capacity all continue to grow together.

A history of neural networks

The neural-network idea is older than almost every other idea in this Compendium — older than the transistor, older than the stored-program computer, older than information theory. Its history is a cycle of enthusiasm, disillusionment, and eventual breakthrough, and understanding the cycle is useful both for historical literacy and for navigating the hype cycles of the present.

Pre-history: McCulloch–Pitts, 1943

The first mathematical model of a neuron is Warren McCulloch and Walter Pitts's 1943 A Logical Calculus of the Ideas Immanent in Nervous Activity. Their unit is a binary threshold neuron — weighted sum of binary inputs, fire if the sum exceeds a threshold — and their paper proves that networks of such units can compute arbitrary Boolean functions. The McCulloch–Pitts neuron is simpler than the Rosenblatt perceptron (no learning rule, binary weights) but establishes the core abstraction: a neuron as a weighted-sum-plus-threshold.

Rosenblatt and the first boom: 1958–1969

Frank Rosenblatt's 1958 perceptron paper The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain adds the learning rule and the convergence theorem of Section 02. Rosenblatt built physical perceptron machines (the Mark I Perceptron at Cornell) that could learn to classify shapes from a 20×20 pixel camera. The 1960s were a period of heady promise: the New York Times in 1958 quoted Navy officials predicting perceptrons that could "walk, talk, see, write, reproduce itself and be conscious of its existence." The reality was more modest — perceptrons could learn linearly separable tasks and nothing more — and the gap between promise and reality was eventually documented brutally by Minsky and Papert.

The first AI winter: 1969–1985

Minsky and Papert's 1969 Perceptrons proved the XOR result of Section 02 and generalised it to a suite of functions a single-layer perceptron cannot learn. The book's mathematical content does not actually say that multilayer networks are limited, but its tone was widely read as condemning neural networks generally, and funding for the field dried up. Through the 1970s and early 1980s, neural-network research continued quietly — the backpropagation algorithm was independently derived several times during this period (Linnainmaa 1970, Werbos 1974) — but the field was out of fashion.

Backpropagation and the second boom: 1986–1995

Rumelhart, Hinton, and Williams's 1986 Nature paper Learning representations by back-propagating errors revived the field by showing that multilayer networks could be trained by gradient descent on a smooth loss, and that the features learned in hidden layers could be interpreted as useful representations of the input. The late 1980s and early 1990s saw real applications — LeCun's 1989 convolutional network for zip-code digit recognition, Waibel's time-delay neural networks for speech, Pomerleau's 1989 ALVINN for autonomous driving. But the networks of the era were small (tens of thousands of parameters), trained on small datasets, and slower than the SVMs and graphical models that rose to prominence through the 1990s. The second wave peaked, then receded.

The long dormancy: 1995–2006

Through the late 1990s and early 2000s, neural networks were considered obsolete in most of academic machine learning. Statistical-learning-theory arguments favoured SVMs (convex, theoretically understood); graphical-model arguments favoured Bayesian networks and HMMs; the bias–variance trade-off said small, well-regularised models would beat large, hard-to-train neural networks. The dedicated handful of researchers who kept working on neural networks — Hinton at Toronto, LeCun at NYU, Bengio at Montreal — became the "Canadian mafia" whose work would define the eventual revival.

The deep-learning revival: 2006–2012

Hinton's 2006 paper A Fast Learning Algorithm for Deep Belief Nets introduced layerwise pretraining — a way to initialise the weights of a deep network using an unsupervised generative model of the data before fine-tuning with backpropagation — that made training networks of six or seven layers feasible. This was the proof-of-concept that deep networks could be trained, even if the specific algorithm (DBN-based pretraining) would later be superseded by better initialisation schemes and ReLUs. The late-2000s work on unsupervised pretraining (Bengio, Lamblin, Popovici, Larochelle 2007), sparse coding, and contractive autoencoders set the stage.

AlexNet and the 2012 inflection

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton's 2012 ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) won the ImageNet Large-Scale Visual Recognition Challenge by a margin so large it could not be dismissed. AlexNet's innovations — ReLU activations, dropout regularisation, data augmentation, GPU training, a deep convolutional architecture — were each important individually, but the combination and the scale of the dataset and the compute were what tipped the field. Every major AI system of the 2010s and 2020s traces its lineage back to this moment. The "deep-learning era" begins in September 2012.

What came next

The decade after AlexNet is a cascade of breakthroughs: VGG, ResNet, Inception, BatchNorm, Adam, word2vec, GloVe, seq2seq, attention, Transformer, BERT, GPT-1, GPT-2, GPT-3, AlphaGo, AlphaFold, DALL-E, Stable Diffusion, ChatGPT, GPT-4. Each is an instance of scaling up the same core recipe — gradient descent on a differentiable loss, over a large-enough dataset, with a carefully designed architecture — to a new problem domain or a new scale. Subsequent chapters of Part V cover this progression chapter by chapter.

The historical lesson. Neural networks were the right idea all along; they needed compute, data, and a handful of engineering refinements (ReLU, dropout, good initialisation, residual connections, attention) to become practical. When a promising idea fails, it is not always because the idea is wrong — sometimes it is because the supporting infrastructure has not yet arrived. Keep this in mind when the next idea that seems obviously right fails to work immediately.

Worked example: an MLP for MNIST

To make every idea in the chapter concrete, this section walks through a minimal implementation of a two-hidden-layer MLP for MNIST digit classification, using NumPy only, with forward pass, backward pass, and mini-batch SGD written by hand. The example is the Hello, World! of neural networks and ties together loss, activation, backprop, initialisation, and SGD in a single executable artefact.

The setup

MNIST is 60,000 training and 10,000 test images of handwritten digits, each a 28×28 grayscale array with a label from {0, 1, …, 9}. We flatten each image into a 784-dimensional vector. Our network has architecture 784 → 128 → 64 → 10: input of size 784, two hidden layers of sizes 128 and 64 with ReLU activations, and a 10-class softmax output. Total parameters: about 110,000, which is small by modern standards but enough to reach 97–98% test accuracy.

Forward pass

Given a batch X ∈ ℝᴮ×⁷⁸⁴, compute Z¹ = X W¹ᵀ + b¹, A¹ = ReLU(Z¹), Z² = A¹ W²ᵀ + b², A² = ReLU(Z²), Z³ = A² W³ᵀ + b³, P = softmax(Z³). The loss is L = −(1/B) ∑ᵢ log Pᵢ,ᶜ(ᵢ). Implementing this in NumPy is about fifteen lines: one line per matmul, one per bias-add, one per nonlinearity, one for the softmax (with the numerical-stability trick of subtracting the max), and one for the cross-entropy reduction.

Backward pass

The backward pass computes parameter gradients in reverse order. At the output, δ³ = (P − Y)/B where Y is the one-hot label matrix. Parameter gradients at the output: ∂L/∂W³ = δ³ᵀ A², ∂L/∂b³ = ∑ δ³. Propagate to the hidden layer: δ² = (δ³ W³) ⊙ 1[Z² > 0], where the indicator is the derivative of ReLU. Parameter gradients: ∂L/∂W² = δ²ᵀ A¹, ∂L/∂b² = ∑ δ². Propagate one more time: δ¹ = (δ² W²) ⊙ 1[Z¹ > 0], ∂L/∂W¹ = δ¹ᵀ X, ∂L/∂b¹ = ∑ δ¹. Another fifteen lines. The entire training loop is now just: sample a mini-batch, forward pass, compute loss, backward pass, update parameters (W ← W − η ∇W), repeat.

Initialisation and training

Initialise weights from N(0, 2/n_in) (He initialisation); initialise biases to zero. Use SGD with learning rate 0.1 (or Adam at 10⁻³ if you prefer), batch size 128, and 10 epochs through the training set. After about two minutes on a laptop CPU the test accuracy should settle around 97.5%, with training loss near zero and training accuracy at 99.5%. The gap between train and test accuracy is the empirical overfitting; dropout (next chapter) and weight decay (also next chapter) can close it to well over 98% test.

What the example teaches

Writing out the backward pass by hand for a simple MLP is the single most clarifying exercise in deep learning. It shows that backpropagation is not magic — it is a mechanical application of the chain rule. It shows that SGD is not mysterious — it is parameter updates proportional to the negative gradient. It shows that every hyperparameter (learning rate, batch size, architecture width and depth, initialisation scale) has a concrete effect visible in the training trace. Once you have written the two-layer MLP in NumPy, every subsequent framework (PyTorch, JAX) is just a productivity layer on top of the same computation. The Deep Learning textbook by Goodfellow, Bengio, and Courville (Chapters 6 and 8) and Michael Nielsen's free online book Neural Networks and Deep Learning walk through essentially this example in more detail; Andrej Karpathy's micrograd and makemore series recreate the same ideas from scratch on video and are the best practical introduction available.

The takeaway. Every modern deep-learning model is a scaled-up, architecturally-elaborated, optimiser-tuned version of the MNIST MLP. The jump from the two-hundred-line NumPy implementation in this section to GPT-4 is one of scale and architectural detail, not of conceptual content. If you understand this MLP, you understand the core of deep learning.

Representation learning

The deepest idea in deep learning is not any specific architecture or training algorithm — it is the reframing of the neural network as a representation learner rather than a classifier. The features learned in the hidden layers are, in a very real sense, more interesting than the final prediction.

The internal representation

The hidden layers of a trained network are not arbitrary numerical intermediates. Each hidden layer is an embedding — a point in a vector space — of the input, chosen by the training procedure to be useful for the final task. Early layers of a convnet trained on ImageNet learn edge detectors and colour-blob detectors similar to those found in the primary visual cortex; middle layers learn textures, shapes, and object parts; late layers learn whole-object categories. Early layers of a BERT-style language model learn word-level features; middle layers learn syntactic structure; late layers learn semantic and task-specific features. This hierarchical feature emergence, documented extensively in the interpretability literature (Olah, Mordvintsev, Schubert 2017; Cammarata, Carter, Goh et al. 2020), is not an artefact of any specific training recipe — it appears robustly across architectures, datasets, and tasks.

Transfer and fine-tuning

The practical consequence is that the features learned for one task are useful for many others. A convnet pretrained on ImageNet contains, in its intermediate layers, a feature extractor that is useful for medical imaging, satellite imagery, art classification, and thousands of other tasks, even though it was never trained on any of them. A language model pretrained on a large text corpus contains a feature extractor that is useful for sentiment classification, named-entity recognition, question answering, and translation, even though it was never trained on any of those tasks specifically. Transfer learning — freezing or fine-tuning a pretrained network on a new downstream task — is the workhorse deployment pattern of modern deep learning, and the subject of Chapter 07 of Part V.

Unsupervised and self-supervised representation learning

The representations learned by a network depend on the training signal. Supervised classification yields representations useful for classification-like tasks. Unsupervised or self-supervised training — predicting masked words from context (BERT), predicting the next token (GPT), contrasting augmented views of the same image (SimCLR, MoCo), reconstructing masked patches (MAE) — yields representations useful for a very broad range of downstream tasks precisely because the self-supervised objective is more generic than any single supervised task. The quality of self-supervised representations scales dramatically with data and compute, which is why foundation models pretrained with self-supervision have become the dominant paradigm; Chapter 07 treats the topic in detail.

The embedding geometry

A late-layer representation of a neural network is a vector in a high-dimensional space (typically 256 to 4096 dimensions). The geometry of this space encodes semantic relationships: similar inputs map to nearby points; categorically different inputs map to distant points; transformations in input space (for example, translating an object in an image) often correspond to simple transformations in embedding space. The seminal example is the word-embedding arithmetic of word2vec (Mikolov et al. 2013): vec("king") − vec("man") + vec("woman") ≈ vec("queen"). Similar analogical structure has been demonstrated in image, audio, and multimodal embeddings. The embeddings are, in a rigorous sense, the "meaning" the network has extracted from the input.

The connection to Part IV

Part IV's feature-engineering chapter taught you to construct features by hand — polynomial terms, interaction effects, target encoding, text vectorisation, PCA. Part V's neural networks learn these features automatically from data. The techniques do not conflict; they are complementary. Classical ML on top of hand-engineered features remains the right choice for small data, highly structured problems, and regulatory-sensitive applications. Deep learning on learned representations is the right choice for rich, high-dimensional, weakly-structured data where enough training data exists to support the learning. In many industrial settings, the two approaches are combined: a pretrained neural network produces embeddings, and a gradient-boosted tree classifier on top of those embeddings handles the final decision. This hybrid pattern is increasingly common and often outperforms either approach alone.

The reframing. A neural network is not a black-box classifier. It is a learned hierarchy of representations, culminating in a task-specific head. The most valuable part of a trained network is often not the final prediction but the representations in the intermediate layers — representations that transfer, that admit interpretable structure, and that have become the reusable substrate on top of which modern AI systems are built.

Where it compounds in ML

This chapter has introduced the smallest complete unit of the deep-learning idea. The remaining chapters of Part V — and much of Parts VI through XVIII — are elaborations and scalings of it. This closing section is a roadmap of what builds on top.

Training deep networks (Chapter 02)

The vanilla SGD of Section 11, applied to the He-initialised MLP of Section 16, trains a small network well. It does not train a hundred-layer ResNet or a billion-parameter transformer. Chapter 02 covers the training innovations that scale the foundation: adaptive optimisers (Adam, AdamW, AdaGrad), learning-rate schedules (cosine decay, warmup, one-cycle), batch-normalisation and its variants, gradient clipping, mixed-precision training, and the hardware-software co-design (distributed data-parallel, ZeRO sharding, tensor parallelism) that turns a single-GPU training run into a thousand-GPU cluster job.

Regularisation and generalisation (Chapter 03)

A network with 110,000 parameters trained on 60,000 MNIST examples overfits the training set substantially — the gap between 99.5% train and 97.5% test accuracy is the empirical overfitting of our worked example. Chapter 03 covers the techniques that close this gap: dropout, weight decay (L2 regularisation), data augmentation, early stopping, label smoothing, mixup, and the implicit regularisation of SGD itself. Regularisation is what turns a trainable network into one that deploys.

Convolutional networks (Chapter 04)

Applied to image data, an MLP's fully-connected architecture wastes parameters: every pixel is treated as independent, every position is a different feature. A convolutional architecture imposes the prior of translation invariance and local structure — the idea that the same feature detector is useful in every position of the image — and reduces parameter count by orders of magnitude while improving generalisation. Chapter 04 covers convolutions, pooling, receptive fields, classic architectures (LeNet, AlexNet, VGG, ResNet, DenseNet, EfficientNet, Vision Transformer), and the domain-specific design patterns of modern computer vision.

Sequence models (Chapter 05)

Applied to sequential data, an MLP cannot naturally handle variable-length input or capture temporal dependencies. Recurrent neural networks (RNNs), Long Short-Term Memory cells (LSTMs, Hochreiter and Schmidhuber 1997), and Gated Recurrent Units (GRUs, Cho et al. 2014) impose the prior that each step's output depends on a running hidden state. Chapter 05 covers these architectures, the sequence-to-sequence encoder–decoder framework that made modern machine translation possible, and the limits of recurrence that motivated the attention mechanisms of Chapter 06.

Attention mechanisms (Chapter 06)

Attention — specifically self-attention — is the architectural innovation that unlocked modern sequence modelling. It replaces the recurrent step-by-step processing of an RNN with a constant-time (per-token) global-context computation, making it possible to train models with much longer effective context windows and much better parallelisability. Chapter 06 covers attention's mathematical formulation, its geometric interpretation, multi-head attention, cross-attention, and the path from attention to the Transformer architecture that defines Chapter 04 of Part VI.

Transfer learning and pretraining (Chapter 07)

The single most practically important development of the 2015–2022 period was the transfer-learning paradigm: pretrain a large network on a huge unlabelled corpus, then fine-tune it on a small labelled target task. Chapter 07 covers supervised and self-supervised pretraining objectives, fine-tuning strategies, parameter-efficient adaptation methods (LoRA, adapters, prefix tuning), and the emergence of foundation models as the shared substrate for modern AI systems.

Parts VI onwards

Part VI takes up large language models, which are a specific elaboration of the transformer architecture of Chapter 06 applied to text at massive scale. Part VII covers computer vision, which marries convnets and transformers for images and video. Parts VIII onwards cover generative models, reinforcement learning, multimodal systems, graphs, recommender systems, time-series, simulation, MLOps, safety and alignment. Every one of those topics is, at the architectural core, a deep neural network trained by SGD on a differentiable loss — the core object of the chapter you have just read.

The deep-learning foundation. Neural network fundamentals — the perceptron, the MLP, backpropagation, activation functions, SGD, initialisation — are a chapter's worth of material. They are also the substrate on which most of the rest of this Compendium is built. The next six chapters of Part V scale this foundation into something that handles images, sequences, long-range dependencies, and transfer learning; Parts VI through XVIII apply the scaled-up version to every significant domain of modern AI. This is the end of the beginning.

How to read this chapter

Contents

Why neural networks

The feature-engineering bottleneck

Learned representations

The two AI winters and the 2012 inflection

What classical ML still does better

The perceptron

The model

The learning rule

The geometric picture

The XOR problem

The modern perceptron

The multilayer perceptron

Layers, units, weights, and biases

The feed-forward pass as matrix multiplication

Depth and width as design axes

Why the hidden layer matters

Universal approximation

The Cybenko–Hornik theorem

What the theorem does not say

Expressivity versus learnability

Depth separation

Loss functions

Regression: mean squared error

Binary classification: binary cross-entropy

Multiclass classification: categorical cross-entropy

The MLE view of loss functions

Task-specific losses

Gradient descent

The update rule

The learning rate

Gradient descent on a convex loss

Gradient descent on a non-convex loss

Backpropagation

The chain rule as a network

The backward pass, derived

Why backpropagation is "reverse mode"

Backpropagation as credit assignment

Rumelhart, Hinton, and Williams 1986

Automatic differentiation and the computation graph

The computation graph

Static vs dynamic graphs

PyTorch's autograd engine

JAX's pure-function view

Higher-order derivatives and meta-learning

Activation functions

Sigmoid and tanh

ReLU

Dead ReLUs and the leaky / parametric variants

GELU and Swish

Special-purpose activations

Weight initialisation

Why zero initialisation fails

The signal-propagation heuristic

He initialisation for ReLU

Orthogonal and identity initialisation

Modern practice and pretraining

Stochastic gradient descent and mini-batches

Batch, mini-batch, and online

The stochastic objective

Why SGD generalises

Learning-rate scaling for batch size

Beyond vanilla SGD

Vanishing and exploding gradients

The mechanism

Sigmoid saturation as the prototype

The ReLU fix

Careful initialisation

Residual connections and layer normalisation

Gradient clipping

Output layers and task heads

Linear heads for regression

Sigmoid heads for binary and multi-label classification

Softmax heads for multiclass classification

Ranked, ordinal, and structured heads

Task-specific output layers

Capacity, width, depth, and overparameterisation

The classical picture: bias–variance and VC

The modern observation: double descent