Part I · Mathematical Foundations · Chapter 02

Calculus, the language of change.

If linear algebra tells neural networks what to compute, calculus tells them how to get better. Every gradient, every backward pass, every diffusion step, every physics-informed constraint is a statement about derivatives. This chapter rebuilds the calculus you remember — limits, derivatives, integrals — then pushes into the parts ML actually uses: the gradient and Hessian, the chain rule in matrix form, automatic differentiation, and the differential equations that animate neural ODEs, diffusion models, and physics-informed networks.

How to read this chapter

Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the diagrams exist to make the prose less abstract; the equations exist to make it precise. If an identity looks scary, look at the picture next to it first — almost all of multivariable calculus is a statement about either tangent planes or tiny volumes.

Notation: scalars are italic ($x$, $t$, $\lambda$); vectors are bold lowercase ($\mathbf{x}$, $\mathbf{v}$); matrices are capitals ($A$, $H$). A function of several variables is written $f(\mathbf{x})$ where $\mathbf{x} \in \mathbb{R}^n$. Partial derivatives are $\partial f / \partial x_i$ or $\partial_i f$. The gradient is $\nabla f$, the Hessian $\nabla^2 f$, and the Jacobian of a vector-valued map is $J$. A function of time is $x(t)$; its time derivative is $\dot x$. We use $\mathrm{d}$ for differentials and $\mathrm{D}$ for the total derivative when the distinction matters.

What calculus is for, in MLMotivation
Derivatives in one variableThe core idea
Taylor series and local approximationLinearize anything
Partial derivatives and the gradientGoing multi-d
The Jacobian and matrix calculusDerivatives of vectors
The chain rule and backpropHow ML actually trains
The Hessian and second-order geometryCurvature
Vector calculus: div, curl, LaplacianFields and flows
Integration and change of variablesAreas, volumes, densities
Critical points: minima, maxima, saddlesLoss landscapes
Automatic differentiationHow frameworks compute gradients
Ordinary differential equationsThe zoo
Linear ODEs and the matrix exponentialThe one case you can solve
Numerical ODE solversEuler to Runge–Kutta
Stability and phase portraitsLong-time behaviour
Partial differential equationsHeat, wave, Laplace
Numerical PDEs and finite differencesDiscretization
Where it shows up in MLPayoff

Section 01

What calculus is for, in machine learning

Linear algebra describes what a neural network computes. Calculus describes how it learns. Every training step is a statement about derivatives; every generative model built since 2020 is a statement about differential equations.

The single most important equation in modern ML is the one at the bottom of the backward pass:

$$ \theta_{t+1} \;=\; \theta_t \;-\; \eta\, \nabla_{\theta}\, \mathcal{L}(\theta_t). $$

"Adjust the parameters a little in the direction that decreases the loss fastest." The whole apparatus of deep learning — automatic differentiation frameworks, GPUs, the distinction between forward and backward passes — exists to evaluate that gradient efficiently at scale. Calculus is the theory behind the arrow.

But gradients are only the entry point. Once you start reading modern ML papers you meet differential equations everywhere. Neural ODEs treat a deep network as the solution to an ordinary differential equation, with the "depth" replaced by a continuous time variable. Diffusion models — the engine behind image and video generation — are a stochastic differential equation run forwards to add noise, and another run backwards to remove it. Physics-informed neural networks train by minimising the residual of a PDE. Even the humble softmax is the gradient of the log-sum-exp function.

Key idea

Calculus in ML splits into two jobs. Differential calculus tells us how to compute the change in a loss with respect to millions of parameters — that is training. Differential equations tell us how a system evolves over time or space — that is generative modelling, continuous-time networks, and physics-informed learning. This chapter teaches both.

The plan. Sections 02–03 are a fast refresher on single-variable calculus — derivatives, Taylor series — calibrated at the level a working ML engineer actually uses. Sections 04–11 are the multivariable core: gradients, Jacobians, the chain rule, Hessians, vector calculus, integration, critical points, and a careful look at automatic differentiation. Sections 12–17 are differential equations: ODEs, their analytical and numerical solution, stability, and the canonical partial differential equations. Section 18 threads the ML connections explicitly.

Section 02

Derivatives in one variable

The derivative is the cleanest idea in mathematics: the slope of a curve at a point, made precise by a limit. Everything that follows is an elaboration.

Given a function $f : \mathbb{R} \to \mathbb{R}$, the derivative of $f$ at $x$ is defined as

$$ f'(x) \;=\; \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}, $$

provided the limit exists. Geometrically, $(f(x+h) - f(x))/h$ is the slope of the secant line through $(x, f(x))$ and $(x+h, f(x+h))$; as $h$ shrinks to zero the secant becomes the tangent, and its slope becomes $f'(x)$. The derivative answers a physical question — "how fast is $f$ changing right here?" — and a geometric one — "what is the slope of the curve here?" — with the same number.

Rules you will use constantly

Three rules generate almost every derivative you need:

$$ (fg)' = f'g + fg', \qquad \left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}, \qquad (f \circ g)'(x) = f'(g(x))\, g'(x). $$

The last is the chain rule. It is the most important rule in this chapter — the rest of calculus is mostly about generalising it to multiple variables, vectors, and, eventually, to the computation graphs that define a neural network.

Higher derivatives and smoothness

Differentiating again gives $f''$, then $f'''$, and so on. A function whose first $k$ derivatives exist and are continuous is said to be of class $C^k$; if all derivatives of every order exist, $C^\infty$. Neural networks with smooth activations — ReLU is $C^0$, GELU and Swish are $C^\infty$ — fall somewhere on this scale, and the difference matters: second-order optimisers, Newton's method, and Hessian-based analysis all need $f \in C^2$ at least.

A caution

The limit in the definition may not exist. $f(x) = |x|$ has no derivative at $0$ — its graph has a corner. ReLU has the same problem at the origin, which is why deep-learning frameworks use sub-gradients there: any value in $[0,1]$ is legal, and by convention most libraries pick $0$. The distinction between "differentiable" and "sub-differentiable" is one of the rare places where pure-mathematics pickiness leaks into engineering.

The Fundamental Theorem of Calculus

Integration and differentiation are inverse operations. If $F$ is an antiderivative of $f$ — that is, $F'(x) = f(x)$ — then

$$ \int_a^b f(x)\, \mathrm{d}x \;=\; F(b) - F(a). $$

This is the Fundamental Theorem of Calculus. It links the two operations that Newton and Leibniz discovered independently, and it is what makes calculus useful: to compute an area (an integral), you find an antiderivative (a differentiation problem in reverse). Every closed-form expectation, every normalising constant, every kernel-density integral you evaluate is a use of this theorem.

Section 03

Taylor series and local approximation

A function that is complicated globally is often simple locally. Taylor's theorem makes that intuition precise: near any point, a smooth function looks like a polynomial of as high a degree as you care to compute.

The Taylor expansion of $f$ about a point $a$ is

$$ f(x) \;=\; f(a) \;+\; f'(a)\,(x-a) \;+\; \frac{f''(a)}{2!}\,(x-a)^2 \;+\; \frac{f'''(a)}{3!}\,(x-a)^3 \;+\; \cdots $$

Truncating at the first-order term gives the tangent-line approximation $f(x) \approx f(a) + f'(a)(x-a)$ — the linearisation that underlies gradient descent. Truncating at the second-order term gives the quadratic approximation, which underlies Newton's method. Modern optimisers all sit on this ladder: first-order methods use only the gradient, second-order methods use the Hessian.

Taylor with remainder

When you truncate, the missing tail has a name. Lagrange's form of the remainder says that for some $\xi$ between $a$ and $x$,

$$ f(x) \;=\; \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k \;+\; \underbrace{\frac{f^{(n+1)}(\xi)}{(n+1)!}(x-a)^{n+1}}_{\text{remainder}}. $$

The remainder tells you exactly how much precision you give up. If $|f^{(n+1)}|$ is bounded, doubling your expansion order reduces the error by another factor of $(x-a)$. This is why Taylor-based analyses of optimisation algorithms always talk about order of convergence: an order-$k$ method has error shrinking like the $k$th power of the step.

Canonical series worth memorising

$$ e^x = \sum_{k=0}^{\infty} \frac{x^k}{k!}, \qquad \ln(1+x) = \sum_{k=1}^{\infty} (-1)^{k+1} \frac{x^k}{k}, \qquad \frac{1}{1-x} = \sum_{k=0}^{\infty} x^k. $$

These three show up everywhere — $e^x$ in softmaxes and Gaussians, $\ln(1+x)$ in cross-entropy losses (where the $-\ln(1 - p)$ near $p=1$ makes the loss explode), and the geometric series in discounted returns from reinforcement learning. Recognising which series you are secretly manipulating often simplifies a long algebraic argument to a paragraph.

Key idea

Taylor's theorem is the bridge from calculus to optimisation. First-order methods (gradient descent, momentum, Adam) approximate the loss by a tangent plane. Second-order methods (Newton, L-BFGS) approximate it by a paraboloid. The cost-accuracy trade-off between them is nothing more than the trade-off between the first and second Taylor terms.

Section 04

Partial derivatives and the gradient

A function $f : \mathbb{R}^n \to \mathbb{R}$ has not one derivative but $n$ — one for each axis. Bundle them into a vector and you get the gradient, the single most important object in applied multivariable calculus.

Fix all inputs but one. The partial derivative of $f$ with respect to $x_i$ is the ordinary one-variable derivative you get when every other input is held constant:

$$ \frac{\partial f}{\partial x_i}(\mathbf{x}) \;=\; \lim_{h \to 0} \frac{f(\mathbf{x} + h\, \mathbf{e}_i) - f(\mathbf{x})}{h}, $$

where $\mathbf{e}_i$ is the $i$th standard basis vector. The partial measures how $f$ responds to a tiny push along axis $i$ only.

The gradient

Collect all the partials into a column vector:

$$ \nabla f(\mathbf{x}) \;=\; \begin{bmatrix} \partial f / \partial x_1 \\ \partial f / \partial x_2 \\ \vdots \\ \partial f / \partial x_n \end{bmatrix}. $$

The gradient $\nabla f$ is a vector field on $\mathbb{R}^n$ — a vector attached to every point. Two geometric facts justify the machinery that follows. First, the gradient points in the direction of steepest ascent of $f$; its negative points the way downhill. Second, the gradient is always orthogonal to the level set $\{ \mathbf{x} : f(\mathbf{x}) = c \}$ — level curves in two dimensions, level surfaces in three, level hypersurfaces in general.

Level curves of $f(x,y) = \tfrac{1}{2}x^2 + y^2$ (concentric ellipses) together with the gradient $\nabla f$ at several points. Each gradient vector is perpendicular to the level curve through its tail and points in the direction where $f$ grows fastest.

The directional derivative

The rate of change of $f$ along an arbitrary unit vector $\mathbf{u}$ is captured by the directional derivative

$$ D_{\mathbf{u}} f(\mathbf{x}) \;=\; \lim_{h \to 0} \frac{f(\mathbf{x} + h \mathbf{u}) - f(\mathbf{x})}{h} \;=\; \nabla f(\mathbf{x}) \cdot \mathbf{u}. $$

The dot product is maximised when $\mathbf{u}$ aligns with $\nabla f$ — which is exactly the claim that $\nabla f$ points uphill. It is minimised in the opposite direction, where $D_{-\nabla f / \|\nabla f\|} f = -\|\nabla f\|$. That last number is the fastest rate at which you can decrease $f$ from $\mathbf{x}$, and it is what gradient descent cashes in on at every step.

Section 05

The Jacobian and matrix calculus

When a function maps vectors to vectors rather than vectors to scalars, its derivative is not a vector but a matrix. Learning to read this matrix — the Jacobian — is what "matrix calculus" actually means.

Consider a map $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$, written $\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x}))$. Each output has $n$ partial derivatives; collect them all and you get an $m \times n$ matrix,

$$ J(\mathbf{x}) \;=\; \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \;=\; \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f_m}{\partial x_1} & \cdots & \dfrac{\partial f_m}{\partial x_n} \end{bmatrix}. $$

This is the Jacobian matrix of $\mathbf{f}$ at $\mathbf{x}$. Its rows are the gradients of the individual output coordinates; its columns are how the whole output vector responds to a push along each input axis. Geometrically, $J$ is the best linear approximation to $\mathbf{f}$ near $\mathbf{x}$:

$$ \mathbf{f}(\mathbf{x} + \mathbf{h}) \;\approx\; \mathbf{f}(\mathbf{x}) \;+\; J(\mathbf{x}) \mathbf{h}. $$

Two special cases

When $m = 1$, the Jacobian is a single row — the gradient transposed, $J = (\nabla f)^\top$. When $n = 1$, it is a single column — the velocity vector $\mathbf{f}'(t)$ of a parameterised curve. The gradient and the velocity are the same idea on opposite sides of the input/output asymmetry.

Matrix-calculus identities you actually need

The rules below cover maybe 90% of the matrix-calculus you hit reading ML papers. Treat them as reflexes.

$$ \frac{\partial}{\partial \mathbf{x}} \left( \mathbf{a}^\top \mathbf{x} \right) = \mathbf{a}, \qquad \frac{\partial}{\partial \mathbf{x}} \left( \mathbf{x}^\top A \mathbf{x} \right) = (A + A^\top)\, \mathbf{x}, $$ $$ \frac{\partial}{\partial \mathbf{x}} \left\| \mathbf{x} \right\|_2^2 = 2\mathbf{x}, \qquad \frac{\partial}{\partial A} \operatorname{tr}(A^\top B) = B, \qquad \frac{\partial}{\partial A} \operatorname{tr}(A B A^\top) = A(B + B^\top). $$

Two conventions coexist in the literature. In denominator layout (common in applied statistics and this compendium), the derivative of a scalar with respect to a column vector is a column vector. In numerator layout (common in matrix-calculus cheat sheets), it is a row. If you find yourself confused by a sign or a transpose, check which convention the paper is using. The Matrix Cookbook (section 18) tabulates both.

Key idea

The Jacobian is the linear map that sits at the heart of the chain rule. Every backward pass in every neural network is a product of Jacobians — usually with the caveat that the framework never materialises them explicitly, multiplying by them through vector–Jacobian products instead.

Section 06

The chain rule, and what backprop really is

The one-variable chain rule generalises to vectors as matrix multiplication: the Jacobian of a composition is the product of the Jacobians. Training a neural network is this identity, applied to a long chain of matrices, billions of times.

If $\mathbf{g} : \mathbb{R}^n \to \mathbb{R}^p$ and $\mathbf{f} : \mathbb{R}^p \to \mathbb{R}^m$ are differentiable, then the composition $\mathbf{f} \circ \mathbf{g} : \mathbb{R}^n \to \mathbb{R}^m$ has Jacobian

$$ J_{\mathbf{f} \circ \mathbf{g}}(\mathbf{x}) \;=\; J_{\mathbf{f}}\bigl(\mathbf{g}(\mathbf{x})\bigr) \,\cdot\, J_{\mathbf{g}}(\mathbf{x}). $$

That's it. All of backprop is that line applied to a chain of $L$ functions (layers). The derivative of a scalar loss $\mathcal{L}$ with respect to the input of layer $\ell$ is

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_\ell} \;=\; \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \, J_L \, J_{L-1} \, \cdots \, J_{\ell+1}. $$

Why the reverse pass is efficient

The two ends of the computation graph look asymmetric. The forward pass starts with an input of size $n$ and ends with a scalar loss. Naively, a derivative of a scalar with respect to $n$ inputs would cost $n$ evaluations — ruinous when $n$ is a billion parameters. The trick, reverse-mode automatic differentiation, is to multiply the Jacobians from the left (the scalar side) rather than the right.

At every step we maintain a small vector $\mathbf{v}^\top = \partial \mathcal{L} / \partial \mathbf{h}_\ell$ (one number per neuron), not the full Jacobian $J_\ell$. To propagate one layer back, the framework computes $\mathbf{v}^\top J_\ell$ — a vector–Jacobian product — without ever forming $J_\ell$. The result is that a gradient of a scalar with respect to $n$ parameters costs roughly the same as one forward pass, regardless of $n$. That single asymmetry is why deep learning is computationally possible.

Forward vs reverse mode

Forward-mode AD does the opposite: it carries a Jacobian–vector product forward, which is efficient when the input is low-dimensional and the output is large. It is the right choice for sensitivity analysis of ODE solvers with a few parameters and many observations. Reverse-mode is the right choice when training a model — one scalar loss, many parameters.

A caution

The chain rule requires that the intermediate values $\mathbf{h}_\ell$ (the activations) be available on the backward pass. Frameworks either cache them (memory cost $O(L)$) or recompute them on the fly (compute cost $2\times$ the forward pass). This trade-off is the whole subject of gradient checkpointing, which is why training a 70B-parameter model on a single GPU is even conceivable.

Section 07

The Hessian and second-order geometry

The gradient tells you the slope of a loss landscape. The Hessian tells you its curvature. For a function of $n$ variables, the Hessian is an $n \times n$ symmetric matrix — and all of second-order optimisation, stability analysis, and the geometry of saddle points lives inside its eigenvalues.

For a twice-differentiable $f : \mathbb{R}^n \to \mathbb{R}$, the Hessian is

$$ \nabla^2 f(\mathbf{x}) \;=\; H(\mathbf{x}) \;=\; \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \cdots & \dfrac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}. $$

If $f \in C^2$, mixed partials commute — this is Clairaut's theorem — so $H$ is symmetric. The second-order Taylor expansion around $\mathbf{x}$ is then

$$ f(\mathbf{x} + \mathbf{h}) \;\approx\; f(\mathbf{x}) \;+\; \nabla f(\mathbf{x})^\top \mathbf{h} \;+\; \tfrac{1}{2}\, \mathbf{h}^\top H(\mathbf{x})\, \mathbf{h}. $$

The quadratic form $\mathbf{h}^\top H \mathbf{h}$ is the piece that determines local shape. Diagonalise $H = Q \Lambda Q^\top$ (spectral theorem — see the linear algebra chapter) and the eigenvalues $\lambda_1, \ldots, \lambda_n$ tell you everything:

All $\lambda_i > 0$. $H$ is positive definite; $\mathbf{x}$ is a local minimum; the level sets are ellipsoids.
All $\lambda_i < 0$. $H$ is negative definite; $\mathbf{x}$ is a local maximum.
Mixed signs. $\mathbf{x}$ is a saddle point: going up in some directions, down in others.
Some $\lambda_i = 0$. $H$ is degenerate; the second-order test is inconclusive and you need higher terms.

Newton's method

Minimising the quadratic approximation gives Newton's step

$$ \mathbf{x}_{t+1} \;=\; \mathbf{x}_t \;-\; H(\mathbf{x}_t)^{-1} \nabla f(\mathbf{x}_t). $$

If $f$ is quadratic, this lands on the exact minimum in one step. Far from the minimum, or when $H$ is indefinite, Newton's method can diverge wildly — which is why deep-learning practice instead uses quasi-Newton variants (L-BFGS), diagonal approximations (Adam, RMSProp), or Gauss–Newton (most nonlinear least-squares solvers, all modern reinforcement-learning natural-gradient methods).

Key idea

In high dimensions, almost every critical point of a deep network loss is a saddle, not a minimum — the probability that all $n$ eigenvalues share a sign shrinks as $2^{-n}$. SGD's effectiveness is partly that noise tends to escape saddles in directions of negative curvature, which is what the Hessian's negative eigenvalues pick out.

Section 08

Vector calculus: divergence, curl, Laplacian

Once the objects have more than one scalar component, calculus grows three new operators. They are geometric statements about vector fields — how much flux leaves a point, how much it swirls, how much it deviates from its neighbourhood average.

A vector field is a map $\mathbf{F} : \mathbb{R}^n \to \mathbb{R}^n$ — a vector attached to every point of space. The wind above a weather map, the gradient of a loss, the velocity of particles in a diffusion — all are vector fields. Three operators built out of partial derivatives describe their local behaviour.

Divergence

The divergence of a vector field $\mathbf{F} = (F_1, \ldots, F_n)$ is the scalar

$$ \nabla \cdot \mathbf{F} \;=\; \sum_{i=1}^n \frac{\partial F_i}{\partial x_i}. $$

It measures the net outflow rate per unit volume — "how much is this point a source?" Positive divergence at $\mathbf{x}$ means fluid is expanding outward there; negative means it is converging. In ML it shows up directly in the continuity equation of probability flow, $\partial_t \rho + \nabla \cdot (\rho \mathbf{v}) = 0$, which is what diffusion-model sampling numerically integrates.

Curl

In three dimensions, the curl is a vector,

$$ \nabla \times \mathbf{F} \;=\; \left( \partial_y F_z - \partial_z F_y,\; \partial_z F_x - \partial_x F_z,\; \partial_x F_y - \partial_y F_x \right). $$

It measures how much the field rotates locally — the axis-and-magnitude of the infinitesimal rotation you would feel dropped into the flow. A field is irrotational if its curl is zero everywhere; irrotational smooth fields are locally the gradient of some scalar, a fact called the Poincaré lemma. Gradients, in particular, always have zero curl — which is why "is this vector field a gradient?" reduces to checking that $\nabla \times \mathbf{F} = \mathbf{0}$.

The Laplacian

The Laplacian of a scalar field $f$ is the divergence of its gradient:

$$ \Delta f \;=\; \nabla \cdot \nabla f \;=\; \sum_{i=1}^n \frac{\partial^2 f}{\partial x_i^2}. $$

It measures the difference between $f$ at a point and the average of $f$ over a small sphere around that point. Positive Laplacian means the point is a local minimum-ish dip; negative means a bump. The Laplacian is the operator that appears in nearly every PDE in physics: the heat equation $\partial_t u = \Delta u$, the wave equation $\partial_{tt} u = c^2 \Delta u$, Laplace's equation $\Delta u = 0$, the Schrödinger equation. It is also the trace of the Hessian, which is why second-order statistics of loss landscapes look like diffusion.

Two theorems to know by name

Green / Stokes' theorem equates an integral of curl over a surface to a line integral around its boundary. Gauss's divergence theorem equates the integral of divergence over a volume to the flux through its boundary. Both are "the integral of a derivative equals a boundary term" — the multivariable Fundamental Theorem of Calculus — and both show up in normalising-flow Jacobians and physics-informed training objectives.

Section 09

Integration and change of variables

Multivariable integration is what lets us talk about total mass, expected values, and normalising constants. Two mechanics — iterated integration and the change-of-variables formula — are the only ones most ML work ever uses.

For a function $f$ of two variables over a rectangle, Fubini's theorem says the double integral equals an iterated integral:

$$ \iint_{[a,b]\times[c,d]} f(x,y)\,\mathrm{d}A \;=\; \int_a^b \int_c^d f(x,y)\, \mathrm{d}y\, \mathrm{d}x. $$

You integrate one variable at a time, treating the other as constant. Nearly every multivariable integral in practice either (a) factorises — so each piece is a one-variable problem — or (b) is intractable and handled by Monte Carlo sampling. In ML, the latter is overwhelmingly common: expectations over high-dimensional posteriors are the whole reason variational inference and MCMC exist.

The change of variables formula

A map $\mathbf{g} : \mathbb{R}^n \to \mathbb{R}^n$ with Jacobian $J_{\mathbf{g}}(\mathbf{u})$ transforms integrals according to

$$ \int_{\mathbf{g}(U)} f(\mathbf{x}) \, \mathrm{d}\mathbf{x} \;=\; \int_U f\bigl(\mathbf{g}(\mathbf{u})\bigr) \, \left| \det J_{\mathbf{g}}(\mathbf{u}) \right| \, \mathrm{d}\mathbf{u}. $$

The $|\det J|$ term corrects for the local stretching of volumes under $\mathbf{g}$. This is the identity that makes normalising flows work: if $\mathbf{x} = \mathbf{g}(\mathbf{u})$ transforms a simple base density $p_{\mathbf{u}}$ into a target $p_{\mathbf{x}}$, the two densities are related by

$$ p_{\mathbf{x}}(\mathbf{x}) \;=\; p_{\mathbf{u}}\bigl(\mathbf{g}^{-1}(\mathbf{x})\bigr) \, \left| \det J_{\mathbf{g}^{-1}}(\mathbf{x}) \right|. $$

The entire engineering problem of building a normalising flow is engineering $\mathbf{g}$ so that $\det J_{\mathbf{g}}$ is cheap to compute — either triangular (autoregressive flows) or block-triangular (coupling layers). The calculus is just a log-Jacobian away; the architecture is the art.

Polar and spherical

Two coordinate changes appear over and over: polar in $\mathbb{R}^2$ (with Jacobian $r$) and spherical in $\mathbb{R}^n$ (with Jacobian involving $r^{n-1}$). The famous computation of the Gaussian normalising constant $\int e^{-x^2/2}\, \mathrm{d}x = \sqrt{2\pi}$ is a change to polar coordinates applied to the product of two copies. It is a rite of passage, and worth sitting down and doing once by hand.

Section 10

Critical points: minima, maxima, and saddles

Find where the gradient vanishes; then use the Hessian to classify. The three-line procedure is the whole story in one variable, and with a few caveats, it survives into arbitrary dimensions — which is good news, because "find the minimum" is what every training loop in ML does.

A critical point of $f : \mathbb{R}^n \to \mathbb{R}$ is a point $\mathbf{x}^*$ where $\nabla f(\mathbf{x}^*) = \mathbf{0}$. The first-order condition says every local extremum is a critical point; the converse is false, which is why second-order tests exist.

Given that $\nabla f = \mathbf{0}$, the Hessian $H$ tells you the local geometry (section 07). If $H$ is positive definite, you are at a local minimum; negative definite, a local maximum; indefinite, a saddle; singular, inconclusive. In one variable this reduces to the familiar "$f''(x^*) > 0 \Rightarrow $ minimum" test.

Saddles dominate in high dimensions

A standard heuristic: if you pick a random symmetric matrix, the probability that all $n$ of its eigenvalues have the same sign shrinks exponentially with $n$. Deep-network loss landscapes are therefore riddled with saddle points — far more than with true local minima. This is a feature, not a bug. Noisy gradient methods escape saddles readily; they struggle more with poorly conditioned minima, which is why preconditioners like Adam and RMSProp exist.

Constrained optimisation and Lagrange multipliers

When you need to minimise $f$ subject to $g(\mathbf{x}) = 0$, the condition "gradient vanishes" is replaced by "gradient is parallel to the constraint gradient":

$$ \nabla f(\mathbf{x}) \;=\; \lambda \, \nabla g(\mathbf{x}), \qquad g(\mathbf{x}) = 0. $$

$\lambda$ is the Lagrange multiplier. For inequality constraints, the Karush–Kuhn–Tucker (KKT) conditions generalise this — the chapter on optimisation theory covers them in depth. For now, note that every time you train a model with a hard constraint (weight-norm clipping, orthogonality, probability simplex) a Lagrange multiplier is lurking in the update rule.

The optimiser's bestiary

Gradient descent only ever moves to critical points. Momentum accelerates along flat directions of the Hessian and damps oscillations in sharp ones. Adam is diagonal second-order information estimated on the fly. Natural gradient uses the Fisher information matrix (a cousin of the Hessian). Every modern optimiser is a rearrangement of the first two terms of a Taylor expansion — with a well-chosen matrix between them.

Section 11

Automatic differentiation

Frameworks like PyTorch and JAX compute exact derivatives — not symbolic formulas, not finite-difference approximations. The mechanism is called automatic differentiation, and once you know how it works it stops feeling like magic.

Three techniques are sometimes confused.

Symbolic differentiation. Work through the algebraic formula for $f$ and derive $f'$ by the rules of calculus. Produces a closed form; is what Mathematica does. Prone to expression swell — the formula for the derivative of a 100-layer network would be unreadable and slow.
Numerical differentiation. Use $(f(x+h) - f(x-h)) / (2h)$ for tiny $h$. Simple, but introduces truncation and round-off error, and costs $n$ forward passes for an $n$-variable gradient.
Automatic differentiation. Decompose the computation into a DAG of primitive operations (add, multiply, exp, matmul); apply the chain rule to the DAG itself. Produces machine-precision derivatives at the cost of one or two forward passes.

The computation graph

Every time you write y = f(x) in a framework, it records the operations as nodes of a directed acyclic graph. Each primitive knows its own local derivative — for $z = \sin(y)$, the local derivative is $\cos(y)$; for $z = x \cdot w$ it is $w$ with respect to $x$ and $x$ with respect to $w$. Reverse-mode AD walks the graph backward, multiplying local derivatives together, to assemble the full chain-rule product.

Dual numbers (forward mode)

Forward-mode AD has a clean algebraic realisation. Extend the reals to $\mathbb{R}[\varepsilon]$ where $\varepsilon^2 = 0$. For any analytic $f$,

$$ f(x + \varepsilon\, \dot x) \;=\; f(x) \;+\; \varepsilon\, f'(x)\, \dot x. $$

Everything after the $\varepsilon$ is exactly the derivative — the machinery is a generalisation of complex numbers, with $\varepsilon^2 = 0$ instead of $i^2 = -1$. Implementing forward-mode AD is no more than overloading arithmetic on pairs $(x, \dot x)$.

Reverse mode: the vjp

Reverse-mode AD is not so clean algebraically, but is what actually trains neural networks. For every primitive $\mathbf{y} = \mathbf{g}(\mathbf{x})$ it needs one function: given the adjoint $\bar{\mathbf{y}} = \partial \mathcal{L} / \partial \mathbf{y}$ arriving from downstream, compute the upstream adjoint $\bar{\mathbf{x}} = \bar{\mathbf{y}}^\top J_{\mathbf{g}}$. This is the vector–Jacobian product, or vjp. A framework ships hand-written vjp's for every primitive; user code that composes them inherits the chain rule for free.

Key idea

AD is a programming-language story as much as a calculus one. JAX's grad, jvp, and vjp expose the whole apparatus directly. PyTorch hides it behind .backward(). Once you see that backprop is a closed-form recursion over a DAG, the engineering choices — checkpointing, mixed precision, activation offloading — stop being mysterious.

Section 12

Ordinary differential equations

An ODE specifies how a quantity changes with respect to one variable — usually time. Given the change, you want the quantity. Solving means integrating, but analytical integration is rare; numerical integration is the real tool.

A first-order ordinary differential equation is a relation

$$ \dot{\mathbf{x}}(t) \;=\; \mathbf{f}\bigl(\mathbf{x}(t),\, t\bigr), \qquad \mathbf{x}(t_0) = \mathbf{x}_0. $$

The pair — equation plus initial condition — is an initial value problem (IVP). Picard's theorem guarantees that if $\mathbf{f}$ is Lipschitz in $\mathbf{x}$ on a neighbourhood of $(t_0, \mathbf{x}_0)$, a unique solution exists locally. In practice every well-posed dynamical system you meet in ML satisfies this.

The basic zoo

A few one-dimensional types yield to pen and paper:

Separable: $\dot x = g(t) h(x)$. Separate and integrate: $\int h(x)^{-1} \mathrm{d}x = \int g(t)\, \mathrm{d}t$. Exponential decay $\dot x = -\alpha x$ is the canonical example; the solution is $x(t) = x_0 e^{-\alpha t}$.
Linear first-order: $\dot x + p(t) x = q(t)$. Multiply by the integrating factor $\exp\!\int p$, then integrate. Closed-form every time.
Exact: $M(x,y)\, \mathrm{d}x + N(x,y)\, \mathrm{d}y = 0$ with $\partial_y M = \partial_x N$. The left side is the differential of some potential $\Phi$; the solution is $\Phi = \text{const}$.

Second-order and higher

A second-order ODE like $\ddot{x} + 2\gamma \dot x + \omega_0^2 x = 0$ — the damped harmonic oscillator — lives in the imagination of every physicist and underlies every result in stability analysis you will ever read. It rewrites as a first-order system in two variables:

$$ \frac{\mathrm{d}}{\mathrm{d}t} \begin{pmatrix} x \\ \dot x \end{pmatrix} \;=\; \begin{pmatrix} 0 & 1 \\ -\omega_0^2 & -2\gamma \end{pmatrix} \begin{pmatrix} x \\ \dot x \end{pmatrix}. $$

This reduction — any $n$th-order ODE becomes a first-order system in $n$ variables — is the reason practitioners can focus entirely on first-order systems $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x}, t)$. It is also exactly what a neural ODE is: a parameterised first-order vector field.

Flows and semigroups

The solution map $\Phi^t : \mathbf{x}_0 \mapsto \mathbf{x}(t)$ of an autonomous ODE is a one-parameter flow: $\Phi^{s+t} = \Phi^s \circ \Phi^t$. In a neural ODE, this says that doubling the integration time composes the network with itself — which is why neural ODEs behave a lot like deep residual networks, and why a trained ResNet of depth $L$ can often be read as a Euler discretisation of an ODE with $L$ steps.

Section 13

Linear ODEs and the matrix exponential

The one class of differential equations you can always solve in closed form is the linear one. The answer involves a matrix exponential — and the matrix exponential turns out to be the right object for talking about continuous-time dynamics in general.

A linear, time-invariant ODE system is

$$ \dot{\mathbf{x}}(t) \;=\; A \mathbf{x}(t), \qquad \mathbf{x}(0) = \mathbf{x}_0, $$

where $A \in \mathbb{R}^{n \times n}$. The solution is

$$ \mathbf{x}(t) \;=\; e^{t A}\, \mathbf{x}_0, $$

where the matrix exponential is defined by the same Taylor series as the scalar exponential,

$$ e^{tA} \;=\; \sum_{k=0}^{\infty} \frac{t^k}{k!} A^k. $$

The series converges for any square $A$. Practically, you compute it by diagonalising: if $A = Q \Lambda Q^{-1}$ with $\Lambda = \operatorname{diag}(\lambda_1,\ldots,\lambda_n)$, then $e^{tA} = Q\, \operatorname{diag}(e^{t\lambda_1},\ldots,e^{t\lambda_n})\, Q^{-1}$. When $A$ is not diagonalisable, use a Jordan form or Padé approximation — every numerical library provides scipy.linalg.expm.

Eigenvalues set the long-time behaviour

The eigenvalues of $A$ tell you everything about the qualitative behaviour of solutions. If every eigenvalue has strictly negative real part, every trajectory decays to zero (stable equilibrium). If any eigenvalue has positive real part, there exist trajectories that blow up (unstable). Pure-imaginary eigenvalues give bounded oscillations (a harmonic oscillator, a hydrogen atom). This is exactly the framework that Kalman filters, linear state-space models (S4, Mamba), and control theory all rely on.

Inhomogeneous linear ODEs

Adding a forcing term $\mathbf{u}(t)$ gives

$$ \dot{\mathbf{x}} \;=\; A \mathbf{x} \;+\; B \mathbf{u}(t). $$

The general solution is a variation-of-constants formula,

$$ \mathbf{x}(t) \;=\; e^{tA} \mathbf{x}_0 \;+\; \int_0^t e^{(t-s)A}\, B\, \mathbf{u}(s)\, \mathrm{d}s. $$

The integral is a continuous convolution of the input $\mathbf{u}$ with the impulse response $e^{tA} B$. Read that sentence again: a linear continuous-time system is a convolution with its impulse response. Every convolutional architecture, from WaveNet to S4 to Mamba, is trying to learn a discrete analogue of this kernel.

Section 14

Numerical ODE solvers

Outside the linear case, ODEs are almost never solvable in closed form. Numerical integration discretises time into finite steps and approximates the trajectory — and once you see the three canonical schemes, almost every solver you meet is a variation.

Given $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x}, t)$ and a step size $h$, a one-step solver produces $\mathbf{x}_{n+1}$ from $\mathbf{x}_n$ and $t_n$.

Explicit Euler

$$ \mathbf{x}_{n+1} \;=\; \mathbf{x}_n \;+\; h\, \mathbf{f}(\mathbf{x}_n, t_n). $$

Linearise locally and step. Error per step is $O(h^2)$, global error is $O(h)$; the method is first-order accurate. Fine for prototyping, unusable for anything with fast dynamics because it demands vanishingly small $h$ to stay stable.

Implicit Euler

$$ \mathbf{x}_{n+1} \;=\; \mathbf{x}_n \;+\; h\, \mathbf{f}(\mathbf{x}_{n+1}, t_{n+1}). $$

$\mathbf{x}_{n+1}$ appears on both sides — you solve an equation at every step — but the trade is worth it for stiff systems, where fast and slow time scales coexist. Implicit methods remain stable at step sizes that would make explicit methods diverge.

Runge–Kutta 4

The workhorse explicit method. In four stages:

$$ \begin{aligned} \mathbf{k}_1 &= \mathbf{f}(\mathbf{x}_n, t_n), \\ \mathbf{k}_2 &= \mathbf{f}(\mathbf{x}_n + \tfrac{h}{2}\mathbf{k}_1,\, t_n + \tfrac{h}{2}), \\ \mathbf{k}_3 &= \mathbf{f}(\mathbf{x}_n + \tfrac{h}{2}\mathbf{k}_2,\, t_n + \tfrac{h}{2}), \\ \mathbf{k}_4 &= \mathbf{f}(\mathbf{x}_n + h \mathbf{k}_3,\, t_n + h), \\ \mathbf{x}_{n+1} &= \mathbf{x}_n + \tfrac{h}{6}\bigl(\mathbf{k}_1 + 2\mathbf{k}_2 + 2\mathbf{k}_3 + \mathbf{k}_4\bigr). \end{aligned} $$

Global error is $O(h^4)$ — each halving of $h$ cuts the error by sixteen. It is the default choice when you do not know better, and drives most "neural-ODE" implementations by default.

Adaptive step sizes

A real solver runs two methods of different order side by side — say RK4 and RK5 — takes their difference as an error estimate, and shrinks $h$ when the estimate grows too large. Dormand–Prince is the classic choice; SciPy's solve_ivp and JAX's diffrax both default to it. This is why modern neural-ODE training does not ask you to pick a step size: the adaptive solver picks one per example per evaluation.

Stiffness and why it matters

A system is stiff when its Jacobian has eigenvalues spread over many orders of magnitude. Explicit solvers must then take steps limited by the fastest mode, even while tracking only the slowest — a crippling cost. Stiff implicit solvers (BDF, Rosenbrock) are the right tool. Most chemical-kinetics, neural-ODE-with-residuals, and large-scale dynamical-system simulation problems are stiff in practice.

Section 15

Stability and phase portraits

Before computing a solution, ask what the solution must look like. The qualitative theory of ODEs — fixed points, stability, phase portraits — is the geometric side of the subject, and it is how physicists and control engineers actually think about dynamics.

A fixed point of $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x})$ is an $\mathbf{x}^*$ where $\mathbf{f}(\mathbf{x}^*) = \mathbf{0}$. Near a fixed point, the dynamics linearise:

$$ \dot{\mathbf{h}} \;\approx\; J_{\mathbf{f}}(\mathbf{x}^*)\, \mathbf{h}, \qquad \mathbf{h} = \mathbf{x} - \mathbf{x}^*. $$

The eigenvalues of the Jacobian at the fixed point classify it. All eigenvalues with negative real part: asymptotically stable; trajectories nearby spiral or slide into $\mathbf{x}^*$. Any eigenvalue with positive real part: unstable. Pure imaginary: the linearisation is a centre, and you need nonlinear information to decide.

Phase portrait of a stable spiral: $\dot x = -0.1 x - y$, $\dot y = x - 0.1 y$. The eigenvalues of the linearisation at the origin are $-0.1 \pm i$, giving damped rotation. Every trajectory spirals in to the fixed point at the origin.

Lyapunov's direct method

Sometimes the linearisation is not enough. Lyapunov's idea: find a scalar function $V(\mathbf{x})$ that is positive near the fixed point and decreases along trajectories. If $V(\mathbf{x}^*) = 0$, $V(\mathbf{x}) > 0$ for $\mathbf{x} \neq \mathbf{x}^*$, and $\dot V = \nabla V \cdot \mathbf{f} \leq 0$, the fixed point is stable. If the inequality is strict, it is asymptotically stable. For mechanical systems the total energy is almost always a Lyapunov function; for ML systems a loss or potential often is.

Bifurcations

As a parameter varies, the structure of the phase portrait can change qualitatively — fixed points can collide, split, or change stability. The point where the change happens is a bifurcation. The names to know are saddle–node (two fixed points collide and annihilate), pitchfork (symmetry-breaking), and Hopf (a fixed point destabilises and spawns a limit cycle). Training deep networks exhibits analogues of all three in mean-field limits, and the Hopf bifurcation is the canonical route to oscillations in excitable systems.

Section 16

Partial differential equations

When the unknown depends on more than one variable, you get a partial differential equation. Three canonical linear PDEs — heat, wave, Laplace — cover most of physics and nearly all of the intuition you need.

A PDE involves partial derivatives of an unknown function with respect to more than one independent variable. The three canonical second-order linear PDEs, which lie at the foundation of the subject, all involve the Laplacian $\Delta$:

$$ \underbrace{\partial_t u \;=\; \alpha\, \Delta u}_{\text{heat / diffusion}}, \qquad \underbrace{\partial_{tt} u \;=\; c^2\, \Delta u}_{\text{wave}}, \qquad \underbrace{\Delta u \;=\; 0}_{\text{Laplace}}. $$

They are three representatives of three fundamentally different behaviours.

Heat (parabolic)

The heat equation describes diffusion. Initial heat distributions smooth out as time goes on — discontinuities vanish instantly, fine features decay exponentially fast. In probability, it is exactly the evolution of a Gaussian density under Brownian motion, and for diffusion models, it is the evolution that turns a sample into pure noise.

On the real line, the heat equation with initial condition $u(x, 0) = u_0(x)$ has the explicit solution

$$ u(x, t) \;=\; \int_{-\infty}^{\infty} G(x - y, t)\, u_0(y)\, \mathrm{d}y, \qquad G(x, t) = \frac{1}{\sqrt{4\pi \alpha t}} e^{-x^2 / (4\alpha t)}. $$

The kernel $G$ is the Gaussian heat kernel; the solution is the initial condition smoothed by a Gaussian of width $\sqrt{2\alpha t}$. The connection to probability — Gaussian densities are the fundamental solutions of diffusion — is why every generative-diffusion paper pairs the forward SDE with a Gaussian noise schedule.

Wave (hyperbolic)

The wave equation preserves detail: solutions oscillate forever and carry information at a finite speed $c$. Its fundamental solutions in one dimension are left- and right-moving functions $f(x - ct)$ and $g(x + ct)$ — d'Alembert's formula. This hyperbolic character is what makes acoustic and electromagnetic signals possible, and why training neural network surrogates for wave-like physics is genuinely harder than for diffusive physics.

Laplace / Poisson (elliptic)

Solutions of $\Delta u = 0$ (harmonic functions) are extraordinarily smooth — determined everywhere inside a domain by their boundary values. The Poisson equation $\Delta u = f$ is the workhorse of electrostatics, gravitation, and image inpainting. In ML, it pops up in graph Laplacian regularisers and in solving for potentials in physics-informed networks.

Key idea

The three canonical PDEs correspond to three styles of behaviour: parabolic (smooths and spreads), hyperbolic (preserves and propagates), elliptic (is equilibrium). Nearly every second-order linear PDE you will meet is one of these three with variable coefficients, and the qualitative behaviour is inherited from the classification. The technical term is characteristics: the lines along which information propagates.

Section 17

Numerical PDEs and finite differences

Solving a PDE on a computer means discretising space and time into grids. Three families dominate: finite differences, finite elements, and spectral methods. Physics-informed neural networks are the fourth — and every one of them replaces the grid with a network.

The finite-difference approach replaces derivatives by grid differences. For the 1D heat equation $\partial_t u = \alpha\, \partial_{xx} u$, pick step sizes $\Delta x$ and $\Delta t$, denote $u_i^n \approx u(i \Delta x, n \Delta t)$, and approximate

$$ \frac{u_i^{n+1} - u_i^n}{\Delta t} \;=\; \alpha\, \frac{u_{i+1}^n - 2 u_i^n + u_{i-1}^n}{(\Delta x)^2}. $$

Solve for $u_i^{n+1}$ and you have an explicit scheme. It is $O(\Delta t) + O(\Delta x^2)$ accurate, and stable only when $\alpha \Delta t / (\Delta x)^2 \leq 1/2$ — the CFL condition. Every time a practitioner halves $\Delta x$, the stability condition forces $\Delta t$ to shrink by four. This is why implicit schemes (Crank–Nicolson, ADI) are the default for serious work.

Finite elements and spectral methods

Finite elements (FEM) represent $u$ as a combination of basis functions — triangles in 2D, tetrahedra in 3D, hat functions in 1D — and minimise a residual in a variational form. FEM handles complex geometries that finite differences cannot; it is the dominant tool in engineering simulation. Spectral methods expand $u$ in a Fourier or Chebyshev basis; they are exponentially accurate on simple geometries but brittle under non-smooth data. Every weather model in the world combines pieces of all three.

Physics-informed neural networks

A PINN represents the unknown solution as a neural network $u_\theta(\mathbf{x}, t)$ and trains the network by minimising the residual of the PDE at sample collocation points, plus a penalty on boundary and initial conditions:

$$ \mathcal{L}(\theta) \;=\; \underbrace{\mathbb{E}\left[(\partial_t u_\theta - \alpha \Delta u_\theta)^2\right]}_{\text{PDE residual}} \;+\; \lambda_{\partial}\, \mathbb{E}\left[(u_\theta - u_{\text{bc}})^2\right] \;+\; \lambda_0\, \mathbb{E}\left[(u_\theta - u_0)^2\right]. $$

The partial derivatives inside the first term are computed by automatic differentiation — exactly the same mechanism that computes gradients for training. PINNs replace the computational mesh of classical solvers with a network-parameterised function, and the stability trade-off with an optimisation problem. When they work they are beautiful; when they fail they do so mysteriously, and most of the current research in the area is trying to make them work more often.

Neural operators

A step beyond PINNs: rather than learning one solution to one PDE, learn the operator that maps initial/boundary data to solution. Fourier Neural Operators (FNO) and DeepONet are the two dominant families. They are the modern way to build surrogate solvers for families of PDEs — weather, subsurface flow, turbulence — at orders-of-magnitude lower cost than conventional numerics.

Section 18

Where it shows up in ML

Every piece of this chapter has a direct analogue in modern machine learning. Here is the map from the mathematics above to the frameworks, papers, and techniques you will meet again and again.

Gradient descent and its descendants. SGD is a first-order Taylor expansion with a step size; momentum is a second-order differential equation discretised forward in time; Adam is a diagonal preconditioner built from running estimates of the gradient and its square.
Backpropagation. A product of Jacobians, evaluated in reverse mode. Every neural-network framework exists to make this product fast.
Second-order optimisers. L-BFGS, K-FAC, Shampoo, and natural-gradient methods all exploit the Hessian (or a cheap approximation) to rescale updates. On small, well-conditioned problems they can be dramatically faster than first-order methods.
Softmax and cross-entropy. The softmax is $\nabla \log\!\sum_i e^{z_i}$; cross-entropy's gradient is $\mathbf{p} - \mathbf{y}$, the cleanest output-side identity in deep learning. That it is clean is a Taylor-expansion fact, not a coincidence.
Normalising flows. The change-of-variables formula with a learnable, invertible map. Every architectural decision (coupling, autoregressive, continuous) is a choice about how to keep $\log |\det J|$ tractable.
Neural ODEs. A residual block is a one-step Euler discretisation of $\dot{\mathbf{h}} = f_\theta(\mathbf{h}, t)$; replacing the fixed number of layers with an adaptive ODE solver gives a continuous-depth model whose memory cost is $O(1)$ in depth — courtesy of the adjoint method, which is a Lagrangian calculation on the ODE.
Diffusion models. A forward stochastic differential equation progressively adds noise; a learned reverse SDE removes it. The score $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ — the gradient of a log-density — is what the network is trained to estimate at every noise level.
Physics-informed networks. Minimise a PDE residual, computed via automatic differentiation of the network with respect to its spatial and temporal inputs. Boundary conditions and initial data enter as penalty terms.
State-space models. S4, Mamba, and friends parameterise linear ODEs (see section 13) with diagonal-plus-low-rank structure. Their efficient convolutional kernels are discrete approximations of $e^{tA}$.
Variational inference. Minimises a KL divergence, which is an expectation; reparameterisation pushes gradients through the sampling step by the chain rule on a Jacobian. Every line of the evidence lower bound has a calculus derivative hiding inside it.

Read carefully enough and every modern ML paper is a piece of applied calculus. The abstractions in the next chapters — optimisation, probability, statistics — are built directly on top of what we have developed here. If a later paper mentions a Jacobian, a Hessian, an ODE solver, or a PDE residual and the reference feels abstract, come back to this chapter and the connection will usually make itself plain within a page or two.

Where to go next

Calculus is a subject with several good entry points. The books below span levels — from a first rigorous encounter to the research frontier of neural differential equations. Pair a textbook with a lecture series and a problem set; the combination works better than any single source.

Calculus textbooks

Calculus

Michael Spivak · 4th ed. 2008

The single-variable calculus book written like a pure-mathematics text — definitions, theorems, proofs. Demanding but transformative. Work the problems, which are famously well chosen.

Publisher
Calculus on Manifolds

Michael Spivak · 1965

A tiny, crystalline book on multivariable calculus from a modern point of view — forms, Stokes' theorem, the implicit function theorem — in 150 pages. The book that turns a working calculus user into a mathematician.

Routledge
Vector Calculus, Linear Algebra, and Differential Forms

Hubbard & Hubbard · 5th ed. 2015

A unified, modern treatment. Does the multivariable calculus you will actually use in ML — gradients, Jacobians, the chain rule, integration on manifolds — with the geometric picture front and centre.

Matrix Editions
Mathematics for Machine Learning

Deisenroth, Faisal & Ong · 2020 · free

Chapter 5 (vector calculus) is the single best short treatment of gradients, Jacobians, and matrix calculus for ML. Chapter 7 on continuous optimisation pairs naturally with this chapter's critical-point discussion.

mml-book.github.io

Differential equations textbooks

Nonlinear Dynamics and Chaos

Steven Strogatz · 3rd ed. 2024

The canonical introduction to the qualitative theory of ODEs. Phase portraits, bifurcations, limit cycles, and chaos — explained with Strogatz's trademark clarity and physical examples. The book every physicist, biologist, and control engineer has read.

Routledge
Partial Differential Equations

Lawrence C. Evans · 2nd ed. 2010

The standard graduate text. Dense, but the definitive reference for the theory of linear and nonlinear PDEs — exactly the background the modern literature on physics-informed networks and diffusion models quietly assumes.

AMS
Numerical Solution of Ordinary Differential Equations

Shampine & Reichelt · 1997

The paper behind MATLAB's ode45 and SciPy's solve_ivp, expanded into a book. A practical, careful treatment of Runge–Kutta and BDF solvers, stiffness, and adaptive stepping. Read once to understand what your ODE solver is actually doing.

SIAM paper

Free video courses

Essence of Calculus

3Blue1Brown · YouTube · 2017

Twelve short videos on the intuition of calculus. Derivatives as ratios of tiny changes, the chain rule as composition, Taylor series as repeated approximation. The best starting point for rebuilding geometric intuition.

3blue1brown.com
Differential Equations

3Blue1Brown · YouTube · 2019

A shorter companion series covering phase portraits, the matrix exponential, and the heat equation with the same visual craft. Section 15 of this chapter draws heavily on the mental pictures this series builds.

3blue1brown.com
MIT 18.01 / 18.02 — Single & Multivariable Calculus

MIT OpenCourseWare

The full undergraduate calculus sequence at MIT. Problem sets, exams, and solutions — free. If you want a thorough, classical treatment from first principles, this is the canonical path.

18.02 on MIT OCW
MIT 18.03 — Differential Equations

MIT OpenCourseWare · Arthur Mattuck

Mattuck's ODE course is the classic treatment — linear systems, the matrix exponential, Laplace transforms, nonlinear systems, phase portraits. The lectures are fifteen years old and still excellent.

MIT OCW

References, cheat sheets, and papers

The Matrix Cookbook

Petersen & Pedersen · 2012

Hundreds of matrix-calculus identities — gradients, Jacobians, Hessians, traces, determinants — gathered in one PDF. Keep it open while you derive anything matrix-valued. Section 05 of this chapter is a sampled version.

PDF
The Matrix Calculus You Need for Deep Learning

Parr & Howard · 2018

A long, careful tutorial paper that derives the matrix-calculus identities used in backpropagation from first principles. Written for ML students, and the clearest treatment of the topic on the open web.

arXiv
JAX autodiff cookbook

JAX docs

A short, working tour of forward- and reverse-mode AD, Jacobian–vector products, Hessians, and higher-order derivatives. Reading this once is the fastest way to internalise section 11.

JAX docs
Neural Ordinary Differential Equations

Chen et al. · NeurIPS 2018

The paper that introduced continuous-depth networks. A clean, short read once you have the material of sections 12–14 in hand; the adjoint method in its appendix is the best worked example of the calculus of variations applied to ML.

arXiv
Physics-Informed Neural Networks

Raissi, Perdikaris & Karniadakis · JCP 2019

The founding paper of PINNs. Short, readable, and the starting point for the now-enormous literature on neural solvers for PDEs.

ScienceDirect

This page is the second chapter of Part I: Mathematical Foundations. Up next: optimisation theory — where the gradients of this chapter meet the problems they were invented to solve — followed by probability, statistics, and the information-theoretic tools that power modern generative models.

How to read this chapter

Contents

What calculus is for, in machine learning

Derivatives in one variable

Rules you will use constantly

Higher derivatives and smoothness

The Fundamental Theorem of Calculus

Taylor series and local approximation

Taylor with remainder

Canonical series worth memorising

Partial derivatives and the gradient

The gradient

The directional derivative

The Jacobian and matrix calculus

Two special cases

Matrix-calculus identities you actually need

The chain rule, and what backprop really is

Why the reverse pass is efficient

Forward vs reverse mode

The Hessian and second-order geometry

Newton's method

Vector calculus: divergence, curl, Laplacian

Divergence

Curl

The Laplacian

Integration and change of variables

The change of variables formula

Polar and spherical

Critical points: minima, maxima, and saddles

Saddles dominate in high dimensions

Constrained optimisation and Lagrange multipliers

Automatic differentiation

The computation graph

Dual numbers (forward mode)

Reverse mode: the vjp

Ordinary differential equations

The basic zoo

Second-order and higher

Linear ODEs and the matrix exponential

Eigenvalues set the long-time behaviour

Inhomogeneous linear ODEs

Numerical ODE solvers

Explicit Euler

Implicit Euler

Runge–Kutta 4

Adaptive step sizes

Stability and phase portraits

Lyapunov's direct method

Bifurcations

Partial differential equations

Heat (parabolic)

Wave (hyperbolic)

Laplace / Poisson (elliptic)

Numerical PDEs and finite differences

Finite elements and spectral methods

Physics-informed neural networks

Where it shows up in ML

Where to go next

Calculus textbooks

Differential equations textbooks

Free video courses

References, cheat sheets, and papers