Part I · Mathematical Foundations · Chapter 02

Calculus, the language of change.

If linear algebra tells neural networks what to compute, calculus tells them how to get better. Every gradient, every backward pass, every diffusion step, every physics-informed constraint is a statement about derivatives. This chapter rebuilds the calculus you remember — limits, derivatives, integrals — then pushes into the parts ML actually uses: the gradient and Hessian, the chain rule in matrix form, automatic differentiation, and the differential equations that animate neural ODEs, diffusion models, and physics-informed networks.

How to read this chapter

Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the diagrams exist to make the prose less abstract; the equations exist to make it precise. If an identity looks scary, look at the picture next to it first — almost all of multivariable calculus is a statement about either tangent planes or tiny volumes.

Notation: scalars are italic ($x$, $t$, $\lambda$); vectors are bold lowercase ($\mathbf{x}$, $\mathbf{v}$); matrices are capitals ($A$, $H$). A function of several variables is written $f(\mathbf{x})$ where $\mathbf{x} \in \mathbb{R}^n$. Partial derivatives are $\partial f / \partial x_i$ or $\partial_i f$. The gradient is $\nabla f$, the Hessian $\nabla^2 f$, and the Jacobian of a vector-valued map is $J$. A function of time is $x(t)$; its time derivative is $\dot x$. We use $\mathrm{d}$ for differentials and $\mathrm{D}$ for the total derivative when the distinction matters.

Contents

  1. What calculus is for, in MLMotivation
  2. Derivatives in one variableThe core idea
  3. Taylor series and local approximationLinearize anything
  4. Partial derivatives and the gradientGoing multi-d
  5. The Jacobian and matrix calculusDerivatives of vectors
  6. The chain rule and backpropHow ML actually trains
  7. The Hessian and second-order geometryCurvature
  8. Vector calculus: div, curl, LaplacianFields and flows
  9. Integration and change of variablesAreas, volumes, densities
  10. Critical points: minima, maxima, saddlesLoss landscapes
  11. Automatic differentiationHow frameworks compute gradients
  12. Ordinary differential equationsThe zoo
  13. Linear ODEs and the matrix exponentialThe one case you can solve
  14. Numerical ODE solversEuler to Runge–Kutta
  15. Stability and phase portraitsLong-time behaviour
  16. Partial differential equationsHeat, wave, Laplace
  17. Numerical PDEs and finite differencesDiscretization
  18. Where it shows up in MLPayoff
Section 01

What calculus is for, in machine learning

Linear algebra describes what a neural network computes. Calculus describes how it learns. Every training step is a statement about derivatives; every generative model built since 2020 is a statement about differential equations.

The single most important equation in modern ML is the one at the bottom of the backward pass:

$$ \theta_{t+1} \;=\; \theta_t \;-\; \eta\, \nabla_{\theta}\, \mathcal{L}(\theta_t). $$

"Adjust the parameters a little in the direction that decreases the loss fastest." The whole apparatus of deep learning — automatic differentiation frameworks, GPUs, the distinction between forward and backward passes — exists to evaluate that gradient efficiently at scale. Calculus is the theory behind the arrow.

But gradients are only the entry point. Once you start reading modern ML papers you meet differential equations everywhere. Neural ODEs treat a deep network as the solution to an ordinary differential equation, with the "depth" replaced by a continuous time variable. Diffusion models — the engine behind image and video generation — are a stochastic differential equation run forwards to add noise, and another run backwards to remove it. Physics-informed neural networks train by minimising the residual of a PDE. Even the humble softmax is the gradient of the log-sum-exp function.

Key idea

Calculus in ML splits into two jobs. Differential calculus tells us how to compute the change in a loss with respect to millions of parameters — that is training. Differential equations tell us how a system evolves over time or space — that is generative modelling, continuous-time networks, and physics-informed learning. This chapter teaches both.

The plan. Sections 02–03 are a fast refresher on single-variable calculus — derivatives, Taylor series — calibrated at the level a working ML engineer actually uses. Sections 04–11 are the multivariable core: gradients, Jacobians, the chain rule, Hessians, vector calculus, integration, critical points, and a careful look at automatic differentiation. Sections 12–17 are differential equations: ODEs, their analytical and numerical solution, stability, and the canonical partial differential equations. Section 18 threads the ML connections explicitly.

Section 02

Derivatives in one variable

The derivative is the cleanest idea in mathematics: the slope of a curve at a point, made precise by a limit. Everything that follows is an elaboration.

Given a function $f : \mathbb{R} \to \mathbb{R}$, the derivative of $f$ at $x$ is defined as

$$ f'(x) \;=\; \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}, $$

provided the limit exists. Geometrically, $(f(x+h) - f(x))/h$ is the slope of the secant line through $(x, f(x))$ and $(x+h, f(x+h))$; as $h$ shrinks to zero the secant becomes the tangent, and its slope becomes $f'(x)$. The derivative answers a physical question — "how fast is $f$ changing right here?" — and a geometric one — "what is the slope of the curve here?" — with the same number.

Rules you will use constantly

Three rules generate almost every derivative you need:

$$ (fg)' = f'g + fg', \qquad \left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}, \qquad (f \circ g)'(x) = f'(g(x))\, g'(x). $$

The last is the chain rule. It is the most important rule in this chapter — the rest of calculus is mostly about generalising it to multiple variables, vectors, and, eventually, to the computation graphs that define a neural network.

Higher derivatives and smoothness

Differentiating again gives $f''$, then $f'''$, and so on. A function whose first $k$ derivatives exist and are continuous is said to be of class $C^k$; if all derivatives of every order exist, $C^\infty$. Neural networks with smooth activations — ReLU is $C^0$, GELU and Swish are $C^\infty$ — fall somewhere on this scale, and the difference matters: second-order optimisers, Newton's method, and Hessian-based analysis all need $f \in C^2$ at least.

A caution

The limit in the definition may not exist. $f(x) = |x|$ has no derivative at $0$ — its graph has a corner. ReLU has the same problem at the origin, which is why deep-learning frameworks use sub-gradients there: any value in $[0,1]$ is legal, and by convention most libraries pick $0$. The distinction between "differentiable" and "sub-differentiable" is one of the rare places where pure-mathematics pickiness leaks into engineering.

The Fundamental Theorem of Calculus

Integration and differentiation are inverse operations. If $F$ is an antiderivative of $f$ — that is, $F'(x) = f(x)$ — then

$$ \int_a^b f(x)\, \mathrm{d}x \;=\; F(b) - F(a). $$

This is the Fundamental Theorem of Calculus. It links the two operations that Newton and Leibniz discovered independently, and it is what makes calculus useful: to compute an area (an integral), you find an antiderivative (a differentiation problem in reverse). Every closed-form expectation, every normalising constant, every kernel-density integral you evaluate is a use of this theorem.

Section 03

Taylor series and local approximation

A function that is complicated globally is often simple locally. Taylor's theorem makes that intuition precise: near any point, a smooth function looks like a polynomial of as high a degree as you care to compute.

The Taylor expansion of $f$ about a point $a$ is

$$ f(x) \;=\; f(a) \;+\; f'(a)\,(x-a) \;+\; \frac{f''(a)}{2!}\,(x-a)^2 \;+\; \frac{f'''(a)}{3!}\,(x-a)^3 \;+\; \cdots $$

Truncating at the first-order term gives the tangent-line approximation $f(x) \approx f(a) + f'(a)(x-a)$ — the linearisation that underlies gradient descent. Truncating at the second-order term gives the quadratic approximation, which underlies Newton's method. Modern optimisers all sit on this ladder: first-order methods use only the gradient, second-order methods use the Hessian.

Taylor with remainder

When you truncate, the missing tail has a name. Lagrange's form of the remainder says that for some $\xi$ between $a$ and $x$,

$$ f(x) \;=\; \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k \;+\; \underbrace{\frac{f^{(n+1)}(\xi)}{(n+1)!}(x-a)^{n+1}}_{\text{remainder}}. $$

The remainder tells you exactly how much precision you give up. If $|f^{(n+1)}|$ is bounded, doubling your expansion order reduces the error by another factor of $(x-a)$. This is why Taylor-based analyses of optimisation algorithms always talk about order of convergence: an order-$k$ method has error shrinking like the $k$th power of the step.

Canonical series worth memorising

$$ e^x = \sum_{k=0}^{\infty} \frac{x^k}{k!}, \qquad \ln(1+x) = \sum_{k=1}^{\infty} (-1)^{k+1} \frac{x^k}{k}, \qquad \frac{1}{1-x} = \sum_{k=0}^{\infty} x^k. $$

These three show up everywhere — $e^x$ in softmaxes and Gaussians, $\ln(1+x)$ in cross-entropy losses (where the $-\ln(1 - p)$ near $p=1$ makes the loss explode), and the geometric series in discounted returns from reinforcement learning. Recognising which series you are secretly manipulating often simplifies a long algebraic argument to a paragraph.

Key idea

Taylor's theorem is the bridge from calculus to optimisation. First-order methods (gradient descent, momentum, Adam) approximate the loss by a tangent plane. Second-order methods (Newton, L-BFGS) approximate it by a paraboloid. The cost-accuracy trade-off between them is nothing more than the trade-off between the first and second Taylor terms.

Section 04

Partial derivatives and the gradient

A function $f : \mathbb{R}^n \to \mathbb{R}$ has not one derivative but $n$ — one for each axis. Bundle them into a vector and you get the gradient, the single most important object in applied multivariable calculus.

Fix all inputs but one. The partial derivative of $f$ with respect to $x_i$ is the ordinary one-variable derivative you get when every other input is held constant:

$$ \frac{\partial f}{\partial x_i}(\mathbf{x}) \;=\; \lim_{h \to 0} \frac{f(\mathbf{x} + h\, \mathbf{e}_i) - f(\mathbf{x})}{h}, $$

where $\mathbf{e}_i$ is the $i$th standard basis vector. The partial measures how $f$ responds to a tiny push along axis $i$ only.

The gradient

Collect all the partials into a column vector:

$$ \nabla f(\mathbf{x}) \;=\; \begin{bmatrix} \partial f / \partial x_1 \\ \partial f / \partial x_2 \\ \vdots \\ \partial f / \partial x_n \end{bmatrix}. $$

The gradient $\nabla f$ is a vector field on $\mathbb{R}^n$ — a vector attached to every point. Two geometric facts justify the machinery that follows. First, the gradient points in the direction of steepest ascent of $f$; its negative points the way downhill. Second, the gradient is always orthogonal to the level set $\{ \mathbf{x} : f(\mathbf{x}) = c \}$ — level curves in two dimensions, level surfaces in three, level hypersurfaces in general.

Level curves of $f(x,y) = \tfrac{1}{2}x^2 + y^2$ (concentric ellipses) together with the gradient $\nabla f$ at several points. Each gradient vector is perpendicular to the level curve through its tail and points in the direction where $f$ grows fastest.

The directional derivative

The rate of change of $f$ along an arbitrary unit vector $\mathbf{u}$ is captured by the directional derivative

$$ D_{\mathbf{u}} f(\mathbf{x}) \;=\; \lim_{h \to 0} \frac{f(\mathbf{x} + h \mathbf{u}) - f(\mathbf{x})}{h} \;=\; \nabla f(\mathbf{x}) \cdot \mathbf{u}. $$

The dot product is maximised when $\mathbf{u}$ aligns with $\nabla f$ — which is exactly the claim that $\nabla f$ points uphill. It is minimised in the opposite direction, where $D_{-\nabla f / \|\nabla f\|} f = -\|\nabla f\|$. That last number is the fastest rate at which you can decrease $f$ from $\mathbf{x}$, and it is what gradient descent cashes in on at every step.

Section 05

The Jacobian and matrix calculus

When a function maps vectors to vectors rather than vectors to scalars, its derivative is not a vector but a matrix. Learning to read this matrix — the Jacobian — is what "matrix calculus" actually means.

Consider a map $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$, written $\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x}))$. Each output has $n$ partial derivatives; collect them all and you get an $m \times n$ matrix,

$$ J(\mathbf{x}) \;=\; \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \;=\; \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial f_m}{\partial x_1} & \cdots & \dfrac{\partial f_m}{\partial x_n} \end{bmatrix}. $$

This is the Jacobian matrix of $\mathbf{f}$ at $\mathbf{x}$. Its rows are the gradients of the individual output coordinates; its columns are how the whole output vector responds to a push along each input axis. Geometrically, $J$ is the best linear approximation to $\mathbf{f}$ near $\mathbf{x}$:

$$ \mathbf{f}(\mathbf{x} + \mathbf{h}) \;\approx\; \mathbf{f}(\mathbf{x}) \;+\; J(\mathbf{x}) \mathbf{h}. $$

Two special cases

When $m = 1$, the Jacobian is a single row — the gradient transposed, $J = (\nabla f)^\top$. When $n = 1$, it is a single column — the velocity vector $\mathbf{f}'(t)$ of a parameterised curve. The gradient and the velocity are the same idea on opposite sides of the input/output asymmetry.

Matrix-calculus identities you actually need

The rules below cover maybe 90% of the matrix-calculus you hit reading ML papers. Treat them as reflexes.

$$ \frac{\partial}{\partial \mathbf{x}} \left( \mathbf{a}^\top \mathbf{x} \right) = \mathbf{a}, \qquad \frac{\partial}{\partial \mathbf{x}} \left( \mathbf{x}^\top A \mathbf{x} \right) = (A + A^\top)\, \mathbf{x}, $$ $$ \frac{\partial}{\partial \mathbf{x}} \left\| \mathbf{x} \right\|_2^2 = 2\mathbf{x}, \qquad \frac{\partial}{\partial A} \operatorname{tr}(A^\top B) = B, \qquad \frac{\partial}{\partial A} \operatorname{tr}(A B A^\top) = A(B + B^\top). $$

Two conventions coexist in the literature. In denominator layout (common in applied statistics and this compendium), the derivative of a scalar with respect to a column vector is a column vector. In numerator layout (common in matrix-calculus cheat sheets), it is a row. If you find yourself confused by a sign or a transpose, check which convention the paper is using. The Matrix Cookbook (section 18) tabulates both.

Key idea

The Jacobian is the linear map that sits at the heart of the chain rule. Every backward pass in every neural network is a product of Jacobians — usually with the caveat that the framework never materialises them explicitly, multiplying by them through vector–Jacobian products instead.

Section 06

The chain rule, and what backprop really is

The one-variable chain rule generalises to vectors as matrix multiplication: the Jacobian of a composition is the product of the Jacobians. Training a neural network is this identity, applied to a long chain of matrices, billions of times.

If $\mathbf{g} : \mathbb{R}^n \to \mathbb{R}^p$ and $\mathbf{f} : \mathbb{R}^p \to \mathbb{R}^m$ are differentiable, then the composition $\mathbf{f} \circ \mathbf{g} : \mathbb{R}^n \to \mathbb{R}^m$ has Jacobian

$$ J_{\mathbf{f} \circ \mathbf{g}}(\mathbf{x}) \;=\; J_{\mathbf{f}}\bigl(\mathbf{g}(\mathbf{x})\bigr) \,\cdot\, J_{\mathbf{g}}(\mathbf{x}). $$

That's it. All of backprop is that line applied to a chain of $L$ functions (layers). The derivative of a scalar loss $\mathcal{L}$ with respect to the input of layer $\ell$ is

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_\ell} \;=\; \frac{\partial \mathcal{L}}{\partial \mathbf{h}_L} \, J_L \, J_{L-1} \, \cdots \, J_{\ell+1}. $$

Why the reverse pass is efficient

The two ends of the computation graph look asymmetric. The forward pass starts with an input of size $n$ and ends with a scalar loss. Naively, a derivative of a scalar with respect to $n$ inputs would cost $n$ evaluations — ruinous when $n$ is a billion parameters. The trick, reverse-mode automatic differentiation, is to multiply the Jacobians from the left (the scalar side) rather than the right.

At every step we maintain a small vector $\mathbf{v}^\top = \partial \mathcal{L} / \partial \mathbf{h}_\ell$ (one number per neuron), not the full Jacobian $J_\ell$. To propagate one layer back, the framework computes $\mathbf{v}^\top J_\ell$ — a vector–Jacobian product — without ever forming $J_\ell$. The result is that a gradient of a scalar with respect to $n$ parameters costs roughly the same as one forward pass, regardless of $n$. That single asymmetry is why deep learning is computationally possible.

Forward vs reverse mode

Forward-mode AD does the opposite: it carries a Jacobian–vector product forward, which is efficient when the input is low-dimensional and the output is large. It is the right choice for sensitivity analysis of ODE solvers with a few parameters and many observations. Reverse-mode is the right choice when training a model — one scalar loss, many parameters.

A caution

The chain rule requires that the intermediate values $\mathbf{h}_\ell$ (the activations) be available on the backward pass. Frameworks either cache them (memory cost $O(L)$) or recompute them on the fly (compute cost $2\times$ the forward pass). This trade-off is the whole subject of gradient checkpointing, which is why training a 70B-parameter model on a single GPU is even conceivable.

Section 07

The Hessian and second-order geometry

The gradient tells you the slope of a loss landscape. The Hessian tells you its curvature. For a function of $n$ variables, the Hessian is an $n \times n$ symmetric matrix — and all of second-order optimisation, stability analysis, and the geometry of saddle points lives inside its eigenvalues.

For a twice-differentiable $f : \mathbb{R}^n \to \mathbb{R}$, the Hessian is

$$ \nabla^2 f(\mathbf{x}) \;=\; H(\mathbf{x}) \;=\; \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \cdots & \dfrac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix}. $$

If $f \in C^2$, mixed partials commute — this is Clairaut's theorem — so $H$ is symmetric. The second-order Taylor expansion around $\mathbf{x}$ is then

$$ f(\mathbf{x} + \mathbf{h}) \;\approx\; f(\mathbf{x}) \;+\; \nabla f(\mathbf{x})^\top \mathbf{h} \;+\; \tfrac{1}{2}\, \mathbf{h}^\top H(\mathbf{x})\, \mathbf{h}. $$

The quadratic form $\mathbf{h}^\top H \mathbf{h}$ is the piece that determines local shape. Diagonalise $H = Q \Lambda Q^\top$ (spectral theorem — see the linear algebra chapter) and the eigenvalues $\lambda_1, \ldots, \lambda_n$ tell you everything:

Newton's method

Minimising the quadratic approximation gives Newton's step

$$ \mathbf{x}_{t+1} \;=\; \mathbf{x}_t \;-\; H(\mathbf{x}_t)^{-1} \nabla f(\mathbf{x}_t). $$

If $f$ is quadratic, this lands on the exact minimum in one step. Far from the minimum, or when $H$ is indefinite, Newton's method can diverge wildly — which is why deep-learning practice instead uses quasi-Newton variants (L-BFGS), diagonal approximations (Adam, RMSProp), or Gauss–Newton (most nonlinear least-squares solvers, all modern reinforcement-learning natural-gradient methods).

Key idea

In high dimensions, almost every critical point of a deep network loss is a saddle, not a minimum — the probability that all $n$ eigenvalues share a sign shrinks as $2^{-n}$. SGD's effectiveness is partly that noise tends to escape saddles in directions of negative curvature, which is what the Hessian's negative eigenvalues pick out.

Section 08

Vector calculus: divergence, curl, Laplacian

Once the objects have more than one scalar component, calculus grows three new operators. They are geometric statements about vector fields — how much flux leaves a point, how much it swirls, how much it deviates from its neighbourhood average.

A vector field is a map $\mathbf{F} : \mathbb{R}^n \to \mathbb{R}^n$ — a vector attached to every point of space. The wind above a weather map, the gradient of a loss, the velocity of particles in a diffusion — all are vector fields. Three operators built out of partial derivatives describe their local behaviour.

Divergence

The divergence of a vector field $\mathbf{F} = (F_1, \ldots, F_n)$ is the scalar

$$ \nabla \cdot \mathbf{F} \;=\; \sum_{i=1}^n \frac{\partial F_i}{\partial x_i}. $$

It measures the net outflow rate per unit volume — "how much is this point a source?" Positive divergence at $\mathbf{x}$ means fluid is expanding outward there; negative means it is converging. In ML it shows up directly in the continuity equation of probability flow, $\partial_t \rho + \nabla \cdot (\rho \mathbf{v}) = 0$, which is what diffusion-model sampling numerically integrates.

Curl

In three dimensions, the curl is a vector,

$$ \nabla \times \mathbf{F} \;=\; \left( \partial_y F_z - \partial_z F_y,\; \partial_z F_x - \partial_x F_z,\; \partial_x F_y - \partial_y F_x \right). $$

It measures how much the field rotates locally — the axis-and-magnitude of the infinitesimal rotation you would feel dropped into the flow. A field is irrotational if its curl is zero everywhere; irrotational smooth fields are locally the gradient of some scalar, a fact called the Poincaré lemma. Gradients, in particular, always have zero curl — which is why "is this vector field a gradient?" reduces to checking that $\nabla \times \mathbf{F} = \mathbf{0}$.

The Laplacian

The Laplacian of a scalar field $f$ is the divergence of its gradient:

$$ \Delta f \;=\; \nabla \cdot \nabla f \;=\; \sum_{i=1}^n \frac{\partial^2 f}{\partial x_i^2}. $$

It measures the difference between $f$ at a point and the average of $f$ over a small sphere around that point. Positive Laplacian means the point is a local minimum-ish dip; negative means a bump. The Laplacian is the operator that appears in nearly every PDE in physics: the heat equation $\partial_t u = \Delta u$, the wave equation $\partial_{tt} u = c^2 \Delta u$, Laplace's equation $\Delta u = 0$, the Schrödinger equation. It is also the trace of the Hessian, which is why second-order statistics of loss landscapes look like diffusion.

Two theorems to know by name

Green / Stokes' theorem equates an integral of curl over a surface to a line integral around its boundary. Gauss's divergence theorem equates the integral of divergence over a volume to the flux through its boundary. Both are "the integral of a derivative equals a boundary term" — the multivariable Fundamental Theorem of Calculus — and both show up in normalising-flow Jacobians and physics-informed training objectives.

Section 09

Integration and change of variables

Multivariable integration is what lets us talk about total mass, expected values, and normalising constants. Two mechanics — iterated integration and the change-of-variables formula — are the only ones most ML work ever uses.

For a function $f$ of two variables over a rectangle, Fubini's theorem says the double integral equals an iterated integral:

$$ \iint_{[a,b]\times[c,d]} f(x,y)\,\mathrm{d}A \;=\; \int_a^b \int_c^d f(x,y)\, \mathrm{d}y\, \mathrm{d}x. $$

You integrate one variable at a time, treating the other as constant. Nearly every multivariable integral in practice either (a) factorises — so each piece is a one-variable problem — or (b) is intractable and handled by Monte Carlo sampling. In ML, the latter is overwhelmingly common: expectations over high-dimensional posteriors are the whole reason variational inference and MCMC exist.

The change of variables formula

A map $\mathbf{g} : \mathbb{R}^n \to \mathbb{R}^n$ with Jacobian $J_{\mathbf{g}}(\mathbf{u})$ transforms integrals according to

$$ \int_{\mathbf{g}(U)} f(\mathbf{x}) \, \mathrm{d}\mathbf{x} \;=\; \int_U f\bigl(\mathbf{g}(\mathbf{u})\bigr) \, \left| \det J_{\mathbf{g}}(\mathbf{u}) \right| \, \mathrm{d}\mathbf{u}. $$

The $|\det J|$ term corrects for the local stretching of volumes under $\mathbf{g}$. This is the identity that makes normalising flows work: if $\mathbf{x} = \mathbf{g}(\mathbf{u})$ transforms a simple base density $p_{\mathbf{u}}$ into a target $p_{\mathbf{x}}$, the two densities are related by

$$ p_{\mathbf{x}}(\mathbf{x}) \;=\; p_{\mathbf{u}}\bigl(\mathbf{g}^{-1}(\mathbf{x})\bigr) \, \left| \det J_{\mathbf{g}^{-1}}(\mathbf{x}) \right|. $$

The entire engineering problem of building a normalising flow is engineering $\mathbf{g}$ so that $\det J_{\mathbf{g}}$ is cheap to compute — either triangular (autoregressive flows) or block-triangular (coupling layers). The calculus is just a log-Jacobian away; the architecture is the art.

Polar and spherical

Two coordinate changes appear over and over: polar in $\mathbb{R}^2$ (with Jacobian $r$) and spherical in $\mathbb{R}^n$ (with Jacobian involving $r^{n-1}$). The famous computation of the Gaussian normalising constant $\int e^{-x^2/2}\, \mathrm{d}x = \sqrt{2\pi}$ is a change to polar coordinates applied to the product of two copies. It is a rite of passage, and worth sitting down and doing once by hand.

Section 10

Critical points: minima, maxima, and saddles

Find where the gradient vanishes; then use the Hessian to classify. The three-line procedure is the whole story in one variable, and with a few caveats, it survives into arbitrary dimensions — which is good news, because "find the minimum" is what every training loop in ML does.

A critical point of $f : \mathbb{R}^n \to \mathbb{R}$ is a point $\mathbf{x}^*$ where $\nabla f(\mathbf{x}^*) = \mathbf{0}$. The first-order condition says every local extremum is a critical point; the converse is false, which is why second-order tests exist.

Given that $\nabla f = \mathbf{0}$, the Hessian $H$ tells you the local geometry (section 07). If $H$ is positive definite, you are at a local minimum; negative definite, a local maximum; indefinite, a saddle; singular, inconclusive. In one variable this reduces to the familiar "$f''(x^*) > 0 \Rightarrow $ minimum" test.

Saddles dominate in high dimensions

A standard heuristic: if you pick a random symmetric matrix, the probability that all $n$ of its eigenvalues have the same sign shrinks exponentially with $n$. Deep-network loss landscapes are therefore riddled with saddle points — far more than with true local minima. This is a feature, not a bug. Noisy gradient methods escape saddles readily; they struggle more with poorly conditioned minima, which is why preconditioners like Adam and RMSProp exist.

Constrained optimisation and Lagrange multipliers

When you need to minimise $f$ subject to $g(\mathbf{x}) = 0$, the condition "gradient vanishes" is replaced by "gradient is parallel to the constraint gradient":

$$ \nabla f(\mathbf{x}) \;=\; \lambda \, \nabla g(\mathbf{x}), \qquad g(\mathbf{x}) = 0. $$

$\lambda$ is the Lagrange multiplier. For inequality constraints, the Karush–Kuhn–Tucker (KKT) conditions generalise this — the chapter on optimisation theory covers them in depth. For now, note that every time you train a model with a hard constraint (weight-norm clipping, orthogonality, probability simplex) a Lagrange multiplier is lurking in the update rule.

The optimiser's bestiary

Gradient descent only ever moves to critical points. Momentum accelerates along flat directions of the Hessian and damps oscillations in sharp ones. Adam is diagonal second-order information estimated on the fly. Natural gradient uses the Fisher information matrix (a cousin of the Hessian). Every modern optimiser is a rearrangement of the first two terms of a Taylor expansion — with a well-chosen matrix between them.

Section 11

Automatic differentiation

Frameworks like PyTorch and JAX compute exact derivatives — not symbolic formulas, not finite-difference approximations. The mechanism is called automatic differentiation, and once you know how it works it stops feeling like magic.

Three techniques are sometimes confused.

The computation graph

Every time you write y = f(x) in a framework, it records the operations as nodes of a directed acyclic graph. Each primitive knows its own local derivative — for $z = \sin(y)$, the local derivative is $\cos(y)$; for $z = x \cdot w$ it is $w$ with respect to $x$ and $x$ with respect to $w$. Reverse-mode AD walks the graph backward, multiplying local derivatives together, to assemble the full chain-rule product.

Dual numbers (forward mode)

Forward-mode AD has a clean algebraic realisation. Extend the reals to $\mathbb{R}[\varepsilon]$ where $\varepsilon^2 = 0$. For any analytic $f$,

$$ f(x + \varepsilon\, \dot x) \;=\; f(x) \;+\; \varepsilon\, f'(x)\, \dot x. $$

Everything after the $\varepsilon$ is exactly the derivative — the machinery is a generalisation of complex numbers, with $\varepsilon^2 = 0$ instead of $i^2 = -1$. Implementing forward-mode AD is no more than overloading arithmetic on pairs $(x, \dot x)$.

Reverse mode: the vjp

Reverse-mode AD is not so clean algebraically, but is what actually trains neural networks. For every primitive $\mathbf{y} = \mathbf{g}(\mathbf{x})$ it needs one function: given the adjoint $\bar{\mathbf{y}} = \partial \mathcal{L} / \partial \mathbf{y}$ arriving from downstream, compute the upstream adjoint $\bar{\mathbf{x}} = \bar{\mathbf{y}}^\top J_{\mathbf{g}}$. This is the vector–Jacobian product, or vjp. A framework ships hand-written vjp's for every primitive; user code that composes them inherits the chain rule for free.

Key idea

AD is a programming-language story as much as a calculus one. JAX's grad, jvp, and vjp expose the whole apparatus directly. PyTorch hides it behind .backward(). Once you see that backprop is a closed-form recursion over a DAG, the engineering choices — checkpointing, mixed precision, activation offloading — stop being mysterious.

Section 12

Ordinary differential equations

An ODE specifies how a quantity changes with respect to one variable — usually time. Given the change, you want the quantity. Solving means integrating, but analytical integration is rare; numerical integration is the real tool.

A first-order ordinary differential equation is a relation

$$ \dot{\mathbf{x}}(t) \;=\; \mathbf{f}\bigl(\mathbf{x}(t),\, t\bigr), \qquad \mathbf{x}(t_0) = \mathbf{x}_0. $$

The pair — equation plus initial condition — is an initial value problem (IVP). Picard's theorem guarantees that if $\mathbf{f}$ is Lipschitz in $\mathbf{x}$ on a neighbourhood of $(t_0, \mathbf{x}_0)$, a unique solution exists locally. In practice every well-posed dynamical system you meet in ML satisfies this.

The basic zoo

A few one-dimensional types yield to pen and paper:

Second-order and higher

A second-order ODE like $\ddot{x} + 2\gamma \dot x + \omega_0^2 x = 0$ — the damped harmonic oscillator — lives in the imagination of every physicist and underlies every result in stability analysis you will ever read. It rewrites as a first-order system in two variables:

$$ \frac{\mathrm{d}}{\mathrm{d}t} \begin{pmatrix} x \\ \dot x \end{pmatrix} \;=\; \begin{pmatrix} 0 & 1 \\ -\omega_0^2 & -2\gamma \end{pmatrix} \begin{pmatrix} x \\ \dot x \end{pmatrix}. $$

This reduction — any $n$th-order ODE becomes a first-order system in $n$ variables — is the reason practitioners can focus entirely on first-order systems $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x}, t)$. It is also exactly what a neural ODE is: a parameterised first-order vector field.

Flows and semigroups

The solution map $\Phi^t : \mathbf{x}_0 \mapsto \mathbf{x}(t)$ of an autonomous ODE is a one-parameter flow: $\Phi^{s+t} = \Phi^s \circ \Phi^t$. In a neural ODE, this says that doubling the integration time composes the network with itself — which is why neural ODEs behave a lot like deep residual networks, and why a trained ResNet of depth $L$ can often be read as a Euler discretisation of an ODE with $L$ steps.

Section 13

Linear ODEs and the matrix exponential

The one class of differential equations you can always solve in closed form is the linear one. The answer involves a matrix exponential — and the matrix exponential turns out to be the right object for talking about continuous-time dynamics in general.

A linear, time-invariant ODE system is

$$ \dot{\mathbf{x}}(t) \;=\; A \mathbf{x}(t), \qquad \mathbf{x}(0) = \mathbf{x}_0, $$

where $A \in \mathbb{R}^{n \times n}$. The solution is

$$ \mathbf{x}(t) \;=\; e^{t A}\, \mathbf{x}_0, $$

where the matrix exponential is defined by the same Taylor series as the scalar exponential,

$$ e^{tA} \;=\; \sum_{k=0}^{\infty} \frac{t^k}{k!} A^k. $$

The series converges for any square $A$. Practically, you compute it by diagonalising: if $A = Q \Lambda Q^{-1}$ with $\Lambda = \operatorname{diag}(\lambda_1,\ldots,\lambda_n)$, then $e^{tA} = Q\, \operatorname{diag}(e^{t\lambda_1},\ldots,e^{t\lambda_n})\, Q^{-1}$. When $A$ is not diagonalisable, use a Jordan form or Padé approximation — every numerical library provides scipy.linalg.expm.

Eigenvalues set the long-time behaviour

The eigenvalues of $A$ tell you everything about the qualitative behaviour of solutions. If every eigenvalue has strictly negative real part, every trajectory decays to zero (stable equilibrium). If any eigenvalue has positive real part, there exist trajectories that blow up (unstable). Pure-imaginary eigenvalues give bounded oscillations (a harmonic oscillator, a hydrogen atom). This is exactly the framework that Kalman filters, linear state-space models (S4, Mamba), and control theory all rely on.

Inhomogeneous linear ODEs

Adding a forcing term $\mathbf{u}(t)$ gives

$$ \dot{\mathbf{x}} \;=\; A \mathbf{x} \;+\; B \mathbf{u}(t). $$

The general solution is a variation-of-constants formula,

$$ \mathbf{x}(t) \;=\; e^{tA} \mathbf{x}_0 \;+\; \int_0^t e^{(t-s)A}\, B\, \mathbf{u}(s)\, \mathrm{d}s. $$

The integral is a continuous convolution of the input $\mathbf{u}$ with the impulse response $e^{tA} B$. Read that sentence again: a linear continuous-time system is a convolution with its impulse response. Every convolutional architecture, from WaveNet to S4 to Mamba, is trying to learn a discrete analogue of this kernel.

Section 14

Numerical ODE solvers

Outside the linear case, ODEs are almost never solvable in closed form. Numerical integration discretises time into finite steps and approximates the trajectory — and once you see the three canonical schemes, almost every solver you meet is a variation.

Given $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x}, t)$ and a step size $h$, a one-step solver produces $\mathbf{x}_{n+1}$ from $\mathbf{x}_n$ and $t_n$.

Explicit Euler

$$ \mathbf{x}_{n+1} \;=\; \mathbf{x}_n \;+\; h\, \mathbf{f}(\mathbf{x}_n, t_n). $$

Linearise locally and step. Error per step is $O(h^2)$, global error is $O(h)$; the method is first-order accurate. Fine for prototyping, unusable for anything with fast dynamics because it demands vanishingly small $h$ to stay stable.

Implicit Euler

$$ \mathbf{x}_{n+1} \;=\; \mathbf{x}_n \;+\; h\, \mathbf{f}(\mathbf{x}_{n+1}, t_{n+1}). $$

$\mathbf{x}_{n+1}$ appears on both sides — you solve an equation at every step — but the trade is worth it for stiff systems, where fast and slow time scales coexist. Implicit methods remain stable at step sizes that would make explicit methods diverge.

Runge–Kutta 4

The workhorse explicit method. In four stages:

$$ \begin{aligned} \mathbf{k}_1 &= \mathbf{f}(\mathbf{x}_n, t_n), \\ \mathbf{k}_2 &= \mathbf{f}(\mathbf{x}_n + \tfrac{h}{2}\mathbf{k}_1,\, t_n + \tfrac{h}{2}), \\ \mathbf{k}_3 &= \mathbf{f}(\mathbf{x}_n + \tfrac{h}{2}\mathbf{k}_2,\, t_n + \tfrac{h}{2}), \\ \mathbf{k}_4 &= \mathbf{f}(\mathbf{x}_n + h \mathbf{k}_3,\, t_n + h), \\ \mathbf{x}_{n+1} &= \mathbf{x}_n + \tfrac{h}{6}\bigl(\mathbf{k}_1 + 2\mathbf{k}_2 + 2\mathbf{k}_3 + \mathbf{k}_4\bigr). \end{aligned} $$

Global error is $O(h^4)$ — each halving of $h$ cuts the error by sixteen. It is the default choice when you do not know better, and drives most "neural-ODE" implementations by default.

Adaptive step sizes

A real solver runs two methods of different order side by side — say RK4 and RK5 — takes their difference as an error estimate, and shrinks $h$ when the estimate grows too large. Dormand–Prince is the classic choice; SciPy's solve_ivp and JAX's diffrax both default to it. This is why modern neural-ODE training does not ask you to pick a step size: the adaptive solver picks one per example per evaluation.

Stiffness and why it matters

A system is stiff when its Jacobian has eigenvalues spread over many orders of magnitude. Explicit solvers must then take steps limited by the fastest mode, even while tracking only the slowest — a crippling cost. Stiff implicit solvers (BDF, Rosenbrock) are the right tool. Most chemical-kinetics, neural-ODE-with-residuals, and large-scale dynamical-system simulation problems are stiff in practice.

Section 15

Stability and phase portraits

Before computing a solution, ask what the solution must look like. The qualitative theory of ODEs — fixed points, stability, phase portraits — is the geometric side of the subject, and it is how physicists and control engineers actually think about dynamics.

A fixed point of $\dot{\mathbf{x}} = \mathbf{f}(\mathbf{x})$ is an $\mathbf{x}^*$ where $\mathbf{f}(\mathbf{x}^*) = \mathbf{0}$. Near a fixed point, the dynamics linearise:

$$ \dot{\mathbf{h}} \;\approx\; J_{\mathbf{f}}(\mathbf{x}^*)\, \mathbf{h}, \qquad \mathbf{h} = \mathbf{x} - \mathbf{x}^*. $$

The eigenvalues of the Jacobian at the fixed point classify it. All eigenvalues with negative real part: asymptotically stable; trajectories nearby spiral or slide into $\mathbf{x}^*$. Any eigenvalue with positive real part: unstable. Pure imaginary: the linearisation is a centre, and you need nonlinear information to decide.

Phase portrait of a stable spiral: $\dot x = -0.1 x - y$, $\dot y = x - 0.1 y$. The eigenvalues of the linearisation at the origin are $-0.1 \pm i$, giving damped rotation. Every trajectory spirals in to the fixed point at the origin.

Lyapunov's direct method

Sometimes the linearisation is not enough. Lyapunov's idea: find a scalar function $V(\mathbf{x})$ that is positive near the fixed point and decreases along trajectories. If $V(\mathbf{x}^*) = 0$, $V(\mathbf{x}) > 0$ for $\mathbf{x} \neq \mathbf{x}^*$, and $\dot V = \nabla V \cdot \mathbf{f} \leq 0$, the fixed point is stable. If the inequality is strict, it is asymptotically stable. For mechanical systems the total energy is almost always a Lyapunov function; for ML systems a loss or potential often is.

Bifurcations

As a parameter varies, the structure of the phase portrait can change qualitatively — fixed points can collide, split, or change stability. The point where the change happens is a bifurcation. The names to know are saddle–node (two fixed points collide and annihilate), pitchfork (symmetry-breaking), and Hopf (a fixed point destabilises and spawns a limit cycle). Training deep networks exhibits analogues of all three in mean-field limits, and the Hopf bifurcation is the canonical route to oscillations in excitable systems.

Section 16

Partial differential equations

When the unknown depends on more than one variable, you get a partial differential equation. Three canonical linear PDEs — heat, wave, Laplace — cover most of physics and nearly all of the intuition you need.

A PDE involves partial derivatives of an unknown function with respect to more than one independent variable. The three canonical second-order linear PDEs, which lie at the foundation of the subject, all involve the Laplacian $\Delta$:

$$ \underbrace{\partial_t u \;=\; \alpha\, \Delta u}_{\text{heat / diffusion}}, \qquad \underbrace{\partial_{tt} u \;=\; c^2\, \Delta u}_{\text{wave}}, \qquad \underbrace{\Delta u \;=\; 0}_{\text{Laplace}}. $$

They are three representatives of three fundamentally different behaviours.

Heat (parabolic)

The heat equation describes diffusion. Initial heat distributions smooth out as time goes on — discontinuities vanish instantly, fine features decay exponentially fast. In probability, it is exactly the evolution of a Gaussian density under Brownian motion, and for diffusion models, it is the evolution that turns a sample into pure noise.

On the real line, the heat equation with initial condition $u(x, 0) = u_0(x)$ has the explicit solution

$$ u(x, t) \;=\; \int_{-\infty}^{\infty} G(x - y, t)\, u_0(y)\, \mathrm{d}y, \qquad G(x, t) = \frac{1}{\sqrt{4\pi \alpha t}} e^{-x^2 / (4\alpha t)}. $$

The kernel $G$ is the Gaussian heat kernel; the solution is the initial condition smoothed by a Gaussian of width $\sqrt{2\alpha t}$. The connection to probability — Gaussian densities are the fundamental solutions of diffusion — is why every generative-diffusion paper pairs the forward SDE with a Gaussian noise schedule.

Wave (hyperbolic)

The wave equation preserves detail: solutions oscillate forever and carry information at a finite speed $c$. Its fundamental solutions in one dimension are left- and right-moving functions $f(x - ct)$ and $g(x + ct)$ — d'Alembert's formula. This hyperbolic character is what makes acoustic and electromagnetic signals possible, and why training neural network surrogates for wave-like physics is genuinely harder than for diffusive physics.

Laplace / Poisson (elliptic)

Solutions of $\Delta u = 0$ (harmonic functions) are extraordinarily smooth — determined everywhere inside a domain by their boundary values. The Poisson equation $\Delta u = f$ is the workhorse of electrostatics, gravitation, and image inpainting. In ML, it pops up in graph Laplacian regularisers and in solving for potentials in physics-informed networks.

Key idea

The three canonical PDEs correspond to three styles of behaviour: parabolic (smooths and spreads), hyperbolic (preserves and propagates), elliptic (is equilibrium). Nearly every second-order linear PDE you will meet is one of these three with variable coefficients, and the qualitative behaviour is inherited from the classification. The technical term is characteristics: the lines along which information propagates.

Section 17

Numerical PDEs and finite differences

Solving a PDE on a computer means discretising space and time into grids. Three families dominate: finite differences, finite elements, and spectral methods. Physics-informed neural networks are the fourth — and every one of them replaces the grid with a network.

The finite-difference approach replaces derivatives by grid differences. For the 1D heat equation $\partial_t u = \alpha\, \partial_{xx} u$, pick step sizes $\Delta x$ and $\Delta t$, denote $u_i^n \approx u(i \Delta x, n \Delta t)$, and approximate

$$ \frac{u_i^{n+1} - u_i^n}{\Delta t} \;=\; \alpha\, \frac{u_{i+1}^n - 2 u_i^n + u_{i-1}^n}{(\Delta x)^2}. $$

Solve for $u_i^{n+1}$ and you have an explicit scheme. It is $O(\Delta t) + O(\Delta x^2)$ accurate, and stable only when $\alpha \Delta t / (\Delta x)^2 \leq 1/2$ — the CFL condition. Every time a practitioner halves $\Delta x$, the stability condition forces $\Delta t$ to shrink by four. This is why implicit schemes (Crank–Nicolson, ADI) are the default for serious work.

Finite elements and spectral methods

Finite elements (FEM) represent $u$ as a combination of basis functions — triangles in 2D, tetrahedra in 3D, hat functions in 1D — and minimise a residual in a variational form. FEM handles complex geometries that finite differences cannot; it is the dominant tool in engineering simulation. Spectral methods expand $u$ in a Fourier or Chebyshev basis; they are exponentially accurate on simple geometries but brittle under non-smooth data. Every weather model in the world combines pieces of all three.

Physics-informed neural networks

A PINN represents the unknown solution as a neural network $u_\theta(\mathbf{x}, t)$ and trains the network by minimising the residual of the PDE at sample collocation points, plus a penalty on boundary and initial conditions:

$$ \mathcal{L}(\theta) \;=\; \underbrace{\mathbb{E}\left[(\partial_t u_\theta - \alpha \Delta u_\theta)^2\right]}_{\text{PDE residual}} \;+\; \lambda_{\partial}\, \mathbb{E}\left[(u_\theta - u_{\text{bc}})^2\right] \;+\; \lambda_0\, \mathbb{E}\left[(u_\theta - u_0)^2\right]. $$

The partial derivatives inside the first term are computed by automatic differentiation — exactly the same mechanism that computes gradients for training. PINNs replace the computational mesh of classical solvers with a network-parameterised function, and the stability trade-off with an optimisation problem. When they work they are beautiful; when they fail they do so mysteriously, and most of the current research in the area is trying to make them work more often.

Neural operators

A step beyond PINNs: rather than learning one solution to one PDE, learn the operator that maps initial/boundary data to solution. Fourier Neural Operators (FNO) and DeepONet are the two dominant families. They are the modern way to build surrogate solvers for families of PDEs — weather, subsurface flow, turbulence — at orders-of-magnitude lower cost than conventional numerics.

Section 18

Where it shows up in ML

Every piece of this chapter has a direct analogue in modern machine learning. Here is the map from the mathematics above to the frameworks, papers, and techniques you will meet again and again.

Read carefully enough and every modern ML paper is a piece of applied calculus. The abstractions in the next chapters — optimisation, probability, statistics — are built directly on top of what we have developed here. If a later paper mentions a Jacobian, a Hessian, an ODE solver, or a PDE residual and the reference feels abstract, come back to this chapter and the connection will usually make itself plain within a page or two.

Further reading

Where to go next

Calculus is a subject with several good entry points. The books below span levels — from a first rigorous encounter to the research frontier of neural differential equations. Pair a textbook with a lecture series and a problem set; the combination works better than any single source.

Calculus textbooks

Differential equations textbooks

Free video courses

References, cheat sheets, and papers

This page is the second chapter of Part I: Mathematical Foundations. Up next: optimisation theory — where the gradients of this chapter meet the problems they were invented to solve — followed by probability, statistics, and the information-theoretic tools that power modern generative models.