Part IV · Classical Machine Learning · Chapter 07

Kernel methods, where we replace the data with its inner products and learn in spaces too big to write down.

Most learning algorithms in this book — linear regression, logistic regression, principal component analysis, k-means — are designed for the case where the data lives in an ordinary Euclidean space and the interesting structure is linear. Real data is almost never linear. The kernel trick, due to Aronszajn in 1950 as an abstract mathematical result and to Boser, Guyon, and Vapnik in 1992 as an algorithmic one, is the single most elegant extension of linear methods to non-linear settings: rather than mapping the data into some high-dimensional feature space and computing inner products there, the kernel computes those inner products directly — often in constant time — without ever materialising the feature map. A huge swath of classical machine learning becomes an exercise in choosing and combining kernels: the method is the same, only the similarity function changes. Support vector machines, the chapter's name-bearing technique, are the most famous application: the maximum-margin classifier that dominated academic benchmarks for a decade, defined most of the vocabulary of modern statistical learning, and produced a generation of theorists who went on to define computational learning theory. Alongside SVMs, the kernel framework supports kernel ridge regression, kernel PCA, kernel clustering, Gaussian processes (Bayesian kernel regression), and a dozen more specialised methods, all sharing the same mathematical core. The catch is scaling: kernel methods with n training points require O(n²) memory and O(n³) training time, which broke them decisively for large modern datasets and motivated the Nyström and random-features approximations that keep them relevant today.

How to read this chapter

Section one motivates the whole enterprise by taking a linear method — the perceptron or ridge regression — and showing how its behaviour changes when the data is first lifted into a non-linear feature space. Section two makes that lifting concrete: a feature map φ: X → H replaces each input with a vector in some (possibly infinite-dimensional) Hilbert space, and a linear method applied to φ(x) becomes a non-linear method in x. Section three is the central trick: any algorithm whose formulation depends only on inner products ⟨φ(x), φ(x')⟩ can be rewritten using a kernel function k(x, x') that computes those inner products directly, without ever evaluating φ. Mercer's theorem tells us which functions k qualify as kernels (positive semi-definite ones), and the kernel matrix K with Kᵢⱼ = k(xᵢ, xⱼ) is the only thing the algorithm ever sees about the data. Section four introduces reproducing kernel Hilbert spaces — the functional-analytic framework that makes the whole construction rigorous — and the representer theorem, which says every kernel-method solution lies in the span of the training points. Section five is a tour of the kernel zoo: linear, polynomial, RBF/Gaussian, Laplacian, sigmoid, and the specialised kernels for strings, graphs, and sets. Section six explains the operations (addition, multiplication, composition) that let you build new kernels from old ones.

Sections seven through ten develop the flagship application: support vector machines. Section seven presents the maximum-margin classifier — Vapnik's foundational idea that, among the many hyperplanes that separate two classes, the one to pick is the one that leaves the widest margin on either side. Section eight introduces soft-margin SVMs via slack variables and the regularisation constant C, which is what lets the method handle non-separable data. Section nine develops the dual formulation, where the primal problem over the weight vector is re-expressed as a quadratic program over dual variables that weight the training points. The dual is what makes SVMs work with kernels, and the support vectors — the small subset of training points with non-zero dual variables — are both the sparsity story and the mechanism by which SVMs generalise despite enormous feature spaces. Section ten applies the kernel trick to the dual, producing the kernel SVM — the method that actually gets implemented. Section eleven generalises the machinery to regression: the epsilon-insensitive loss, support vector regression, and when to prefer it over kernel ridge. Section twelve covers the two standard routes to multiclass SVMs (one-vs-one, one-vs-all), and notes the Crammer–Singer unified formulation.

Sections thirteen through sixteen survey the kernel-methods ecosystem beyond SVMs. Section thirteen is kernel ridge regression — linear ridge with the kernel trick applied — the closed-form analogue of SVR and the cleanest possible example of kernel thinking. Section fourteen touches on the other kernelised classics: kernel PCA (already met in Chapter 05), kernel k-means, kernel Fisher discriminant, one-class SVM for anomaly detection, and spectral clustering viewed as a kernel method. Section fifteen is Gaussian processes — the Bayesian formulation of kernel regression, with calibrated posteriors, hyperparameter learning via the marginal likelihood, and the close relationship between the GP mean predictor and kernel ridge regression. Section sixteen is the hardest practical question: scaling. Kernel methods are O(n²) in memory and O(n³) in time; the Nyström approximation, random Fourier features, inducing-point sparse GPs, and modern GPU libraries (GPyTorch, KeOps, Falkon) are the techniques that push the usable range from tens of thousands to tens of millions of points. Section seventeen closes with practical guidance: which kernel to try first, how to tune hyperparameters, when to prefer a random-features approximation over a full kernel, and the common traps. Section eighteen places kernel methods in the modern ML landscape: the neural-network–kernel equivalence in the infinite-width limit, the neural tangent kernel, kernel methods as the gold-standard baseline against which every new deep model is still measured, and the continuing relevance of Gaussian processes in Bayesian optimisation, scientific ML, and anywhere calibrated uncertainty matters.

Why kernelsNon-linearity without ever building a non-linear model
Feature maps and inner productsLifting data into a Hilbert space
The kernel trick and Mercer's theoremInner products without materialising the map
Reproducing kernel Hilbert spacesThe representer theorem and the function-space view
The kernel zooLinear, polynomial, RBF, Laplacian, sigmoid, string, graph
Constructing new kernelsClosure under sum, product, composition, conditioning
Maximum-margin classifiersLinear hard-margin SVM
Soft-margin SVMSlack variables, the C parameter, hinge loss
The dual formulation and support vectorsQuadratic programming, sparsity, KKT conditions
Kernel SVMsThe kernel trick applied to the dual
Support vector regressionEpsilon-insensitive loss and its kernels
Multiclass SVMsOne-vs-one, one-vs-all, Crammer–Singer
Kernel ridge regressionLinear ridge + kernel trick, closed-form
Other kernelised methodsKernel PCA, kernel k-means, one-class SVM
Gaussian processesBayesian kernel regression with calibrated uncertainty
Scaling kernel methodsNyström, random features, sparse GPs
Kernels in practiceWhich kernel, which C, which γ — an operational guide
Where it compounds in MLNeural tangent kernel, baselines, Bayesian optimisation

Why kernels: non-linearity without a non-linear model

Linear methods are the workhorses of classical statistics and machine learning because they are convex, easy to optimise, and well understood. They fail — often catastrophically — on problems where the interesting structure is not linear. The tempting fix is to engineer non-linear features by hand: add squared terms, pairwise products, polynomials, trigonometric functions. This works for small problems and becomes impossible as the number of features grows. Kernel methods are the elegant alternative: use a linear algorithm in a space where the data has been non-linearly transformed, but never actually compute the transformation. The inner product in the transformed space — which is all a linear algorithm needs — is computed directly by a kernel function.

A worked motivation: XOR

The prototypical example is the XOR problem: four points in the plane, labels (+1, −1, −1, +1) at (1,1), (1,−1), (−1,1), (−1,−1). No line separates the positives from the negatives, so no linear classifier works. Lift the points into three dimensions with the feature map φ(x₁, x₂) = (x₁², x₂², √2·x₁·x₂). The XOR points now live at (1, 1, √2), (1, 1, −√2), (1, 1, −√2), (1, 1, √2), and the hyperplane z = 0 (with z the third coordinate) separates them perfectly. A linear classifier in three dimensions solves the problem that no linear classifier in two dimensions could.

The scaling problem

Explicit feature maps become impractical quickly. A degree-d polynomial expansion of a p-dimensional input has O(p^d) features. For p=1000 and d=5 that is a trillion features, far too many to materialise. Even worse, some useful feature maps are infinite-dimensional: the Gaussian RBF map, derived later, lifts each point into a countable-infinite-dimensional Hilbert space. No algorithm that works with explicit feature vectors can handle this.

The trick

The observation that makes everything work: a huge family of linear algorithms — ridge regression, logistic regression, PCA, k-means, the perceptron, and the support vector machine — touch the data only through inner products ⟨xᵢ, xⱼ⟩. None of them ever uses a raw coordinate of xᵢ; only dot products. If we replace every occurrence of ⟨xᵢ, xⱼ⟩ with a kernel function k(xᵢ, xⱼ) that computes ⟨φ(xᵢ), φ(xⱼ)⟩ directly, the algorithm runs in the lifted space without our ever having lifted the points. The kernel is the only thing that matters.

The core claim. Any learning algorithm that can be written in terms of inner products can be made non-linear by substituting a kernel function for the inner product. The kernel implicitly defines a feature space; the algorithm implicitly learns in it; no explicit feature vector is ever constructed.

What we gain and what we pay

We gain expressivity: an RBF-kernel SVM can represent any continuous decision boundary, an RBF-kernel Gaussian process can represent any smooth function. We gain the ability to work with data whose natural representation is not a vector at all — strings, graphs, trees, distributions — by defining a kernel that measures similarity between such objects. We gain rich theoretical tools: convergence guarantees, generalisation bounds, and explicit relationships between the kernel, the hypothesis space, and the capacity of the model.

We pay in computation. Kernel methods with n training points require an n×n kernel matrix — O(n²) memory — and the training of most kernel methods is O(n³) in time. For n=100 000 this is already painful; for n=10⁶ it is infeasible without the approximations of section 16. This scaling is the single reason that, when deep learning arrived with linear-in-n minibatch SGD, kernel methods lost ground in the commodity-compute regime. They remain the gold standard in small-to-medium data problems and in any setting where calibrated uncertainty, interpretability, or rigour matters more than raw size.

The roadmap of this chapter

Sections 2–6 develop the representation: feature maps, the kernel trick, reproducing kernel Hilbert spaces, the kernels you use in practice, and how to build new ones. Sections 7–12 develop SVMs — the flagship algorithm — from linear max-margin classifiers to the kernelised multiclass versions. Sections 13–15 cover kernel ridge regression, the other kernelised classics, and Gaussian processes. Section 16 is about scaling — without which modern kernel methods are a museum piece. Sections 17–18 close with practice and where the ideas live today.

Feature maps and inner products

Before introducing the kernel trick, we need to be precise about what a feature map is and what an inner product is in the generality we need. A feature map is a function that takes each input and produces a vector in some — possibly infinite-dimensional — Hilbert space. The inner product in that space, restricted to the images of feature maps of pairs of inputs, will end up being the kernel. This section pins down the setting.

The feature map

A feature map is a function φ: X → H from the input space X (which can be anything — vectors, strings, graphs, images) to a Hilbert space H with inner product ⟨·, ·⟩. A Hilbert space is a complete inner-product space — intuitively, a possibly-infinite-dimensional vector space on which angles and lengths are well-defined. R^d with the ordinary dot product is the finite-dimensional example; the space of square-integrable functions on R with the integral inner product is the canonical infinite-dimensional example.

Example: the polynomial feature map

For input x = (x₁, x₂) ∈ R², define φ(x) = (1, √2·x₁, √2·x₂, x₁², x₂², √2·x₁·x₂). The image lives in R⁶. The inner product in R⁶ is

⟨φ(x), φ(x')⟩ = 1 + 2·x₁·x₁' + 2·x₂·x₂' + x₁²·(x₁')² + x₂²·(x₂')² + 2·x₁·x₂·x₁'·x₂'.

With some algebra, this equals (1 + x₁·x₁' + x₂·x₂')² = (1 + ⟨x, x'⟩)². The inner product in the six-dimensional feature space is computed by squaring a one-line scalar expression in the two-dimensional input space. This is the degree-2 polynomial kernel, and it already shows the trick in miniature: the feature vector has 6 entries, the kernel computation has 1 multiplication and 1 squaring.

Why linear methods are enough

A linear classifier in a Hilbert space H is a function f(x) = ⟨w, φ(x)⟩ + b for some w in H and scalar b. Training finds the w that minimises some loss. For many loss functions — hinge loss (SVM), squared loss (ridge), log loss (logistic regression) — the optimiser's w lies in the span of the training images: w = Σᵢ αᵢ φ(xᵢ). This is the representer theorem, formalised in section 4. Substituting back:

f(x) = Σᵢ αᵢ ⟨φ(xᵢ), φ(x)⟩ + b = Σᵢ αᵢ k(xᵢ, x) + b,

so the prediction for a new input x depends on the training data only through the kernel evaluated between x and each training point. The whole method, from training to prediction, has vanished into kernel evaluations.

Inner products are similarity measures

It is useful to read the kernel as a similarity function. ⟨φ(x), φ(x')⟩ is large when φ(x) and φ(x') point in similar directions — it is a measure of how much the two inputs have in common in the feature space. Designing a good kernel is designing a good measure of similarity. For points in R^d, the Gaussian RBF kernel k(x, x') = exp(−γ‖x−x'‖²) is large when the two points are close and small when they are far — which is why it works so well as a default.

The kernel trick and Mercer's theorem

The kernel trick is the observation that we do not need to know the feature map φ explicitly — we need only the inner product ⟨φ(x), φ(x')⟩, which is the kernel k(x, x'). The deep question is: which functions k(x, x') are inner products of some feature map into some Hilbert space? The answer is given by Mercer's theorem and its modern generalisations, and it is essentially: exactly the positive semi-definite functions.

Positive semi-definite kernels

A function k: X × X → R is a positive semi-definite kernel — sometimes just "PSD" — if for every finite set of inputs x₁, …, xₙ the Gram matrix (or kernel matrix) K with entries Kᵢⱼ = k(xᵢ, xⱼ) is symmetric and positive semi-definite: vᵀKv ≥ 0 for every vector v. Equivalently, all eigenvalues of K are non-negative, for every finite n and every choice of inputs.

The Moore–Aronszajn theorem

The result that makes everything work: every PSD kernel k corresponds to some feature map φ into some Hilbert space H such that k(x, x') = ⟨φ(x), φ(x')⟩. The Hilbert space is unique up to isomorphism and is called the reproducing kernel Hilbert space (RKHS) of the kernel. The theorem does not give us φ explicitly — it proves existence and constructs H as the closure of finite linear combinations of kernel evaluations — but it guarantees that any PSD kernel is legitimately the inner product of some feature map, which is all we need to apply the kernel trick.

Mercer's theorem

Mercer's theorem (1909) is the classical result that predated the kernel-methods literature by ninety years: under mild conditions, a continuous PSD kernel k(x, x') on a compact domain admits an eigendecomposition

k(x, x') = Σ_i λᵢ · eᵢ(x) · eᵢ(x'),

where λᵢ ≥ 0 are eigenvalues and eᵢ are orthonormal eigenfunctions. The feature map is φ(x) = (√λ₁ · e₁(x), √λ₂ · e₂(x), …), and the feature space is ℓ². This is where infinite-dimensional feature spaces first appear in the theory — the RBF kernel, for instance, has a countable-infinite eigendecomposition and a feature space that is genuinely infinite-dimensional.

The practical recipe

To apply the kernel trick to an existing linear algorithm: (i) write the algorithm entirely in terms of inner products between training inputs; (ii) replace every inner product ⟨xᵢ, xⱼ⟩ with k(xᵢ, xⱼ); (iii) verify that the algorithm still makes sense — typically it does, because the only way it ever touched the raw data was through inner products. The kernel matrix K becomes the central object: it holds all the information the algorithm has about the data. Many algorithms (kernel ridge, kernel PCA, Gaussian processes) require factorising or inverting K — hence the O(n³) scaling.

Verifying a function is PSD

Two main routes. The first: construct an explicit feature map φ and show that k(x, x') = ⟨φ(x), φ(x')⟩. The second, more commonly useful in practice: use the closure properties of section 6 (sum, product, scalar multiplication) to build k from known PSD kernels; the result is guaranteed PSD. For a function pulled from thin air, the default is to compute the Gram matrix on a sample of data and check its eigenvalues.

Reproducing kernel Hilbert spaces and the representer theorem

The feature-space view of kernel methods is concrete but heavy — working with a possibly-infinite-dimensional feature vector is clumsy. The function-space view is cleaner: think of the kernel not as an inner product between feature vectors but as an inner product between functions in a particular space of functions — the reproducing kernel Hilbert space. This view makes regularisation theory, generalisation bounds, and the connection to Gaussian processes transparent, and the representer theorem falls out as a direct consequence.

The reproducing property

Given a PSD kernel k on a space X, the RKHS H_k is a Hilbert space of functions f: X → R with the following property: for every x ∈ X, the function k(·, x) is in H_k, and for every f in H_k, the reproducing property

f(x) = ⟨f, k(·, x)⟩_{H_k}

holds. Pointwise evaluation of f at x is computed by taking the inner product of f with the kernel "bump" at x. Every RKHS has a unique kernel, and every PSD kernel has a unique RKHS. Both the space and the kernel are synonyms for the same mathematical structure.

The representer theorem

The theorem that makes kernel methods computable. Given a set of training inputs x₁, …, xₙ, a strictly increasing regulariser Ω: [0, ∞) → R, and a loss L that depends only on function values at the training points, the minimiser of

min_{f ∈ H_k} L(f(x₁), …, f(xₙ), y) + Ω(‖f‖_{H_k})

lies in the finite-dimensional subspace spanned by k(·, x₁), …, k(·, xₙ). That is, f*(x) = Σᵢ αᵢ k(xᵢ, x) for some coefficients α. The proof is three lines: any component of f orthogonal to that span increases the regulariser without affecting the loss.

Why this matters. The representer theorem reduces an optimisation over an infinite-dimensional function space to an optimisation over n scalar coefficients. Every kernel method is an exploitation of this reduction: you never have to represent f as an abstract function, only as a linear combination of kernels evaluated at training points.

Regularisation in the RKHS

The RKHS norm ‖f‖_{H_k} is the canonical regulariser for kernel methods: it penalises functions that vary rapidly in the metric the kernel implicitly defines. A smooth kernel (like the RBF) makes the RKHS consist of smooth functions; a non-smooth kernel (like the Laplacian) makes it consist of rougher functions. Choosing the kernel is choosing the inductive bias — the hypothesis space and its implicit complexity ordering. Kernel ridge regression, kernel logistic regression, and the soft-margin SVM are all particular choices of loss paired with the RKHS norm as the regulariser.

Connection to Gaussian processes

A Gaussian process with covariance function k has sample paths that almost surely lie in an RKHS with kernel k (with some technicalities). The kernel ridge-regression predictor equals the posterior mean of the GP with the same kernel. The GP is the Bayesian twin of kernel ridge: same mean predictor, plus a calibrated posterior variance. This connection motivates section 15 in full detail.

The kernel zoo: which kernels practitioners actually use

In principle there are uncountably many PSD kernels. In practice, five or six dominate — the linear kernel for baseline sanity checks, the RBF kernel as a nearly universal default, the polynomial kernel for interactions of known degree, the Laplacian for heavy-tailed behaviour, and a handful of specialised kernels for non-vector inputs (strings, graphs, sets, probability distributions). Knowing the standard kernels and when each is appropriate is most of what practical kernel design is about.

The linear kernel

k(x, x') = ⟨x, x'⟩. The feature map is the identity. A kernel SVM with a linear kernel is just a linear SVM. The linear kernel is useful when the data is high-dimensional and already near-linearly separable — common in sparse, high-dimensional problems like text (bag-of-words vectors) or genomics. Linear kernels admit O(n) or O(nd) algorithms (LibLinear, SVMlin) that dominate kernel-SVM implementations when the kernel is the identity.

The polynomial kernel

k(x, x') = (⟨x, x'⟩ + c)^d, with c ≥ 0 a constant and d a positive integer. The feature space is the space of all monomials of degree d. A polynomial kernel with d=2 encodes pairwise interactions; d=3 encodes three-way interactions; and so on. The number of features in the feature space grows as O(p^d), so for even modest d and p the kernel computes inner products in a trillion-dimensional space in a single line of arithmetic.

The Gaussian RBF kernel

k(x, x') = exp(−γ · ‖x − x'‖²), sometimes written with γ = 1/(2σ²). The most widely used kernel in practice, for good reason. Its feature space is infinite-dimensional (countable orthonormal basis of Hermite polynomials). It is universal: an RBF-kernel SVM can approximate any continuous function on a compact domain to arbitrary accuracy, and Gaussian processes with the RBF kernel produce posteriors over arbitrary smooth functions. The bandwidth γ is the key hyperparameter: small γ makes the kernel smooth and long-range, large γ makes it peaky and short-range. Cross-validate or use the median heuristic (γ ≈ 1 / median(‖xᵢ − xⱼ‖²)) as a starting point.

The Laplacian kernel

k(x, x') = exp(−γ · ‖x − x'‖₁), with the L¹ norm in the exponent. The sample paths are less smooth than RBF's, which can be a feature when the target function has non-smooth behaviour (edges, step changes). The Laplacian kernel is also the covariance of the Ornstein–Uhlenbeck process.

The Matérn family

The Matérn kernel with smoothness parameter ν is a parametric family that interpolates between the Laplacian (ν=1/2) and the RBF (ν → ∞). The common values ν=3/2 and ν=5/2 are the standards in spatial statistics and Bayesian optimisation — they give smoother sample paths than Laplacian but less restrictive behaviour than RBF. When you care about the smoothness of the learned function, the Matérn kernel is the principled choice.

The sigmoid kernel

k(x, x') = tanh(α · ⟨x, x'⟩ + c). Famously not always PSD — it is PSD only for certain parameter ranges — but historically important because it gave kernel SVMs a formal resemblance to two-layer perceptrons. In practice the RBF kernel is a better universal non-linearity and the sigmoid kernel is rarely used today.

Specialised kernels

The real flexibility of kernel methods is in the specialised kernels for non-vector inputs. The string kernel measures the similarity of two strings by counting shared subsequences; it is the backbone of protein-sequence classification and text classification. The graph kernel family — Weisfeiler–Lehman, random-walk, graphlet — measures similarity between graphs and was state of the art for molecular property prediction before graph neural networks. The set kernel measures the similarity of two sets by aggregating pairwise element kernels. The Fisher kernel measures similarity in terms of the score gradients of a probabilistic model. For any kind of structured input, there is usually a published kernel that captures the natural notion of similarity.

A default kernel hierarchy

When starting a new problem: try the linear kernel first as a baseline. If the problem is non-linear, try the RBF kernel with the median-distance heuristic for γ. If the RBF kernel works, tune γ and C via cross-validation. Use polynomial kernels only when you have a reason to think the interaction order matters. Use specialised kernels when the input is not a vector.

Constructing new kernels: closure properties

The set of PSD kernels is closed under a surprisingly rich collection of operations. This closure is what lets you compose simple kernels — each capturing one aspect of similarity — into a custom kernel tailored to the problem, without having to re-derive positive-definiteness from first principles. Mastering these operations is the practical craft of kernel design.

The basic closure rules

Let k₁ and k₂ be PSD kernels on X. Then:

(i) k(x, x') = k₁(x, x') + k₂(x, x') is PSD — sums of kernels are kernels.

(ii) k(x, x') = α · k₁(x, x') for α ≥ 0 is PSD — non-negative scalar multiples are kernels.

(iii) k(x, x') = k₁(x, x') · k₂(x, x') is PSD — products of kernels are kernels. This generalises: any polynomial with non-negative coefficients applied entrywise to a kernel is a kernel, and by Taylor expansion so is the exponential, so exp(k₁(x, x')) is a kernel.

(iv) k(x, x') = f(x) · k₁(x, x') · f(x') for any function f is PSD — this is "kernel conjugation" and is how one normalises a kernel to have unit diagonal.

(v) If g: X → X' is any function and k' is PSD on X', then k(x, x') = k'(g(x), g(x')) is PSD — composition with an arbitrary function preserves positive-definiteness. This is how the feature-engineering mindset fits into kernels.

Building the RBF from the polynomial

A satisfying little derivation. Start with the linear kernel k₀(x, x') = ⟨x, x'⟩. Raise to any power: k₀^n is a kernel (closure under products). Exponentiate: exp(k₀) is a kernel (closure under polynomials with non-negative coefficients, and the Taylor series of exp has all non-negative coefficients). Now conjugate by f(x) = exp(−γ · ‖x‖² / 2): the result is

exp(−γ ‖x‖² / 2) · exp(γ ⟨x, x'⟩) · exp(−γ ‖x'‖² / 2) = exp(−γ · ‖x − x'‖² / 2).

The RBF kernel emerges as a composition of linear-kernel operations with closure rules. The same kind of derivation shows that any "product kernel" — a kernel on X × Y defined by k((x, y), (x', y')) = k_X(x, x') · k_Y(y, y') — is PSD, which is the main tool for building kernels on structured inputs.

Multiple kernel learning

When several candidate kernels are available, multiple kernel learning (Lanckriet et al. 2004, Bach 2008) learns a non-negative weighted combination k(x, x') = Σ_m μ_m · k_m(x, x') directly from the data, choosing the weights to minimise a loss with a sparsity penalty. MKL was the reigning technique for automatic kernel design from about 2005 to 2015 and is still the right approach when the problem naturally admits several candidate similarity measures (multi-modal data, heterogeneous features).

Deep kernels

The modern extension: let the feature extractor inside the kernel be a neural network. Deep kernels (Wilson et al. 2016) parametrise the kernel as k(x, x') = k_base(φ_θ(x), φ_θ(x')) where φ_θ is a neural network and k_base is a standard kernel (typically RBF). The neural network learns the representation; the kernel reasons in it. This is the natural bridge between kernel methods and deep learning and underpins current deep Gaussian processes.

Maximum-margin classifiers: the linear SVM

Support vector machines start from a simple question: given linearly separable data, which of the infinitely many separating hyperplanes should we pick? The maximum-margin answer — the hyperplane that leaves the widest possible gap on either side — turns out to have strong generalisation properties and beautiful mathematical structure. Vapnik's insight was that the margin is not just an aesthetic choice but a complexity measure, and a widely separated hyperplane generalises better than a narrowly separated one almost independently of the dimensionality of the input space.

The hard-margin problem

Given training data {(xᵢ, yᵢ)} with yᵢ ∈ {−1, +1}, a linear classifier is f(x) = wᵀx + b, predicting the sign of f. The data is linearly separable if there exist w and b such that yᵢ · (wᵀxᵢ + b) ≥ 1 for all i. The margin is the distance from the hyperplane to the nearest training point; for the above normalisation it equals 1 / ‖w‖. Maximising the margin is minimising ‖w‖, leading to the quadratic program

min_{w, b} (1/2) ‖w‖² subject to yᵢ · (wᵀxᵢ + b) ≥ 1 for all i.

This is the hard-margin SVM. It has a unique solution when the data is separable, and no solution when it is not — a limitation that section 8 addresses.

Why the margin matters

A hyperplane with a wide margin is robust: small perturbations of the training data do not flip its predictions near the boundary. It also generalises better, in a precise VC-theoretic sense: the VC-dimension of maximum-margin hyperplanes depends on the margin, not on the ambient dimension. This is the foundational result that justifies why SVMs can work in very high-dimensional feature spaces (or infinite-dimensional ones, via kernels) without overfitting — the maximum-margin constraint itself provides capacity control.

Geometric interpretation

The optimal hyperplane is the perpendicular bisector of the line segment between the nearest positive and negative training points, adjusted for multiple nearest points. The points that lie exactly on the margin — for which yᵢ · (wᵀxᵢ + b) = 1 — are called support vectors. Moving any non-support vector has no effect on the optimal hyperplane; moving a support vector does. The solution depends only on a small subset of the training data — a sparsity property that becomes algorithmic leverage once we move to the dual formulation in section 9.

Historical note

The maximum-margin hyperplane was described by Vapnik and Chervonenkis in a 1964 paper; the soft-margin extension was published in 1995 by Cortes and Vapnik; the kernel version by Boser, Guyon, and Vapnik in 1992. The SVM dominated academic benchmarks throughout the late 1990s and early 2000s, spawned computational learning theory as a field, and produced the conceptual machinery (margin, capacity, structural risk minimisation, sparse dual representation) that still underpins how we think about generalisation.

Soft-margin SVMs and the C parameter

Real data is almost never linearly separable: there is always noise, mis-labelled points, or genuine overlap between classes. The hard-margin SVM has no solution on such data. The soft-margin SVM, introduced by Cortes and Vapnik in 1995, fixes this by allowing training points to violate the margin at a penalty, trading off margin width against training error. The strength of the trade-off is controlled by a single hyperparameter C, which is almost the only knob the user has to turn.

Slack variables

Introduce non-negative slack variables ξᵢ ≥ 0 measuring by how much each training point is allowed to violate the margin. The optimisation becomes

min_{w, b, ξ} (1/2) ‖w‖² + C · Σᵢ ξᵢ subject to yᵢ · (wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0.

A point with ξᵢ = 0 is correctly classified outside the margin; 0 < ξᵢ ≤ 1 means it is inside the margin but still correctly classified; ξᵢ > 1 means it is misclassified. The penalty C · Σ ξᵢ is added to the objective. Large C makes violations expensive — the method tries to classify every point correctly, yielding a narrow margin that overfits. Small C makes violations cheap — the method tolerates many violations, yielding a wide margin that may underfit.

Hinge loss

The soft-margin SVM has a clean loss-minimisation reformulation. Define the hinge loss

L(yᵢ, f(xᵢ)) = max(0, 1 − yᵢ · f(xᵢ)).

Then the SVM minimises (1/n) · Σᵢ L(yᵢ, f(xᵢ)) + λ · ‖w‖² with λ = 1/(nC). The SVM is hinge-loss + L2 regularisation, structurally identical in form to logistic regression (log-loss + L2) and ridge regression (squared-loss + L2). The only thing that makes SVMs distinctive is the hinge loss — which produces sparsity in the dual and the clean "support vectors" story.

Tuning C

C is the single most important hyperparameter for a soft-margin SVM. Cross-validate over a logarithmic grid: C ∈ {0.01, 0.1, 1, 10, 100}. For most problems, the optimal C will be one of these; fine-tune around the best if it matters. The intuition: C is the inverse of the regularisation strength. Large C = weak regularisation = tight fit. Small C = strong regularisation = smooth boundary. If the data is noisy, use small C; if it is clean, use large.

Choosing between hinge and log loss

Structurally, SVMs and logistic regression are twins — the same regulariser, different losses. Hinge loss produces sparse dual solutions (support vectors); log loss produces probabilistic predictions but no sparsity. If you want calibrated probabilities, use logistic regression (or Platt-scale the SVM outputs post-hoc). If you want a sparse representation and strong margin-based generalisation guarantees, use SVM. Empirically the two tend to perform similarly on most classification tasks; the choice is often driven by downstream needs rather than accuracy.

The dual formulation and support vectors

The soft-margin SVM is a convex quadratic program in the primal — solvable directly by interior-point methods or specialised solvers. But its dual formulation is where all the interesting mathematics happens: the dual variables αᵢ weight the training points, the solution is sparse (most αᵢ are zero), the data appears only through inner products, and the kernel trick slots in effortlessly. Every kernel-SVM implementation solves the dual, not the primal.

Derivation

Apply Lagrangian duality to the soft-margin primal. Introduce multipliers αᵢ ≥ 0 for the margin constraints and μᵢ ≥ 0 for the slack-non-negativity constraints. Stationarity over w gives w = Σᵢ αᵢ yᵢ xᵢ. Stationarity over b gives Σᵢ αᵢ yᵢ = 0. Eliminating w, b, and the slacks yields the dual

max_α Σᵢ αᵢ − (1/2) Σᵢⱼ αᵢ αⱼ yᵢ yⱼ ⟨xᵢ, xⱼ⟩ subject to 0 ≤ αᵢ ≤ C, Σᵢ αᵢ yᵢ = 0.

This is a quadratic program in n variables. The data enters only through the Gram matrix G with Gᵢⱼ = yᵢ yⱼ ⟨xᵢ, xⱼ⟩, so the dual depends on the training inputs only through inner products. Replace ⟨xᵢ, xⱼ⟩ with k(xᵢ, xⱼ) and you have the kernel SVM of section 10.

Support vectors and the KKT conditions

The Karush–Kuhn–Tucker conditions characterise the optimum. Each training point falls into one of three categories:

αᵢ = 0 — the point is correctly classified outside the margin and does not participate in the solution.

0 < αᵢ < C — the point lies exactly on the margin (yᵢ·(wᵀxᵢ+b) = 1) and is an unbounded support vector. These points determine the margin boundary.

αᵢ = C — the point has positive slack: it is inside the margin or misclassified. These are the bounded support vectors.

The final weight vector is w = Σ_{i ∈ SV} αᵢ yᵢ xᵢ with the sum over support vectors only. Typically a small fraction of training points are support vectors — SVMs are sparse in the dual. This is what makes them memory-efficient at prediction time: to predict on a new point x, compute f(x) = Σ_{i ∈ SV} αᵢ yᵢ k(xᵢ, x) + b, a sum over support vectors only.

Training: SMO and beyond

Training an SVM is a quadratic program in n variables with a dense kernel matrix. The classical algorithm is Sequential Minimal Optimization (Platt 1998): iteratively pick a pair of dual variables, solve the two-variable QP analytically, and update. SMO has no outer QP solver and is straightforward to implement; libsvm and scikit-learn's SVC both use SMO variants. For very large n, LaSVM (online training), bundle methods, and stochastic-subgradient approaches (Pegasos) are faster.

Computational cost

Training an SVM is empirically O(n²) to O(n³) depending on the kernel and the solver. Memory is O(n²) for the kernel matrix, which for n = 10⁵ is already 80 GB at double precision. Prediction is O(|SV|) — bounded support vectors are proportional to training error, unbounded ones grow slowly with n. These costs are what make scaling kernel methods to modern dataset sizes genuinely hard (section 16).

Kernel SVMs: applying the trick to the dual

Because the SVM dual depends on the training inputs only through inner products, the kernel trick applies immediately: replace ⟨xᵢ, xⱼ⟩ with k(xᵢ, xⱼ), and the entire method now runs in the feature space implicitly defined by the kernel. The linear maximum-margin classifier becomes a non-linear maximum-margin classifier in the feature space, which traces out a non-linear boundary in the input space. This is the "S" and "V" and "M" that make the SVM what it is — and it all falls out of a trivial substitution.

The kernel dual

The kernel-SVM dual is identical to the linear dual with k(xᵢ, xⱼ) in place of the inner product:

max_α Σᵢ αᵢ − (1/2) Σᵢⱼ αᵢ αⱼ yᵢ yⱼ k(xᵢ, xⱼ) subject to 0 ≤ αᵢ ≤ C, Σᵢ αᵢ yᵢ = 0.

The prediction is f(x) = Σᵢ αᵢ yᵢ k(xᵢ, x) + b. The weight vector w is never explicitly computed — for an infinite-dimensional kernel like the RBF, it cannot be. The entire algorithm runs on the n × n kernel matrix K, and prediction uses only the kernel evaluated between the new point and each support vector.

The non-linear decision boundary

In the RBF kernel case, the prediction function is a weighted sum of Gaussians centred at the support vectors. The decision boundary f(x) = 0 is a level set of this sum — a curved, locally adaptive surface in the input space. The curvature is controlled by γ (bandwidth); the complexity is controlled by C (misclassification penalty). Together these two hyperparameters span the whole bias–variance spectrum for kernel SVMs.

Strengths

Kernel SVMs dominate small-to-medium classification problems when carefully tuned. They produce sparse, interpretable decision functions (a sum over support vectors); they generalise well thanks to the maximum-margin principle; they work on any input domain for which a kernel can be defined (images, text, graphs, sequences). They were the state-of-the-art vision and NLP classifier through the late 2000s — until deep learning scaled past them in vision (AlexNet 2012) and then everywhere else.

The limitation

Kernel SVMs do not scale. The kernel matrix is n² bytes at double precision (so 80 GB for n=10⁵), and training is O(n²) to O(n³). On moderate datasets — millions of training points — full kernel SVMs are infeasible, which is why modern practice uses either (i) a linear SVM (on high-dimensional sparse features like text), (ii) a kernel approximation (Nyström or random Fourier features, section 16), or (iii) a different model entirely (ensemble trees, deep networks).

Probabilistic outputs

SVMs produce distances to the hyperplane, not probabilities. If you need calibrated probabilities, Platt scaling (Platt 1999) fits a one-dimensional logistic regression to the SVM outputs on a held-out set to produce P(y=1 | x) ≈ σ(A · f(x) + B). Isotonic regression is a non-parametric alternative. Both are post-hoc calibrations that do not change the SVM itself. See Chapter 09 (Model Evaluation & Selection) for the broader calibration picture.

Support vector regression

The same max-margin + kernel trick machinery extends to regression. The key modification is the ε-insensitive loss: instead of penalising all prediction errors, penalise only those larger than ε. The result is a regression method whose solution is sparse (a subset of training points are "support vectors"), whose flexibility comes from the kernel, and whose regularisation is the same RKHS norm as in classification. Support vector regression is less widely used than kernel ridge regression — the closed form of kernel ridge is appealing — but it has one advantage: genuine sparsity at the solution.

The ε-insensitive loss

Define the ε-insensitive loss as

L_ε(y, f(x)) = max(0, |y − f(x)| − ε).

Errors smaller than ε incur zero loss; errors larger than ε are penalised linearly. The parameter ε controls the width of the "tube" around the predictor inside which no penalty is paid, and is a regularisation knob complementary to C. The loss is convex but non-differentiable at its corners, giving rise to the sparse dual structure.

The SVR primal and dual

Primal:

min_{w, b, ξ, ξ*} (1/2) ‖w‖² + C · Σᵢ (ξᵢ + ξᵢ*) s.t. yᵢ − wᵀxᵢ − b ≤ ε + ξᵢ, wᵀxᵢ + b − yᵢ ≤ ε + ξᵢ*, ξᵢ, ξᵢ* ≥ 0.

Two slacks (upper and lower) per point reflect the symmetry of the loss. The dual is a QP in 2n variables, with prediction f(x) = Σᵢ (αᵢ − αᵢ*) · k(xᵢ, x) + b. Training points with αᵢ − αᵢ* = 0 are inside the ε-tube and do not participate; those with non-zero coefficients are the support vectors, typically a small fraction of the training data.

Hyperparameters

Three knobs: the kernel bandwidth (e.g. γ for RBF), the regularisation C, and the tube width ε. Smith's heuristics: set ε proportional to the expected noise standard deviation; C at least 10× the data range; tune γ by median distance. All three interact, and a three-way grid search is standard for serious SVR work.

SVR versus kernel ridge

Kernel ridge regression (next section) is convex, has a closed-form solution, uses all training points as non-zero coefficients (no sparsity), and has squared-loss semantics. SVR is also convex, uses the ε-insensitive loss, has a sparse dual solution, and requires iterative optimisation. On most problems the predictive accuracy is similar; choose kernel ridge if you want simplicity and closed forms, choose SVR if you need the sparse prediction structure (e.g., deploying a model where prediction time scales with the number of support vectors). Gaussian-process regression (section 15) subsumes both when you can afford the extra computation.

Multiclass SVMs

SVMs are natively binary: the max-margin formulation is defined between two classes, and extending it to more than two classes requires either reducing the multiclass problem to many binary ones or extending the formulation directly. Both routes are well understood, and in practice the reduction-based methods dominate because they work with any binary-SVM library. The direct formulation (Crammer–Singer) is conceptually cleaner but less widely supported.

One-vs-all

Train K binary SVMs, each separating one class from all others. At prediction time, pass a new point through all K classifiers and predict the class whose classifier returns the highest score. Simple, fast, and used in most production systems. The trained classifiers are independent — they can be trained in parallel — which is part of why the approach scales. The weakness: the binary sub-problems are imbalanced (one class vs K−1), which can hurt calibration.

One-vs-one

Train K·(K−1)/2 binary classifiers, one per pair of classes. At prediction time, apply all classifiers and vote. For K = 10 classes this is 45 classifiers; for K = 1000 it is roughly half a million — so one-vs-one scales poorly in the number of classes, though each sub-problem uses only the data of two classes and is therefore fast. The scikit-learn SVC default is one-vs-one; LibSVM uses it too. For small-to-medium K, one-vs-one is often slightly more accurate than one-vs-all.

Crammer–Singer

The unified multiclass formulation (Crammer & Singer 2001) generalises the margin concept to multiple classes in one shot: learn K weight vectors simultaneously with the constraint that the correct-class score exceeds each incorrect-class score by a margin. Elegant and theoretically principled, but the QP is harder to solve than a batch of one-vs-all sub-problems and implementations are rare. Most practitioners never need to know about it.

Error-correcting output codes

A different reduction approach: encode each class as a binary string of length L, train L binary classifiers (one per bit), and at prediction time decode to the nearest class codeword. If the codewords are chosen to have large Hamming distance, the scheme has an error-correcting property: it can withstand misclassifications in individual bits. Dietterich & Bakiri's 1995 paper introduced the idea. ECOC is used occasionally for very-many-class problems but is mostly a historical curiosity.

Practical advice

Use whatever multiclass strategy your SVM library defaults to — one-vs-one for scikit-learn, one-vs-all for some others. For small K (up to 10–20) both are fine; for larger K prefer one-vs-all for training speed, at the mild cost of calibration quality. If probabilistic outputs matter, pair one-vs-all with Platt scaling or use a softmax layer on the raw SVM scores. If accuracy matters most and compute allows, run both strategies and validate.

Kernel ridge regression

Ridge regression (Chapter 01) minimises squared loss plus an L² penalty on the weight vector. Kernelising it is the cleanest possible example of the kernel trick: derive the solution in closed form, recognise that the data appears only through inner products, and substitute a kernel. The result — kernel ridge regression — is a predictor of the form f(x) = Σᵢ αᵢ k(xᵢ, x) with coefficients given by a single matrix inversion. There is no iterative optimisation, no quadratic program, and the solution is the equivalent of Gaussian-process regression's posterior mean.

The solution in one line

The ridge problem is min_w (1/2) ‖y − Xw‖² + (λ/2) ‖w‖², with closed form w = (XᵀX + λI)⁻¹ Xᵀy. Prediction: f(x) = xᵀw. Apply the matrix identity (XᵀX + λI)⁻¹Xᵀ = Xᵀ(XXᵀ + λI)⁻¹ and use XXᵀ = K (the Gram matrix with ordinary inner products). We get

f(x) = k(x)ᵀ (K + λI)⁻¹ y,

where k(x) is the vector (k(x₁, x), …, k(xₙ, x)). Replace the ordinary inner products with any PSD kernel and the formula still holds — kernel ridge regression is a one-line closed form.

Why it matters

Kernel ridge is the pedagogical cleanest kernel method: no dual, no QP, just a matrix inverse. It is the mean predictor of a Gaussian process with the same kernel and noise λ — Bayesian and frequentist views give identical point predictions, differing only in the posterior-variance story. It generalises well for the same reason maximum-margin SVMs do: the RKHS-norm regulariser controls capacity. For small-to-medium datasets with clean kernel structure, kernel ridge is as strong as anything else in this chapter and simpler than everything except the linear baseline.

Complexity

Training: compute the n×n kernel matrix (O(n²) memory), Cholesky-factorise K + λI (O(n³) time). Prediction: compute the n-vector k(x) and take a dot product (O(n) time per new point). For n up to about 10⁴ this runs on a laptop; past that point the kernel approximations of section 16 become necessary. When you have 10² or 10³ points and a good kernel, kernel ridge is often the first thing to try.

Relation to other methods

Kernel ridge is linear ridge regression in the feature space defined by the kernel. It is the dual of linear ridge regression, expressed in n variables instead of d — so whenever n < d (the over-parameterised regime), the dual formulation is cheaper. Its posterior-mean relationship with Gaussian processes is deep: kernel ridge is what you get when you only want the mean, Gaussian processes are what you get when you also want the variance (next two sections).

Other kernelised methods

Once the pattern is clear — take a linear method, rewrite in terms of inner products, substitute the kernel — the list of kernelised classics is long. Several of them are already familiar: kernel PCA appeared in Chapter 05, spectral clustering in Chapter 04. This section is a quick tour of the rest, so that you recognise them and know when each is appropriate.

Kernel PCA

PCA on the feature-space vectors φ(xᵢ). The covariance operator in the feature space cannot be formed, but its eigenvectors can be expanded in the feature vectors, and the expansion coefficients are the eigenvectors of the centred kernel matrix. Kernel PCA produces non-linear dimensionality reductions; it was one of the first methods that could handle non-linear manifolds and is still a useful tool when the manifold structure is rich but smooth. Covered in detail in Chapter 05.

Kernel k-means

K-means with Euclidean distance in the feature space — equivalently, with distances derived from a kernel. It is the kernel counterpart of the algorithm in Chapter 04 and produces clusters that are non-linearly separable in the input space. Kernel k-means is closely related to spectral clustering: the relaxation of the kernel-k-means objective gives the spectral-clustering eigenvalue problem, which is another way to see why spectral clustering "sees" non-linear structure.

Kernel Fisher discriminant

Fisher linear discriminant (Chapter 02) projects data onto the direction that maximises the ratio of between-class to within-class variance. Kernelising it gives the kernel Fisher discriminant, a non-linear binary classifier that performs similarly to a kernel SVM but with simpler machinery (a generalised eigenvalue problem instead of a QP). Historically significant, less widely used today.

One-class SVM

An one-class SVM (Schölkopf et al. 2001) learns a boundary that encloses a fraction ν of the training data — a density-free anomaly detector. The problem is solved as a kernel SVM with a single "class" and a modified primal. Anomalies (points near the boundary or outside it) are those with small decision-function value. One-class SVMs are a classical tool in novelty detection, fraud detection, and unsupervised outlier-finding.

Spectral methods and graph Laplacians

Spectral clustering, Laplacian eigenmaps (Chapter 05), diffusion maps — all can be understood as kernel methods where the kernel is the normalised Laplacian of a similarity graph. This unifying view relates the manifold-learning literature to the kernel-methods literature and makes it easy to see when one set of techniques can be transposed to the other.

Kernel independent-component analysis

Kernel ICA (Bach & Jordan 2002) replaces the non-Gaussianity measure in ICA with a kernel-based statistical independence measure. It produces non-linear independent components and is one of the roots of modern kernel-based tests of independence (HSIC, the Hilbert–Schmidt Independence Criterion) that are still used in causal-discovery and feature-selection work.

Kernel mean embeddings

Treat a probability distribution as a single point in an RKHS by embedding it as the mean feature map. Two distributions can then be compared by the distance between their mean embeddings — the Maximum Mean Discrepancy (MMD). MMD is the kernel version of a two-sample test, the basis of kernel methods for generative model evaluation, and the building block of several modern methods for distribution comparison (Gretton et al. 2012). The machinery is simple, general, and the bridge between kernel methods and probability.

Gaussian processes: Bayesian kernel regression

A Gaussian process is a distribution over functions: a prior that says "I think the function I'm trying to learn is smooth, with correlation structure given by the kernel k." Conditioning this prior on observed data produces a posterior — also a Gaussian process — whose mean is the kernel-ridge-regression predictor and whose variance is a calibrated uncertainty estimate. GPs are kernel methods with uncertainty quantification for free, and they remain the default choice in any regime where "how sure is the model?" matters as much as "what does it predict?".

The definition

A stochastic process f is a Gaussian process with mean function m and covariance function (kernel) k — written f ~ GP(m, k) — if for every finite set of inputs x₁, …, xₙ the vector (f(x₁), …, f(xₙ)) has a multivariate Gaussian distribution with mean (m(x₁), …, m(xₙ)) and covariance matrix K. A GP is to a Gaussian random variable what a function is to a number: a generalisation from points to whole function values. The mean function is often set to zero for modelling convenience; the covariance function is the kernel and determines the properties of the sample paths.

Conditioning on observations

Given training data (X, y) and a noise model yᵢ = f(xᵢ) + εᵢ, εᵢ ~ N(0, σ²), the posterior over f at a new point x* is Gaussian with mean

μ(x*) = k(x*)ᵀ (K + σ²I)⁻¹ y

and variance

σ²(x*) = k(x*, x*) − k(x*)ᵀ (K + σ²I)⁻¹ k(x*).

The posterior mean is identical to the kernel-ridge predictor with λ = σ². The variance is the gift: a calibrated measure of how confident the model is at each prediction, larger far from training data (which is what you would want) and smaller where training data is abundant.

Hyperparameter learning

GPs have hyperparameters — kernel lengthscales, signal variance, noise variance — that are learned by maximising the log marginal likelihood

log p(y | X, θ) = −(1/2) yᵀ(K_θ + σ²I)⁻¹y − (1/2) log|K_θ + σ²I| − (n/2) log 2π.

The first term is a data-fit term; the second is a complexity penalty (Occam's razor in Bayesian form); the third is a constant. Optimising this is how GPs avoid overfitting despite their apparent flexibility — the complexity term automatically pushes back against over-flexible kernels.

Where GPs shine

Calibrated uncertainty is the superpower. GPs dominate Bayesian optimisation (section 18), active learning, and experimental design — applications where knowing what the model does not know is the whole point. They are the default in spatial statistics (kriging) and the probabilistic-modelling literature of geostatistics and engineering. They work exceptionally well with small datasets and noisy observations, exactly the regimes where deep networks struggle.

The scaling problem and its fixes

GP training is O(n³) and memory is O(n²), the same as any kernel method. Past a few thousand training points, approximations are mandatory. Sparse variational GPs (Titsias 2009) represent the posterior using a small set of inducing points and reduce the cost to O(m²n) where m ≪ n. Modern GPU libraries (GPyTorch, GPflow) make GPs with tens of thousands of points routine; specialised techniques (deep kernel learning, hierarchical GPs) push the frontier further. The scaling story is the same as for SVMs and is covered in full in the next section.

Scaling kernel methods: Nyström, random features, sparse GPs

Every full kernel method costs O(n²) in memory and O(n³) in time, which breaks decisively for n past about 10⁵. The fix is to approximate either the kernel matrix or the feature map with something low-rank or random, reducing the cost to linear or near-linear in n. Three approaches dominate: Nyström approximation, random Fourier features, and sparse-inducing-point methods. Each has its niche; all are essential tools in the modern kernel-methods kit.

Nyström approximation

Choose a subsample of m < n training points and use them to build a low-rank approximation of the full kernel matrix:

K ≈ K_{nm} K_{mm}⁻¹ K_{mn},

where K_{mm} is the kernel matrix on the subsample and K_{nm} is the cross-kernel between full data and subsample. Plugging this into a kernel method reduces cost from O(n²) to O(nm) memory and O(nm²) time. Typical values: m = 1000–10000, which gives excellent accuracy on most problems. The Nyström approximation is the default low-rank approach when the kernel matrix is accessible and subsample selection is easy (uniform random or leverage-score sampling both work).

Random Fourier features

Rahimi & Recht's 2007 paper: any translation-invariant kernel k(x − x') is the inverse Fourier transform of a non-negative measure (Bochner's theorem). Sample D random frequencies from that measure, and the Monte Carlo estimate of the integral gives an explicit D-dimensional feature map z(x) such that z(x)ᵀz(x') ≈ k(x, x'). Train a linear model (linear SVM, linear ridge) on the explicit features z(x), and you get a kernel method at linear-time cost. D = 1000–10000 approximates the RBF kernel to high accuracy; the method scales to tens of millions of points.

Sparse inducing-point GPs

For Gaussian processes specifically, the inducing-point framework (Quiñonero-Candela & Rasmussen 2005, Titsias 2009) represents the GP posterior using m inducing points — summaries of the training data — reducing the cost to O(nm²). The variational sparse GP (Titsias 2009) makes the inducing points learnable and gives a principled variational lower bound on the marginal likelihood. Stochastic variational GPs (Hensman et al. 2013) extend this with mini-batch training, making GPs feasible on datasets with millions of points.

Modern software

GPyTorch (Gardner et al. 2018) and GPflow are the modern default GP libraries: GPU-accelerated, integrated with PyTorch/TensorFlow, supporting inducing-point and stochastic variational methods out of the box. KeOps (Charlier et al. 2021) implements GPU kernel matrix–vector products without ever materialising the matrix, making full kernel methods tractable to around 10⁶ points on a single GPU. Falkon (Rudi et al. 2017) is the fastest Nyström-based kernel ridge implementation, scaling to billions of features.

When to pick which

For GPs with n up to ~10⁴: full GP. For n up to ~10⁶: sparse variational GP (GPyTorch, GPflow). For kernel ridge or SVM with n up to ~10⁷: Nyström (Falkon) or random Fourier features (fastfood, LibLinear on the explicit features). For even larger scales: the kernel method is probably not the right tool — use a neural network or an ensemble.

Kernels in practice: an operational guide

Kernel methods are the easiest family in this book to use badly. The math is elegant, the libraries are solid, and yet a hastily tuned SVM or GP will usually underperform a moderately tuned boosted-tree baseline, while a well-tuned kernel method will often match or beat it. The practical workflow is about three things: pick the right kernel, tune the two or three hyperparameters carefully, and diagnose failures early.

Choosing the kernel

Start with the linear kernel as a baseline sanity check. If the data is not linearly separable, try the RBF kernel — it is the universal default, works well on most tabular data, and has two tunable hyperparameters (bandwidth γ, regularisation C). If the problem is intrinsically polynomial (e.g., you know the interactions of interest are quadratic), try a polynomial kernel with small degree. Specialised kernels (string, graph, Fisher) when the input is not a vector.

Normalising the inputs

Kernels are distance-sensitive. Without input normalisation, features with larger scales dominate. The default pre-processing is to standardise each feature to zero mean and unit variance, or to min-max scale to [0, 1]. For the RBF kernel this is essentially mandatory: without it, the bandwidth γ interacts unpredictably with the feature scales and tuning becomes very difficult.

Tuning hyperparameters

The RBF kernel has two knobs — C and γ — and their interaction is strong, so grid-search over a logarithmic grid is the norm. A reasonable default grid: C ∈ {0.1, 1, 10, 100}, γ ∈ {0.01, 0.1, 1, 10} scaled by the median-distance heuristic. Five-fold cross-validation on this 16-point grid is the pragmatic default and will find a near-optimal setting in 80 model fits. For serious work, Bayesian optimisation over the hyperparameters often finds better settings in fewer fits.

The median heuristic for γ. Compute the pairwise distances ‖xᵢ − xⱼ‖ on a random subsample of the data and take the median. Set γ = 1 / (median)². This is almost always within a factor of 2 of the optimal bandwidth and is the right starting point for cross-validation.

Common failure modes

Un-normalised inputs — the hyperparameter you tune depends on arbitrary feature scales, and nothing generalises. An RBF γ off by 100× — the decision surface collapses to either constants (too-small γ) or nearest-neighbour-like islands (too-large γ). C too small — the method underfits and looks like a weak linear model. C too large — the method memorises the training data and has terrible test accuracy. A kernel matrix that is not positive semi-definite — a user-defined similarity function that happens not to be PSD is a silent bug that makes the QP solver misbehave.

When to use which method

Kernel ridge regression: small-to-medium regression problems, closed-form solution, simplest kernel method. Kernel SVM: small-to-medium classification problems, sparse solution at prediction time. SVR: same regime as SVM but for regression, when you want sparsity. Gaussian processes: when calibrated uncertainty matters — Bayesian optimisation, active learning, experimental design. Kernel PCA / kernel clustering: dimensionality reduction and clustering with non-linear structure. All of the above with Nyström or random Fourier approximations when n gets past 10⁵.

Where it compounds in modern machine learning

Kernel methods are not the dominant paradigm in 2026 — deep learning is — but their influence on modern ML is pervasive, and in several regimes they remain the state of the art. The vocabulary of feature maps, positive definiteness, the representer theorem, and the RKHS norm shows up constantly in contemporary theoretical work. Several modern techniques are kernel methods in disguise. And when small data, calibrated uncertainty, or theoretical rigour are the constraints, kernels are still the first thing to try.

The neural tangent kernel

One of the most unexpected results of the past decade: in the limit of infinite width, a gradient-trained neural network is exactly a kernel method (Jacot, Gabriel & Hongler 2018). The neural tangent kernel (NTK) is the kernel defined by the infinite-width limit of the network's feature gradients, and it predicts the behaviour of finite neural networks remarkably well — at least early in training. The NTK is the bridge from the kernel-methods literature to the theory of deep learning, and a huge amount of recent theoretical work on neural networks uses kernel-methods machinery to make claims about deep networks.

Gaussian processes in the deep-learning era

GPs are the right tool whenever "calibrated uncertainty" and "small data" are the key constraints. They dominate Bayesian optimisation — the de facto standard for hyperparameter tuning of expensive functions (model training, chemical experiments, engineering design). They are a core tool in active learning, where the decision of "which point should I label next?" is a direct function of GP posterior variance. They are the backbone of modern scientific ML in geostatistics, climate modelling, and robot control where training-data collection is expensive.

Kernel methods as a strong baseline

In many domains — particularly small-to-medium tabular data — a well-tuned kernel SVM or kernel ridge still matches or beats gradient-boosted trees and neural networks. This is less widely appreciated than it should be, because kernel methods do not scale to the sizes where deep networks shine. If your training set is under 10 000 points, a kernel method is almost always worth trying as a baseline, and often worth deploying.

MMD, kernel two-sample tests, and discrepancy measures

Maximum mean discrepancy (Gretton et al. 2012) is a kernel-based two-sample test — "are these two samples drawn from the same distribution?" — that is differentiable, closed-form, and used as a regulariser in modern generative-model training. Kernel methods for independence testing (HSIC) show up in causal discovery and feature selection. The kernel-based framework for comparing distributions is one of the most useful modern bridges between classical statistics and deep-learning-style optimisation.

The conceptual legacy

Even when kernels are not the tool, their vocabulary has won. Every modern deep network is doing some kind of implicit feature extraction followed by a linear operation on features — the feature-map-plus-linear-classifier architecture is the kernel-methods architecture. The notion that capacity is controlled by a norm of the learned function (not by the number of parameters) was kernel thinking that deep learning adopted without always acknowledging. Generalisation bounds based on margin, on Rademacher complexity, on the RKHS norm — all originated in the kernel-methods literature and now appear in every paper on deep-learning theory.

Bridge to the next chapter. Chapters 7 (kernel methods) and the coming Chapter 8 (feature engineering) sit at the same conceptual joint from opposite sides: the first says "we don't need to engineer features, we can let the kernel do the work," and the second says "pick the features by hand, carefully, and the model can be simpler." Both are right; both are partial. Chapter 8 picks up feature engineering as its own subject — encoding categorical variables, interaction terms, missing-value imputation, target encoding, feature selection by filter, wrapper, and embedded methods — so that you can choose between "let the kernel handle it" and "let the features handle it" on the basis of the concrete problem.

Where to go next

Kernel methods are the most mathematically elegant corner of classical machine learning — with Hilbert-space theory, convex-optimisation duality, and statistical learning theory all converging on the same family of algorithms. The references below split into anchor textbooks that define the field, foundational papers from Aronszajn's 1950 RKHS paper through Cortes–Vapnik's 1995 soft-margin SVM to Jacot et al.'s 2018 neural tangent kernel, modern extensions that carry kernel ideas into the deep-learning era, and the software documentation where all serious kernel work happens. If you only read one textbook, read Schölkopf & Smola's Learning with Kernels.

The anchor textbooks

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Bernhard Schölkopf & Alexander Smola · MIT Press · 2002

The canonical textbook on kernel methods — written by two of the people who built the field. Covers the full arc from positive-definite kernels and reproducing kernel Hilbert spaces through SVMs, SVR, kernel PCA, and one-class SVMs to the statistical-learning-theory foundations underneath it all. The chapters on kernel construction, the representer theorem, and the SVM dual are the clearest in print. Rigorous but pedagogically friendly; the book that taught a generation of ML researchers how to think about kernels as geometry rather than algebraic tricks. If you only buy one kernel book, buy this one.

MIT Press
Kernel Methods for Pattern Analysis

John Shawe-Taylor & Nello Cristianini · Cambridge · 2004

The complementary reference to Schölkopf–Smola. Shawe-Taylor and Cristianini emphasise the "pattern analysis" perspective: kernel methods as general-purpose modular tools for extracting statistical structure, with kernels on strings, trees, graphs, and images treated as first-class citizens alongside the classical Euclidean cases. Strong coverage of kernel CCA, kernel clustering, kernel Fisher discriminants, and kernel PCA. Pair with the authors' earlier An Introduction to Support Vector Machines (2000) for the gentler first pass that focuses specifically on classification.

Cambridge SVM intro
Gaussian Processes for Machine Learning

Carl Rasmussen & Christopher Williams · MIT Press · 2006 · free PDF

The definitive GP textbook, and the book that put Gaussian processes firmly on the machine-learning map as the Bayesian cousin of kernel ridge regression. Chapter 2 (GP regression) and Chapter 3 (GP classification) are the essential reading; Chapter 4 (kernels) is the best single reference on the kernel zoo from a GP perspective; Chapter 5 (model selection) is the canonical treatment of marginal-likelihood hyperparameter optimisation. The duality between kernel ridge and GP regression is developed carefully. Free PDF at gaussianprocess.org — there is no reason not to have this book on your shelf.

Free PDF
The Elements of Statistical Learning

Hastie, Tibshirani & Friedman · 2nd ed. · Springer · 2009 · free PDF

Chapter 12 (Support Vector Machines and Flexible Discriminants) is the most lucid short treatment of SVMs in any statistics textbook — derivation of the max-margin and soft-margin classifiers, the kernel trick, the loss-function view of SVMs as regularised hinge-loss minimisation, and the SVM–logistic regression comparison that is the textbook's pedagogical signature. Section 5.8 on RKHS and splines places kernels in the broader family of regularisation-based function estimation. Pair with Chapter 14 (Unsupervised Learning) for kernel PCA and spectral clustering.

Free PDF
The Nature of Statistical Learning Theory

Vladimir Vapnik · Springer · 2nd ed. 2000

Vapnik's own exposition of the statistical learning theory that underlies SVMs — VC dimension, empirical risk minimisation, structural risk minimisation, and the margin-based generalisation bounds that motivate the SVM objective. Short, opinionated, written by the person who invented most of the ideas. The longer companion, Statistical Learning Theory (1998), is the technical reference; this book is the manifesto. Pair with Steinwart & Christmann's Support Vector Machines (2008) for the rigorous modern mathematical treatment of consistency and rates of convergence.

Springer Steinwart
Probabilistic Machine Learning: Advanced Topics

Kevin Murphy · MIT Press · 2023 · free HTML

The modern unified reference on kernels-as-one-piece-of-a-larger-picture. Chapter 18 (Gaussian Processes) covers the classical GP story and its modern scalable variants (inducing points, sparse VGPs, KISS-GP). Chapter 17 (Bayesian nonparametrics) places kernel methods in the broader nonparametric Bayesian tradition. Cross-references throughout to the neural tangent kernel, deep kernel learning, and the connection between attention and kernel smoothing. Free HTML at probml.github.io.

probml.github.io

Foundational papers

A Training Algorithm for Optimal Margin Classifiers

Boser, Guyon & Vapnik · COLT · 1992 · free

The paper that introduced the support vector machine as we now know it. Combines Vapnik's earlier optimal-margin linear classifier with the kernel trick (attributed here to Aizerman, Braverman, Rozonoer 1964) to produce a non-linear classifier via a linear algorithm in a reproducing kernel Hilbert space. Short, dense, and foundational — the four-page paper that started an entire field. Read in tandem with the Cortes–Vapnik paper below for the soft-margin extension that made SVMs practical.

PDF
Support-Vector Networks

Corinna Cortes & Vladimir Vapnik · Machine Learning · 1995 · free

The paper that introduced the soft-margin SVM with the slack-variable formulation and the C regularisation parameter, transforming SVMs from a neat theoretical idea into a practical workhorse for real classification problems. Also contains the first serious empirical demonstration of SVMs on OCR benchmarks, where they matched or beat the best neural networks of the era. The single most important SVM paper for the working practitioner. Pair with Schölkopf, Smola, Williamson & Bartlett's 2000 paper on ν-SVMs for the reparameterisation that controls the number of support vectors directly.

Springer
Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines

John Platt · Microsoft Research · 1998 · free

The SMO paper. Decomposes the quadratic program at the heart of SVM training into a sequence of two-variable subproblems that have closed-form solutions, eliminating the need for a general-purpose QP solver and allowing training on datasets with hundreds of thousands of examples. SMO is the algorithm that libsvm and every serious SVM library actually implements. Pair with Keerthi et al.'s 2001 Improvements to Platt's SMO Algorithm for SVM Classifier Design for the working-set and caching refinements that every modern implementation includes.

MSR
Nonlinear Component Analysis as a Kernel Eigenvalue Problem

Schölkopf, Smola & Müller · Neural Computation · 1998 · free

The kernel PCA paper — the first clean generalisation of the kernel trick beyond classification. Derives PCA in a reproducing kernel Hilbert space via the eigendecomposition of the centred kernel matrix, producing non-linear principal components without ever computing the feature map explicitly. The template that inspired kernel Fisher discriminants, kernel CCA, kernel k-means, and the rest of the kernelised-classical-algorithm family. Still the single clearest exposition of "how to kernelise any linear algorithm that depends only on inner products."

PDF
Theory of Reproducing Kernels

Nachman Aronszajn · Transactions of the AMS · 1950 · free

Aronszajn's monumental paper that founded the theory of reproducing kernel Hilbert spaces as a branch of functional analysis, decades before anyone thought to apply it to machine learning. Establishes the Moore–Aronszajn theorem — every positive-definite kernel corresponds to a unique RKHS — and develops the tensor-product, sum, and restriction operations on kernels. Heavy functional-analysis reading, but worth knowing the original source. Pair with Berlinet & Thomas-Agnan's Reproducing Kernel Hilbert Spaces in Probability and Statistics (2004) for the modern textbook exposition, and with Steinwart & Christmann's Support Vector Machines (2008, Ch. 4) for the version written for ML audiences.

AMS
Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations

James Mercer · Phil. Trans. Royal Society A · 1909

Mercer's theorem — every continuous positive-definite kernel on a compact set has an eigen-expansion in terms of an orthonormal basis of eigenfunctions. The theorem that justifies the "kernel-as-inner-product-in-a-feature-space" intuition rigorously, and the reason we can talk about the feature map φ(x) implicitly. A hundred-year-old paper that every kernel textbook cites but almost nobody reads; the modern pedagogical version is in Chapter 4 of Schölkopf–Smola.

Royal Society
Random Features for Large-Scale Kernel Machines

Ali Rahimi & Ben Recht · NeurIPS · 2007 · free

The random-Fourier-features paper. Uses Bochner's theorem to represent shift-invariant kernels as expectations of cosine features with random frequencies drawn from the kernel's Fourier dual, yielding an explicit finite-dimensional approximation that makes kernel methods scale linearly with the number of samples. A 2007 idea that is still, in 2026, the single most important trick for making kernel methods run on datasets of millions of points. Winner of the NeurIPS Test of Time award in 2017. Pair with the Nyström paper below for the complementary approximation strategy.

PDF
Using the Nyström Method to Speed Up Kernel Machines

Christopher Williams & Matthias Seeger · NeurIPS · 2001 · free

The Nyström-approximation paper. Approximates the full n × n kernel matrix by the rank-m matrix obtained from a uniformly sampled subset of columns, reducing both storage and computation from O(n²) to O(nm). The workhorse alternative to random Fourier features, and the approach used in GPflow's and GPyTorch's sparse-GP implementations. Pair with Drineas & Mahoney's 2005 On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning for the randomised-numerical-linear-algebra analysis, and with Rudi, Carratino & Rosasco's 2017 FALKON paper for the modern scalable implementation.

NeurIPS FALKON
A Tutorial on Support Vector Regression

Alex Smola & Bernhard Schölkopf · Statistics and Computing · 2004 · free

The canonical SVR tutorial. Develops the ε-insensitive loss, the primal and dual SVR formulations, the ν-SVR parameterisation, and the relationships to kernel ridge regression and Gaussian-process regression in one coherent narrative. The single best entry point for anyone fitting SVMs to continuous targets. Pair with Drucker et al.'s original 1997 Support Vector Regression Machines paper for the short seminal version.

PDF
Estimating the Support of a High-Dimensional Distribution

Schölkopf, Platt, Shawe-Taylor, Smola & Williamson · Neural Computation · 2001 · free

The one-class SVM paper. Introduces the kernel-based approach to novelty detection that separates the data from the origin in feature space with maximum margin, and connects it to the ν-SVM parameterisation that controls the fraction of outliers directly. The basis of the most popular kernel-based anomaly-detection method in scikit-learn and libsvm. Pair with Tax & Duin's 2004 Support Vector Data Description (SVDD) for the ball-based dual formulation.

PDF
Multiclass Support Vector Machines

Crammer & Singer · JMLR · 2001 · free

The direct-multiclass SVM paper. Formulates a single joint optimisation over all class weight vectors with a multiclass margin constraint, avoiding the one-vs-one and one-vs-all heuristics. Although one-vs-one remains the default in libsvm (for algorithmic reasons), the Crammer–Singer formulation is the theoretically cleaner object and the one most often cited in structured-prediction work. Pair with Tsochantaridis, Joachims, Hofmann & Altun's 2005 Large Margin Methods for Structured and Interdependent Output Variables for the structural-SVM generalisation.

JMLR
Kernel Independent Component Analysis

Francis Bach & Michael Jordan · JMLR · 2002 · free

A representative of the "kernelise a classical statistical algorithm" project, here targeting ICA. More broadly, a showcase of how kernel methods can be used to operationalise statistical independence via RKHS-based measures — the direct ancestor of the HSIC (Hilbert–Schmidt independence criterion) and MMD (maximum mean discrepancy) literature. Pair with Gretton, Borgwardt, Rasch, Schölkopf & Smola's 2012 A Kernel Two-Sample Test for the MMD framework that is now central to kernel-based hypothesis testing and to the evaluation of generative models.

JMLR MMD

Modern extensions

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, Clément Hongler · NeurIPS · 2018 · free

The NTK paper. Shows that in the infinite-width limit, gradient-descent training of a neural network follows the dynamics of kernel regression with a specific kernel — the neural tangent kernel — determined by the network architecture. This single result gave the kernel-methods community a rigorous bridge to deep learning and opened an entire subfield. The practical consequence (infinite-width networks learn linear functions of the NTK features) is both clarifying and sobering — real networks don't stay in this regime, and the NTK literature has spent the subsequent eight years working out exactly how and why. Pair with Lee et al.'s 2019 Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent for the empirical follow-up.

NTK arXiv Lee et al.
Deep Kernel Learning

Wilson, Hu, Salakhutdinov & Xing · AISTATS · 2016 · free

The most practical of the "hybrid deep-net-plus-kernel" proposals. A deep feature-extractor network feeds into a Gaussian-process output layer with a learned kernel on top of the extracted features, trained end-to-end by marginal-likelihood gradient. The architecture combines the representational power of deep networks with the calibrated uncertainty of GPs, and is a template for many subsequent hybrid designs. Pair with Calandra, Peters, Rasmussen & Deisenroth's 2016 Manifold Gaussian Processes for Regression for the closely related manifold-GP idea, and with GPyTorch's deep-kernel-learning tutorial for a working implementation.

arXiv
Variational Learning of Inducing Variables in Sparse Gaussian Processes

Michalis Titsias · AISTATS · 2009 · free

The modern sparse-GP paper. Reformulates the inducing-point approximations of Snelson & Ghahramani (FITC, 2006) as a proper variational bound on the exact marginal likelihood, solving the overfitting pathologies of earlier methods and giving the theoretically clean scalable-GP framework that GPflow, GPyTorch, and every serious modern GP library implement. Pair with Hensman, Fusi & Lawrence's 2013 Gaussian Processes for Big Data for the stochastic-variational-GP extension that enables mini-batch training.

PMLR SVGP
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Katharopoulos, Vyas, Pappas, Fleuret · ICML · 2020 · free

A representative of the "attention-is-kernel-smoothing" literature that has crystallised in the last five years. Shows that softmax attention is a specific case of kernel smoothing, and that replacing the softmax with a positive-feature-map kernel yields a linear-time attention variant. The insight — that the entire transformer architecture can be read as a kernel method in disguise — has been elaborated in multiple subsequent papers, and is the cleanest way to see why kernel theory is still relevant to deep-learning research. Pair with Tsai et al.'s 2019 Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel for the explicit kernel-perspective rewrite.

Linear attn Kernel view
FALKON: An Optimal Large Scale Kernel Method

Alessandro Rudi, Luigi Carratino, Lorenzo Rosasco · NeurIPS · 2017 · free

A decisive answer to the "can kernel methods scale?" question. FALKON combines Nyström subsampling with preconditioned conjugate-gradient optimisation to train a kernel ridge regression on ten-million-point datasets in hours on a single GPU, matching deep-net accuracy on ImageNet-scale tabular problems. The paper also provides the statistical analysis showing that the approximation incurs no asymptotic loss of accuracy relative to exact kernel ridge. Pair with Meanti, Carratino, Rosasco & Rudi's 2020 Kernel Methods Through the Roof for the GPU implementation that brought Falkon to 10⁸-point datasets.

FALKON Kernels Through Roof
A Kernel Two-Sample Test

Gretton, Borgwardt, Rasch, Schölkopf & Smola · JMLR · 2012 · free

The canonical reference on the maximum mean discrepancy. MMD is a distance between probability distributions computed purely from samples via a characteristic kernel, with a closed-form unbiased estimator and explicit null-distribution statistics. It is the workhorse of modern distribution-comparison tooling: domain adaptation, generative-model evaluation (MMD-GANs, Frechet distances), covariate-shift detection, and independence testing all reduce to MMD variants. Pair with Muandet, Fukumizu, Sriperumbudur & Schölkopf's 2017 Kernel Mean Embedding of Distributions: A Review and Beyond for the comprehensive modern survey.

JMLR Survey
On the Relationship Between Self-Attention and Convolutional Layers

Cordonnier, Loukas & Jaggi · ICLR · 2020 · free

One representative of the broader "kernel-methods perspective on modern architectures" line. Shows that multi-head self-attention can exactly express any convolutional operation, formalising the architectural kinship that empirical work had already hinted at. The kernel-theory concept at work is the positive-definite structure of attention scores. A reminder that even in the deep-learning era, kernel ideas keep resurfacing as the organising lens for architectural analysis. Pair with the NTK-regime convolutional-kernel literature (Arora, Du, Hu, Li, Salakhutdinov & Wang, 2019, On Exact Computation with an Infinitely Wide Neural Net) for the infinite-width-CNN connection.

arXiv

Software and documentation

libsvm and LIBLINEAR

Chih-Chung Chang & Chih-Jen Lin · National Taiwan University · open source · C++

The reference SVM implementations. libsvm implements SMO-style training for kernel SVMs (classification, regression, one-class) and is the library that scikit-learn's SVC and SVR wrap internally; LIBLINEAR implements specialised coordinate-descent solvers for the linear-kernel case and scales to millions of samples and features. The documentation is terse but complete; A Practical Guide to Support Vector Classification (Hsu, Chang & Lin) is the single most useful ten-page document on day-to-day SVM use. If you want to understand what SVC is actually doing under the hood, this is the source.

libsvm LIBLINEAR Practical guide
scikit-learn: kernel methods and SVMs

scikit-learn developers · open source · Python / Cython

The standard Python interface to classical kernel methods. sklearn.svm.SVC and SVR wrap libsvm; LinearSVC wraps LIBLINEAR; sklearn.kernel_ridge.KernelRidge provides the exact kernel-ridge regressor; sklearn.decomposition.KernelPCA implements kernel PCA; sklearn.kernel_approximation provides the Nyström and random-Fourier-feature explicit approximators. The User Guide sections on SVMs and kernel approximation are unusually good. For 80% of kernel-methods work in Python, this is the whole toolkit.

SVM guide Kernel approx
GPyTorch

Cornell ML group · open source · Python · PyTorch

The modern GPU-accelerated Gaussian-process library. Built on PyTorch with a modular kernel-arithmetic DSL, GPyTorch implements exact GPs, sparse variational GPs, deep kernel learning, multi-task GPs, and scalable structured kernel interpolation, all with autograd-differentiable hyperparameter optimisation. The Examples notebooks are an exceptional library of working GP code, covering regression, classification, deep kernels, and Bayesian optimisation. For any serious GP work in the PyTorch ecosystem, this is the default.

gpytorch.ai Examples
GPflow

PROWLER.io / Secondmind · open source · Python · TensorFlow

GPyTorch's TensorFlow-based cousin. GPflow descends directly from the original GPy library and emphasises a clean mathematical API over a cutting-edge feature set. Strong support for sparse variational GPs (Titsias's framework is the library's native tongue), multi-output GPs, and the Bayesian-deep-GP hierarchy. The tutorial set on inducing-point variational methods and on GP classification is the best reference at that abstraction level. For TensorFlow-aligned teams, this is the natural choice.

gpflow.github.io
Falkon and KeOps

IIT & University of Genoa / CEREMADE & ENS Paris-Saclay · open source · Python · PyTorch / GPU

The scalable-kernel toolkit for datasets where exact kernel methods are infeasible. Falkon implements the Rudi–Carratino–Rosasco FALKON algorithm for GPU-accelerated Nyström-approximated kernel ridge regression at ten-million-point scale. KeOps ("kernel operations") is a lower-level library that provides symbolic-tensor kernel computations with on-the-fly GPU kernel fusion, eliminating the memory bottleneck of naive dense-kernel-matrix approaches. Together these two libraries represent the state of the art in "kernel methods that actually scale to deep-learning-sized datasets."

Falkon KeOps
kernlab and e1071 (R)

Karatzoglou et al. / Meyer et al. · open source · R

The two canonical R kernel-methods libraries. kernlab provides a unified object-oriented interface to kernel SVMs, SVR, kernel PCA, kernel CCA, kernel k-means, Gaussian processes, and relevance-vector machines, with a full kernel zoo. e1071 is the older and simpler libsvm wrapper, still the most widely used R SVM interface. For R-first workflows these are the defaults. Pair with the caret and tidymodels meta-packages for the cross-validation-and-tuning infrastructure around them.

kernlab e1071

This page is Chapter 07 of Part IV: Classical Machine Learning. Chapter 08 turns from kernel methods — which invite us to let the kernel do the non-linear work so the algorithm can stay linear — to feature engineering and selection, which invites us to do the non-linear work by hand, carefully, so the algorithm and the features together tell a story a human can read. The two philosophies are complementary: kernels work best when we don't know what the right features are; feature engineering works best when we do. Chapter 09 then closes the classical-ML arc with model evaluation and selection — the measurement discipline that tells us which of the preceding seven chapters' techniques actually wins on any given problem.

How to read this chapter

Contents

Why kernels: non-linearity without a non-linear model

A worked motivation: XOR

The scaling problem

The trick

What we gain and what we pay

The roadmap of this chapter

Feature maps and inner products

The feature map

Example: the polynomial feature map

Why linear methods are enough

Inner products are similarity measures

The kernel trick and Mercer's theorem

Positive semi-definite kernels

The Moore–Aronszajn theorem

Mercer's theorem

The practical recipe

Verifying a function is PSD

Reproducing kernel Hilbert spaces and the representer theorem

The reproducing property

The representer theorem

Regularisation in the RKHS

Connection to Gaussian processes

The kernel zoo: which kernels practitioners actually use

The linear kernel

The polynomial kernel

The Gaussian RBF kernel

The Laplacian kernel

The Matérn family

The sigmoid kernel

Specialised kernels

A default kernel hierarchy

Constructing new kernels: closure properties

The basic closure rules

Building the RBF from the polynomial

Multiple kernel learning

Deep kernels

Maximum-margin classifiers: the linear SVM

The hard-margin problem

Why the margin matters

Geometric interpretation

Historical note

Soft-margin SVMs and the C parameter

Slack variables

Hinge loss

Tuning C

Choosing between hinge and log loss

The dual formulation and support vectors

Derivation

Support vectors and the KKT conditions

Training: SMO and beyond

Computational cost

Kernel SVMs: applying the trick to the dual

The kernel dual

The non-linear decision boundary

Strengths

The limitation

Probabilistic outputs

Support vector regression

The ε-insensitive loss

The SVR primal and dual

Hyperparameters

SVR versus kernel ridge

Multiclass SVMs

One-vs-all

One-vs-one

Crammer–Singer

Error-correcting output codes

Practical advice

Kernel ridge regression

The solution in one line

Why it matters

Complexity

Relation to other methods

Other kernelised methods

Kernel PCA

Kernel k-means

Kernel Fisher discriminant

One-class SVM