Most learning algorithms in this book — linear regression, logistic regression, principal component analysis, k-means — are designed for the case where the data lives in an ordinary Euclidean space and the interesting structure is linear. Real data is almost never linear. The kernel trick, due to Aronszajn in 1950 as an abstract mathematical result and to Boser, Guyon, and Vapnik in 1992 as an algorithmic one, is the single most elegant extension of linear methods to non-linear settings: rather than mapping the data into some high-dimensional feature space and computing inner products there, the kernel computes those inner products directly — often in constant time — without ever materialising the feature map. A huge swath of classical machine learning becomes an exercise in choosing and combining kernels: the method is the same, only the similarity function changes. Support vector machines, the chapter's name-bearing technique, are the most famous application: the maximum-margin classifier that dominated academic benchmarks for a decade, defined most of the vocabulary of modern statistical learning, and produced a generation of theorists who went on to define computational learning theory. Alongside SVMs, the kernel framework supports kernel ridge regression, kernel PCA, kernel clustering, Gaussian processes (Bayesian kernel regression), and a dozen more specialised methods, all sharing the same mathematical core. The catch is scaling: kernel methods with n training points require O(n²) memory and O(n³) training time, which broke them decisively for large modern datasets and motivated the Nyström and random-features approximations that keep them relevant today.
Section one motivates the whole enterprise by taking a linear method — the perceptron or ridge regression — and showing how its behaviour changes when the data is first lifted into a non-linear feature space. Section two makes that lifting concrete: a feature map φ: X → H replaces each input with a vector in some (possibly infinite-dimensional) Hilbert space, and a linear method applied to φ(x) becomes a non-linear method in x. Section three is the central trick: any algorithm whose formulation depends only on inner products ⟨φ(x), φ(x')⟩ can be rewritten using a kernel function k(x, x') that computes those inner products directly, without ever evaluating φ. Mercer's theorem tells us which functions k qualify as kernels (positive semi-definite ones), and the kernel matrix K with Kᵢⱼ = k(xᵢ, xⱼ) is the only thing the algorithm ever sees about the data. Section four introduces reproducing kernel Hilbert spaces — the functional-analytic framework that makes the whole construction rigorous — and the representer theorem, which says every kernel-method solution lies in the span of the training points. Section five is a tour of the kernel zoo: linear, polynomial, RBF/Gaussian, Laplacian, sigmoid, and the specialised kernels for strings, graphs, and sets. Section six explains the operations (addition, multiplication, composition) that let you build new kernels from old ones.
Sections seven through ten develop the flagship application: support vector machines. Section seven presents the maximum-margin classifier — Vapnik's foundational idea that, among the many hyperplanes that separate two classes, the one to pick is the one that leaves the widest margin on either side. Section eight introduces soft-margin SVMs via slack variables and the regularisation constant C, which is what lets the method handle non-separable data. Section nine develops the dual formulation, where the primal problem over the weight vector is re-expressed as a quadratic program over dual variables that weight the training points. The dual is what makes SVMs work with kernels, and the support vectors — the small subset of training points with non-zero dual variables — are both the sparsity story and the mechanism by which SVMs generalise despite enormous feature spaces. Section ten applies the kernel trick to the dual, producing the kernel SVM — the method that actually gets implemented. Section eleven generalises the machinery to regression: the epsilon-insensitive loss, support vector regression, and when to prefer it over kernel ridge. Section twelve covers the two standard routes to multiclass SVMs (one-vs-one, one-vs-all), and notes the Crammer–Singer unified formulation.
Sections thirteen through sixteen survey the kernel-methods ecosystem beyond SVMs. Section thirteen is kernel ridge regression — linear ridge with the kernel trick applied — the closed-form analogue of SVR and the cleanest possible example of kernel thinking. Section fourteen touches on the other kernelised classics: kernel PCA (already met in Chapter 05), kernel k-means, kernel Fisher discriminant, one-class SVM for anomaly detection, and spectral clustering viewed as a kernel method. Section fifteen is Gaussian processes — the Bayesian formulation of kernel regression, with calibrated posteriors, hyperparameter learning via the marginal likelihood, and the close relationship between the GP mean predictor and kernel ridge regression. Section sixteen is the hardest practical question: scaling. Kernel methods are O(n²) in memory and O(n³) in time; the Nyström approximation, random Fourier features, inducing-point sparse GPs, and modern GPU libraries (GPyTorch, KeOps, Falkon) are the techniques that push the usable range from tens of thousands to tens of millions of points. Section seventeen closes with practical guidance: which kernel to try first, how to tune hyperparameters, when to prefer a random-features approximation over a full kernel, and the common traps. Section eighteen places kernel methods in the modern ML landscape: the neural-network–kernel equivalence in the infinite-width limit, the neural tangent kernel, kernel methods as the gold-standard baseline against which every new deep model is still measured, and the continuing relevance of Gaussian processes in Bayesian optimisation, scientific ML, and anywhere calibrated uncertainty matters.
C, which γ — an operational guideLinear methods are the workhorses of classical statistics and machine learning because they are convex, easy to optimise, and well understood. They fail — often catastrophically — on problems where the interesting structure is not linear. The tempting fix is to engineer non-linear features by hand: add squared terms, pairwise products, polynomials, trigonometric functions. This works for small problems and becomes impossible as the number of features grows. Kernel methods are the elegant alternative: use a linear algorithm in a space where the data has been non-linearly transformed, but never actually compute the transformation. The inner product in the transformed space — which is all a linear algorithm needs — is computed directly by a kernel function.
The prototypical example is the XOR problem: four points in the plane, labels (+1, −1, −1, +1) at (1,1), (1,−1), (−1,1), (−1,−1). No line separates the positives from the negatives, so no linear classifier works. Lift the points into three dimensions with the feature map φ(x₁, x₂) = (x₁², x₂², √2·x₁·x₂). The XOR points now live at (1, 1, √2), (1, 1, −√2), (1, 1, −√2), (1, 1, √2), and the hyperplane z = 0 (with z the third coordinate) separates them perfectly. A linear classifier in three dimensions solves the problem that no linear classifier in two dimensions could.
Explicit feature maps become impractical quickly. A degree-d polynomial expansion of a p-dimensional input has O(p^d) features. For p=1000 and d=5 that is a trillion features, far too many to materialise. Even worse, some useful feature maps are infinite-dimensional: the Gaussian RBF map, derived later, lifts each point into a countable-infinite-dimensional Hilbert space. No algorithm that works with explicit feature vectors can handle this.
The observation that makes everything work: a huge family of linear algorithms — ridge regression, logistic regression, PCA, k-means, the perceptron, and the support vector machine — touch the data only through inner products ⟨xᵢ, xⱼ⟩. None of them ever uses a raw coordinate of xᵢ; only dot products. If we replace every occurrence of ⟨xᵢ, xⱼ⟩ with a kernel function k(xᵢ, xⱼ) that computes ⟨φ(xᵢ), φ(xⱼ)⟩ directly, the algorithm runs in the lifted space without our ever having lifted the points. The kernel is the only thing that matters.
We gain expressivity: an RBF-kernel SVM can represent any continuous decision boundary, an RBF-kernel Gaussian process can represent any smooth function. We gain the ability to work with data whose natural representation is not a vector at all — strings, graphs, trees, distributions — by defining a kernel that measures similarity between such objects. We gain rich theoretical tools: convergence guarantees, generalisation bounds, and explicit relationships between the kernel, the hypothesis space, and the capacity of the model.
We pay in computation. Kernel methods with n training points require an n×n kernel matrix — O(n²) memory — and the training of most kernel methods is O(n³) in time. For n=100 000 this is already painful; for n=10⁶ it is infeasible without the approximations of section 16. This scaling is the single reason that, when deep learning arrived with linear-in-n minibatch SGD, kernel methods lost ground in the commodity-compute regime. They remain the gold standard in small-to-medium data problems and in any setting where calibrated uncertainty, interpretability, or rigour matters more than raw size.
Sections 2–6 develop the representation: feature maps, the kernel trick, reproducing kernel Hilbert spaces, the kernels you use in practice, and how to build new ones. Sections 7–12 develop SVMs — the flagship algorithm — from linear max-margin classifiers to the kernelised multiclass versions. Sections 13–15 cover kernel ridge regression, the other kernelised classics, and Gaussian processes. Section 16 is about scaling — without which modern kernel methods are a museum piece. Sections 17–18 close with practice and where the ideas live today.
Before introducing the kernel trick, we need to be precise about what a feature map is and what an inner product is in the generality we need. A feature map is a function that takes each input and produces a vector in some — possibly infinite-dimensional — Hilbert space. The inner product in that space, restricted to the images of feature maps of pairs of inputs, will end up being the kernel. This section pins down the setting.
A feature map is a function φ: X → H from the input space X (which can be anything — vectors, strings, graphs, images) to a Hilbert space H with inner product ⟨·, ·⟩. A Hilbert space is a complete inner-product space — intuitively, a possibly-infinite-dimensional vector space on which angles and lengths are well-defined. R^d with the ordinary dot product is the finite-dimensional example; the space of square-integrable functions on R with the integral inner product is the canonical infinite-dimensional example.
For input x = (x₁, x₂) ∈ R², define φ(x) = (1, √2·x₁, √2·x₂, x₁², x₂², √2·x₁·x₂). The image lives in R⁶. The inner product in R⁶ is
⟨φ(x), φ(x')⟩ = 1 + 2·x₁·x₁' + 2·x₂·x₂' + x₁²·(x₁')² + x₂²·(x₂')² + 2·x₁·x₂·x₁'·x₂'.
With some algebra, this equals (1 + x₁·x₁' + x₂·x₂')² = (1 + ⟨x, x'⟩)². The inner product in the six-dimensional feature space is computed by squaring a one-line scalar expression in the two-dimensional input space. This is the degree-2 polynomial kernel, and it already shows the trick in miniature: the feature vector has 6 entries, the kernel computation has 1 multiplication and 1 squaring.
A linear classifier in a Hilbert space H is a function f(x) = ⟨w, φ(x)⟩ + b for some w in H and scalar b. Training finds the w that minimises some loss. For many loss functions — hinge loss (SVM), squared loss (ridge), log loss (logistic regression) — the optimiser's w lies in the span of the training images: w = Σᵢ αᵢ φ(xᵢ). This is the representer theorem, formalised in section 4. Substituting back:
f(x) = Σᵢ αᵢ ⟨φ(xᵢ), φ(x)⟩ + b = Σᵢ αᵢ k(xᵢ, x) + b,
so the prediction for a new input x depends on the training data only through the kernel evaluated between x and each training point. The whole method, from training to prediction, has vanished into kernel evaluations.
It is useful to read the kernel as a similarity function. ⟨φ(x), φ(x')⟩ is large when φ(x) and φ(x') point in similar directions — it is a measure of how much the two inputs have in common in the feature space. Designing a good kernel is designing a good measure of similarity. For points in R^d, the Gaussian RBF kernel k(x, x') = exp(−γ‖x−x'‖²) is large when the two points are close and small when they are far — which is why it works so well as a default.
The kernel trick is the observation that we do not need to know the feature map φ explicitly — we need only the inner product ⟨φ(x), φ(x')⟩, which is the kernel k(x, x'). The deep question is: which functions k(x, x') are inner products of some feature map into some Hilbert space? The answer is given by Mercer's theorem and its modern generalisations, and it is essentially: exactly the positive semi-definite functions.
A function k: X × X → R is a positive semi-definite kernel — sometimes just "PSD" — if for every finite set of inputs x₁, …, xₙ the Gram matrix (or kernel matrix) K with entries Kᵢⱼ = k(xᵢ, xⱼ) is symmetric and positive semi-definite: vᵀKv ≥ 0 for every vector v. Equivalently, all eigenvalues of K are non-negative, for every finite n and every choice of inputs.
The result that makes everything work: every PSD kernel k corresponds to some feature map φ into some Hilbert space H such that k(x, x') = ⟨φ(x), φ(x')⟩. The Hilbert space is unique up to isomorphism and is called the reproducing kernel Hilbert space (RKHS) of the kernel. The theorem does not give us φ explicitly — it proves existence and constructs H as the closure of finite linear combinations of kernel evaluations — but it guarantees that any PSD kernel is legitimately the inner product of some feature map, which is all we need to apply the kernel trick.
Mercer's theorem (1909) is the classical result that predated the kernel-methods literature by ninety years: under mild conditions, a continuous PSD kernel k(x, x') on a compact domain admits an eigendecomposition
k(x, x') = Σ_i λᵢ · eᵢ(x) · eᵢ(x'),
where λᵢ ≥ 0 are eigenvalues and eᵢ are orthonormal eigenfunctions. The feature map is φ(x) = (√λ₁ · e₁(x), √λ₂ · e₂(x), …), and the feature space is ℓ². This is where infinite-dimensional feature spaces first appear in the theory — the RBF kernel, for instance, has a countable-infinite eigendecomposition and a feature space that is genuinely infinite-dimensional.
To apply the kernel trick to an existing linear algorithm: (i) write the algorithm entirely in terms of inner products between training inputs; (ii) replace every inner product ⟨xᵢ, xⱼ⟩ with k(xᵢ, xⱼ); (iii) verify that the algorithm still makes sense — typically it does, because the only way it ever touched the raw data was through inner products. The kernel matrix K becomes the central object: it holds all the information the algorithm has about the data. Many algorithms (kernel ridge, kernel PCA, Gaussian processes) require factorising or inverting K — hence the O(n³) scaling.
Two main routes. The first: construct an explicit feature map φ and show that k(x, x') = ⟨φ(x), φ(x')⟩. The second, more commonly useful in practice: use the closure properties of section 6 (sum, product, scalar multiplication) to build k from known PSD kernels; the result is guaranteed PSD. For a function pulled from thin air, the default is to compute the Gram matrix on a sample of data and check its eigenvalues.
The feature-space view of kernel methods is concrete but heavy — working with a possibly-infinite-dimensional feature vector is clumsy. The function-space view is cleaner: think of the kernel not as an inner product between feature vectors but as an inner product between functions in a particular space of functions — the reproducing kernel Hilbert space. This view makes regularisation theory, generalisation bounds, and the connection to Gaussian processes transparent, and the representer theorem falls out as a direct consequence.
Given a PSD kernel k on a space X, the RKHS H_k is a Hilbert space of functions f: X → R with the following property: for every x ∈ X, the function k(·, x) is in H_k, and for every f in H_k, the reproducing property
f(x) = ⟨f, k(·, x)⟩_{H_k}
holds. Pointwise evaluation of f at x is computed by taking the inner product of f with the kernel "bump" at x. Every RKHS has a unique kernel, and every PSD kernel has a unique RKHS. Both the space and the kernel are synonyms for the same mathematical structure.
The theorem that makes kernel methods computable. Given a set of training inputs x₁, …, xₙ, a strictly increasing regulariser Ω: [0, ∞) → R, and a loss L that depends only on function values at the training points, the minimiser of
min_{f ∈ H_k} L(f(x₁), …, f(xₙ), y) + Ω(‖f‖_{H_k})
lies in the finite-dimensional subspace spanned by k(·, x₁), …, k(·, xₙ). That is, f*(x) = Σᵢ αᵢ k(xᵢ, x) for some coefficients α. The proof is three lines: any component of f orthogonal to that span increases the regulariser without affecting the loss.
The RKHS norm ‖f‖_{H_k} is the canonical regulariser for kernel methods: it penalises functions that vary rapidly in the metric the kernel implicitly defines. A smooth kernel (like the RBF) makes the RKHS consist of smooth functions; a non-smooth kernel (like the Laplacian) makes it consist of rougher functions. Choosing the kernel is choosing the inductive bias — the hypothesis space and its implicit complexity ordering. Kernel ridge regression, kernel logistic regression, and the soft-margin SVM are all particular choices of loss paired with the RKHS norm as the regulariser.
A Gaussian process with covariance function k has sample paths that almost surely lie in an RKHS with kernel k (with some technicalities). The kernel ridge-regression predictor equals the posterior mean of the GP with the same kernel. The GP is the Bayesian twin of kernel ridge: same mean predictor, plus a calibrated posterior variance. This connection motivates section 15 in full detail.
In principle there are uncountably many PSD kernels. In practice, five or six dominate — the linear kernel for baseline sanity checks, the RBF kernel as a nearly universal default, the polynomial kernel for interactions of known degree, the Laplacian for heavy-tailed behaviour, and a handful of specialised kernels for non-vector inputs (strings, graphs, sets, probability distributions). Knowing the standard kernels and when each is appropriate is most of what practical kernel design is about.
k(x, x') = ⟨x, x'⟩. The feature map is the identity. A kernel SVM with a linear kernel is just a linear SVM. The linear kernel is useful when the data is high-dimensional and already near-linearly separable — common in sparse, high-dimensional problems like text (bag-of-words vectors) or genomics. Linear kernels admit O(n) or O(nd) algorithms (LibLinear, SVMlin) that dominate kernel-SVM implementations when the kernel is the identity.
k(x, x') = (⟨x, x'⟩ + c)^d, with c ≥ 0 a constant and d a positive integer. The feature space is the space of all monomials of degree d. A polynomial kernel with d=2 encodes pairwise interactions; d=3 encodes three-way interactions; and so on. The number of features in the feature space grows as O(p^d), so for even modest d and p the kernel computes inner products in a trillion-dimensional space in a single line of arithmetic.
k(x, x') = exp(−γ · ‖x − x'‖²), sometimes written with γ = 1/(2σ²). The most widely used kernel in practice, for good reason. Its feature space is infinite-dimensional (countable orthonormal basis of Hermite polynomials). It is universal: an RBF-kernel SVM can approximate any continuous function on a compact domain to arbitrary accuracy, and Gaussian processes with the RBF kernel produce posteriors over arbitrary smooth functions. The bandwidth γ is the key hyperparameter: small γ makes the kernel smooth and long-range, large γ makes it peaky and short-range. Cross-validate or use the median heuristic (γ ≈ 1 / median(‖xᵢ − xⱼ‖²)) as a starting point.
k(x, x') = exp(−γ · ‖x − x'‖₁), with the L¹ norm in the exponent. The sample paths are less smooth than RBF's, which can be a feature when the target function has non-smooth behaviour (edges, step changes). The Laplacian kernel is also the covariance of the Ornstein–Uhlenbeck process.
The Matérn kernel with smoothness parameter ν is a parametric family that interpolates between the Laplacian (ν=1/2) and the RBF (ν → ∞). The common values ν=3/2 and ν=5/2 are the standards in spatial statistics and Bayesian optimisation — they give smoother sample paths than Laplacian but less restrictive behaviour than RBF. When you care about the smoothness of the learned function, the Matérn kernel is the principled choice.
k(x, x') = tanh(α · ⟨x, x'⟩ + c). Famously not always PSD — it is PSD only for certain parameter ranges — but historically important because it gave kernel SVMs a formal resemblance to two-layer perceptrons. In practice the RBF kernel is a better universal non-linearity and the sigmoid kernel is rarely used today.
The real flexibility of kernel methods is in the specialised kernels for non-vector inputs. The string kernel measures the similarity of two strings by counting shared subsequences; it is the backbone of protein-sequence classification and text classification. The graph kernel family — Weisfeiler–Lehman, random-walk, graphlet — measures similarity between graphs and was state of the art for molecular property prediction before graph neural networks. The set kernel measures the similarity of two sets by aggregating pairwise element kernels. The Fisher kernel measures similarity in terms of the score gradients of a probabilistic model. For any kind of structured input, there is usually a published kernel that captures the natural notion of similarity.
When starting a new problem: try the linear kernel first as a baseline. If the problem is non-linear, try the RBF kernel with the median-distance heuristic for γ. If the RBF kernel works, tune γ and C via cross-validation. Use polynomial kernels only when you have a reason to think the interaction order matters. Use specialised kernels when the input is not a vector.
The set of PSD kernels is closed under a surprisingly rich collection of operations. This closure is what lets you compose simple kernels — each capturing one aspect of similarity — into a custom kernel tailored to the problem, without having to re-derive positive-definiteness from first principles. Mastering these operations is the practical craft of kernel design.
Let k₁ and k₂ be PSD kernels on X. Then:
(i) k(x, x') = k₁(x, x') + k₂(x, x') is PSD — sums of kernels are kernels.
(ii) k(x, x') = α · k₁(x, x') for α ≥ 0 is PSD — non-negative scalar multiples are kernels.
(iii) k(x, x') = k₁(x, x') · k₂(x, x') is PSD — products of kernels are kernels. This generalises: any polynomial with non-negative coefficients applied entrywise to a kernel is a kernel, and by Taylor expansion so is the exponential, so exp(k₁(x, x')) is a kernel.
(iv) k(x, x') = f(x) · k₁(x, x') · f(x') for any function f is PSD — this is "kernel conjugation" and is how one normalises a kernel to have unit diagonal.
(v) If g: X → X' is any function and k' is PSD on X', then k(x, x') = k'(g(x), g(x')) is PSD — composition with an arbitrary function preserves positive-definiteness. This is how the feature-engineering mindset fits into kernels.
A satisfying little derivation. Start with the linear kernel k₀(x, x') = ⟨x, x'⟩. Raise to any power: k₀^n is a kernel (closure under products). Exponentiate: exp(k₀) is a kernel (closure under polynomials with non-negative coefficients, and the Taylor series of exp has all non-negative coefficients). Now conjugate by f(x) = exp(−γ · ‖x‖² / 2): the result is
exp(−γ ‖x‖² / 2) · exp(γ ⟨x, x'⟩) · exp(−γ ‖x'‖² / 2) = exp(−γ · ‖x − x'‖² / 2).
The RBF kernel emerges as a composition of linear-kernel operations with closure rules. The same kind of derivation shows that any "product kernel" — a kernel on X × Y defined by k((x, y), (x', y')) = k_X(x, x') · k_Y(y, y') — is PSD, which is the main tool for building kernels on structured inputs.
When several candidate kernels are available, multiple kernel learning (Lanckriet et al. 2004, Bach 2008) learns a non-negative weighted combination k(x, x') = Σ_m μ_m · k_m(x, x') directly from the data, choosing the weights to minimise a loss with a sparsity penalty. MKL was the reigning technique for automatic kernel design from about 2005 to 2015 and is still the right approach when the problem naturally admits several candidate similarity measures (multi-modal data, heterogeneous features).
The modern extension: let the feature extractor inside the kernel be a neural network. Deep kernels (Wilson et al. 2016) parametrise the kernel as k(x, x') = k_base(φ_θ(x), φ_θ(x')) where φ_θ is a neural network and k_base is a standard kernel (typically RBF). The neural network learns the representation; the kernel reasons in it. This is the natural bridge between kernel methods and deep learning and underpins current deep Gaussian processes.
Support vector machines start from a simple question: given linearly separable data, which of the infinitely many separating hyperplanes should we pick? The maximum-margin answer — the hyperplane that leaves the widest possible gap on either side — turns out to have strong generalisation properties and beautiful mathematical structure. Vapnik's insight was that the margin is not just an aesthetic choice but a complexity measure, and a widely separated hyperplane generalises better than a narrowly separated one almost independently of the dimensionality of the input space.
Given training data {(xᵢ, yᵢ)} with yᵢ ∈ {−1, +1}, a linear classifier is f(x) = wᵀx + b, predicting the sign of f. The data is linearly separable if there exist w and b such that yᵢ · (wᵀxᵢ + b) ≥ 1 for all i. The margin is the distance from the hyperplane to the nearest training point; for the above normalisation it equals 1 / ‖w‖. Maximising the margin is minimising ‖w‖, leading to the quadratic program
min_{w, b} (1/2) ‖w‖² subject to yᵢ · (wᵀxᵢ + b) ≥ 1 for all i.
This is the hard-margin SVM. It has a unique solution when the data is separable, and no solution when it is not — a limitation that section 8 addresses.
A hyperplane with a wide margin is robust: small perturbations of the training data do not flip its predictions near the boundary. It also generalises better, in a precise VC-theoretic sense: the VC-dimension of maximum-margin hyperplanes depends on the margin, not on the ambient dimension. This is the foundational result that justifies why SVMs can work in very high-dimensional feature spaces (or infinite-dimensional ones, via kernels) without overfitting — the maximum-margin constraint itself provides capacity control.
The optimal hyperplane is the perpendicular bisector of the line segment between the nearest positive and negative training points, adjusted for multiple nearest points. The points that lie exactly on the margin — for which yᵢ · (wᵀxᵢ + b) = 1 — are called support vectors. Moving any non-support vector has no effect on the optimal hyperplane; moving a support vector does. The solution depends only on a small subset of the training data — a sparsity property that becomes algorithmic leverage once we move to the dual formulation in section 9.
The maximum-margin hyperplane was described by Vapnik and Chervonenkis in a 1964 paper; the soft-margin extension was published in 1995 by Cortes and Vapnik; the kernel version by Boser, Guyon, and Vapnik in 1992. The SVM dominated academic benchmarks throughout the late 1990s and early 2000s, spawned computational learning theory as a field, and produced the conceptual machinery (margin, capacity, structural risk minimisation, sparse dual representation) that still underpins how we think about generalisation.
Real data is almost never linearly separable: there is always noise, mis-labelled points, or genuine overlap between classes. The hard-margin SVM has no solution on such data. The soft-margin SVM, introduced by Cortes and Vapnik in 1995, fixes this by allowing training points to violate the margin at a penalty, trading off margin width against training error. The strength of the trade-off is controlled by a single hyperparameter C, which is almost the only knob the user has to turn.
Introduce non-negative slack variables ξᵢ ≥ 0 measuring by how much each training point is allowed to violate the margin. The optimisation becomes
min_{w, b, ξ} (1/2) ‖w‖² + C · Σᵢ ξᵢ subject to yᵢ · (wᵀxᵢ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0.
A point with ξᵢ = 0 is correctly classified outside the margin; 0 < ξᵢ ≤ 1 means it is inside the margin but still correctly classified; ξᵢ > 1 means it is misclassified. The penalty C · Σ ξᵢ is added to the objective. Large C makes violations expensive — the method tries to classify every point correctly, yielding a narrow margin that overfits. Small C makes violations cheap — the method tolerates many violations, yielding a wide margin that may underfit.
The soft-margin SVM has a clean loss-minimisation reformulation. Define the hinge loss
L(yᵢ, f(xᵢ)) = max(0, 1 − yᵢ · f(xᵢ)).
Then the SVM minimises (1/n) · Σᵢ L(yᵢ, f(xᵢ)) + λ · ‖w‖² with λ = 1/(nC). The SVM is hinge-loss + L2 regularisation, structurally identical in form to logistic regression (log-loss + L2) and ridge regression (squared-loss + L2). The only thing that makes SVMs distinctive is the hinge loss — which produces sparsity in the dual and the clean "support vectors" story.
C is the single most important hyperparameter for a soft-margin SVM. Cross-validate over a logarithmic grid: C ∈ {0.01, 0.1, 1, 10, 100}. For most problems, the optimal C will be one of these; fine-tune around the best if it matters. The intuition: C is the inverse of the regularisation strength. Large C = weak regularisation = tight fit. Small C = strong regularisation = smooth boundary. If the data is noisy, use small C; if it is clean, use large.
Structurally, SVMs and logistic regression are twins — the same regulariser, different losses. Hinge loss produces sparse dual solutions (support vectors); log loss produces probabilistic predictions but no sparsity. If you want calibrated probabilities, use logistic regression (or Platt-scale the SVM outputs post-hoc). If you want a sparse representation and strong margin-based generalisation guarantees, use SVM. Empirically the two tend to perform similarly on most classification tasks; the choice is often driven by downstream needs rather than accuracy.
The soft-margin SVM is a convex quadratic program in the primal — solvable directly by interior-point methods or specialised solvers. But its dual formulation is where all the interesting mathematics happens: the dual variables αᵢ weight the training points, the solution is sparse (most αᵢ are zero), the data appears only through inner products, and the kernel trick slots in effortlessly. Every kernel-SVM implementation solves the dual, not the primal.
Apply Lagrangian duality to the soft-margin primal. Introduce multipliers αᵢ ≥ 0 for the margin constraints and μᵢ ≥ 0 for the slack-non-negativity constraints. Stationarity over w gives w = Σᵢ αᵢ yᵢ xᵢ. Stationarity over b gives Σᵢ αᵢ yᵢ = 0. Eliminating w, b, and the slacks yields the dual
max_α Σᵢ αᵢ − (1/2) Σᵢⱼ αᵢ αⱼ yᵢ yⱼ ⟨xᵢ, xⱼ⟩ subject to 0 ≤ αᵢ ≤ C, Σᵢ αᵢ yᵢ = 0.
This is a quadratic program in n variables. The data enters only through the Gram matrix G with Gᵢⱼ = yᵢ yⱼ ⟨xᵢ, xⱼ⟩, so the dual depends on the training inputs only through inner products. Replace ⟨xᵢ, xⱼ⟩ with k(xᵢ, xⱼ) and you have the kernel SVM of section 10.
The Karush–Kuhn–Tucker conditions characterise the optimum. Each training point falls into one of three categories:
αᵢ = 0 — the point is correctly classified outside the margin and does not participate in the solution.
0 < αᵢ < C — the point lies exactly on the margin (yᵢ·(wᵀxᵢ+b) = 1) and is an unbounded support vector. These points determine the margin boundary.
αᵢ = C — the point has positive slack: it is inside the margin or misclassified. These are the bounded support vectors.
The final weight vector is w = Σ_{i ∈ SV} αᵢ yᵢ xᵢ with the sum over support vectors only. Typically a small fraction of training points are support vectors — SVMs are sparse in the dual. This is what makes them memory-efficient at prediction time: to predict on a new point x, compute f(x) = Σ_{i ∈ SV} αᵢ yᵢ k(xᵢ, x) + b, a sum over support vectors only.
Training an SVM is a quadratic program in n variables with a dense kernel matrix. The classical algorithm is Sequential Minimal Optimization (Platt 1998): iteratively pick a pair of dual variables, solve the two-variable QP analytically, and update. SMO has no outer QP solver and is straightforward to implement; libsvm and scikit-learn's SVC both use SMO variants. For very large n, LaSVM (online training), bundle methods, and stochastic-subgradient approaches (Pegasos) are faster.
Training an SVM is empirically O(n²) to O(n³) depending on the kernel and the solver. Memory is O(n²) for the kernel matrix, which for n = 10⁵ is already 80 GB at double precision. Prediction is O(|SV|) — bounded support vectors are proportional to training error, unbounded ones grow slowly with n. These costs are what make scaling kernel methods to modern dataset sizes genuinely hard (section 16).
Because the SVM dual depends on the training inputs only through inner products, the kernel trick applies immediately: replace ⟨xᵢ, xⱼ⟩ with k(xᵢ, xⱼ), and the entire method now runs in the feature space implicitly defined by the kernel. The linear maximum-margin classifier becomes a non-linear maximum-margin classifier in the feature space, which traces out a non-linear boundary in the input space. This is the "S" and "V" and "M" that make the SVM what it is — and it all falls out of a trivial substitution.
The kernel-SVM dual is identical to the linear dual with k(xᵢ, xⱼ) in place of the inner product:
max_α Σᵢ αᵢ − (1/2) Σᵢⱼ αᵢ αⱼ yᵢ yⱼ k(xᵢ, xⱼ) subject to 0 ≤ αᵢ ≤ C, Σᵢ αᵢ yᵢ = 0.
The prediction is f(x) = Σᵢ αᵢ yᵢ k(xᵢ, x) + b. The weight vector w is never explicitly computed — for an infinite-dimensional kernel like the RBF, it cannot be. The entire algorithm runs on the n × n kernel matrix K, and prediction uses only the kernel evaluated between the new point and each support vector.
In the RBF kernel case, the prediction function is a weighted sum of Gaussians centred at the support vectors. The decision boundary f(x) = 0 is a level set of this sum — a curved, locally adaptive surface in the input space. The curvature is controlled by γ (bandwidth); the complexity is controlled by C (misclassification penalty). Together these two hyperparameters span the whole bias–variance spectrum for kernel SVMs.
Kernel SVMs dominate small-to-medium classification problems when carefully tuned. They produce sparse, interpretable decision functions (a sum over support vectors); they generalise well thanks to the maximum-margin principle; they work on any input domain for which a kernel can be defined (images, text, graphs, sequences). They were the state-of-the-art vision and NLP classifier through the late 2000s — until deep learning scaled past them in vision (AlexNet 2012) and then everywhere else.
Kernel SVMs do not scale. The kernel matrix is n² bytes at double precision (so 80 GB for n=10⁵), and training is O(n²) to O(n³). On moderate datasets — millions of training points — full kernel SVMs are infeasible, which is why modern practice uses either (i) a linear SVM (on high-dimensional sparse features like text), (ii) a kernel approximation (Nyström or random Fourier features, section 16), or (iii) a different model entirely (ensemble trees, deep networks).
SVMs produce distances to the hyperplane, not probabilities. If you need calibrated probabilities, Platt scaling (Platt 1999) fits a one-dimensional logistic regression to the SVM outputs on a held-out set to produce P(y=1 | x) ≈ σ(A · f(x) + B). Isotonic regression is a non-parametric alternative. Both are post-hoc calibrations that do not change the SVM itself. See Chapter 09 (Model Evaluation & Selection) for the broader calibration picture.
The same max-margin + kernel trick machinery extends to regression. The key modification is the ε-insensitive loss: instead of penalising all prediction errors, penalise only those larger than ε. The result is a regression method whose solution is sparse (a subset of training points are "support vectors"), whose flexibility comes from the kernel, and whose regularisation is the same RKHS norm as in classification. Support vector regression is less widely used than kernel ridge regression — the closed form of kernel ridge is appealing — but it has one advantage: genuine sparsity at the solution.
Define the ε-insensitive loss as
L_ε(y, f(x)) = max(0, |y − f(x)| − ε).
Errors smaller than ε incur zero loss; errors larger than ε are penalised linearly. The parameter ε controls the width of the "tube" around the predictor inside which no penalty is paid, and is a regularisation knob complementary to C. The loss is convex but non-differentiable at its corners, giving rise to the sparse dual structure.
Primal:
min_{w, b, ξ, ξ*} (1/2) ‖w‖² + C · Σᵢ (ξᵢ + ξᵢ*) s.t. yᵢ − wᵀxᵢ − b ≤ ε + ξᵢ, wᵀxᵢ + b − yᵢ ≤ ε + ξᵢ*, ξᵢ, ξᵢ* ≥ 0.
Two slacks (upper and lower) per point reflect the symmetry of the loss. The dual is a QP in 2n variables, with prediction f(x) = Σᵢ (αᵢ − αᵢ*) · k(xᵢ, x) + b. Training points with αᵢ − αᵢ* = 0 are inside the ε-tube and do not participate; those with non-zero coefficients are the support vectors, typically a small fraction of the training data.
Three knobs: the kernel bandwidth (e.g. γ for RBF), the regularisation C, and the tube width ε. Smith's heuristics: set ε proportional to the expected noise standard deviation; C at least 10× the data range; tune γ by median distance. All three interact, and a three-way grid search is standard for serious SVR work.
Kernel ridge regression (next section) is convex, has a closed-form solution, uses all training points as non-zero coefficients (no sparsity), and has squared-loss semantics. SVR is also convex, uses the ε-insensitive loss, has a sparse dual solution, and requires iterative optimisation. On most problems the predictive accuracy is similar; choose kernel ridge if you want simplicity and closed forms, choose SVR if you need the sparse prediction structure (e.g., deploying a model where prediction time scales with the number of support vectors). Gaussian-process regression (section 15) subsumes both when you can afford the extra computation.
SVMs are natively binary: the max-margin formulation is defined between two classes, and extending it to more than two classes requires either reducing the multiclass problem to many binary ones or extending the formulation directly. Both routes are well understood, and in practice the reduction-based methods dominate because they work with any binary-SVM library. The direct formulation (Crammer–Singer) is conceptually cleaner but less widely supported.
Train K binary SVMs, each separating one class from all others. At prediction time, pass a new point through all K classifiers and predict the class whose classifier returns the highest score. Simple, fast, and used in most production systems. The trained classifiers are independent — they can be trained in parallel — which is part of why the approach scales. The weakness: the binary sub-problems are imbalanced (one class vs K−1), which can hurt calibration.
Train K·(K−1)/2 binary classifiers, one per pair of classes. At prediction time, apply all classifiers and vote. For K = 10 classes this is 45 classifiers; for K = 1000 it is roughly half a million — so one-vs-one scales poorly in the number of classes, though each sub-problem uses only the data of two classes and is therefore fast. The scikit-learn SVC default is one-vs-one; LibSVM uses it too. For small-to-medium K, one-vs-one is often slightly more accurate than one-vs-all.
The unified multiclass formulation (Crammer & Singer 2001) generalises the margin concept to multiple classes in one shot: learn K weight vectors simultaneously with the constraint that the correct-class score exceeds each incorrect-class score by a margin. Elegant and theoretically principled, but the QP is harder to solve than a batch of one-vs-all sub-problems and implementations are rare. Most practitioners never need to know about it.
A different reduction approach: encode each class as a binary string of length L, train L binary classifiers (one per bit), and at prediction time decode to the nearest class codeword. If the codewords are chosen to have large Hamming distance, the scheme has an error-correcting property: it can withstand misclassifications in individual bits. Dietterich & Bakiri's 1995 paper introduced the idea. ECOC is used occasionally for very-many-class problems but is mostly a historical curiosity.
Use whatever multiclass strategy your SVM library defaults to — one-vs-one for scikit-learn, one-vs-all for some others. For small K (up to 10–20) both are fine; for larger K prefer one-vs-all for training speed, at the mild cost of calibration quality. If probabilistic outputs matter, pair one-vs-all with Platt scaling or use a softmax layer on the raw SVM scores. If accuracy matters most and compute allows, run both strategies and validate.
Ridge regression (Chapter 01) minimises squared loss plus an L² penalty on the weight vector. Kernelising it is the cleanest possible example of the kernel trick: derive the solution in closed form, recognise that the data appears only through inner products, and substitute a kernel. The result — kernel ridge regression — is a predictor of the form f(x) = Σᵢ αᵢ k(xᵢ, x) with coefficients given by a single matrix inversion. There is no iterative optimisation, no quadratic program, and the solution is the equivalent of Gaussian-process regression's posterior mean.
The ridge problem is min_w (1/2) ‖y − Xw‖² + (λ/2) ‖w‖², with closed form w = (XᵀX + λI)⁻¹ Xᵀy. Prediction: f(x) = xᵀw. Apply the matrix identity (XᵀX + λI)⁻¹Xᵀ = Xᵀ(XXᵀ + λI)⁻¹ and use XXᵀ = K (the Gram matrix with ordinary inner products). We get
f(x) = k(x)ᵀ (K + λI)⁻¹ y,
where k(x) is the vector (k(x₁, x), …, k(xₙ, x)). Replace the ordinary inner products with any PSD kernel and the formula still holds — kernel ridge regression is a one-line closed form.
Kernel ridge is the pedagogical cleanest kernel method: no dual, no QP, just a matrix inverse. It is the mean predictor of a Gaussian process with the same kernel and noise λ — Bayesian and frequentist views give identical point predictions, differing only in the posterior-variance story. It generalises well for the same reason maximum-margin SVMs do: the RKHS-norm regulariser controls capacity. For small-to-medium datasets with clean kernel structure, kernel ridge is as strong as anything else in this chapter and simpler than everything except the linear baseline.
Training: compute the n×n kernel matrix (O(n²) memory), Cholesky-factorise K + λI (O(n³) time). Prediction: compute the n-vector k(x) and take a dot product (O(n) time per new point). For n up to about 10⁴ this runs on a laptop; past that point the kernel approximations of section 16 become necessary. When you have 10² or 10³ points and a good kernel, kernel ridge is often the first thing to try.
Kernel ridge is linear ridge regression in the feature space defined by the kernel. It is the dual of linear ridge regression, expressed in n variables instead of d — so whenever n < d (the over-parameterised regime), the dual formulation is cheaper. Its posterior-mean relationship with Gaussian processes is deep: kernel ridge is what you get when you only want the mean, Gaussian processes are what you get when you also want the variance (next two sections).
Once the pattern is clear — take a linear method, rewrite in terms of inner products, substitute the kernel — the list of kernelised classics is long. Several of them are already familiar: kernel PCA appeared in Chapter 05, spectral clustering in Chapter 04. This section is a quick tour of the rest, so that you recognise them and know when each is appropriate.
PCA on the feature-space vectors φ(xᵢ). The covariance operator in the feature space cannot be formed, but its eigenvectors can be expanded in the feature vectors, and the expansion coefficients are the eigenvectors of the centred kernel matrix. Kernel PCA produces non-linear dimensionality reductions; it was one of the first methods that could handle non-linear manifolds and is still a useful tool when the manifold structure is rich but smooth. Covered in detail in Chapter 05.
K-means with Euclidean distance in the feature space — equivalently, with distances derived from a kernel. It is the kernel counterpart of the algorithm in Chapter 04 and produces clusters that are non-linearly separable in the input space. Kernel k-means is closely related to spectral clustering: the relaxation of the kernel-k-means objective gives the spectral-clustering eigenvalue problem, which is another way to see why spectral clustering "sees" non-linear structure.
Fisher linear discriminant (Chapter 02) projects data onto the direction that maximises the ratio of between-class to within-class variance. Kernelising it gives the kernel Fisher discriminant, a non-linear binary classifier that performs similarly to a kernel SVM but with simpler machinery (a generalised eigenvalue problem instead of a QP). Historically significant, less widely used today.
An one-class SVM (Schölkopf et al. 2001) learns a boundary that encloses a fraction ν of the training data — a density-free anomaly detector. The problem is solved as a kernel SVM with a single "class" and a modified primal. Anomalies (points near the boundary or outside it) are those with small decision-function value. One-class SVMs are a classical tool in novelty detection, fraud detection, and unsupervised outlier-finding.
Spectral clustering, Laplacian eigenmaps (Chapter 05), diffusion maps — all can be understood as kernel methods where the kernel is the normalised Laplacian of a similarity graph. This unifying view relates the manifold-learning literature to the kernel-methods literature and makes it easy to see when one set of techniques can be transposed to the other.
Kernel ICA (Bach & Jordan 2002) replaces the non-Gaussianity measure in ICA with a kernel-based statistical independence measure. It produces non-linear independent components and is one of the roots of modern kernel-based tests of independence (HSIC, the Hilbert–Schmidt Independence Criterion) that are still used in causal-discovery and feature-selection work.
Treat a probability distribution as a single point in an RKHS by embedding it as the mean feature map. Two distributions can then be compared by the distance between their mean embeddings — the Maximum Mean Discrepancy (MMD). MMD is the kernel version of a two-sample test, the basis of kernel methods for generative model evaluation, and the building block of several modern methods for distribution comparison (Gretton et al. 2012). The machinery is simple, general, and the bridge between kernel methods and probability.
A Gaussian process is a distribution over functions: a prior that says "I think the function I'm trying to learn is smooth, with correlation structure given by the kernel k." Conditioning this prior on observed data produces a posterior — also a Gaussian process — whose mean is the kernel-ridge-regression predictor and whose variance is a calibrated uncertainty estimate. GPs are kernel methods with uncertainty quantification for free, and they remain the default choice in any regime where "how sure is the model?" matters as much as "what does it predict?".
A stochastic process f is a Gaussian process with mean function m and covariance function (kernel) k — written f ~ GP(m, k) — if for every finite set of inputs x₁, …, xₙ the vector (f(x₁), …, f(xₙ)) has a multivariate Gaussian distribution with mean (m(x₁), …, m(xₙ)) and covariance matrix K. A GP is to a Gaussian random variable what a function is to a number: a generalisation from points to whole function values. The mean function is often set to zero for modelling convenience; the covariance function is the kernel and determines the properties of the sample paths.
Given training data (X, y) and a noise model yᵢ = f(xᵢ) + εᵢ, εᵢ ~ N(0, σ²), the posterior over f at a new point x* is Gaussian with mean
μ(x*) = k(x*)ᵀ (K + σ²I)⁻¹ y
and variance
σ²(x*) = k(x*, x*) − k(x*)ᵀ (K + σ²I)⁻¹ k(x*).
The posterior mean is identical to the kernel-ridge predictor with λ = σ². The variance is the gift: a calibrated measure of how confident the model is at each prediction, larger far from training data (which is what you would want) and smaller where training data is abundant.
GPs have hyperparameters — kernel lengthscales, signal variance, noise variance — that are learned by maximising the log marginal likelihood
log p(y | X, θ) = −(1/2) yᵀ(K_θ + σ²I)⁻¹y − (1/2) log|K_θ + σ²I| − (n/2) log 2π.
The first term is a data-fit term; the second is a complexity penalty (Occam's razor in Bayesian form); the third is a constant. Optimising this is how GPs avoid overfitting despite their apparent flexibility — the complexity term automatically pushes back against over-flexible kernels.
Calibrated uncertainty is the superpower. GPs dominate Bayesian optimisation (section 18), active learning, and experimental design — applications where knowing what the model does not know is the whole point. They are the default in spatial statistics (kriging) and the probabilistic-modelling literature of geostatistics and engineering. They work exceptionally well with small datasets and noisy observations, exactly the regimes where deep networks struggle.
GP training is O(n³) and memory is O(n²), the same as any kernel method. Past a few thousand training points, approximations are mandatory. Sparse variational GPs (Titsias 2009) represent the posterior using a small set of inducing points and reduce the cost to O(m²n) where m ≪ n. Modern GPU libraries (GPyTorch, GPflow) make GPs with tens of thousands of points routine; specialised techniques (deep kernel learning, hierarchical GPs) push the frontier further. The scaling story is the same as for SVMs and is covered in full in the next section.
Every full kernel method costs O(n²) in memory and O(n³) in time, which breaks decisively for n past about 10⁵. The fix is to approximate either the kernel matrix or the feature map with something low-rank or random, reducing the cost to linear or near-linear in n. Three approaches dominate: Nyström approximation, random Fourier features, and sparse-inducing-point methods. Each has its niche; all are essential tools in the modern kernel-methods kit.
Choose a subsample of m < n training points and use them to build a low-rank approximation of the full kernel matrix:
K ≈ K_{nm} K_{mm}⁻¹ K_{mn},
where K_{mm} is the kernel matrix on the subsample and K_{nm} is the cross-kernel between full data and subsample. Plugging this into a kernel method reduces cost from O(n²) to O(nm) memory and O(nm²) time. Typical values: m = 1000–10000, which gives excellent accuracy on most problems. The Nyström approximation is the default low-rank approach when the kernel matrix is accessible and subsample selection is easy (uniform random or leverage-score sampling both work).
Rahimi & Recht's 2007 paper: any translation-invariant kernel k(x − x') is the inverse Fourier transform of a non-negative measure (Bochner's theorem). Sample D random frequencies from that measure, and the Monte Carlo estimate of the integral gives an explicit D-dimensional feature map z(x) such that z(x)ᵀz(x') ≈ k(x, x'). Train a linear model (linear SVM, linear ridge) on the explicit features z(x), and you get a kernel method at linear-time cost. D = 1000–10000 approximates the RBF kernel to high accuracy; the method scales to tens of millions of points.
For Gaussian processes specifically, the inducing-point framework (Quiñonero-Candela & Rasmussen 2005, Titsias 2009) represents the GP posterior using m inducing points — summaries of the training data — reducing the cost to O(nm²). The variational sparse GP (Titsias 2009) makes the inducing points learnable and gives a principled variational lower bound on the marginal likelihood. Stochastic variational GPs (Hensman et al. 2013) extend this with mini-batch training, making GPs feasible on datasets with millions of points.
GPyTorch (Gardner et al. 2018) and GPflow are the modern default GP libraries: GPU-accelerated, integrated with PyTorch/TensorFlow, supporting inducing-point and stochastic variational methods out of the box. KeOps (Charlier et al. 2021) implements GPU kernel matrix–vector products without ever materialising the matrix, making full kernel methods tractable to around 10⁶ points on a single GPU. Falkon (Rudi et al. 2017) is the fastest Nyström-based kernel ridge implementation, scaling to billions of features.
For GPs with n up to ~10⁴: full GP. For n up to ~10⁶: sparse variational GP (GPyTorch, GPflow). For kernel ridge or SVM with n up to ~10⁷: Nyström (Falkon) or random Fourier features (fastfood, LibLinear on the explicit features). For even larger scales: the kernel method is probably not the right tool — use a neural network or an ensemble.
Kernel methods are the easiest family in this book to use badly. The math is elegant, the libraries are solid, and yet a hastily tuned SVM or GP will usually underperform a moderately tuned boosted-tree baseline, while a well-tuned kernel method will often match or beat it. The practical workflow is about three things: pick the right kernel, tune the two or three hyperparameters carefully, and diagnose failures early.
Start with the linear kernel as a baseline sanity check. If the data is not linearly separable, try the RBF kernel — it is the universal default, works well on most tabular data, and has two tunable hyperparameters (bandwidth γ, regularisation C). If the problem is intrinsically polynomial (e.g., you know the interactions of interest are quadratic), try a polynomial kernel with small degree. Specialised kernels (string, graph, Fisher) when the input is not a vector.
Kernels are distance-sensitive. Without input normalisation, features with larger scales dominate. The default pre-processing is to standardise each feature to zero mean and unit variance, or to min-max scale to [0, 1]. For the RBF kernel this is essentially mandatory: without it, the bandwidth γ interacts unpredictably with the feature scales and tuning becomes very difficult.
The RBF kernel has two knobs — C and γ — and their interaction is strong, so grid-search over a logarithmic grid is the norm. A reasonable default grid: C ∈ {0.1, 1, 10, 100}, γ ∈ {0.01, 0.1, 1, 10} scaled by the median-distance heuristic. Five-fold cross-validation on this 16-point grid is the pragmatic default and will find a near-optimal setting in 80 model fits. For serious work, Bayesian optimisation over the hyperparameters often finds better settings in fewer fits.
‖xᵢ − xⱼ‖ on a random subsample of the data and take the median. Set γ = 1 / (median)². This is almost always within a factor of 2 of the optimal bandwidth and is the right starting point for cross-validation.
Un-normalised inputs — the hyperparameter you tune depends on arbitrary feature scales, and nothing generalises. An RBF γ off by 100× — the decision surface collapses to either constants (too-small γ) or nearest-neighbour-like islands (too-large γ). C too small — the method underfits and looks like a weak linear model. C too large — the method memorises the training data and has terrible test accuracy. A kernel matrix that is not positive semi-definite — a user-defined similarity function that happens not to be PSD is a silent bug that makes the QP solver misbehave.
Kernel ridge regression: small-to-medium regression problems, closed-form solution, simplest kernel method. Kernel SVM: small-to-medium classification problems, sparse solution at prediction time. SVR: same regime as SVM but for regression, when you want sparsity. Gaussian processes: when calibrated uncertainty matters — Bayesian optimisation, active learning, experimental design. Kernel PCA / kernel clustering: dimensionality reduction and clustering with non-linear structure. All of the above with Nyström or random Fourier approximations when n gets past 10⁵.
Kernel methods are not the dominant paradigm in 2026 — deep learning is — but their influence on modern ML is pervasive, and in several regimes they remain the state of the art. The vocabulary of feature maps, positive definiteness, the representer theorem, and the RKHS norm shows up constantly in contemporary theoretical work. Several modern techniques are kernel methods in disguise. And when small data, calibrated uncertainty, or theoretical rigour are the constraints, kernels are still the first thing to try.
One of the most unexpected results of the past decade: in the limit of infinite width, a gradient-trained neural network is exactly a kernel method (Jacot, Gabriel & Hongler 2018). The neural tangent kernel (NTK) is the kernel defined by the infinite-width limit of the network's feature gradients, and it predicts the behaviour of finite neural networks remarkably well — at least early in training. The NTK is the bridge from the kernel-methods literature to the theory of deep learning, and a huge amount of recent theoretical work on neural networks uses kernel-methods machinery to make claims about deep networks.
GPs are the right tool whenever "calibrated uncertainty" and "small data" are the key constraints. They dominate Bayesian optimisation — the de facto standard for hyperparameter tuning of expensive functions (model training, chemical experiments, engineering design). They are a core tool in active learning, where the decision of "which point should I label next?" is a direct function of GP posterior variance. They are the backbone of modern scientific ML in geostatistics, climate modelling, and robot control where training-data collection is expensive.
In many domains — particularly small-to-medium tabular data — a well-tuned kernel SVM or kernel ridge still matches or beats gradient-boosted trees and neural networks. This is less widely appreciated than it should be, because kernel methods do not scale to the sizes where deep networks shine. If your training set is under 10 000 points, a kernel method is almost always worth trying as a baseline, and often worth deploying.
Maximum mean discrepancy (Gretton et al. 2012) is a kernel-based two-sample test — "are these two samples drawn from the same distribution?" — that is differentiable, closed-form, and used as a regulariser in modern generative-model training. Kernel methods for independence testing (HSIC) show up in causal discovery and feature selection. The kernel-based framework for comparing distributions is one of the most useful modern bridges between classical statistics and deep-learning-style optimisation.
Even when kernels are not the tool, their vocabulary has won. Every modern deep network is doing some kind of implicit feature extraction followed by a linear operation on features — the feature-map-plus-linear-classifier architecture is the kernel-methods architecture. The notion that capacity is controlled by a norm of the learned function (not by the number of parameters) was kernel thinking that deep learning adopted without always acknowledging. Generalisation bounds based on margin, on Rademacher complexity, on the RKHS norm — all originated in the kernel-methods literature and now appear in every paper on deep-learning theory.
Kernel methods are the most mathematically elegant corner of classical machine learning — with Hilbert-space theory, convex-optimisation duality, and statistical learning theory all converging on the same family of algorithms. The references below split into anchor textbooks that define the field, foundational papers from Aronszajn's 1950 RKHS paper through Cortes–Vapnik's 1995 soft-margin SVM to Jacot et al.'s 2018 neural tangent kernel, modern extensions that carry kernel ideas into the deep-learning era, and the software documentation where all serious kernel work happens. If you only read one textbook, read Schölkopf & Smola's Learning with Kernels.
SVC and SVR wrap internally; LIBLINEAR implements specialised coordinate-descent solvers for the linear-kernel case and scales to millions of samples and features. The documentation is terse but complete; A Practical Guide to Support Vector Classification (Hsu, Chang & Lin) is the single most useful ten-page document on day-to-day SVM use. If you want to understand what SVC is actually doing under the hood, this is the source.sklearn.svm.SVC and SVR wrap libsvm; LinearSVC wraps LIBLINEAR; sklearn.kernel_ridge.KernelRidge provides the exact kernel-ridge regressor; sklearn.decomposition.KernelPCA implements kernel PCA; sklearn.kernel_approximation provides the Nyström and random-Fourier-feature explicit approximators. The User Guide sections on SVMs and kernel approximation are unusually good. For 80% of kernel-methods work in Python, this is the whole toolkit.This page is Chapter 07 of Part IV: Classical Machine Learning. Chapter 08 turns from kernel methods — which invite us to let the kernel do the non-linear work so the algorithm can stay linear — to feature engineering and selection, which invites us to do the non-linear work by hand, carefully, so the algorithm and the features together tell a story a human can read. The two philosophies are complementary: kernels work best when we don't know what the right features are; feature engineering works best when we do. Chapter 09 then closes the classical-ML arc with model evaluation and selection — the measurement discipline that tells us which of the preceding seven chapters' techniques actually wins on any given problem.