Almost every interesting dataset lives in a high-dimensional space — images are vectors of millions of pixels, documents are vectors of tens of thousands of word counts, gene-expression samples are vectors of twenty thousand transcript levels — and almost none of the techniques humans use to think about data scale past three dimensions. Dimensionality reduction is the discipline that bridges the gap: it replaces a high-dimensional representation with a low-dimensional one that preserves as much of the structure that matters as possible, so that clustering, classification, visualisation, storage, and retrieval all become tractable. The oldest method, principal component analysis, is older than computers — Pearson wrote down its geometry in 1901, Hotelling independently rederived it statistically in 1933 — and the field has been accumulating techniques ever since: linear (PCA, SVD, NMF, factor analysis, ICA, random projection, LDA), classical nonlinear (MDS, Isomap, LLE, Laplacian eigenmaps, spectral embedding), modern nonlinear (t-SNE, UMAP), and neural (autoencoders, their denoising and variational cousins, contrastive self-supervised encoders). This chapter develops each family, the assumptions it encodes, and the task it is best suited for. The last two sections are the practical throughline — how to choose the target dimensionality, how to evaluate a reduction, and how reductions sit at the heart of every modern ML system that handles text, images, audio, or recommendations.
Section one motivates the problem by making the curse of dimensionality concrete — the handful of geometric pathologies that make learning, distance, and density estimation misbehave as d grows, and the parallel observation that most high-dimensional data secretly lives on a far lower-dimensional surface. Section two lays out the taxonomy that the rest of the chapter inhabits: feature selection (keep a subset of original features, covered properly in Chapter 08) versus feature extraction (learn new features as functions of the old ones), and within extraction, linear versus nonlinear, unsupervised versus supervised. Sections three and four develop principal component analysis and the singular value decomposition that implements it — the canonical workhorse of the field, the method everyone should try first, and the linear-algebra machinery that turns up again and again in later chapters. Section five covers the PCA extensions that matter in practice: kernel PCA (the nonlinear generalisation via the kernel trick), sparse PCA (interpretable loadings), robust PCA (decomposition into a low-rank matrix plus a sparse outlier matrix), and incremental PCA (streaming and out-of-core training). Section six is factor analysis, PCA's probabilistic cousin, which adds a noise model and clarifies what "variance explained" actually means.
Sections seven and eight cover two more linear-algebra-based reductions with distinctive properties: non-negative matrix factorisation (parts-based decompositions, widely used for topic modelling and recommendation) and independent component analysis (which maximises non-Gaussianity instead of variance and recovers statistically independent sources, the canonical technique for blind source separation). Section nine is random projection, the surprising result that projecting onto a random low-dimensional subspace preserves pairwise distances up to a logarithmic factor — the Johnson–Lindenstrauss lemma, the basis of a family of fast approximate methods. Section ten turns to supervised reduction: linear discriminant analysis and its extensions (Fisher's LDA, regularised LDA, quadratic discriminant analysis) project data in a way that maximises between-class separation rather than total variance. Section eleven opens the nonlinear-manifold family: classical multidimensional scaling and Isomap, which both preserve distances (Euclidean for MDS, geodesic along the data manifold for Isomap). Section twelve continues the family with locally linear embedding, Laplacian eigenmaps, and spectral embedding — methods that preserve local neighbourhood structure rather than global distances.
Sections thirteen and fourteen are the methods that dominate modern visualisation of high-dimensional data: t-SNE (the probabilistic neighbourhood-embedding technique whose success rewrote the field in 2008) and UMAP (the 2018 manifold-learning method that became the default almost immediately). Section fifteen covers autoencoders — the neural-network framing of dimensionality reduction — with stops at undercomplete, denoising, sparse, contractive, and variational autoencoders, and a brief look at how the same idea animates modern contrastive and masked self-supervised learning. Section sixteen is dimensionality reduction in practice: how to choose the target dimension (scree plot, variance explained, cross-validated downstream performance), how to evaluate a reduction (reconstruction error, neighbourhood preservation, trustworthiness, continuity), and the operational recipe for slotting a reduction into a data-science workflow. Section seventeen closes with where dimensionality reduction sits inside modern ML — representation learning, embeddings as the substrate of retrieval, compression for efficient inference, and why every foundation model is, from a certain angle, an enormous nonlinear dimensionality reducer.
Real datasets have more dimensions than they need. A photograph at 1024×1024 is a vector in a space of a million dimensions, but the set of photographs of cats is not a million-dimensional cloud — it's a thin sliver, maybe a few thousand degrees of freedom wide, curving through the pixel space. Dimensionality reduction exists because that sliver is what we actually care about, and everything goes better when we can compute in a coordinate system aligned to it rather than to the raw axes.
Richard Bellman coined the phrase curse of dimensionality in 1961 to describe what happens when an otherwise sensible algorithm is run in high dimensions. The pathologies are geometric. Volume concentrates on the boundary: a hypersphere of radius 1 in d dimensions has almost all its volume in a thin shell near the surface as d → ∞. Pairwise distances concentrate: for random points, the ratio between the largest and smallest pairwise distance tends to 1, so nearest-neighbour queries become meaningless. Data becomes sparse: uniformly sampling a hypercube with n points gives an average inter-point distance that grows like n^(1/d), so to keep a given density one needs exponentially many samples. Density estimation becomes impossible: histograms and kernel density estimates need exponentially many samples per dimension to maintain a fixed error.
The consequence is that every algorithm whose workings depend on density, distance, or local neighbourhoods starts to misbehave once d is past about 50, and most of them are completely unusable past about 1000. Clustering (Chapter 04), nearest-neighbour classification, kernel density estimation, Gaussian process regression, every method that depends on "points near this query look like it" runs into trouble. Dimensionality reduction is the first-line response: shrink d until the geometry is tractable again, then run the downstream method in the reduced space.
The manifold hypothesis is the empirical observation that makes reduction work. Natural high-dimensional data — images, audio, text embeddings, gene-expression profiles — does not fill its ambient space uniformly. It concentrates on a manifold (a lower-dimensional curved surface) of much smaller intrinsic dimension. The space of all natural 256×256 grayscale images is a minuscule subset of R^65536, and the intrinsic dimension of familiar image classes is often estimated to be in the dozens, not the tens of thousands. The same is true of text: sentence embeddings with 768 dimensions have an intrinsic dimensionality of perhaps 20–50, depending on the corpus. Every dimensionality-reduction method is, one way or another, an attempt to find and parameterise this manifold.
The motivations for applying dimensionality reduction split into four loose categories. Visualisation: humans can parse 2-D and 3-D scatter plots but not 50-D ones, so projecting to 2 or 3 dimensions is the default first-look technique for any new dataset. Computation: many algorithms run in time polynomial in d, so reducing d from 10 000 to 100 turns an intractable problem into a fast one; storage and memory scale similarly. Denoising and regularisation: projecting data onto a lower-dimensional subspace discards directions of low variance, which are often noise, improving downstream predictive accuracy. Structure discovery: PCA on gene-expression data reveals cell types as clusters in the first few components; ICA on EEG recordings separates brain sources from muscle artefacts; NMF on a term-document matrix yields interpretable topics. The same mathematical operation serves all four purposes, which is why dimensionality reduction is the quiet prerequisite for most of data science.
d. The second — variance, pairwise distances, local neighbourhoods, class separation, reconstructibility — determines which family of methods to reach for.
A reduction that loses the signal you cared about is worse than no reduction at all. PCA on a dataset whose class structure lies in low-variance directions will wipe out exactly what you wanted to keep. t-SNE on a dataset with no local neighbourhood structure will manufacture the illusion of clusters where none exist. Dimensionality reduction is a lossy transformation with priorities; knowing the priorities is the whole game. The rest of the chapter is a tour of which method prioritises what, and how to tell the difference.
All dimensionality-reduction methods answer the same question — "which d-dimensional summary of this data should we use?" — but they split along three axes that determine which method applies to which problem. Setting the axes out before the algorithms is useful because it makes the differences between methods feel less arbitrary.
Feature selection keeps a subset of the original features and discards the rest. The output is still interpretable in the original units — "age, income, and zipcode" instead of "all thirty fields" — and the method returns a decision for each feature: in or out. Chapter 08 develops feature selection in its own right, with filter methods (mutual information, correlation, chi-squared), wrapper methods (forward, backward, recursive feature elimination), and embedded methods (lasso, tree-based importance). Feature extraction, which is the subject of this chapter, constructs new features as functions of the old ones. The output is d new coordinates, each a mixture of the original features, and interpretability depends on how that mixture can be labelled. Selection produces subsets; extraction produces projections.
An extraction is linear if each new feature is a linear combination of the original features — the reduction is a matrix multiplication Z = X W for some loading matrix W. PCA, SVD, NMF, factor analysis, ICA, LDA, random projection, and classical MDS all fall here. Linear methods are fast (closed-form via eigendecomposition or SVD), have a clear geometric interpretation (projection onto a subspace), and are the right answer when the manifold is well-approximated by a hyperplane. They are the wrong answer when the data curves — and most real data curves past a few dozen dimensions. Nonlinear methods allow the new coordinates to be arbitrary functions of the old ones. Isomap, LLE, Laplacian eigenmaps, kernel PCA, t-SNE, UMAP, and autoencoders are nonlinear. They are more expressive but also more delicate — they have hyperparameters, they don't always converge to the same answer on repeated runs, and their outputs can be harder to interpret.
An unsupervised reduction uses only the features X and chooses a projection that optimises some criterion on X alone — variance (PCA), reconstruction (autoencoders), neighbourhood preservation (t-SNE, UMAP), independence (ICA). The resulting reduction is agnostic to any downstream task, which is both its strength and its weakness: it is reusable across tasks, but it may not preserve the class structure that a supervised model cares about. A supervised reduction uses the labels y as well, picking projections that maximise class separation (LDA) or predictiveness (partial least squares). Supervised reductions produce features that are better for prediction on y but may throw away structure useful for other tasks. In practice, the unsupervised methods are used more often because the downstream task is rarely the only thing to worry about.
Crossing the axes gives a landscape with three practical camps. Linear unsupervised — PCA, SVD, NMF, ICA, random projection — is where you should start. Fast, stable, closed-form, universally understood; if it works, stop. Nonlinear unsupervised for visualisation — t-SNE, UMAP, kernel PCA, Laplacian eigenmaps — is where you reach when the linear methods miss structure that is qualitatively obvious by eye. Nonlinear unsupervised for learned representations — autoencoders, contrastive and masked self-supervised encoders — is where you end up when the reduction needs to be differentiable, invertible, or integrated into a neural training loop. Supervised reductions (LDA, PLS) are useful in specific contexts but rarely the first tool.
Three terms recur across the chapter. The intrinsic dimension of a dataset is the smallest d at which a reduction can preserve the structure of interest without loss; it is an empirical quantity that can be estimated (maximum-likelihood estimators, correlation-dimension estimators, topological estimators) and is often much smaller than the ambient D. The embedding is the n × d output matrix of reduced coordinates — the data in its new coordinate system. The loadings or mixing matrix is the D × d matrix (for linear methods) mapping old features to new. Interpretability comes from looking at the loadings: which original features are heavily weighted in each new component, and does that weighting make sense to a domain expert?
Principal component analysis (PCA) is the single most-used dimensionality reduction in all of science. Given a centred data matrix X with n rows and D columns, PCA finds the orthogonal directions in feature space along which X varies most, and projects the data onto the top d of them. It is the canonical linear unsupervised reduction — the method you should try first, and often the only one you need.
Centre the data so each feature has mean zero. The covariance matrix is S = (1/n) X^T X, a D × D symmetric positive-semidefinite matrix whose entries capture pairwise feature covariances. Consider a unit vector w ∈ R^D; the variance of the data projected onto w is w^T S w. The first principal component is the w_1 that maximises this subject to ‖w_1‖ = 1. The solution, by the Rayleigh quotient theorem, is the eigenvector of S with the largest eigenvalue λ_1, and the maximal variance is λ_1 itself. The second principal component is the w_2 that maximises variance subject to orthogonality with w_1; it is the eigenvector corresponding to the second-largest eigenvalue. Continuing, the k-th principal component is the k-th eigenvector of S, and the total variance captured by the first d components is the sum of the first d eigenvalues divided by the sum of all eigenvalues.
PCA has a second, equivalent derivation that is sometimes more illuminating. Consider the problem: find the d-dimensional subspace of R^D that minimises the average squared distance from the data to its projection onto the subspace. The Eckart–Young theorem (1936) says the optimal subspace is exactly the span of the top d eigenvectors of S, and the minimum reconstruction error is Σ_{k=d+1}^{D} λ_k — the variance in the directions you discarded. So PCA is simultaneously the projection that preserves the most variance and the projection that has the smallest reconstruction error. These are the same optimisation in different coordinates.
Three algorithms are in practical use. Eigendecomposition of the covariance matrix: form the D × D matrix S, diagonalise it, keep the top d eigenvectors. Fast when D ≪ n, quadratic in D for forming S, cubic in D for the eigendecomposition. SVD of the data matrix: factor X = U Σ V^T with U an n × n orthogonal matrix, V a D × D orthogonal matrix, and Σ diagonal. The principal components are the columns of V, and the projections are given by U Σ. Numerically preferred because it avoids forming S explicitly, which doubles the condition number and loses precision. Randomised SVD: use random projections to approximate the top d components in time O(n D d), suitable for very large matrices. Section 04 develops the SVD route in detail.
PCA is a variance-maximising method, so any feature with large scale will dominate the top components regardless of its information content. Before running PCA you should either standardise each feature to unit variance (the default, equivalent to running PCA on the correlation matrix) or have a strong reason not to. The exception: when the features are already in the same physical units and their raw variances carry information — a set of sensor readings in volts, or pixel intensities from a camera — you may prefer the covariance-matrix version. When in doubt, standardise.
Three diagnostic plots tell you what PCA has done. The scree plot shows eigenvalues in descending order — the drop from large to small eigenvalues is the elbow that guides choice of d. The cumulative-variance plot shows Σ_{k=1}^d λ_k / Σ_{k=1}^D λ_k as a function of d, and the cut at "90 % explained" is a common default. The loadings plot shows the coefficients of each component in terms of the original features, and is the tool that makes PCA interpretable: if the first component weights "height", "weight", and "foot length" equally, it represents overall size.
A common use of PCA is not as an end in itself but as preprocessing for a subsequent method: run PCA to reduce 10 000 dimensions to 50, then apply k-means, Gaussian mixtures, kernel methods, neural networks, or visualisation. The reduction makes the subsequent algorithm faster, more stable, and often more accurate (the discarded directions are usually noise). Whiten after PCA — divide each projected feature by the square root of its eigenvalue — if the downstream method is not scale-invariant, which most of them aren't.
The singular value decomposition is the factorisation that underlies not just PCA but most of linear dimensionality reduction. It works on any matrix (real or complex, square or rectangular, full-rank or not), always exists, and is numerically stable. If you know one decomposition in numerical linear algebra, let it be the SVD.
Every n × D real matrix X admits a factorisation X = U Σ V^T where U is an n × n orthogonal matrix (U^T U = I), V is a D × D orthogonal matrix (V^T V = I), and Σ is an n × D rectangular diagonal matrix with non-negative entries σ_1 ≥ σ_2 ≥ … ≥ σ_r > 0 (where r = rank(X)) and zeros elsewhere. The σ_i are the singular values. The columns of U are the left singular vectors, the columns of V are the right singular vectors.
X maps unit vectors in R^D to vectors in R^n. The SVD says: the map is a rotation in the input space (V^T), followed by a scaling along the coordinate axes (Σ), followed by a rotation in the output space (U). Every linear map does these three things. The singular values are the scaling factors; the right singular vectors are the input directions that get stretched independently; the left singular vectors are where those directions end up in the output.
If X is mean-centred, then X^T X = V Σ^T Σ V^T = V Σ^2 V^T, which is an eigendecomposition of the (scaled) covariance matrix. The right singular vectors V are therefore the principal components, and the squared singular values σ_k^2 are the eigenvalues of n S. The projection of X onto the first d principal components is X V_d = U_d Σ_d, the first d columns of U Σ. PCA is SVD with the covariance-computation step folded into the factorisation; every PCA implementation in every major library is actually an SVD.
The SVD gives the optimal rank-d approximation to X in both Frobenius and spectral norm. Take the first d singular values and vectors: X_d = U_d Σ_d V_d^T. The Eckart–Young theorem (1936) says X_d is the matrix of rank at most d closest to X in Frobenius norm, and the approximation error is ‖X − X_d‖_F^2 = Σ_{k=d+1}^{r} σ_k^2. This is the mathematical content of "variance explained" in PCA — discarded variance equals squared reconstruction error equals sum of squared discarded singular values.
The same decomposition underlies many other techniques. Latent semantic analysis is truncated SVD on a term-document matrix: the first d singular vectors give a document embedding space where semantically similar documents are close. Collaborative filtering via matrix factorisation fits a low-rank approximation to a user-item ratings matrix, with the singular structure capturing latent taste dimensions. Low-rank approximation for compression: storing X_d costs d(n + D + 1) values versus nD for X, a meaningful saving when d ≪ min(n, D). Total least squares regression uses the smallest singular vector of the augmented [X | y] matrix to find the direction of minimum variance. Pseudo-inverse for least squares: X^+ = V Σ^+ U^T where Σ^+ inverts the non-zero singular values, giving the minimum-norm least-squares solution.
Classical algorithms (Golub–Kahan bidiagonalisation, divide and conquer) compute the full SVD in O(min(n D^2, D n^2)) time, which is acceptable for min(n, D) up to a few thousand. For large matrices, truncated SVD (Lanczos methods, ARPACK) computes only the top d singular triples in O(n D d) time. For very large matrices, randomised SVD (Halko, Martinsson, Tropp, 2011) uses random projections to approximate the top d singular triples in a single pass and is the default in modern libraries (scikit-learn's TruncatedSVD, PyTorch's svd_lowrank). The accuracy-vs-speed trade-off is controlled by the number of random sketches and optional power iterations.
The SVD is numerically stable, but two things can go wrong. First, forming X^T X explicitly (then eigendecomposing) squares the condition number and loses about half the floating-point precision; always SVD the data matrix directly. Second, for very skewed matrices where n ≫ D or D ≫ n, use the economy SVD (only min(n, D) singular values and the corresponding vectors) — the full SVD wastes memory storing a large orthogonal factor whose extra columns span the null space.
PCA has three classic weaknesses — it is linear, its components are dense mixtures of all features, and it is sensitive to outliers — plus a practical one: it needs the full data matrix in memory. Each weakness has its own generalisation. These four variants are what you reach for when vanilla PCA fails for a reason you can name.
Kernel PCA (Schölkopf, Smola, Müller, 1998) generalises PCA to nonlinear reductions via the kernel trick. Choose a kernel k(x, y) implicitly mapping data into a high-dimensional feature space φ(x) — Gaussian RBF exp(−γ‖x − y‖²), polynomial (x^T y + c)^p, sigmoid — and do PCA in that feature space without ever computing φ(x) explicitly. The algorithm operates on the Gram matrix K with K_{ij} = k(x_i, x_j), centres it, eigendecomposes it, and uses the top d eigenvectors as the embedding. Computational cost is O(n^3) for the eigendecomposition and O(n^2) memory for storing K, which limits it to datasets of a few thousand points. Kernel PCA is the principled nonlinear extension of PCA and underpins many later manifold methods (Isomap is kernel PCA with a geodesic-distance kernel; Laplacian eigenmaps are kernel PCA on the graph Laplacian).
Vanilla PCA components are dense: every original feature contributes to every principal component. This is fine for prediction but dreadful for interpretation — if you want to name the third component, looking at 10 000 loadings is useless. Sparse PCA (Zou, Hastie, Tibshirani, 2006; d'Aspremont et al., 2007) adds an L1 penalty to the component search, forcing each principal component to involve only a small subset of features. The result is interpretable — the third component might now involve only "height", "weight", and "foot length" — at the cost of some variance explained. Sparse PCA trades off two hyperparameters: d (number of components) and the sparsity level per component. It is used extensively in gene-expression analysis, where component interpretability is often more valuable than last-percent variance.
Classical PCA minimises squared reconstruction error, which makes it extremely sensitive to outliers — a single wild point can rotate the leading components by tens of degrees. Robust PCA (Candès, Li, Ma, Wright, 2011) assumes the data matrix is X = L + S where L is low-rank (the clean signal) and S is sparse (the outliers), and recovers both by solving a convex program minimising the nuclear norm of L plus the L1 norm of S. When the assumptions hold, the recovery is exact with high probability. The canonical application is surveillance video: L is the static background (low rank across frames), S is the moving foreground (sparse per frame). Related techniques — the robust PCA of Ke and Kanade (L1 norm on residuals, 2005), the grid-search-based methods in ROBPCA (Hubert, Rousseeuw, Vanden Branden, 2005) — are older alternatives with different sparsity models.
Classical PCA needs the full n × D data matrix in memory. For datasets too large to hold at once — streams, out-of-core training, terabyte-scale imaging data — incremental PCA updates the principal components one mini-batch at a time. The earliest algorithms (Ross, Lim, Lin, Yang, 2008) merge the top d components of a new batch with the existing components by SVD on the stacked loading matrices. Scikit-learn's IncrementalPCA follows this approach. It is also the basis of streaming PCA algorithms (Mitliagkas et al., 2013; Oja's rule in its original 1982 formulation) that update components with a single sample at a time. The trade-off is that incremental PCA computes an approximation to the exact PCA that converges to the true components as the data stream grows; for very small streams the early components can be biased.
A few further variants deserve mention. Probabilistic PCA (Tipping & Bishop, 1999) recasts PCA as a maximum-likelihood estimate of a latent-variable model with isotropic Gaussian noise; it is the bridge to factor analysis (Section 06) and to the Bayesian extensions (Bayesian PCA, which treats the number of components as a random variable to be inferred). Exponential-family PCA replaces the Gaussian likelihood with a distribution appropriate to the data type — Bernoulli for binary data, Poisson for counts — and is the dimensionality-reduction analogue of generalised linear models. Functional PCA operates on curves rather than vectors and decomposes a dataset of functions (spectra, growth curves, trajectories) into a set of basis functions ordered by variance explained. Each is a principled extension of the same eigenstructure reasoning.
Factor analysis predates PCA — Spearman introduced it in 1904 to study general intelligence as a latent variable underlying correlated test scores — but in modern terms it is better understood as PCA with the addition of a noise model. That difference is small mathematically and large philosophically: factor analysis is a generative probabilistic model, not a pure projection technique, and the things you can do with a generative model — likelihood comparison, missing-data imputation, model selection — are unavailable to PCA.
Factor analysis assumes each observed vector x ∈ R^D is generated as x = μ + W z + ε, where z ~ N(0, I_d) is a latent d-dimensional factor vector, W is a D × d loading matrix, and ε ~ N(0, Ψ) is Gaussian noise with diagonal covariance Ψ = diag(ψ_1, …, ψ_D). The diagonality of Ψ is the crucial modelling choice: it encodes the assumption that all correlations between observed features are explained by the shared latent factors, and what's left over is independent per-feature noise. Under this model, the marginal distribution of x is N(μ, W W^T + Ψ) — a Gaussian whose covariance is low-rank plus diagonal.
The parameters (μ, W, Ψ) are estimated by maximum likelihood, usually via the EM algorithm (Section 10 of Chapter 04 developed EM for GMMs; the same procedure applies here with z as the latent variable). The E-step computes the posterior p(z | x), which is Gaussian in closed form; the M-step re-estimates W and Ψ as weighted averages. EM converges monotonically and is usually fast — a few dozen iterations for typical problems. Unlike PCA, the fit is unique only up to a rotation of the factor axes: any orthogonal rotation W R gives the same marginal distribution of x. This is a feature, not a bug, because interpretable rotations (varimax, promax, oblimin) are used to make the factors nameable.
The differences are worth knowing exactly. PCA assumes all variance is signal and finds the directions of maximum variance; it has no noise model, no likelihood, and is determined uniquely (up to sign flips) by the covariance matrix. Factor analysis separates shared variance (captured by W) from feature-specific noise (captured by Ψ); it has a likelihood, supports model comparison, is rotation-invariant in the factor space, and is appropriate when per-feature noise levels genuinely differ. When Ψ = σ² I (homoscedastic noise), factor analysis reduces to probabilistic PCA (Tipping & Bishop, 1999), which is PCA with a likelihood. When Ψ → 0, it reduces to PCA exactly. PCA is a special case of factor analysis.
Factor analysis is the right choice in three situations. Psychometrics and social science: the field where it originated, where the model of "latent traits plus item-specific measurement error" is a reasonable one. Heteroscedastic data: sensors with different noise levels, gene-expression assays with different dynamic ranges, survey questions with different reliability. Missing data: because factor analysis is a likelihood model, the EM algorithm handles missing entries gracefully, imputing them as part of the E-step. PCA on data with missing entries requires ad-hoc imputation first; factor analysis builds it in.
The rotational indeterminacy of factor analysis lets you post-process W to maximise interpretability. Varimax (Kaiser, 1958) finds an orthogonal rotation that maximises the variance of squared loadings within each factor, producing components where each feature has a large loading on few factors and small loadings on the rest. Promax and oblimin allow oblique (non-orthogonal) rotations, appropriate when factors themselves are believed to be correlated. The choice of rotation is a modelling decision, not a statistical one — different rotations give equally good fits but different stories.
Factor analysis is the simplest member of a family of latent-Gaussian linear models. Probabilistic PCA (isotropic noise). Independent component analysis (Section 08, non-Gaussian factors). Canonical correlation analysis (two sets of features, shared factors). Partial least squares (one set of features, one target, shared factors). Gaussian-process latent variable models (nonlinear factors). All of these inherit the latent-variable formulation; each adds or relaxes one assumption.
Non-negative matrix factorisation (NMF) decomposes a non-negative matrix X into a product W H of two non-negative matrices. The non-negativity constraint is what makes the method distinctive: unlike PCA, which can subtract as well as add, NMF can only build the observed data by adding up parts, which yields decompositions that look more like the way humans decompose objects into constituent pieces.
Given an n × D matrix X ≥ 0 with non-negative entries, NMF finds W (n × d) and H (d × D) both non-negative such that X ≈ W H. The rank d is the target dimensionality, chosen by the user. The objective is usually squared Frobenius error ‖X − W H‖_F^2 or the generalised Kullback–Leibler divergence, both minimised subject to non-negativity constraints. Lee and Seung's multiplicative update algorithm (2001) was the first practical NMF solver; modern implementations (scikit-learn's NMF, LibNMF) use alternating-least-squares or coordinate-descent variants that are much faster.
The striking empirical observation of Lee and Seung's 1999 Nature paper was that NMF on a dataset of face images produces basis images (rows of H) that look like face parts — noses, mouths, eyes, forehead patches — while PCA on the same data produces basis images that look like ghostly full faces (the positive and negative parts cancelling to form wholes). The difference comes from the non-negativity constraint: without the ability to subtract, each basis element has to represent something that can meaningfully be added. This makes NMF the method of choice whenever the data is genuinely non-negative and the decomposition is expected to be additive: parts of objects, counts of events, intensities.
Three applications dominate NMF's practical use. Topic modelling on document-term matrices: the columns of H are topics (distributions over words), and the rows of W give each document's topic proportions. Arora, Ge, and Moitra's 2012 result that NMF can provably recover topic models under mild separability assumptions made this a respectable alternative to latent Dirichlet allocation. Source separation on spectrograms: audio signals' time-frequency representations are non-negative, and NMF decomposes them into instrument- or note-like components; much of the 2000s-era blind-source-separation literature is NMF-based. Recommendation systems: decomposing a user-item interaction matrix into user-preference and item-attribute factors, both constrained non-negative for interpretability.
Practical NMF adds regularisation to both W and H: an L1 penalty on H enforces sparse topics (each topic uses few words); an L2 penalty on W stabilises the reconstruction. The combination (L1 + L2 = elastic net) is the default in most libraries. Without regularisation, NMF solutions are non-unique — multiplying W by a diagonal matrix and dividing H by the same matrix gives an equivalent factorisation — and reg preserves a preferred scale.
NMF is non-convex: different random initialisations give different local minima, and the objective has many. Multiple restarts (five to ten, keeping the best reconstruction) is standard practice. NMF is also slower than PCA and has no closed form. It doesn't apply to data that can be legitimately negative (signed signals, residuals, correlations). And the parts-based interpretation is often overstated — on many real datasets NMF gives components that are not obviously "parts" in any semantic sense, and the interpretation ends up being as cryptic as PCA's.
Multi-way generalisations — CP decomposition, Tucker decomposition, non-negative tensor factorisation — extend NMF to tensors with three or more modes, appropriate for data indexed by multiple factors (user × item × time, voxel × time × subject). Bayesian NMF places priors on W and H and performs variational or MCMC inference, giving uncertainty estimates and automatic rank selection. Both directions have specialised but lively literatures; neither is a first stop for general dimensionality reduction.
Independent component analysis (ICA) asks a different question from PCA. PCA finds directions of maximum variance; ICA finds directions of maximum statistical independence. When the data is generated by a mixture of independent non-Gaussian sources — microphones picking up overlapping speakers, EEG sensors picking up overlapping brain signals — PCA returns the overall geometry and ICA returns the sources.
ICA assumes the observed data x ∈ R^D is a linear mixture x = A s of d independent non-Gaussian latent sources s = (s_1, …, s_d). The mixing matrix A is D × d (usually d = D; d < D is possible but less common). The task is to recover A and s from x alone — the blind source separation problem. The cocktail-party analogy is canonical: if three microphones record three speakers in a room, ICA takes the three mixed recordings and returns three unmixed channels, one per speaker.
The deep fact underlying ICA is that if the sources are independent and non-Gaussian, then linear mixing is the only operation that leaves them independent — and conversely, a linear transformation that makes a mixture independent must undo the original mixing. But if two or more of the sources are Gaussian, independence is preserved by any rotation, and the method cannot distinguish the original sources from any rotation of them. Gaussian sources are invisible to ICA.
The consequence is that ICA maximises non-Gaussianity. Each independent component is the direction in which the projected data is least Gaussian (most peaked or most kurtotic, depending on the measure). Popular non-Gaussianity measures include the fourth cumulant (kurtosis), negentropy (a natural measure derived from Shannon entropy), and fixed-point approximations thereof. The FastICA algorithm (Hyvärinen, 1999) iteratively finds one component at a time by fixed-point optimisation of a negentropy proxy; it is the default in scikit-learn.
ICA nearly always begins with a whitening step — decorrelate the features and rescale them to unit variance. This is exactly PCA followed by rescaling (PCA + whitening). After whitening, ICA reduces to finding an orthogonal rotation that maximises non-Gaussianity, a simpler problem than the original. So ICA is PCA plus one extra step: a rotation in the PCA-whitened space that aligns the axes with the most independent directions. This is why ICA and PCA are often applied together.
ICA recovers sources up to two ambiguities that cannot be resolved from the data alone. Sign: each component can be flipped, since A s = (−A)(−s). Permutation: the ordering of the components is arbitrary; ICA returns sources in no particular order. These ambiguities rarely matter in practice — domain knowledge selects which component is which (the speaker with the deep voice, the artefact at 50 Hz) — but they are worth knowing when setting up a pipeline.
The canonical applications lie in neuroscience and signal processing. EEG and MEG: decomposing multi-channel brain recordings into independent sources corresponding to brain regions and artefacts (eye blinks, muscle movement, line noise); removing the artefact components before downstream analysis is a standard preprocessing step. fMRI: spatial ICA identifies networks of coordinated brain activity. Audio source separation: classical blind-source-separation task from multiple microphones, though modern deep-learning methods have largely surpassed ICA here. Finance: recovering independent factors driving asset returns, an interpretable alternative to PCA-based factor models. ICA is not a general-purpose reduction — when the independence assumption isn't satisfied, the returned components don't mean much — but when the assumption holds, it is the right tool.
Several generalisations extend the basic ICA idea. Overcomplete ICA allows more sources than observed channels, at the cost of probabilistic inference. Nonlinear ICA drops the linearity assumption; recent results (Hyvärinen et al., 2019) show that with auxiliary variables (time indices, class labels), nonlinear ICA is identifiable, bridging to self-supervised representation learning. Sparse coding (Olshausen & Field, 1996) is closely related: it finds an overcomplete non-negative basis that sparsely reconstructs the data, with the sparsity inducing non-Gaussianity implicitly. Each of these sits at the frontier where classical signal processing meets modern representation learning.
The most surprising result in dimensionality reduction is that you don't have to optimise anything to get a good projection. Projecting data onto a random low-dimensional subspace approximately preserves pairwise distances, with high probability, as long as the target dimension is at least O(log n / ε²). The Johnson–Lindenstrauss lemma turned random projection from a curiosity into a workhorse of large-scale approximate computation.
Given n points in R^D and a target distortion ε ∈ (0, 1), there exists an embedding f : R^D → R^d with d = O(ε^{−2} log n) such that for every pair x, y in the point set, (1 − ε) ‖x − y‖² ≤ ‖f(x) − f(y)‖² ≤ (1 + ε) ‖x − y‖². The result (Johnson & Lindenstrauss, 1984) has two striking features. First, d is independent of D: the target dimension depends only on the number of points and the allowed distortion, not the ambient dimensionality. Second, f can be chosen at random — sampling a matrix with Gaussian entries and scaling appropriately gives such an embedding with high probability. There is nothing to optimise.
Four concrete constructions are widely used. Gaussian random projection: sample R ∈ R^{D × d} with i.i.d. N(0, 1/d) entries and compute Z = X R. Correct but expensive when D is large. Sparse random projection (Achlioptas, 2003): use entries in {−1, 0, +1} with probabilities (1/6, 2/3, 1/6). Faster because matrix multiplication exploits sparsity, and still satisfies the JL lemma. Very sparse random projection (Li, Hastie, Church, 2006): go to density 1/√D, giving further speed-up. Structured random projection (subsampled randomised Hadamard transforms, Count-Sketch, Fast JL): replace the random matrix with a structured one that can be applied in O(D log D) time instead of O(D d), which matters for very large D.
Random projection is the right reduction when three things are true. The data is very high-dimensional: D in the thousands to millions. You need a projection that works across many downstream tasks: random projection is task-agnostic, so one projection serves all. You can't afford to fit a model: no training step, no eigendecomposition, no tuning. The classic applications are in large-scale nearest-neighbour search (locality-sensitive hashing variants are random-projection-based) and streaming algorithms where data arrives too fast to preprocess. In scikit-learn, GaussianRandomProjection and SparseRandomProjection are drop-in replacements for PCA.
On a fixed dataset with a fixed target d, PCA gives a lower reconstruction error than random projection, because PCA optimises for variance. But for approximate distance preservation — which is what many downstream algorithms actually need — random projection is competitive, and it has advantages PCA doesn't. It needs no data-dependent computation: you can project new data with the same random matrix, with no need to refit as data grows. It's embarrassingly parallel. It works in streaming settings. For pairwise-distance-based downstream methods (kNN, clustering, kernel methods), the ε distortion of the JL lemma translates directly into an ε error bound on the downstream method, which is often acceptable.
Random projection is the simplest example of the broader field of compressed sensing: the set of results that say sparse or low-rank signals can be recovered exactly from far fewer measurements than the ambient dimension suggests (Candès, Tao, Romberg, 2006; Donoho, 2006). Where random projection preserves pairwise distances, compressed sensing recovers the data itself from random linear measurements, provided sparsity in some basis. Sketching is the algorithmic-theory term for random-projection-based data structures that support approximate queries (count queries, distinct counts, heavy hitters) on massive data streams. The Count-Min Sketch, HyperLogLog, and the sketches inside modern database systems all descend from the JL idea.
The reductions covered so far have all been unsupervised — they use the features X alone. When the downstream task is classification and class labels are available, a supervised reduction can do much better: project onto directions that separate classes rather than directions that capture total variance. Linear discriminant analysis (LDA) is the canonical supervised linear reduction, and it pays to know the one direction in which it differs from PCA.
Given data X with class labels y ∈ {1, …, C}, define the between-class scatter S_B (how far apart the class means are from the grand mean) and the within-class scatter S_W (how spread out the data is within each class). LDA finds the projection direction w that maximises the ratio (w^T S_B w) / (w^T S_W w). The solution, by the Rayleigh quotient theorem for generalised eigenvalue problems, is the top generalised eigenvector of (S_B, S_W). With C classes LDA yields at most C − 1 meaningful directions — the class means span a (C − 1)-dimensional subspace and everything else in S_B is zero.
LDA has two identities that are sometimes confused. As a reducer, LDA produces a projection of the data into a (C − 1)-dimensional space where classes are maximally separated. As a classifier, LDA (also called Fisher discriminant analysis) fits a Gaussian to each class with a shared covariance and classifies new points by the posterior. The two are connected but distinct: the LDA reduction is useful as a preprocessing step for other classifiers, while the LDA classifier stands alone. The reducer version appears in scikit-learn's LinearDiscriminantAnalysis.transform; the classifier version in .predict.
The canonical diagram: two elongated class clusters stretched along one axis and separated along the perpendicular axis. PCA picks the long axis — that's where the variance is. LDA picks the perpendicular — that's where the class separation is. Running PCA on this dataset erases the class structure; running LDA preserves it exactly. The moral is that unsupervised reduction can throw away exactly what you care about. Whenever labels are available and the downstream task is classification, LDA should at least be compared to PCA.
Classical LDA inverts S_W, which is singular or ill-conditioned whenever there are more features than samples (the "D > n" regime, common in genomics and text). Regularised LDA (Friedman, 1989) shrinks S_W toward a scaled identity S_W + λ I before inversion, stabilising the estimate. Shrinkage LDA (Ledoit & Wolf, 2004) chooses the shrinkage parameter λ analytically to minimise expected prediction error, and is the default in scikit-learn's LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto'). Diagonal LDA assumes S_W is diagonal, which works surprisingly well in very-high-dimensional settings and is equivalent to naive Bayes with Gaussian features.
Quadratic discriminant analysis (QDA) drops LDA's assumption of a shared covariance matrix and fits a separate covariance per class. The resulting decision boundaries are quadratic, not linear — ellipses, parabolas, hyperbolas — and QDA can model classes with different shapes and orientations. The cost is twice as many parameters and less regularisation-by-constraint; QDA overfits faster than LDA and needs more data. The bias-variance trade-off between LDA and QDA mirrors the trade-off between ridge and ordinary least squares: LDA is the more biased, lower-variance option, and is usually the right default. Hastie's Regularized Discriminant Analysis (1995) explicitly interpolates between the two via a convex combination of the shared and per-class covariance matrices.
LDA matters most in three situations. Very high dimensional data with labels: gene-expression classification with D = 20 000 features and a dozen samples per class — shrinkage LDA is often the strongest baseline. Visualisation with class structure: projecting into C − 1 dimensions and plotting gives a clearer picture of class separability than PCA. Before a flexible classifier: running LDA first and then a random forest or gradient-boosted model on the (C − 1)-dimensional projection is sometimes the best possible classifier when signal is weak and features are noisy.
(C − 1)-dimensional representation is easier to visualise, store, and reason about. If it underperforms, the failure mode (linear-decision boundary, normality) tells you what the nonlinear upgrade should be.
Two related supervised linear methods deserve brief mention. Canonical correlation analysis (CCA; Hotelling, 1936) finds pairs of projections of two sets of features X and Y that maximise their correlation — a dimensionality reduction in both spaces simultaneously. It is the natural tool for multi-modal analysis (images and captions, fMRI and gene expression). Partial least squares (PLS) is a regression method disguised as a reduction: given X and a target y, it finds projections of X that are maximally correlated with y. PLS is heavily used in chemometrics and spectroscopy and is the method to know when the D > n regression has any reasonable amount of signal.
Linear methods project along straight directions in the ambient space. When the data lies on a curved manifold — the canonical example is the Swiss roll, a 2-D sheet rolled up inside 3-D space — the right notion of "close" is not Euclidean distance but geodesic distance along the manifold. The classical manifold-learning methods, starting with multidimensional scaling and extending through Isomap, try to make the embedding preserve those geodesic distances.
Classical multidimensional scaling (MDS; Torgerson, 1952) starts from a distance matrix D between n points and asks: find an embedding Z ∈ R^{n × d} such that the pairwise Euclidean distances in Z match D as closely as possible. The classical solution is to double-centre D² (subtracting row, column, and grand means of the squared-distance matrix), giving a Gram matrix B, and then take the top d eigenvectors of B. When D is a matrix of Euclidean distances in some ambient space, classical MDS gives exactly the PCA projection of the underlying data. When D is derived from some other distance — road distances between cities, psychological similarities between stimuli — MDS produces an embedding compatible with those distances. This is the method's defining utility.
Metric MDS (Kruskal, 1964) generalises classical MDS by minimising a stress function — a weighted sum of squared errors between embedded distances and target distances — via iterative optimisation rather than eigendecomposition. It accommodates non-Euclidean distances, allows weighting, and handles missing entries. Non-metric MDS (Shepard, 1962) relaxes further: it preserves the rank order of distances rather than their magnitudes. Two points that were fifth-closest in the input should remain fifth-closest in the output, regardless of the actual distance value. Non-metric MDS is appropriate when the input "distances" are really ordinal similarity judgments — psychology's classic setting.
Isomap (Tenenbaum, de Silva, Langford, 2000) was the paper that restarted interest in nonlinear dimensionality reduction. The algorithm has three steps. First, build a neighbourhood graph: connect each point to its k nearest neighbours, weighted by Euclidean distance. Second, compute geodesic distances between all pairs of points as shortest-path distances in the graph (Dijkstra or Floyd–Warshall). Third, apply classical MDS to the resulting geodesic-distance matrix. The key insight: for points on a curved manifold, the shortest path through a dense nearest-neighbour graph approximates the geodesic distance along the manifold. A Swiss roll unrolled by Isomap recovers the underlying 2-D sheet with minimal distortion, because the graph geodesic snakes along the roll rather than cutting through the empty space.
Both Isomap and the methods in Section 12 depend critically on the neighbourhood-graph construction. Too few neighbours (k too small) and the graph fragments into disconnected components, breaking the distance computation. Too many neighbours and the graph takes shortcuts across the manifold — the curse of "short-circuiting" — and the geodesic distances collapse to Euclidean. The safe range for k is usually 5–20, varying with dataset size and intrinsic dimension. Diagnostic: plot the number of connected components of the k-NN graph as a function of k; the graph should become fully connected at a small k and stay that way as k grows.
Isomap was the first nonlinear method to produce visually convincing manifold recoveries on standard benchmarks — the Swiss roll, the rotating-object faces dataset. Its computational cost is O(n²) for the shortest-path distances and O(n^3) for the classical MDS, which limits it to a few thousand points. The method has been superseded in visualisation by t-SNE and UMAP, but it is still useful: Isomap returns a global embedding where distances carry meaning, whereas t-SNE and UMAP do not. For tasks where distances in the reduced space should be interpretable — downstream regression, systematic visualisation of multiple datasets in the same coordinates — Isomap remains the right tool.
To run MDS or Isomap on more than a few thousand points, use landmark MDS (de Silva & Tenenbaum, 2003): sample m ≪ n points as landmarks, compute distances among the landmarks and from every point to every landmark, embed the landmarks with MDS, and triangulate the remaining points into the landmark embedding. The cost drops from O(n^3) to O(m n). This is how modern MDS implementations handle datasets of a million points — by secretly running on a few thousand landmarks and interpolating.
Isomap preserves global distances through the manifold, which is expensive and fragile (one short-circuit in the neighbourhood graph wrecks everything). A complementary family of methods gives up on global distances and preserves only the local structure: points that were neighbours in the input should stay neighbours in the output. Locally linear embedding and Laplacian eigenmaps are the two canonical members.
Locally linear embedding (LLE; Roweis & Saul, 2000) has three steps. First, build a k-NN graph (same as Isomap). Second, for each point, find the weights W_{ij} such that x_i ≈ Σ_j W_{ij} x_j — reconstructing the point as a linear combination of its neighbours, with Σ_j W_{ij} = 1. The weights are invariant to rotations, scaling, and translations of the local neighbourhood, which is what makes the method geometric. Third, find the low-dimensional embedding Z ∈ R^{n × d} that minimises the same reconstruction error, Σ_i ‖z_i − Σ_j W_{ij} z_j‖², subject to centring and unit covariance constraints. The solution is given by the bottom d + 1 eigenvectors of (I − W)^T (I − W), discarding the smallest (constant). LLE is elegant and non-iterative; it produces smooth low-dimensional embeddings when the manifold is well-approximated locally by affine patches.
Laplacian eigenmaps (Belkin & Niyogi, 2003) take a similar approach with a different mathematical foundation. Build a weighted k-NN graph with edge weights W_{ij} (Gaussian kernel on distances, or just 0/1). Define the graph Laplacian L = D − W, where D is the diagonal matrix of degrees. Find the d smallest non-trivial eigenvectors of the generalised eigenvalue problem L z = λ D z; these are the embedding coordinates. The method minimises Σ_{ij} W_{ij} ‖z_i − z_j‖², which penalises embeddings where connected points are far apart — a direct "preserve local structure" objective. Laplacian eigenmaps is the unnormalised form of spectral clustering (Section 11 of Chapter 04 already covered this in the clustering context); the same Laplacian eigenvectors serve either as cluster indicators or as an embedding.
Spectral embedding is the generic name for any reduction that takes the top or bottom eigenvectors of a graph Laplacian (or similar matrix) as the embedding. Laplacian eigenmaps, normalised-cut spectral clustering, diffusion maps, and PageRank-style embeddings all fit under this umbrella. The common pattern is: construct an affinity matrix encoding the notion of "close," convert it to a Laplacian, take eigenvectors, use them as coordinates. Scikit-learn's SpectralEmbedding implements the unnormalised form; it's a close cousin of Laplacian eigenmaps with a cleaner interface.
Diffusion maps (Coifman & Lafon, 2006) generalise Laplacian eigenmaps by considering random walks on the similarity graph. The diffusion distance between two points is defined as the difference between the probability distributions of a random walker starting at each of them after t steps, and the diffusion map embeds points such that Euclidean distances in the embedding match diffusion distances. Tuning t gives a family of embeddings that emphasise structure at different scales — small t for fine local structure, large t for coarse global structure. Diffusion maps are less widely used than LLE or Laplacian eigenmaps but are mathematically the most principled of the spectral family, with a strong connection to the Fokker–Planck equation.
The local-neighbourhood methods share several failure modes. Sensitivity to the neighbourhood size: choice of k heavily affects the embedding, and the "right" k varies with intrinsic dimension. Non-uniform sampling: dense regions of the data dominate the embedding; sparse regions are poorly represented. Distance metric in the reduced space is meaningless: unlike Isomap, the embedded distances have no intrinsic interpretation — they only encode neighbourhood relations. No out-of-sample extension: projecting a new point requires recomputing the whole embedding, unless you use a Nyström approximation or similar trick. These limitations are the reason t-SNE and UMAP (sections 13 and 14) largely replaced LLE and Laplacian eigenmaps for practical work — though the underlying spectral-graph machinery is the same.
Two extensions of basic LLE deserve mention. Modified LLE (Zhang & Wang, 2007) addresses LLE's regularisation instability by using multiple local weight vectors per point rather than one. Hessian LLE (Donoho & Grimes, 2003) uses a Hessian-based functional in place of the Laplacian-based one, giving theoretically better recovery on isometrically embedded manifolds. Both are implemented in scikit-learn (LocallyLinearEmbedding with method='modified' or 'hessian') and are occasionally useful, though they rarely outperform UMAP in practice.
t-distributed stochastic neighbour embedding (t-SNE; van der Maaten & Hinton, 2008) is the method that reset expectations for what a 2-D visualisation of high-dimensional data could look like. It produces scatter plots where distinct populations appear as tight, well-separated clusters — the kind of picture that makes a paper submission instantly legible — and it does so by casting the embedding problem as matching probability distributions rather than matching distances.
The ancestor of t-SNE, stochastic neighbour embedding (SNE; Hinton & Roweis, 2002), frames the problem probabilistically. In the high-dimensional space, define a conditional probability p_{j|i} that point i picks point j as its neighbour, proportional to a Gaussian centred on x_i: p_{j|i} = exp(−‖x_i − x_j‖² / 2σ_i²) / Σ_{k≠i} exp(…). The variance σ_i is set per-point by binary search to achieve a target perplexity — effectively the number of neighbours each point "sees." In the low-dimensional embedding, define symmetric probabilities q_{ij} analogously. The objective is to minimise the sum of KL divergences between the ps and the qs. The intuition: if i is close to j in the high-dimensional space, p_{j|i} is large, so q_{ij} must also be large, forcing i and j close in the embedding.
The innovation in t-SNE was replacing the Gaussian kernel in the embedding space with a Student's t-distribution with one degree of freedom (equivalent to a Cauchy distribution). This introduces heavy tails: moderately-far points in the embedding contribute to q_{ij} more than they would under a Gaussian. The practical consequence is that t-SNE can push dissimilar points apart without paying a high gradient cost, which produces the well-separated cluster structure that made the method famous. The heavy-tailed kernel also addresses the crowding problem: in 2-D there is not enough room to place all moderately-similar points at roughly equal distances from a focal point, and the t-distribution's long tail compensates.
t-SNE has three important hyperparameters. Perplexity: the effective number of neighbours each point considers. Small perplexity (5–15) emphasises local structure; large perplexity (50–100) emphasises broader structure. Typical ranges are 5–50, and Wattenberg, Viégas, and Johnson's 2016 Distill article "How to Use t-SNE Effectively" is the canonical warning about over-interpreting perplexity-dependent visualisations. Learning rate: usually 10–1000, set to n / 12 (Kobak & Berens, 2019) as a data-adaptive default. Initialisation: random is traditional, but PCA initialisation is far more stable across runs and usually gives better embeddings — another recommendation of Kobak and Berens.
t-SNE produces beautiful pictures but each picture is a rorschach blot if not interpreted with care. Cluster sizes are not meaningful: a large cluster and a small cluster in a t-SNE plot may correspond to populations of equal size in the original data — t-SNE adjusts its local scale to match each point's perplexity. Distances between clusters are not meaningful: two clusters that appear far apart may be similar in the high-dimensional space, and vice versa. Only the local neighbourhood structure survives. Different runs give different plots: t-SNE is stochastic; initialisation and random seed matter. PCA initialisation largely fixes this. Perplexity-driven structure can be an artefact: the same data can look like 3 clusters at perplexity 5, 2 clusters at perplexity 30, and 1 cluster at perplexity 100 — the method invents apparent structure at the chosen scale.
Naive t-SNE is O(n²) per iteration and scales to a few thousand points. Barnes–Hut t-SNE (van der Maaten, 2014) approximates the gradient using a quad-tree to group distant points, bringing the cost down to O(n log n) per iteration and enabling embeddings of a few hundred thousand points. FIt-SNE (Linderman et al., 2019) uses Fourier-based interpolation and drops the cost to O(n) per iteration, handling millions of points. The openTSNE library implements both and is the default for large-scale work.
Despite the caveats, t-SNE remains the right tool for one thing: exploratory 2-D visualisation of high-dimensional data when you want to see whether clusters exist and roughly what they look like. For that task, t-SNE's tendency to produce clean, visible clusters is a feature. For downstream tasks — clustering, nearest-neighbour retrieval, feeding a classifier — UMAP (Section 14) is usually preferred because it scales better, preserves more global structure, and supports out-of-sample projection of new points natively.
UMAP (McInnes, Healy, Melville, 2018) is the method that replaced t-SNE as the default 2-D visualisation tool for high-dimensional data in most domains. It produces qualitatively similar pictures to t-SNE, runs 5–10× faster on modern hardware, preserves more global structure, and supports out-of-sample projection — three practical advantages that added up to widespread adoption in the two years after the paper.
UMAP has two phases. First, construct a fuzzy topological representation of the data: build a weighted k-NN graph where edge weights come from a local-distance Gaussian (much like t-SNE's per-point σ_i), symmetrise the graph so that if i considers j a neighbour then j considers i a neighbour too (using a probabilistic t-conorm combination), and treat the result as a fuzzy simplicial set encoding the manifold's local structure. Second, optimise a low-dimensional embedding whose fuzzy simplicial set best matches the high-dimensional one, by minimising a cross-entropy loss between the two sets of edge weights. The optimisation uses stochastic gradient descent with negative sampling — the same trick used in word2vec — which is what makes UMAP so much faster than t-SNE's full-batch Barnes–Hut gradient.
The differences that matter in practice: Speed. UMAP is typically 5–10× faster than Barnes–Hut t-SNE on the same data, and the gap widens with dataset size. Global structure. UMAP preserves global relationships better than t-SNE — clusters that are similar in the high-dimensional space are placed nearby in UMAP but at arbitrary positions in t-SNE. Out-of-sample projection. UMAP's projection step can be applied to new points without re-fitting the whole embedding; t-SNE cannot. Consistency across runs. UMAP is still stochastic, but less so than t-SNE; different seeds give qualitatively similar embeddings. Embedding dimension. UMAP is designed to scale beyond 2-D without losing coherence — 10-D, 50-D, 100-D UMAP embeddings are used as preprocessing for downstream classifiers and clusterers. The combined effect: UMAP is usually the better first choice, and t-SNE is reached for only when its specific aesthetic (crisper cluster separation, tighter clumps) is wanted.
UMAP has two principal hyperparameters. n_neighbors (typical 5–50, default 15): controls the local/global trade-off. Small values emphasise fine-grained local structure; large values preserve broader topology. Analogous to perplexity in t-SNE, but UMAP is more robust to the choice. min_dist (typical 0.0–0.99, default 0.1): controls how tightly UMAP packs points. Low values produce tight clusters with visible gaps; high values produce smoother, more spread-out embeddings. For visualisation the defaults work well. For clustering downstream, Healy's recommendation is to set n_neighbors large (30–100), min_dist small (0.0), and embed to 10–50 dimensions rather than 2, then cluster with HDBSCAN. This is the UMAP+HDBSCAN pipeline that has become a workhorse of single-cell genomics.
The umap-learn Python library is the reference implementation. Usage is a three-line drop-in for any scikit-learn reducer. The method supports categorical and numerical distances (Hamming, Mahalanobis, Jaccard, cosine), supervised UMAP with labels, and parametric UMAP (where the embedding is learned by a neural network and can be applied to unseen data). For very large datasets (tens of millions of points), rapidsai/cuml offers a GPU implementation that is another order of magnitude faster.
Supervised UMAP takes both features and labels and produces an embedding where classes are separated, analogous to LDA but nonlinear and much more expressive. Semi-supervised UMAP takes partial labels and uses them to guide the embedding while respecting the unsupervised structure in the unlabelled majority. These are useful when a reduction is being prepared specifically for a downstream classification task, and they typically produce better classification accuracy than unsupervised UMAP followed by the same classifier.
UMAP's paper is unusually mathematical for a visualisation tool, grounded in fuzzy simplicial sets and algebraic topology. In practice the theory mostly justifies the design decisions rather than being actively used; what matters is that the method works well, scales well, and has a stable, well-maintained implementation. Later theoretical work (Damrich & Hamprecht, 2021; Böhm, Berens, Kobak, 2022) has shown that UMAP and t-SNE can be unified as members of a parameterised family of neighbour-embedding losses, and that their difference is a single degree of freedom in the loss, clarifying that the methods are closer than they initially seemed.
An autoencoder is a neural network trained to copy its input to its output through a low-dimensional bottleneck. The encoder maps x to a lower-dimensional code z = f(x); the decoder maps z back to a reconstruction x̂ = g(z); the network is trained to minimise ‖x − x̂‖². The bottleneck forces the encoder to preserve only the information needed to reconstruct x, which is exactly the content of dimensionality reduction. Autoencoders are the neural generalisation of PCA and the bridge to modern representation learning.
The simplest autoencoder has an encoder and decoder that are both single linear layers. With a Frobenius-norm reconstruction loss, this exactly recovers PCA: the weights of the encoder, after training, span the same subspace as the top d principal components. Replacing the linear layers with multi-layer nonlinear networks (ReLU, sigmoid, tanh activations) gives a nonlinear autoencoder, which can capture curved manifolds that PCA cannot. The trade-off: nonlinear autoencoders have no closed-form solution, training is stochastic, hyperparameters (depth, width, learning rate, regularisation) matter, and the optimisation can settle into poor local minima. When they work, they work beautifully; when they don't, diagnosing why is harder than for PCA.
Denoising autoencoders (Vincent et al., 2008) corrupt the input with noise — Gaussian noise, salt-and-pepper noise, masking random features — and train the network to reconstruct the clean original. The corruption forces the encoder to learn features robust to the noise, which empirically produces better representations than a plain autoencoder with the same architecture. This was one of the first pieces of evidence that clever training objectives matter more than clever architectures, an idea that would later dominate self-supervised learning.
A sparse autoencoder adds a sparsity penalty on the code z, forcing most of its elements to be zero most of the time. The penalty can be L1 (simple, interpretable), KL-divergence to a target activation rate (Andrew Ng's 2011 formulation), or a learned top-k gate. Sparse autoencoders learn overcomplete dictionaries — d > D — where each code element represents a specific input pattern, reminiscent of the parts-based decompositions of NMF but nonlinear. Sparse autoencoders have had a minor renaissance in 2024–2026 interpretability work on transformer models, where they are used to decompose the internal activations of LLMs into sparse, more interpretable directions.
Variational autoencoders (VAE; Kingma & Welling, 2013) replace the deterministic encoder with a probabilistic one: z ~ q(z | x) = N(μ(x), σ²(x)), with both μ and σ² learned by the encoder network. The training objective is the evidence lower bound, a sum of reconstruction error and a KL divergence between q(z | x) and a prior p(z) = N(0, I). The prior term forces the encoder to map inputs into a smooth, well-organised latent space — nearby points in code space decode to similar inputs — which is what makes VAEs useful generative models as well as reducers. The reparameterisation trick (sample ε ~ N(0, I), let z = μ + σ ⊙ ε) makes the whole thing differentiable by stochastic gradient descent. VAEs are the bridge between classical dimensionality reduction and modern generative modelling.
Two further variants show up in the literature. Contractive autoencoders (Rifai et al., 2011) add a penalty on the Frobenius norm of the encoder's Jacobian, forcing the code to be locally invariant to small perturbations in the input. Adversarial autoencoders (Makhzani et al., 2015) use a GAN-style discriminator to enforce the latent distribution rather than a KL penalty, which gives more flexibility in choosing the latent prior. Both are niche tools useful in specific problems.
When is an autoencoder the right choice over PCA or UMAP? Three situations. Large data. Autoencoders train by mini-batch gradient descent and scale to hundreds of millions of points; PCA needs the full matrix and UMAP is memory-bound. Task-specific reductions. Autoencoders can be trained end-to-end with a downstream task — the reduction is tuned for the task, not for general-purpose variance or neighbourhood preservation. Nonlinear manifolds with structured noise. Denoising autoencoders explicitly model noise; PCA doesn't; UMAP doesn't either. When the noise is non-Gaussian and structured (masked images, corrupted tokens), autoencoders outperform.
The modern lineage of self-supervised representation learning runs directly through the autoencoder. Masked autoencoders (He et al., 2021) are the vision counterpart of BERT's masked language modelling — mask random patches of an image, train a network to reconstruct them, use the encoder as a general-purpose visual feature extractor. Contrastive methods (SimCLR, MoCo, CLIP) replace reconstruction with a contrastive objective but follow the same idea: train the encoder to produce useful low-dimensional features from unlabelled data. From this vantage, the entire pre-training paradigm of modern ML is a massive nonlinear dimensionality reduction — take high-dimensional inputs (pixels, tokens), learn a much lower-dimensional code that preserves what matters, use the code everywhere.
You have a high-dimensional dataset and the vague goal "reduce it for analysis." This section is the recipe for turning that into an actual reduced representation you can defend — choosing the target dimension, evaluating the reduction, and slotting it into a data-science workflow. The recipe collapses a lot of the chapter into a short procedure.
Centre each feature to mean zero; scale to unit variance unless the features share a natural unit and their raw scales carry meaning. Handle missing values (impute, indicator, or drop — see Chapter 09). Check for outliers — a single wild point can rotate PCA's leading components by tens of degrees. If you know there are outliers, use robust PCA or a robust scaler. Decide on the feature set: dropping features that you know are noise before reducing usually improves the reduction more than the reduction improves the features.
A decision tree. Need a reduction that preserves linear variance on numerical features? PCA. Non-negative data, interpretable parts? NMF. Recovering independent sources from a mixture? ICA. Labels available, task is classification? LDA (with shrinkage if D > n). Very high dimensional, need speed, task-agnostic? Random projection. Nonlinear manifold structure in moderate dimensions? Kernel PCA or Isomap. 2-D visualisation? UMAP first, t-SNE if UMAP doesn't give the aesthetic you want. Reduction feeding a clustering step? PCA to ~50 dimensions, then UMAP to 10 dimensions, then HDBSCAN. Reduction for a neural-network-driven pipeline? Autoencoder or a pre-trained encoder. None of these decisions is irrevocable — trying two or three methods and comparing is usually a good idea.
The target dimension is usually more consequential than the choice of method. For PCA-family reductions: the scree plot (eigenvalues against rank) usually has an elbow where the drop from one eigenvalue to the next becomes small; take d at the elbow. The cumulative-variance plot lets you pick a threshold — commonly 90 %, 95 %, or 99 % — and take the smallest d that clears it. A data-driven alternative: cross-validate the downstream task over different d and pick the one that maximises downstream performance. For visualisation: d = 2 if for human viewing; d = 3 if rotating 3-D plots are useful; d = 10–50 if the reduction is preprocessing for a clusterer or classifier. For methods without a natural scree plot (t-SNE, UMAP, autoencoders), cross-validated downstream performance is usually the only principled choice.
Four complementary metrics, depending on what the reduction should preserve. Reconstruction error (‖X − X̂‖_F² / ‖X‖_F²): the fraction of variance not preserved by the reduction. Appropriate for PCA-family methods. Trustworthiness and continuity (Venna & Kaski, 2001): measures of how well local neighbourhoods in the input are preserved in the embedding and vice versa. Appropriate for nonlinear-manifold methods where reconstruction isn't defined. Downstream task performance: the only metric that matters if you know what the reduction is for. Cross-validate a classifier or clusterer on both the original and the reduced representation and compare. Human inspection: for 2-D visualisations, colour the plot by known labels (classes, continuous covariates) and check that the structure in the plot is interpretable. This is the least rigorous but often the most informative evaluation.
Two practical rules. Fit the reduction on training data only. Running PCA or UMAP on train + test and then splitting is a form of leakage — the reduction sees the test features while training. Fit on train, transform test. Treat the reduction as part of the model. When evaluating via cross-validation, refit the reduction inside the CV loop. In scikit-learn, Pipeline([('pca', PCA(d)), ('clf', LogisticRegression())]) handles this correctly; treating the PCA as preprocessing outside the loop does not.
A short list of traps. Not scaling the features. One feature with variance 1 000× the others dominates PCA. Fitting on the wrong split. Pipeline leakage. Choosing d by inspection alone. Eye-balling a 2-D visualisation tells you nothing about whether the reduction is good for a downstream task. Over-interpreting t-SNE and UMAP. Always report parameters, seed, and downstream validation. Running dimensionality reduction when the real problem is too few samples. Reduction doesn't rescue a dataset that is too small for the task; if n is tiny, get more data or simplify the model. Picking d too small. Aggressive reduction looks efficient but wipes out important variance; when in doubt, err on the side of keeping more dimensions.
Dimensionality reduction rarely appears explicitly in the architecture diagrams of modern ML systems, because it is everywhere. Every embedding layer is a trained linear reduction; every encoder-decoder architecture is a nonlinear reduction with a reconstruction objective; every pre-trained foundation model is a reduction from raw data to a semantically meaningful feature space. This closing section is a tour of where the ideas from this chapter have quietly moved into the infrastructure of contemporary ML.
Almost every modern ML system operates on embeddings — low-dimensional dense vector representations of discrete or complex inputs. Word embeddings (word2vec, GloVe) collapsed the 50 000-dimensional one-hot-encoded vocabulary of English into a 300-dimensional dense space where cosine similarity captured semantic relatedness. Sentence transformers extended the idea to sentences, then paragraphs. Vision models (CLIP, DINO) map images to embeddings that support zero-shot classification and cross-modal retrieval. Graph neural networks embed nodes. Recommender systems embed users and items. The embedding space is not an implementation detail — it is the working language of modern ML, and every embedding table is a learned dimensionality reduction.
Once data lives in an embedding space, the natural operation is nearest-neighbour retrieval: given a query vector, find the nearest database vectors. Modern retrieval stacks — semantic search, retrieval-augmented generation, recommenders — are all nearest-neighbour search over learned embeddings. The supporting algorithmic infrastructure is directly descended from dimensionality reduction: product quantisation (Jégou et al., 2011) is a structured form of vector quantisation on sub-vectors, locality-sensitive hashing (Indyk & Motwani, 1998) is random-projection-based, HNSW (Malkov & Yashunin, 2018) is a graph-based approximate nearest-neighbour method whose construction uses ideas from small-world graphs. FAISS, ScaNN, and the modern vector databases (Pinecone, Weaviate, Milvus) are engineered around these techniques.
Neural-network weights themselves can be reduced for faster inference. Weight pruning removes small-magnitude weights (a form of sparse approximation). Low-rank adaptation (LoRA; Hu et al., 2021) trains low-rank updates to the weights of a pre-trained model rather than full matrices — a direct application of Eckart–Young style low-rank approximation to the fine-tuning problem, and the technique that made fine-tuning of large language models affordable. Quantisation reduces the bit width of weights (from 32-bit float to 8-bit, 4-bit, even 1-bit), which is not literally dimensionality reduction but is the same compress-with-bounded-error pattern. All three turn the weights-as-large-matrices problem into a low-rank-or-sparse-approximation problem.
Language models start with tokenisation — the process of turning text into integer IDs — and the vocabulary (typically 30 000–200 000 entries) followed by an embedding layer is effectively a massive compression: the space of all possible token strings is collapsed to a finite vocabulary, each mapped to a learned vector of a few thousand dimensions. Byte-pair encoding and WordPiece are hierarchical clustering algorithms on character sequences; the embedding table is a learned dimensionality reduction from one-hot vocabulary to dense semantic space. The upstream tokenisation and downstream embedding together are the most-used dimensionality-reduction pipeline in all of AI.
Diffusion models, variational autoencoders, and normalising flows are, in a loose sense, inverse dimensionality reducers. The encoder maps high-dimensional inputs into a lower-dimensional latent; the decoder (or denoiser, or flow) maps back. During training, the decoder learns to invert the reduction; during sampling, one draws from the latent prior and runs the decoder forward to synthesise a new high-dimensional sample. Stable Diffusion's "latent diffusion" paradigm explicitly runs the expensive diffusion process in a VAE-reduced latent space of a few thousand dimensions rather than on the 500 000-dimensional pixel grid, a ~100× speed-up. The reduction is load-bearing infrastructure for the generative system.
A recent vein of interpretability research uses sparse autoencoders to decompose the internal activations of transformer models into sparse, interpretable directions — "features" that correspond to human-recognisable concepts (golden-gate-bridge, legal-language, code-style). This is the feature-decomposition idea from classical ICA and sparse PCA, applied to the activations of neural networks rather than the raw data. Early results (Anthropic's 2023 and 2024 papers, OpenAI's work on GPT-4 features) suggest the technique can surface thousands or millions of interpretable features in a single model, opening a path to rigorous explanation of how these models work internally.
This closes the unsupervised-learning arc of Part IV. The next chapter, Probabilistic Graphical Models, returns to a different kind of structure in data — not the clusters or directions or manifolds of the last two chapters, but the conditional independence relations between random variables, made explicit as a graph. Bayesian networks, Markov random fields, hidden Markov models, conditional random fields — the machinery of probabilistic reasoning about structured problems. The methods borrow heavily from everything so far: the EM algorithm from Chapter 04 is the standard tool for fitting models with latent variables, the linear algebra of Chapter 05's PCA reappears as the inference machinery for Gaussian graphical models, and the supervised-versus-unsupervised split that organised Chapters 01–05 gives way to a richer partial-observation formulation that underpins much of modern structured prediction.
Dimensionality reduction spans a century and a half of mathematical work — Pearson's 1901 PCA paper, Hotelling's 1933 rediscovery, Spearman's 1904 factor analysis, the full linear-algebra machinery codified by Golub and Van Loan, the manifold-learning revolution of the early 2000s, and the modern neural-representation era. The references below split into the anchor textbooks that should live on a working data scientist's shelf, the foundational papers that introduced each major algorithm, the modern extensions that extend them into 2026, and the software documentation where most of the actual day-to-day work happens.
D > n. The book rewards re-reading; the PCA section is essentially the only treatment you'll need.r correction ΔW = A B where A and B are low-rank factors. With r in the single digits, fine-tuning a 7B-parameter model reduces to training a few million parameters — cheap enough to run on a laptop. Now the default technique for fine-tuning open-source LLMs.This page is Chapter 05 of Part IV: Classical Machine Learning. The next chapter — Probabilistic Graphical Models — turns from continuous-geometric structure (clusters, manifolds, directions) to the conditional-independence structure of random variables made explicit as a graph: Bayesian networks, Markov random fields, hidden Markov models, and conditional random fields, with their inference machinery (belief propagation, variational methods, MCMC). The later chapters of Part IV round out classical machine learning with kernel methods and support vector machines (Chapter 07), feature engineering and selection (Chapter 08, the proper supervised treatment of what this chapter does for unsupervised settings), and model evaluation and selection (Chapter 09, the discipline that turns a trained model of any kind into a trustworthy deployed system).