Clustering is the first serious problem in unsupervised learning: given a dataset without labels, partition it into groups of points that are similar to each other and different from points in other groups. The problem is ancient — Linnaeus was doing a kind of clustering when he classified species in 1735, and the k-means algorithm was independently invented at least four times between 1950 and 1965 — and it is also oddly unresolved. Unlike regression and classification, where a training loss and a held-out error together pin down success, clustering has no ground truth: different algorithms carve the same data into different partitions, and there is no single criterion that picks the right one. This chapter develops the four algorithmic families that do most of the work — centroid-based (k-means and its variants), hierarchical (agglomerative and divisive), density-based (DBSCAN, HDBSCAN), and model-based (Gaussian mixtures, soft clustering via EM) — the evaluation machinery that lets you defend a choice among them, and the practical habits that separate a useful clustering from a cargo-cult one. Chapter 05's dimensionality reduction is the usual companion: cluster first in a reduced space, interpret, iterate.
The first section motivates the clustering problem and lays out the four algorithmic families the rest of the chapter will inhabit. Section two is the quiet prerequisite — similarity and distance — because every clustering algorithm is a function of a distance metric, and choosing the metric is often the most consequential decision in the whole pipeline. Sections three through five are the centroid-based family: k-means itself (Lloyd's algorithm and its Forgy/MacQueen variants), the variants that fix its common failure modes (k-means++, mini-batch k-means, kernel k-means), and the robust relatives k-medoids and k-medians. Section six covers hierarchical clustering — the agglomerative linkage family and the dendrogram as a full multi-scale picture of the dataset. Sections seven and eight are the density-based family: DBSCAN as the canonical algorithm that discovers clusters of arbitrary shape and rejects noise, and its descendants HDBSCAN and OPTICS that turn the single density threshold into a hierarchy.
Section nine introduces Gaussian mixture models as the model-based cousin of k-means — k-means is the limit of a GMM with spherical covariances and zero variance — and the natural entry point into soft clustering. Section ten develops the EM algorithm in the detail it deserves, because EM is the tool every later probabilistic model in Part V will inherit. Section eleven is spectral clustering, the linear-algebra approach that handles non-convex geometry by clustering in the eigenvectors of a Laplacian. Section twelve is the hardest and most important: how many clusters — the elbow method, silhouette, gap statistic, BIC, information criteria, and the honest acknowledgement that the question often has no single right answer. Section thirteen is the rest of evaluation, with and without ground-truth labels: ARI, NMI, purity, V-measure, and the internal indices (silhouette, Davies-Bouldin, Calinski-Harabasz). Section fourteen is stability and consensus clustering — the idea that if a clustering is real, it should survive perturbations and agree with itself across methods. Sections fifteen and sixteen are practical: high-dimensional clustering (where Euclidean distance fails and dimensionality reduction becomes almost mandatory) and the operational recipe for attacking a new clustering problem. Section seventeen closes with where clustering shows up in modern ML — as the scaffolding under vector-quantisation, tokenisation, prototype-based few-shot learning, self-supervised representation learning, and the recommendation and retrieval systems that shape the modern web.
Clustering is what learning from data looks like when the data does not come with labels. It is the defining problem of unsupervised learning and a surprisingly hard one — the objective is underdetermined, the evaluation is circular, and two reasonable algorithms will often produce genuinely different partitions of the same dataset. The chapter opens with what that unresolved condition implies for how to approach the problem.
Given a set of data points X = {x_1, …, x_n} in some feature space, partition them into clusters C_1, …, C_K such that points in the same cluster are more similar to each other than to points in different clusters. The problem sounds simple but conceals an entire hierarchy of decisions: what "similar" means (Euclidean distance, cosine, mutual information, graph adjacency), what shape clusters are allowed to take (convex blobs, arbitrary density ridges, elongated manifolds), whether each point must belong to exactly one cluster (hard clustering) or can belong to several (soft / fuzzy clustering), and — the killer — how many clusters there are. A single dataset can be partitioned into 3 meaningful clusters or 30 meaningful clusters depending on which of these questions gets answered first.
Almost every classical clustering algorithm falls into one of four families, and choosing between them is more consequential than tuning any single algorithm. Centroid-based methods (k-means, k-medoids) represent each cluster by a prototype and assign points to the nearest one; they are fast, well-understood, and assume roughly convex, equal-sized, equal-variance clusters. Hierarchical methods build a tree of nested clusters (the dendrogram), giving a multi-scale picture rather than a single partition; they are slow (O(n²) or worse) but expose structure at every granularity. Density-based methods (DBSCAN, HDBSCAN) find clusters as connected regions of high density and explicitly label low-density points as noise; they discover arbitrary shapes and need no prior K. Model-based methods (Gaussian mixtures) fit a probability distribution as a mixture of components and assign points to the component that most likely generated them; they give soft assignments and a principled path to model selection via likelihood. The rest of the chapter is a tour of each family.
Jon Kleinberg's 2002 impossibility theorem (An Impossibility Theorem for Clustering) formalised the everyday frustration: no clustering function can simultaneously satisfy three very reasonable axioms — scale invariance (rescaling the distance shouldn't change the clustering), richness (any partition should be achievable for some distance matrix), and consistency (shrinking within-cluster distances and expanding between-cluster ones shouldn't change the clustering). Any algorithm that satisfies any two will fail the third. This is not a practical obstacle — most real algorithms sacrifice richness, picking a family of achievable partitions — but it is the theoretical reason that clustering admits no universal answer. Every algorithm is biased in favour of some partitions and against others; the art is matching the bias to the problem.
This chapter is about clustering as it sits inside classical machine learning: working on vector data, returning partitions, evaluated against internal or external criteria. It does not cover topic models (soft clustering of documents via latent Dirichlet allocation; see Part V), community detection on networks (modularity, Louvain, Leiden — a large adjacent literature), deep clustering (end-to-end learning of both embeddings and cluster assignments; a mid-2010s-onward line of work), or time-series clustering as a specialised domain. The techniques developed here — distances, centroids, densities, mixtures, hierarchies — are the vocabulary all those extensions assume. Chapter 05's dimensionality reduction is the natural first extension and is almost always used in conjunction with the clustering methods in this chapter.
Every clustering algorithm is a function of a distance or similarity matrix. The choice of metric is not a preprocessing step; it is a statement about what kind of structure the clustering is meant to find. A practitioner who ignores the metric has already made a choice — usually Euclidean — and often the wrong one.
A metric on a set is a function d(x, y) satisfying non-negativity, identity of indiscernibles (d(x, y) = 0 iff x = y), symmetry, and the triangle inequality d(x, z) ≤ d(x, y) + d(y, z). The classical metrics on R^p are the Euclidean distance ‖x − y‖₂, the Manhattan (L1, taxicab) distance Σ|x_i − y_i|, and the Chebyshev (L∞, max-coord) distance max|x_i − y_i| — all special cases of the Minkowski family (Σ|x_i − y_i|^p)^(1/p). The Euclidean default comes from its closed-form relationship to the mean (the sum of squared Euclidean distances from a cluster's mean is minimised, which is exactly what k-means exploits) and to Gaussian likelihoods. The L1 default comes from its robustness to outliers. Pick Euclidean when the data is roughly Gaussian and in comparable units; pick L1 when outliers are common.
Many useful notions of closeness are not metrics. Cosine similarity — cos(x, y) = x·y / (‖x‖‖y‖) — measures angle between vectors and ignores magnitude; it is the dominant similarity on text (tf-idf and embedding vectors) and retrieval (dense vector search). It is not a metric — the triangle inequality fails — but 1 − cos(x, y) is bounded in [0, 2] and is the distance surrogate used in practice. Jaccard distance 1 − |A ∩ B|/|A ∪ B| is the canonical measure on sets (bag-of-words, user-click sets) and does satisfy the triangle inequality, making it a true metric. Hamming distance counts mismatches in fixed-length strings and is the right measure for categorical data encoded as binary vectors. Edit distance (Levenshtein) on sequences is a metric but expensive to compute. The choice among these is almost always dictated by the data type rather than the algorithm.
The Mahalanobis distance d(x, y) = √((x − y)ᵀ Σ⁻¹ (x − y)) generalises Euclidean distance by accounting for the covariance Σ of the data: it is the Euclidean distance after whitening (multiplying by Σ^(−1/2)), so that features with correlated variation count less. A cluster that is visibly elongated in one direction in Euclidean space is a round cluster in Mahalanobis space. The Gaussian mixture model in Section 09 uses exactly this distance in its likelihood formula — one Mahalanobis per component — and that is why GMMs handle elliptical clusters that k-means cannot. For quick work, whitening the data (standardising each feature, or doing a full PCA whitening) before running k-means is a cheap approximation.
For problems where the right metric is not obvious, metric learning learns the distance function from data. Classical methods (Xing et al., 2003; LMNN, Weinberger and Saul, 2009; ITML, Davis et al., 2007) learn a Mahalanobis matrix that pulls same-class points closer and pushes different-class points apart, given some side information about which pairs are similar. Modern deep-learning analogues (Siamese networks, triplet loss, contrastive learning) do the same thing with non-linear encoders — the embedding itself is the learned metric, and the resulting cosine distances are the effective similarity. For clustering, the relevance is that the same raw data embedded by different representation-learning pipelines can produce very different cluster structures, and the embedding stage is often more consequential than the clustering algorithm that follows.
k-means is the most-used clustering algorithm in the world. It was invented at least four times between 1950 and 1965 (Steinhaus, Lloyd, MacQueen, Forgy), it has a beautifully simple objective, and it is fast enough to run on millions of points. It is also badly behaved in several specific ways, and its failure modes organise the next few sections.
k-means partitions X into K clusters C_1, …, C_K and assigns to each cluster a centroid μ_k, minimising the total within-cluster sum of squares (WCSS, also called inertia): J = Σ_k Σ_{x ∈ C_k} ‖x − μ_k‖². For fixed cluster assignments, the optimal centroid is the mean of the cluster's points. For fixed centroids, the optimal assignment is to the nearest centroid. The k-means algorithm (Lloyd, 1957; published 1982) alternates these two steps: assign each point to its nearest centroid; recompute each centroid as the mean of its newly-assigned points; repeat until nothing changes. This is coordinate descent on J, and it is guaranteed to converge because each step strictly decreases J and there are finitely many possible assignments.
The objective quietly encodes several strong assumptions. Spherical clusters: because WCSS penalises Euclidean distance uniformly in every direction, k-means prefers round clusters. Elliptical clusters get split. Equal variance: clusters of very different sizes or spreads are incorrectly partitioned — a tight cluster next to a diffuse one will have the diffuse one's edge points assigned to the tight one. Equal size: k-means implicitly biases toward clusters with roughly the same number of points, because large clusters drag their centroid more. Known K: the number of clusters must be supplied externally. Convex clusters only: non-convex shapes (two interlocking half-moons) cannot be captured by any Voronoi partition. Sensitive to outliers: a single extreme point in a cluster yanks the centroid toward it. Every clustering method in the rest of the chapter exists to fix one or more of these failures.
Lloyd's algorithm only finds a local minimum of J, and which local minimum depends on where the centroids start. The original Forgy initialisation picks K random points as centroids; random partitioning assigns each point to a random cluster and computes centroids. Both can produce terrible solutions when the random starts happen to land multiple centroids in the same true cluster. The standard fix is to run the algorithm R times with different random seeds and keep the solution with the lowest J. A better fix — k-means++, Section 04 — is smarter about the initialisation itself. The lesson: treat k-means as a randomised algorithm, not a deterministic one, and always run multiple restarts.
O(nKp) for n points in p dimensions with K clusters. The number of iterations is usually small (5–50) and largely independent of n. k-means comfortably handles millions of points on a single machine; billions need the mini-batch variant in Section 04.
Despite its limitations, k-means remains the first thing to try on almost every new clustering problem. It is fast, the output is interpretable (centroids are themselves points in feature space), the objective is a single number that you can compare across values of K, and the algorithm is one of the best-understood optimisation problems in unsupervised learning. Many real problems are well-approximated by spherical equal-variance clusters, especially after standardisation and dimensionality reduction, and a k-means baseline is the reference point against which every fancier method is measured. The chapter's strategy is to understand k-means thoroughly, then meet every other method as a specific fix for one of its failure modes.
Each common failure mode of k-means has a specific fix. k-means++ addresses bad initialisation. Mini-batch k-means addresses scale. Kernel k-means addresses non-convexity. Fuzzy c-means addresses the hard-assignment assumption. Together these variants cover most of the real-world cases where plain k-means struggles.
k-means++ (Arthur and Vassilvitskii, 2007) is a smarter initialisation scheme. Pick the first centroid uniformly at random from the data. For each subsequent centroid, sample a point with probability proportional to its squared distance from the nearest already-chosen centroid. This biases the initialisation toward spread-out centroids — a point in a region already covered by a centroid has low probability of being chosen next — and provably achieves O(log K)-approximation to the optimal J in expectation. The practical effect: fewer random restarts needed, better final solutions, and the new default in every major library (scikit-learn's KMeans uses k-means++ by default; Spark's MLlib does too). Use k-means++ always; plain random initialisation is a historical footnote.
Full k-means recomputes centroids against every point at every iteration. Mini-batch k-means (Sculley, 2010) computes centroid updates from random mini-batches of size 100–1000 at each step, much like stochastic gradient descent. The quality of the final solution is slightly worse than full k-means (around 1–3% higher J on standard benchmarks), but the speed is 10–50× faster at scale, and for streaming data it is the only option. Mini-batch is the right default at n > 10^6; at smaller scales full k-means is still preferred. The scikit-learn MiniBatchKMeans class implements it with sensible defaults.
Kernel k-means replaces the Euclidean distance with a kernelised distance d(x, y) = K(x, x) + K(y, y) − 2 K(x, y), effectively running k-means in an implicit high-dimensional feature space without computing coordinates there. With a Gaussian RBF kernel, this gives the algorithm the flexibility to fit non-convex clusters that plain k-means cannot represent. The cost: no explicit centroid in input space (the centroid lives in the RKHS), O(n²) kernel-matrix storage, and the usual Gaussian-kernel bandwidth choice. In practice, spectral clustering (Section 11) is the more common route to non-convex clusters, but kernel k-means is the direct generalisation and the bridge to the kernel-methods chapter.
Fuzzy c-means (Bezdek, 1981) allows each point to belong to every cluster with a degree of membership in [0, 1] that sums to 1. The membership of point x_i to cluster k is u_{ik} ∝ (1/d(x_i, μ_k))^(2/(m−1)), where m > 1 is the "fuzziness" parameter (m = 2 is standard). Centroids become weighted means of all points, weighted by their memberships. The algorithm is still alternating optimisation and still requires K supplied externally, but it captures the intuition that boundary points are genuinely ambiguous. For overlapping clusters, fuzzy c-means often gives a more faithful picture than a hard partition. Gaussian mixture models (Section 09) are the probabilistic version of the same idea and have largely superseded fuzzy c-means in modern practice, but the fuzzy version remains common in image segmentation and bioinformatics.
Bisecting k-means starts with all points in one cluster and recursively splits the worst cluster into two via k-means until K clusters are produced — a hierarchical cousin of k-means that is often more stable than the flat version. Spherical k-means projects all points onto the unit sphere and clusters on cosine similarity, and is the algorithm inside classical topic-model initialisation and dense-embedding clustering. Constrained k-means enforces must-link and cannot-link constraints from external information (semi-supervised clustering). Each is a tactical variant; none displaces the core algorithm.
k-means uses the arithmetic mean as its prototype, which is not robust to outliers and not meaningful for non-Euclidean data. k-medoids and k-medians replace the mean with a robust alternative — an actual data point, or a coordinate-wise median — making clustering possible on data where means are either fragile or undefined.
k-medoids (Kaufman and Rousseeuw, 1987, published as PAM — Partitioning Around Medoids) replaces each cluster's centroid with a medoid: the actual data point in the cluster that minimises the sum of distances to all other points in the cluster. The medoid is a real point, not a synthetic average, which is essential when points have no meaningful average — categorical data, string data, graph-node data, any data whose distances are defined but whose arithmetic is not. The objective is J = Σ_k Σ_{x ∈ C_k} d(x, m_k) for an arbitrary distance d, not necessarily Euclidean. The classical PAM algorithm is O(K(n − K)²) per iteration, which limits it to smaller datasets; CLARA (Clustering LARge Applications) approximates by running PAM on random subsamples, and CLARANS does randomised neighbour search over medoid assignments.
Two reasons. First, robustness: the medoid is the geometric median of the cluster (in the L1 sense) and is insensitive to outliers in a way the arithmetic mean is not. A single point 100 standard deviations from a cluster's centre will pull the mean substantially but not the medoid. Second, interpretability: a medoid is itself a data point that can be examined, explained, described. For a customer-segmentation project, "the medoid customer of cluster 3 is customer #4172, who has these properties" is often more useful than "the centroid customer has these synthetic averaged properties". Every interpretable-clustering pipeline eventually benefits from medoids as exemplars, even if the partitioning itself was computed with k-means.
k-medians uses the coordinate-wise median as the cluster prototype and minimises the L1 sum J = Σ_k Σ_{x ∈ C_k} Σ_j |x_j − μ_{kj}|. Unlike the medoid (which is a data point) the coordinate-wise median is a synthetic point, but unlike the mean it is robust to outliers — a single extreme point in one coordinate does not move the median at all. For data with heavy-tailed feature distributions (income, response time, transaction value), k-medians frequently produces more stable clusterings than k-means without requiring the switch to arbitrary-distance medoids. It is the natural choice when the L1 distance is the right metric and the data remains in R^p.
n is not enormous. The three share the alternating-optimisation structure and the same K-choosing problem.
Affinity propagation (Frey and Dueck, 2007) is a message-passing algorithm that identifies exemplars — points that are representative of the clusters — without requiring K to be supplied. It passes responsibility and availability messages between all pairs of points and converges on a set of exemplars that double as cluster prototypes. It is elegantly parameter-light, it handles non-Euclidean distances naturally, and it finds its own cluster count (controlled indirectly by a "preference" parameter). The cost is O(n²) memory and many iterations, limiting it to moderate datasets; for those datasets, affinity propagation is often a strong alternative to k-medoids with clustering count pre-specified.
Hierarchical clustering does not produce a single partition of the data; it produces a tree of nested partitions — a dendrogram — that shows how points merge (or split) at every scale. The tree is often more informative than any single cut of it. For small-to-medium datasets where the right number of clusters is unclear, hierarchical clustering is usually the first method to try.
Agglomerative (bottom-up) hierarchical clustering starts with every point as its own singleton cluster and repeatedly merges the two closest clusters until only one remains. Divisive (top-down) clustering starts with everything in one cluster and recursively splits it, usually via k-means with K = 2. Agglomerative is overwhelmingly more common because it is simpler to implement, produces the same dendrogram at the same cost (O(n²) memory, O(n² log n) to O(n³) time depending on the linkage), and admits several natural definitions of cluster-to-cluster distance that the divisive form lacks. The output of either is a dendrogram — a tree where leaves are data points and internal nodes are merges, with merge heights equal to the distance at which the merge happened.
The single knob in agglomerative clustering is the linkage: how is the distance between two clusters defined? Single linkage uses the minimum distance between any two points, one from each cluster; it produces long chain-like clusters and suffers from the chaining effect, where a single bridge of outliers glues distant clusters together. Complete linkage uses the maximum distance; it produces tight compact clusters and tends to break large clusters apart. Average linkage (UPGMA) uses the mean pairwise distance; a reasonable compromise. Ward's linkage (Ward, 1963) minimises the increase in within-cluster variance at each merge, and it produces clusterings that most closely resemble k-means — Ward is often called the "k-means of hierarchical clustering" and is the default in scikit-learn. Picking the linkage is a choice about what kind of clusters you are looking for, and the four standard linkages can produce genuinely different dendrograms on the same data.
The dendrogram is the main output of hierarchical clustering. Read it by drawing a horizontal line at a chosen height: every vertical line the horizontal crosses is a cluster, and the line's leaves are the cluster's members. Small changes in the height produce small changes in the number of clusters, which is itself a diagnostic — if cluster counts jump abruptly with tiny threshold changes, the clustering is unstable. The tallest un-split branches indicate the cleanest cluster separations; very short branches at the top usually indicate weak structure. Dendrograms also reveal hierarchical structure the flat-partition methods cannot: a clustering of species that contains "carnivore" as a super-cluster of "canid" and "felid" is a genuinely useful piece of information that k-means would collapse into three unrelated partitions.
The O(n²) memory cost of hierarchical clustering makes it impractical above n ≈ 50,000 — the full pairwise distance matrix becomes a bottleneck before any algorithm runs. BIRCH (Zhang, Ramakrishnan, Livny, 1996) is a scalable hierarchical alternative that maintains a compact in-memory summary (the CF-tree) and processes data in a single pass; it is the right choice when hierarchical structure matters at scale. For most real use cases, the workflow is: sample a subset, run agglomerative clustering on the sample to understand the structure and choose K, then run k-means on the full dataset with that K. This hybrid recipe combines hierarchical clustering's exploratory strength with k-means' scale.
DBSCAN (Ester, Kriegel, Sander, Xu, 1996) is the canonical density-based clustering algorithm: it finds clusters as connected regions of high point density, allows clusters of arbitrary shape, does not require K to be specified, and explicitly labels sparse points as noise. It is the first method to reach for whenever the clusters are not going to be round.
DBSCAN takes two parameters: a distance threshold ε and a minimum point count minPts. A point is a core point if at least minPts points (including itself) lie within ε of it. A border point is within ε of a core point but has fewer than minPts neighbours itself. A noise point is neither. The algorithm picks an arbitrary unvisited point; if it is a core point, start a new cluster and recursively include all points within ε of any core point in the cluster; if not, label it as noise (it may be reassigned to a cluster later as a border point). Continue until every point has been visited. The output is a partition of the core and border points into clusters, with noise points set aside. The clusters take whatever shape the dense regions of the data happen to have.
Three things that k-means gets wrong. Arbitrary shape: a dense S-curve or concentric rings are detected as coherent clusters, because density propagates along any path where neighbours exist. No K required: the number of clusters emerges from the data; you pick ε and minPts instead. Explicit noise: points in sparse regions are not forced into a cluster. A cluster of 20 customers who behave coherently and 5 idiosyncratic outliers is partitioned into a 20-point cluster plus 5 noise points, which is usually what you want. For geographic data, astronomical data, fraud-detection patterns, and any setting where some points just don't belong to any cluster, DBSCAN's noise-awareness is a structural advantage.
The two parameters trade off in a specific way. minPts controls how many points constitute "dense" — smaller values find more clusters (and more noise); larger values find fewer, denser clusters. The standard heuristic is minPts = 2p or 2p + 1 where p is the feature-space dimension. ε controls what counts as "close". The standard choice: plot the k-distance graph (for each point, the distance to its k-th nearest neighbour, where k = minPts), sort the distances, and look for the elbow — the knee of the curve is a reasonable ε. Getting ε right is genuinely harder than getting K right in k-means, and DBSCAN is sensitive to it: too small and everything becomes noise; too large and separate clusters bleed together.
ε: the small ε that finds the dense cluster breaks the diffuse one into fragments; the large ε that holds the diffuse one together merges the dense one with everything nearby. This is the problem HDBSCAN solves.
Naive DBSCAN is O(n²) because every point needs its neighbours computed. With a spatial index (KD-tree, ball tree, R-tree), the cost drops to O(n log n) in low dimensions, which is the regime where DBSCAN is most useful. At p > 20, the curse of dimensionality kicks in (every pair of points becomes roughly equidistant) and density-based methods struggle; this is where dimensionality reduction before clustering is essentially mandatory. For low-dimensional geographic or tabular data, DBSCAN scales comfortably to millions of points and is the right default.
DBSCAN's single ε parameter is its main weakness: clusters of varying density cannot be captured. HDBSCAN and OPTICS both fix this by making density a hierarchy rather than a threshold. HDBSCAN, in particular, has become the modern default density-based clusterer.
OPTICS (Ordering Points To Identify the Clustering Structure; Ankerst, Breunig, Kriegel, Sander, 1999) runs a DBSCAN-like sweep across all values of ε simultaneously by recording a reachability distance for each point — how far out you have to look to reach the point from the already-processed part of the dataset. The resulting reachability plot, a bar chart of reachability distances in the processing order, shows clusters as valleys: deep narrow valleys are dense tight clusters; wide shallow ones are diffuse clusters. Any horizontal cut of the plot at a given reachability gives a DBSCAN-like clustering; cuts at different heights give different-density clusterings. OPTICS is rarely used directly — the reachability plot is hard to automate — but it was the first algorithm to make the density hierarchy explicit.
HDBSCAN (Campello, Moulavi, Sander, 2013) replaces OPTICS's reachability plot with a more principled condensed cluster tree and an automatic extraction rule. The algorithm: (1) transform the data by replacing each point's distance to its neighbours with a mutual reachability distance that accounts for local density; (2) build a minimum spanning tree in the transformed space; (3) compute the hierarchical clustering tree of the MST; (4) condense the tree by collapsing clusters smaller than min_cluster_size; (5) extract the flat clustering by selecting, for each path through the condensed tree, the cluster that maximises a persistence/stability measure. The output: clusters of arbitrary shape, at arbitrary (and mixed) densities, with noise points explicitly labelled, with no ε parameter at all. The single knob is min_cluster_size, which is much easier to set sensibly than ε.
HDBSCAN's advantages over DBSCAN are practical rather than theoretical: it handles varying density; it has one intuitive parameter instead of two; the parameter min_cluster_size directly controls what you usually care about (minimum useful cluster size); and it returns a soft cluster membership probability in [0, 1] as a bonus. The cost is some complexity in the implementation and slightly higher runtime than DBSCAN. In practice, HDBSCAN is now the default density-based clusterer in scientific Python (hdbscan package), and DBSCAN is mostly used when there is a specific reason to fix ε externally (e.g. a known physical distance scale in the data).
min_cluster_size chosen to match the smallest useful cluster; inspect the results; adjust min_cluster_size and the UMAP n_neighbors if the clusters seem wrong. This pipeline — UMAP + HDBSCAN — has become the de facto standard for exploratory unsupervised analysis, particularly in single-cell biology, NLP embeddings, and customer-analytics work.
Two other density-based methods are worth knowing. Mean-shift (Fukunaga and Hostetler, 1975; Comaniciu and Meer, 2002) treats each point as a seed and iteratively shifts it toward the mode of the local density (via kernel density estimation), converging on local density maxima that become cluster centres. Clusters are then defined by which mode each point converges to. Mean-shift is elegant and parameter-light (one bandwidth), but O(n²) per iteration limits it to smaller datasets. DENCLUE is a generalisation that uses arbitrary density-estimation kernels. For most practical work, HDBSCAN has absorbed the density-based niche; mean-shift persists in image-segmentation and computer-vision applications.
A Gaussian mixture model is k-means in its full probabilistic form: each cluster is a Gaussian distribution, each point belongs to every cluster with some probability, and the model is fitted by maximising the likelihood of the data. GMMs handle elliptical clusters that k-means cannot, give proper soft assignments, and plug into a principled model-selection framework via BIC.
A Gaussian mixture model assumes the data was generated as follows: first, a hidden cluster label z ∈ {1, …, K} is drawn from a categorical distribution with probabilities π_1, …, π_K; then the observed point x is drawn from a Gaussian N(μ_{z}, Σ_{z}) specific to that cluster. The marginal density of x is then the mixture p(x) = Σ_k π_k · N(x; μ_k, Σ_k). A GMM fits this model by finding the parameters {π_k, μ_k, Σ_k} that maximise the log-likelihood Σ_i log p(x_i). Once fitted, the posterior responsibility γ_{ik} = p(z_i = k | x_i) ∝ π_k N(x_i; μ_k, Σ_k) is the probability that point i belongs to cluster k — a soft assignment that sums to 1 across clusters.
The covariance matrix Σ_k controls the shape of each Gaussian, and different choices make GMMs more or less flexible. Spherical (Σ_k = σ_k² I): each cluster is a round blob with its own radius; in the limit σ → 0, this is k-means. Diagonal (Σ_k diagonal): axis-aligned ellipsoids. Tied (all Σ_k equal): linear discriminant analysis for a single Gaussian shape across clusters. Full (arbitrary positive-definite Σ_k): each cluster is a general ellipsoid in any orientation; the most flexible, and the most prone to overfitting with few data points. The default in scikit-learn is "full" and it is usually right; switch to "diagonal" or "tied" if you are fitting a GMM to high-dimensional data with few samples.
k-means is the MAP (hard) version of a GMM with spherical equal-variance covariances and uniform mixing weights. Every improvement GMMs add is a relaxation of one of these assumptions. Unequal variances: the GMM can fit a tight cluster and a diffuse one simultaneously; k-means cannot. Elliptical clusters: a GMM with full covariance fits any ellipsoid; k-means forces spherical. Soft assignments: the responsibility γ_{ik} is a real probability, not a binary indicator, and carries uncertainty about boundary points. Unequal priors: the π_k lets small clusters coexist with large ones without getting absorbed. The cost is more parameters to fit and slower convergence; the benefit is a substantially more flexible model that plugs directly into the EM framework of the next section.
reg_covar, default 1e-6). Set it larger for small datasets; 1e-4 is a reasonable bump.
The GMM is a proper probabilistic model, so standard likelihood-based model-selection criteria apply. The Bayesian Information Criterion BIC = −2 log L + k log n penalises the log-likelihood by the number of parameters k and the sample size n; picking the K (and covariance structure) that minimises BIC is a principled answer to the number-of-clusters question. The Akaike Information Criterion (AIC) uses a weaker penalty; for clustering, BIC is usually preferred because it penalises complexity more heavily and produces fewer spurious clusters. This is the main reason GMMs are often chosen over k-means even when the clusters are approximately round: they let you choose K by optimisation rather than by heuristic.
Expectation-Maximisation is the iterative scheme that fits latent-variable models — GMMs, hidden Markov models, latent Dirichlet allocation, any model where some of the variables are hidden and their values have to be inferred along with the parameters. It is one of the most broadly useful algorithms in machine learning, and the cleanest way to understand it is through its application to the GMM.
In a GMM, we want to find the parameters θ = {π_k, μ_k, Σ_k} that maximise the log-likelihood log p(X | θ) = Σ_i log Σ_k π_k N(x_i; μ_k, Σ_k). Direct maximisation is intractable because the log sum has no closed-form solution and gradient-based methods get stuck in local minima. The EM algorithm (Dempster, Laird, Rubin, 1977) solves this by introducing the missing cluster labels z_i and alternating between inferring them given the parameters (E-step) and re-estimating the parameters given the inferred labels (M-step). Each iteration provably increases (or leaves unchanged) the log-likelihood, and the algorithm converges to a local maximum.
E-step (Expectation): given the current parameters, compute the responsibilities — the posterior probabilities of each hidden label — γ_{ik} = π_k N(x_i; μ_k, Σ_k) / Σ_j π_j N(x_i; μ_j, Σ_j). Each point gets a probability distribution over cluster memberships. M-step (Maximisation): given the responsibilities, update the parameters as weighted sample statistics. Each cluster's mean is the responsibility-weighted mean of the data; each cluster's covariance is the responsibility-weighted empirical covariance; each cluster's prior π_k is the average responsibility. Iterate until the log-likelihood change falls below a tolerance. This is the cleanest example of the general EM recipe, and it generalises directly to other latent-variable models.
EM can be understood through the evidence lower bound (ELBO). The log-likelihood log p(X | θ) satisfies, for any distribution q(Z) over the hidden variables, log p(X | θ) ≥ E_q[log p(X, Z | θ)] − E_q[log q(Z)]. The right-hand side is the ELBO; the gap between it and the true log-likelihood is the KL divergence KL(q ‖ p(Z | X, θ)). The E-step sets q(Z) = p(Z | X, θ), making the gap zero; the M-step maximises the ELBO over θ. Each step moves the ELBO up, and each E-step tightens the bound against the true likelihood. This framing is what connects EM to variational inference, where the E-step's exact posterior is replaced by a tractable approximation.
EM has three main weaknesses. Local maxima: EM converges to the nearest local maximum, which depends on initialisation; run multiple restarts and pick the solution with the highest log-likelihood. Slow convergence near the optimum (EM is a first-order method); quasi-Newton or stochastic variants exist but complicate implementation. Requires the E-step to be tractable; for complex graphical models it is not, and variational EM (with an approximate posterior) or sampling (MCMC) takes over. For GMMs, none of these are prohibitive, but they become central in the more complex latent-variable models of Part V. The lesson: understand EM at the GMM level and you have the scaffolding for almost every probabilistic model you will see later.
Spectral clustering takes a different route to non-convex clusters: represent the data as a graph of nearest neighbours, then cluster in the space of the top eigenvectors of the graph Laplacian. It handles arbitrary shapes that k-means cannot, without the single-density assumption of DBSCAN, and it's one of the cleanest applications of linear algebra to unsupervised learning.
The first step is to build a similarity graph on the data: each point is a vertex, and edges connect similar points. The standard construction is a k-nearest-neighbour graph (each point connected to its k nearest neighbours) or a Gaussian (RBF) graph (edge weights w_{ij} = exp(−‖x_i − x_j‖² / 2σ²) for all pairs). The graph encodes the local structure of the data, and the key shift is that clustering now happens on this graph, not on the original coordinates. Two points can be close on the graph (many short paths between them) even if they are far in Euclidean distance, which is exactly what lets spectral clustering find elongated clusters and manifold structure.
Given adjacency matrix W and degree matrix D (diagonal, with D_{ii} = Σ_j W_{ij}), the unnormalised Laplacian is L = D − W. The normalised symmetric Laplacian is L_sym = I − D^(−1/2) W D^(−1/2). Both have a critical property: the multiplicity of the eigenvalue 0 equals the number of connected components of the graph, and the corresponding eigenvectors are indicator vectors on the components. If the graph has exactly K perfectly-disconnected clusters, the top K eigenvectors of L trivially reveal them. When the graph is not perfectly disconnected (real data is always a little noisy), the top K eigenvectors still embed the clusters in a low-dimensional space where they become approximately linearly separable.
Spectral clustering (Ng, Jordan, Weiss, 2002; Shi and Malik, 2000) proceeds as follows: (1) build the similarity graph W; (2) compute the Laplacian L; (3) find the eigenvectors corresponding to the K smallest eigenvalues; (4) stack them as columns of an n × K matrix; (5) cluster the rows of that matrix with k-means. The last step is usually plain k-means, which works well because in the spectral-embedded space the clusters have become round and linearly separable even when they were arbitrarily shaped in the original data. The algorithm handles concentric rings, two moons, elongated manifolds, and any other non-convex structure that k-means alone cannot.
n < 10^4) with visibly non-convex clusters, or clusters on a manifold. Spectral clustering is O(n³) in the standard implementation (eigendecomposition of an n × n matrix), which limits scale. For larger problems, use Nyström approximation of the Laplacian or fall back to HDBSCAN on a UMAP embedding, which achieves similar ends through a different route.
The single most consequential parameter is the graph construction — specifically, the k in the k-NN graph or the bandwidth σ in the RBF kernel. Too sparse and the graph disconnects into spurious components (spectral clustering will pick those up as clusters); too dense and the clusters smear together. The k in the k-NN graph is typically between log n and 20; the σ for RBF is often set to the median pairwise distance or tuned by cross-validation against an internal criterion. Unlike the choice of K (which can be automated via eigengap analysis — the gap between λ_K and λ_{K+1} signals the right number), the graph-construction parameter has to be set with care.
The single hardest question in clustering is how many clusters there are. Several families of heuristics address it — elbow plots, silhouette scores, the gap statistic, information criteria — and none of them gives a clean answer. A practitioner's real answer is usually some mixture of these metrics, stability analysis, and the downstream use case.
Plot the within-cluster sum of squares J(K) (or whatever objective the algorithm uses) as a function of K, and look for a knee or elbow — the point where the curve transitions from steep to shallow. The intuition: adding a cluster beyond the true K only explains noise, so the marginal reduction in J drops substantially. The elbow is the oldest heuristic for choosing K, and it is the one practitioners reach for first. It is also notoriously subjective — on real data, the curve is often smooth with no clear knee — and it sensibly becomes the first step of a broader analysis rather than a final answer.
The silhouette coefficient (Rousseeuw, 1987) for a point i is s(i) = (b(i) − a(i)) / max(a(i), b(i)), where a(i) is the average distance from i to the other points in its own cluster and b(i) is the average distance from i to the points in the nearest other cluster. s(i) ∈ [−1, 1]: close to 1 means well-clustered, close to 0 means boundary, negative means probably in the wrong cluster. The average silhouette across the dataset is a single number quantifying cluster quality, and the K that maximises it is a reasonable choice. Silhouette works across algorithms (not tied to a particular objective), is interpretable at the point level (silhouette plots are genuinely informative), and is the most widely-used internal metric. Its main weakness: it favours convex round clusters, so it undervalues HDBSCAN or spectral-clustering outputs on non-convex data.
The gap statistic (Tibshirani, Walther, Hastie, 2001) compares the within-cluster sum of squares on the real data to the expected WCSS under a null reference distribution (usually uniform in the bounding box of the data) at each K. If W_k is the real WCSS and E[W_k] is the null expectation, the gap is log E[W_k] − log W_k. The right K is the smallest value where the gap has peaked. This method is more principled than the elbow (it has a null model baked in) and often works where the elbow is ambiguous. The cost: computing E[W_k] requires Monte Carlo simulation with many null datasets.
For model-based clustering (GMMs), choosing K is a proper model-selection problem. BIC = −2 log L + k log n (where k is the parameter count) and AIC = −2 log L + 2k both trade off fit against complexity. Pick the K minimising BIC for most work; use AIC only when you expect to have many more clusters than the data can support (AIC penalises complexity less). Both naturally handle the covariance-structure choice (spherical vs diagonal vs full) as part of the same optimisation. This is the main methodological advantage of GMMs over k-means: you don't have to guess K, you optimise it.
K ∈ {2, …, 20}, compute silhouette and gap at each, fit a GMM with BIC model selection, inspect the dendrogram if hierarchical clustering is feasible, and look at the top three or four candidates by eye. The right K is usually one or two of the contenders. If the metrics disagree wildly, your clusters may not exist.
Not every dataset has clusters. A well-behaved continuum — a single Gaussian blob in the middle of feature space — will be happily partitioned by k-means into arbitrary equal-area slices, and the elbow plot will be smooth, the silhouette will be weak, the BIC will decrease monotonically with K. The diagnostic: run the clustering on the data and on a null dataset (permute each feature independently to break any cluster structure), and compare. If the metrics look similar, there are no clusters. Consensus clustering (Section 14) formalises this: if small perturbations of the data change the cluster assignments dramatically, the clustering is not stable and is probably not real.
Choosing K was one evaluation problem. A separate one is: given a clustering, is it any good? The metrics split cleanly in two — internal indices use only the data and the clustering, external indices compare to known labels when you happen to have them.
Internal metrics measure properties of the clustering itself, typically in terms of compactness (points within a cluster are close) and separation (points across clusters are far). The three standards are the silhouette coefficient (already defined in Section 12), the Davies–Bouldin index, and the Calinski–Harabasz index. Davies–Bouldin averages, over clusters, the ratio of within-cluster scatter to between-cluster separation — smaller is better. Calinski–Harabasz (also called the variance-ratio criterion) is the ratio of between-cluster dispersion to within-cluster dispersion normalised by cluster count — larger is better. All three have a bias toward convex, equal-variance clusters because they use Euclidean geometry, so they systematically undervalue the solutions of DBSCAN, HDBSCAN, and spectral clustering on non-convex data.
If you have labels — either because the dataset was labelled for a supervised problem and you want to see whether clustering rediscovers them, or because you've set up a controlled benchmark — you can measure how well the clustering agrees with the labels. The naive metric is purity: for each cluster, take the fraction of points belonging to its majority class, then average over clusters weighted by size. Purity is easy to understand but biased: you can always maximise it by producing many tiny clusters.
The two metrics to actually use are adjusted Rand index (ARI) and normalised mutual information (NMI). The Rand index counts the fraction of point pairs that are either in the same cluster in both partitions or in different clusters in both; ARI corrects for chance by subtracting the expected value under random labellings. ARI is 1 for identical clusterings, 0 for random agreement, and can be slightly negative for worse-than-random. NMI is the mutual information between cluster and label, normalised by the geometric or arithmetic mean of the two entropies; it is 1 for identical partitions and 0 for statistical independence. ARI penalises cluster-number mismatches more strongly than NMI; NMI is more forgiving when a clustering is a refinement of the labels (splits a true class into two).
A third class of evaluation measures how reproducible the clustering is under perturbation — different random seeds, bootstrap resamples, added noise, feature subsampling. Stable clusterings are, other things equal, more trustworthy than unstable ones. Section 14 makes this concrete with consensus clustering. For now: always report a clustering along with how sensitive it is to seed and sample.
At minimum report: the algorithm and its hyperparameters, the feature set and how it was scaled, the distance or similarity metric, K if applicable, the random seed, and one or two internal metrics. If ground truth exists, add ARI and NMI. If stability matters, add a stability score across B bootstrap runs. Include a 2-D projection (PCA or UMAP, reproducible seed) with the clusters coloured, so a reader can see what you saw.
Unsupervised results are easy to overinterpret. Two runs with different seeds can produce rather different clusterings, and if you only ever looked at one, you'd never know. Stability analysis asks how much the output changes under small perturbations — resampled data, added noise, different initialisation — and consensus clustering aggregates many runs into a single more trustworthy partition.
Suppose you run k-means with K = 5 on a dataset and you run it again on a random 80 % subsample of the same dataset. If the two results agree on most points, the five clusters are a real feature of the data. If the two results disagree on most points, the five clusters are an artefact of the particular points you happened to include, and the algorithm is fitting noise at the boundaries. Stability is therefore a proxy for signal that requires no ground truth.
The procedure: draw B subsamples of the data (either bootstrap with replacement, or uniform without replacement at 80 % size), cluster each, and compare the resulting partitions on the points they share in common. Pairwise ARI between the B partitions gives a stability distribution; the median ARI is the summary. A mean pairwise ARI above about 0.75 is usually considered stable; below 0.5 is shaky. This is also one of the best ways to pick K: choose the K that is most stable across resamples (Ben-Hur, Elisseeff, Guyon, 2002).
Consensus clustering (Monti et al., 2003) formalises the idea that agreement across many runs is more trustworthy than any single run. For each pair of points, compute the fraction of bootstrap runs in which they landed in the same cluster — this is the consensus matrix, an n × n co-occurrence matrix with entries in [0, 1]. A final clustering is obtained by hierarchical clustering on 1 − consensus as a distance, or by spectral clustering on the consensus matrix treated as similarity. The consensus matrix also visualises well: sorted by the final clusters, it should have sharp high-value blocks on the diagonal and low off-diagonal values. If the blocks are fuzzy, the clustering is not stable.
K via stabilityK, run B bootstrap clusterings, compute the mean pairwise ARI across runs at that K, and plot ARI vs K. The most stable K is often a sharper signal than the elbow or the silhouette — especially when the clusters exist but are slightly uneven in size or shape.
The most informative case: silhouette prefers one K, stability prefers another. Usually this means silhouette is rewarding a clustering that fits the data well on average but carves through a real cluster, while stability is rewarding the clustering that cleanly separates the populations but leaves some points in boundary regions. Which is right depends on the downstream task; seeing the disagreement is more valuable than either number alone.
In high dimensions, everything is far from everything else, and "far" starts to mean nothing in particular. Euclidean distance concentrates, nearest-neighbour rankings become unstable, and every clustering algorithm that depends on a distance metric becomes dubious. This section is about what breaks, why, and how to work around it.
For random points in d-dimensional space, as d → ∞ the ratio of the maximum pairwise distance to the minimum pairwise distance tends to 1. In words: every point becomes equidistant from every other point. Distance-based clustering depends on the relative ordering of distances, and in high dimensions that ordering becomes noisy. Beyond about d = 50 for typical problems, k-means and DBSCAN start to behave erratically and their outputs depend heavily on the particular features included. This is the curse of dimensionality for clustering.
The most common fix is to run dimensionality reduction (covered in depth in Chapter 05) before clustering. PCA is cheap and captures variance — cluster the first 10–50 principal components rather than the 1000 raw features. UMAP and t-SNE produce 2-D or low-D embeddings specifically designed to preserve local structure; clustering an HDBSCAN on a UMAP embedding is a strong default for exploratory work on high-dimensional data. Autoencoders learn a nonlinear low-dimensional representation that can be clustered with the same methods as the raw data. The caveat in all three cases is that the reduction is non-trivial — PCA emphasises variance, UMAP emphasises local neighbourhoods, autoencoders emphasise reconstruction — and different clusterings will appear depending on what was preserved.
Sometimes different clusters live in different subspaces of the feature space — cluster A is tight in features 1, 5, 7 but diffuse in the others; cluster B is tight in features 3, 8, 10. No single global distance metric captures both. Subspace clustering methods (CLIQUE, SUBCLU, PROCLUS) search for clusters together with the subspaces in which they exist. The combinatorial search is expensive and the outputs are harder to interpret, but for genuine high-dimensional cluster structure (common in gene-expression and text data) they outperform global methods.
In low dimensions, Euclidean is the default and other metrics make small differences. In high dimensions, switching from Euclidean to cosine distance (for direction, not magnitude) or to Mahalanobis distance (which whitens against covariance) can make the difference between a usable clustering and an unusable one. For sparse, high-dimensional data (documents, one-hot-encoded categoricals, user-item matrices), cosine is nearly always correct. For dense, high-dimensional continuous data, Mahalanobis or a learned metric is often needed. Section 02 flagged this; it becomes mandatory once d is large.
DBSCAN and HDBSCAN are sometimes sold as the answer to high-dimensional clustering, but they depend on local density, and in high dimensions every point's local density looks the same. Without reduction, they tend to put everything in one cluster or everything in noise, depending on eps. Always pre-reduce before density-based methods in more than 20 or so dimensions.
You have a dataset and the goal "find structure". This is the recipe — an ordered sequence of steps that will, nine times out of ten, get you from raw data to a defensible clustering. The tenth time is when the data genuinely has no clusters, which is itself a useful finding.
Before any clustering, answer: what is a row? What is a column? What are the units? Are there missing values, and if so, are they informative or random? Are there obvious outliers that should be handled, and how? What do the domain experts expect the structure to look like? Clustering answers that come out of a vacuum are nearly always wrong; spending an hour here saves a day later.
Standardise continuous features (z-score or min-max), encode categoricals (one-hot for low cardinality, target-encoded or embedded for high), decide how to handle missing values (imputation, indicators, or drop rows), and pick a distance metric based on feature types. If features are on wildly different scales and you haven't standardised, every distance-based clusterer is secretly clustering on the biggest-variance feature.
If d is more than about 50, run PCA to capture ~90 % of variance and cluster the PCA scores. For exploratory work, produce a UMAP embedding for visualisation regardless of whether you cluster on it. Keep both the raw-feature clustering and the embedded clustering and compare them.
Run three clusterers as a baseline: k-means (the prior-free default), hierarchical with Ward linkage (produces the dendrogram, lets you pick K after the fact), and HDBSCAN (finds whatever density-based structure exists without you specifying K). If any of them agrees strongly with the others, that is a sign of robust structure. If they disagree, inspect the disagreements — they tell you about the shape of your data.
K (or don't)
For k-means and hierarchical, sweep K ∈ {2, …, 20}, compute silhouette and gap, run bootstrap stability at each K, and pick a K supported by at least two of the three signals. For density methods, tune min_cluster_size instead of K and let the algorithm choose K.
Compute internal metrics (silhouette, DB, CH); if labels exist, ARI and NMI; report stability across bootstrap runs. Then — and this is the step everyone skips — look at the clusters. Compute per-cluster summary statistics on the original features. Plot examples. Ask a domain expert to name each cluster. If you can't name them, you don't have clusters; you have a partition.
If the clustering is going into a product — customer segmentation, anomaly detection, pre-labelling for a supervised problem — set up an evaluation harness with a downstream metric (conversion rate per segment, precision of anomaly flags, label efficiency), and monitor cluster stability over time as new data arrives. Clusters drift as the underlying population changes; re-fit on a schedule and watch for degradation.
If, across three algorithms, metrics don't agree on a good K; if bootstrap stability is low at every K; if the consensus matrix has no blocks; if domain experts look at the clusters and can't name them — accept that the data is a continuum, not a set of groups. This is a real finding. Report it. Move on to regression, density modelling, or a different framing of the problem.
Clustering rarely appears as the final product of a large ML system. It appears as a subroutine — inside vector quantisation, inside tokenisation, inside retrieval, inside self-supervised learning, inside semi-supervised labelling — and the clusters themselves are the outputs that drive other components. This section is a tour of where clustering hides in systems that don't obviously use it.
Vector quantisation (VQ) replaces a continuous vector with the index of its nearest cluster centre from a fixed codebook. This is k-means in disguise: run k-means on a training set of vectors, keep the centroids as the codebook, and at inference replace each vector with its nearest centroid index. VQ underlies product quantisation for approximate nearest-neighbour search (FAISS), and it underlies VQ-VAE and the discrete codes used by AudioLM, MusicLM, and many image generation models. The core idea is the same as Lloyd's algorithm from 1957; the context is different.
BPE, WordPiece, and SentencePiece — the tokenisers used by every modern language model — are agglomerative clustering algorithms in disguise. They start with characters and greedily merge the most-frequent adjacent pairs into new tokens, producing a hierarchical vocabulary. The training data gets segmented into clusters of characters that occur together often. The token embeddings are then learned on top of this clustering. If you squint, training a language model is a two-stage procedure: cluster the character sequences, then learn dense representations on the cluster indices.
Large-scale retrieval systems (Google's image search, a recommender's candidate generator) pre-cluster billions of items into an inverted index based on embedding similarity, then at query time search only the top-matching clusters. This is k-means on vectors, at internet scale, often with hierarchical cluster trees or product quantisation on top. The clusters are invisible to the user; they make the system a thousand times faster.
SwAV, DeepCluster, and related self-supervised vision methods use clustering as a pretext task: cluster the unlabelled images into K groups, treat the cluster assignments as pseudo-labels, train a network to predict them, use the network's new features to re-cluster, iterate. The cluster labels are throwaway — what is learned is the representation. This is a modern application of the old idea that structure in the input, even without labels, contains signal.
Prototypical networks (Snell, Swersky, Zemel, 2017) represent each few-shot class as the mean of the support-set embeddings — a one-centroid cluster per class — and classify new points by nearest centroid. This is one iteration of k-means with fixed assignments. Despite its simplicity it is a strong baseline for few-shot classification, and it works because the embedding (from a pre-trained backbone) already places the classes in well-separated clusters.
When you have a lot of unlabelled data and a small budget for annotation, cluster the unlabelled set first and label a few points per cluster rather than labelling a uniformly random sample. This gives the supervised model a much more representative training set, often at a fraction of the annotation cost. The same principle drives active learning: use uncertainty and cluster membership to choose the next point to label.
The next chapter, Dimensionality Reduction, is the natural companion to this one. Clustering and dimensionality reduction are both unsupervised techniques for finding structure; the difference is that clustering finds groups of points, while dimensionality reduction finds directions or manifolds along which the points vary. The two are routinely combined — reduce first, then cluster, or cluster first, then reduce for visualisation — and the boundary between them blurs for methods like spectral clustering (which is a reduction followed by a clustering) and self-organising maps (which are both at once). We'll cover PCA, factor analysis, ICA, kernel PCA, t-SNE, UMAP, and autoencoders; by the end you'll have the full toolkit for unsupervised representation learning.
Clustering is one of the oldest topics in data analysis — Lloyd's k-means paper is from 1957, Ward's hierarchical method from 1963 — and most of its foundational literature is short, readable, and still in print. The modern research frontier has moved to density-based and graph-based methods, to clustering on learned embeddings, and to consensus and stability analysis. The references below cover the textbooks to anchor the subject, the classical papers that introduced each major algorithm, the modern papers that extend them, and the software documentation where most of the day-to-day work actually happens.
O(log K) of optimal, plus convincing empirical demonstrations that careful seeding makes Lloyd's iteration converge faster and to better local optima. One of the rare cases where a tiny algorithmic tweak has been universally adopted because it obviously works.eps. The method handles clusters of varying density — a failure case for DBSCAN — and has become the practical default in the density-based family. Pair with McInnes, Healy, and Astels' 2017 JOSS paper for the algorithmic engineering behind the widely-used Python implementation.K. Compares the within-cluster sum of squares on the data to the expected value under a null reference distribution, picking K at the biggest gap. More principled than the elbow method, and a de facto default when k-means is the clustering algorithm. Readable, eight pages, and still the canonical reference for the method.K is usually the one that is most stable under resampling, and gives a practical pairwise-agreement procedure that has become standard practice. A short and instructive counterweight to the internal-metric-maximisation approach. If silhouette and gap disagree, this is the method that usually decides between them.n_neighbors and min_dist to when the downstream task is clustering, and why those choices differ from the visualisation defaults. Pair with the HDBSCAN docs above for a complete exploratory-clustering workflow on high-dimensional data.This page is Chapter 04 of Part IV: Classical Machine Learning. The next chapter — Dimensionality Reduction — is the close sibling of this one. Clustering asks "what are the groups", dimensionality reduction asks "what are the directions"; the two are routinely combined (reduce then cluster, or cluster then reduce for visualisation), and several methods live in both worlds at once — spectral clustering is a reduction followed by k-means, self-organising maps are a clustering on a low-dimensional grid. Later chapters of Part IV extend to probabilistic graphical models (which add the structure of conditional independence), kernel methods and SVMs (where the Mahalanobis-like trick of changing the similarity metric is made rigorous), feature engineering and selection, and finally model evaluation — the discipline that turns any of these trained models, supervised or unsupervised, into a trustworthy deployment.