Part IV · Classical Machine Learning · Chapter 03

Ensembles, where many mediocre models are better than one clever one.

An ensemble is a model made of models. Instead of training a single classifier or regressor and hoping it is well-specified, the ensemble trains many base learners and combines their predictions — by averaging, by voting, by weighted sums, or by learning a meta-model on top. This sounds like a trick. In practice it is the single most reliable source of accuracy gains in classical machine learning, and it was the dominant paradigm on tabular data for two decades. The regression and classification chapters introduced the workhorse models; this chapter shows what happens when they are combined. The two central families — bagging, the variance-reducing average of bootstrapped learners, and boosting, the bias-reducing sequence of error-correcting learners — underlie Random Forests, AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost, stacking, and the practice of blending predictions from an entire model zoo. For structured/tabular problems, gradient-boosted trees routinely beat deep networks, and even where deep learning wins, ensembles of deep models often win by a little more. Chapter 02's decision trees are the building block; this chapter is where those building blocks become something more.

How to read this chapter

The first section motivates the ensemble idea and its intellectual lineage — the Condorcet jury theorem, the wisdom of crowds, the PAC-learnability of weak learners — and introduces the two vocabulary words the rest of the chapter will lean on: diversity and decorrelation. Section two develops the bias–variance decomposition specifically for ensembles, showing analytically why averaging decorrelated learners cuts variance and why boosting chips away at bias. Sections three through five cover the bagging family: the bootstrap-aggregation meta-algorithm, the Random Forest (Breiman, 2001) as the canonical instance, and the variants that followed — Extra Trees, Isolation Forests, Rotation Forests. Sections six through nine cover the boosting family, which grew out of a theoretical question about whether weak learners could be combined into strong ones: AdaBoost (Freund & Schapire, 1995) and its margin theory, gradient boosting (Friedman, 1999) as functional gradient descent on a loss, and the three modern implementations that dominate structured-data leaderboards — XGBoost, LightGBM, and CatBoost. Section ten is the practitioner's guide to boosting hyperparameters: learning rate, number of trees, depth, subsampling, regularisation, early stopping. Section eleven is the part people forget: once an ensemble is trained, how do you explain it — feature importance, permutation importance, SHAP values.

Sections twelve and thirteen cover the second main way to combine models: stacking and blending (training a meta-learner on out-of-fold predictions), and the simpler voting combinators (hard majority vote, soft probability average) that every practitioner reaches for first. Section fourteen revisits the diversity thesis in light of everything the chapter has introduced, and looks at why some ensembles help and others don't — correlated errors are the silent killer. Section fifteen addresses calibration, which ensembles handle in interestingly different ways: bagging leaves calibration roughly intact, boosting wrecks it and requires Platt scaling or isotonic regression to recover. Section sixteen is the practical section — when to reach for what on a new tabular problem, and how the ensemble ladder (logistic regression → random forest → gradient boosting → stacked ensemble) is traversed in practice. Section seventeen closes with where ensembling compounds in modern ML: deep ensembles and their calibration benefits, mixture-of-experts architectures, model soups, and the observation that much of what is now called "scaling" in deep learning is ensembling in disguise.

Contents

  1. Why ensembles workCondorcet, wisdom of crowds, weak-to-strong learnability
  2. Bias, variance, and the two ensemble strategiesAveraging cuts variance; boosting cuts bias
  3. BaggingBootstrap aggregation, pasting, out-of-bag estimates
  4. Random ForestsBreiman, 2001: feature bagging on top of trees
  5. Forest variantsExtra Trees, Isolation Forests, Rotation Forests
  6. AdaBoostThe original boosting algorithm and its margins
  7. Gradient boostingFriedman's functional gradient descent
  8. XGBoostRegularised objective, second-order, sparsity-aware
  9. LightGBM and CatBoostHistograms, leaf-wise growth, ordered boosting
  10. Tuning boosted treesLearning rate, trees, depth, early stopping
  11. Feature importance and explanationImpurity, permutation, SHAP
  12. Stacking and blendingMeta-learners on out-of-fold predictions
  13. VotingHard votes, soft averages, and their limits
  14. Diversity and correlated errorsWhy ensembles sometimes don't help
  15. Ensembles and calibrationBagging preserves, boosting distorts
  16. Ensembles in practiceThe ladder and when to climb it
  17. Where it compounds in MLDeep ensembles, mixtures of experts, model soups
01

Why ensembles work

An ensemble is many models voting, averaging, or stacking into a single prediction. It is a strange idea — surely the best model is better than the best model plus some worse ones — but the mathematics says otherwise whenever the models make different mistakes. The chapter opens with the intellectual lineage that makes the ensemble idea unavoidable.

The Condorcet jury theorem

The earliest mathematical statement of the ensemble idea is two and a half centuries old. The Condorcet jury theorem (Marquis de Condorcet, 1785) asks: suppose a jury of n independent voters each has probability p > 0.5 of reaching the correct verdict. What is the probability the majority reaches it? The answer, by the law of large numbers applied to a binomial, is that the majority's accuracy tends to 1 as n grows. If each voter is only slightly better than random, a big enough jury is virtually infallible. If p < 0.5, the opposite happens: the majority converges to always wrong. And if the voters are correlated — if they all share the same bias, read the same newspaper, talk before the vote — the theorem fails silently. Classifiers replace voters, bootstrap samples replace independent deliberation, and the theorem's two conditions — better-than-random, and sufficiently decorrelated — become the two obsessions of ensemble design.

Weak learners and strong learners

In 1988 Michael Kearns posed what became the defining theoretical question of ensemble learning: is a concept class that is only weakly learnable (barely better than random) also strongly learnable (arbitrarily close to perfect)? The practical version: can you take a classifier that is 51% accurate and, by combining many copies of it cleverly, get to 99%? In 1990 Schapire answered yes, constructively, with the first boosting algorithm — an explicit procedure that takes any weak learner and turns it into a strong one. The construction was a recursive tree of weighted classifiers and was impractical. But it proved the conceptual point, and five years later Freund and Schapire's AdaBoost made the construction practical. The theoretical result — that weak and strong learnability are the same — is why boosting exists as an idea at all. It is also why the word ensemble has quietly expanded: anything that aggregates weak signals into strong ones is, in this sense, an ensemble.

Wisdom of crowds

The folk version of the same mathematics — Francis Galton's 1906 observation that the median guess of 787 county-fair visitors at the weight of an ox was within a pound of the true weight — was named the wisdom of crowds by James Surowiecki in 2004. His four preconditions (diversity of opinion, independence, decentralisation, aggregation) read like the design principles of an ensemble method: different base learners, decorrelated training, parallel fits, and an aggregation rule. Every ensemble technique in this chapter can be classified by how it tries to satisfy the first three, because the fourth — the aggregation — is usually the easy part.

Ensembles work whenever their members are (a) better than chance and (b) decorrelated in their errors. Boosting manufactures diversity by reweighting data; bagging manufactures it by resampling; stacking learns the aggregation rule from data.

Why this chapter sits here

Chapter 01 built regression. Chapter 02 built classification. Both chapters ended with decision trees as a family of flexible, interpretable, but high-variance models — the kind of base learner the ensemble literature was built around. This chapter is where individual models become systems of models. Everything that follows — the random forest, gradient boosting, the ensemble calibration fixes, the stacking recipes — is a way of answering a single operational question: given a moderately good model, how do you make it a lot better by not making it alone?

02

Bias, variance, and the two ensemble strategies

The bias–variance decomposition introduced in Chapter 01 explains the entire strategic landscape of ensembling. Bagging reduces variance by averaging decorrelated learners; boosting reduces bias by fitting residuals sequentially. Knowing which one you need is the difference between an ensemble that helps and one that makes things worse.

The variance of an average

Suppose you fit B base learners f_1, …, f_B on bootstrapped samples of the data and average their predictions: f̄(x) = (1/B) Σ f_b(x). If the individual learners are unbiased but have variance σ², and their errors have pairwise correlation ρ, then the variance of the averaged prediction is Var(f̄) = ρσ² + (1 − ρ)σ²/B. As B → ∞, the second term vanishes and the variance is floored at ρσ². If the learners are independent (ρ = 0) the floor is zero and averaging can drive variance to zero. If they are perfectly correlated (ρ = 1) averaging does nothing. This single formula explains most of the design decisions in the bagging family: every trick — bootstrapping rows, subsampling columns, randomising split thresholds — is an attempt to drive ρ down.

Bias is a different animal

Averaging does nothing to bias. If every f_b systematically under-predicts by 3 units, the average also under-predicts by 3 units. Reducing bias requires a fundamentally different move: fit a model, look at its errors, and fit another model that targets those errors. That is boosting. The canonical recipe: initialise F_0(x) = 0; at step m, fit f_m to the residuals (or pseudo-residuals) of F_{m−1}; set F_m = F_{m−1} + η · f_m. Each base learner has to be simple — an unpruned tree would wipe out all residuals at once — and the learning rate η (also called shrinkage) controls how aggressively each step corrects. Boosting does not aim for low variance in each base learner; it aims for low bias in the additive sum. Successive models chip away at what the ensemble still gets wrong.

The strategic choice. If your base model overfits (high variance, low bias), bag it: a deep unpruned tree becomes a random forest. If your base model underfits (low variance, high bias), boost it: a shallow tree becomes a gradient-boosted machine. Using the wrong strategy will hurt as often as help — boosting an already-low-bias model mostly adds variance and burns compute.

Why trees are the canonical base learner

Almost every classical ensemble in this chapter uses decision trees underneath, and this is not historical accident. Trees are universal (they can represent any Boolean function), piecewise constant (so boundaries are arbitrarily flexible), fast to train, insensitive to feature scale, and — crucially — high-variance. A deep tree fits noise; a shallow tree underfits. Either failure mode is exactly what the corresponding ensemble method wants to exploit. Deep trees are the ideal bagging base learner precisely because variance is what bagging reduces. Shallow trees (depth 3–8) are the ideal boosting base learner precisely because bias is what boosting reduces. Linear models can also be boosted (see the classical L2-boosting literature), but they rarely get anywhere, because their bias is set by their functional form and trees' is not.

Ensembles and the irreducible error

One thing ensembles cannot do is beat the Bayes error — the irreducible component of the loss due to noise in P(y | x). A credit-default model cannot be better than the inherent stochasticity of whether a given borrower repays. If your single best model is close to the Bayes error, an ensemble will deliver only tiny gains. If it is far from the Bayes error, the ensemble has a lot of room. Knowing roughly where you are on that gap — via learning curves, holdout performance of the best simple model, or the gap between your model and a near-neighbour baseline — is a better guide to whether ensembling is worth the complexity than any single rule of thumb.

03

Bagging

Bootstrap aggregation — bagging — is the first of the two great ensemble families. Introduced by Leo Breiman in 1996, it is the simplest variance-reduction machine in machine learning: resample the training set many times, fit the base learner to each resample, average (or majority-vote) the predictions. It is embarrassingly parallel, nearly free conceptually, and the base ingredient of every random forest.

The bootstrap, briefly

The bootstrap (Efron, 1979) is a resampling procedure: from a training set of n examples, draw n examples with replacement to produce a new training set the same size. About 1 − (1 − 1/n)^n ≈ 63.2% of the original points appear in each bootstrap; the remaining ≈ 36.8% are out-of-bag (OOB). Bagging trains B learners, one on each bootstrap sample, and averages their predictions: f̄(x) = (1/B) Σ f_b(x) for regression, majority vote for classification, or soft-vote the class probabilities if the base learner produces them. The reason this works is the variance-of-averages formula from the previous section: as long as the bootstraps create diverse learners, the (1 − ρ)σ²/B term shrinks with B.

Pasting, random subspaces, and random patches

Several close relatives of bagging trade off differently in the diversity/efficiency equation. Pasting (Breiman, 1999) samples without replacement, giving smaller training sets per learner but preserving point identity — useful when n is enormous. Random subspaces (Ho, 1998) bag the features instead of the rows: each base learner sees all rows but a random subset of columns. This is the technique that, stacked on top of row-bagging, becomes the random forest. Random patches bag both at once. The generalisation is that any stochastic restriction of the training data — rows, columns, both — that degrades individual learners slightly while making them different from each other is a valid bagging flavour. The engineering question is always which degradation buys the most decorrelation at the least accuracy cost.

Out-of-bag error

The 36.8% of points each bootstrap leaves out turn out to be a free validation set. The out-of-bag error is computed by predicting each training point using only the base learners that did not see it during training, then scoring the predictions against the true labels. As B grows, this estimate converges to leave-one-out cross-validation, at essentially no extra compute cost. Random forests use OOB error as their default internal evaluation metric, and OOB feature-importance scores (permute a feature, re-score, measure the drop) are the most commonly reported diagnostics from a trained forest. OOB is one of the quiet gifts of bagging: you get cross-validation for free just from the way the training set was constructed.

What bagging doesn't fix. Bagging reduces variance but leaves bias untouched. If the base learner is systematically wrong — a depth-2 tree on a highly nonlinear target, say — bagging just averages many versions of the same wrongness. The diagnostic: train one base learner, train a bagged version, and compare the training-set residuals. If the residuals are similar, your problem is bias and bagging will not help.

When to bag

Bag when your base learner is high-variance and you can afford to train many of it. Decision trees are the obvious target. Neural networks can also be bagged (deep ensembles, covered in Section 17), and at the other extreme, bagging k-nearest neighbours or linear regression typically does little because those models are already low-variance. The canonical failure mode of bagging is applying it reflexively: averaging 500 logistic regressions gives you almost exactly one logistic regression, while consuming five hundred times the compute and complicating deployment.

04

Random Forests

Random Forests (Breiman, 2001) combine bagging with random feature selection at each split. They are the default ensemble method of the last quarter-century — nearly parameter-free, robust to noisy features, and very hard to make catastrophically worse. For decades a random forest was the first thing to try on a new tabular problem, and it still is.

The algorithm

A random forest trains B decision trees, each on a bootstrap sample of the rows. When growing each tree, at every split-candidate node the algorithm considers only a random subset of m features, typically m = √p for classification or m = p/3 for regression (with p total features). The trees are grown deep — usually unpruned, limited only by a minimum leaf size — because bagging wants high-variance base learners. Prediction averages the regression outputs or majority-votes the classification outputs; for probabilities, the forest averages per-class votes or per-tree estimated probabilities. Two sources of randomness make the trees decorrelated: the bootstrap (which changes which rows each tree sees) and the random feature restriction at each split (which changes which columns a tree can use). The second is the key Breiman insight; plain bagged trees were already known, and they helped, but not nearly as much.

Why the feature randomness helps

If a few features are very predictive, every unrestricted tree will use them at the top, and the bootstrap alone will produce many trees that split on the same features in similar orders. Their errors will be highly correlated — the ρ in Var(f̄) = ρσ² + (1 − ρ)σ²/B stays high — and bagging yields only modest gains. Forcing each split to choose among a random subset of features breaks this correlation: some trees cannot split on the strongest feature at the top and are forced to make their first cut on a weaker one, producing a genuinely different model. The accuracy cost per tree is small; the decorrelation benefit is large. This is the single most important idea in Breiman's 2001 paper and the reason random forests outperform plain bagged trees so reliably.

Why random forests are so robust

Random forests are famously insensitive to hyperparameters: the defaults in scikit-learn and R's randomForest work almost everywhere. The tree count B only needs to be "enough" — more trees never hurt, they just cost compute, and OOB error flattens once B is large. Tree depth matters a little but is usually fine at "fully grown". The feature count m is the only parameter with real effect, and the defaults of √p / p/3 have been re-validated on hundreds of datasets. Forests tolerate irrelevant features (they will simply rarely be chosen), mild label noise (bootstrapping dilutes it), and missing values (with imputation or surrogate splits). They do not require feature scaling, they handle mixed numerical and categorical data, and they produce a calibrated probability estimate out of the box. This is why, despite newer methods, random forests remain the rock-solid default baseline on tabular data.

A random forest is a bag of decorrelated deep trees. The bootstrap decorrelates by rows; the random feature restriction decorrelates by columns. Trees are grown deep so each one has low bias and high variance; the averaging kills the variance; the random feature restriction keeps the averaging effective.

Limitations

Random forests have two real limitations. First, they struggle to extrapolate: predictions are bounded by the leaves' training-set averages, so a regression forest asked to predict on inputs outside the training range outputs a flat line. Second, on problems where every gain matters — Kaggle competitions, high-stakes pricing, risk models with large money at stake — they are usually beaten by a well-tuned gradient-boosted model by a small but real margin. Random forests optimise for robustness (fit-and-forget) while gradient boosting optimises for peak accuracy (when carefully tuned). On most production problems, the robustness wins; on leaderboard problems, the peak accuracy does.

05

Forest variants

The random forest idea is a template, and once it was clear that bagging + feature randomness worked, people started varying the recipe. Extra Trees push the randomness further; Isolation Forests repurpose the tree ensemble for anomaly detection; Rotation Forests rotate the feature space before training. Each shows a different thing the ensemble can be.

Extra Trees

Extremely Randomised Trees (Geurts, Ernst, Wehenkel, 2006), usually called Extra Trees, change the random forest in two ways: they abandon bootstrapping (each tree sees the full training set) and they randomise split thresholds as well as split features. Instead of searching for the best threshold to split a feature, at each node Extra Trees draw a random threshold from the feature's range and use it. This makes each tree a worse fit to the training data, but faster to train and even more decorrelated from its siblings. Empirically Extra Trees often match random forests in accuracy while being faster, and they are the better choice when the dataset has many noisy features: since the threshold is random, the algorithm cannot overfit to the exact noise of any particular feature. The tradeoff: you give up the interpretability of "this tree split on age at 42.5" because the threshold was never chosen to be good.

Isolation Forest

Isolation Forest (Liu, Ting, Zhou, 2008) is the clever discovery that an ensemble of random, unpruned, unsupervised trees can detect anomalies. Build each tree by repeatedly splitting on a random feature at a random threshold until every point is in its own leaf, or until a depth limit. Anomalies — points in sparse regions of feature space — get isolated quickly (shallow leaves); normal points, crowded together, take many splits to separate (deep leaves). The average isolation depth across trees is the anomaly score. This is one of the most useful one-class learning algorithms in practice: no distributional assumptions, no distance metric to choose, handles high-dimensional data well, and runs in O(n log n) per tree. It is the default anomaly detector in scikit-learn and the first thing to try for fraud, intrusion detection, or any outlier-hunting problem.

Rotation Forests

Rotation Forests (Rodríguez, Kuncheva, Alonso, 2006) push diversity by rotating the feature space before each tree is trained. The algorithm partitions the features into random subsets, runs PCA within each subset, and uses the resulting rotated coordinates as the input to a decision tree. Because trees split axis-aligned, rotating the axes gives the tree a genuinely different inductive bias; the ensemble is diverse in a way plain bagging cannot achieve. Rotation Forests often beat random forests on smaller datasets at a higher compute cost, and they are a reminder that the two knobs of ensemble design (bag the data; diversify the learner) have many more settings than the classical recipes explore.

The family resemblance. All forest variants obey the same structural law: many weak trees, aggregated, with some randomisation that decorrelates them. The differences are in what is randomised — rows, columns, thresholds, axes, class labels — and what is aggregated — mean predictions, majority votes, depth counts, rank statistics. Once you see the pattern, you can invent your own forest for any problem whose base learner is a tree.

Forests beyond classification and regression

The random-forest template has been ported well past supervised learning. Quantile regression forests (Meinshausen, 2006) let a trained forest return any quantile of the conditional distribution, not just the mean — useful for uncertainty estimation. Survival forests handle censored time-to-event data. Causal forests (Wager and Athey, 2018) estimate conditional average treatment effects. In each case the tree structure is the same; the aggregation rule is what changes. These extensions are why forests persist: they are a flexible template, not a single algorithm.

06

AdaBoost

Boosting is the second great ensemble family, and AdaBoost (Freund and Schapire, 1995) was the first algorithm that made Kearns's weak-to-strong conjecture practical. It trains a sequence of weak classifiers on reweighted data, each one focused on the examples the previous ones got wrong, then combines them with weights of their own. It won the 2003 Gödel Prize, founded an entire literature on margin theory, and was the first boosting algorithm most practitioners ever met.

The algorithm

Initialise uniform weights w_i = 1/n over the n training examples. For m = 1, …, M: fit a weak classifier h_m to the weighted data (typically a decision stump, a depth-1 tree); compute its weighted error ε_m = Σ_i w_i · I(h_m(x_i) ≠ y_i); set the classifier's weight α_m = ½ ln((1 − ε_m) / ε_m); update the example weights w_i ← w_i · exp(α_m · I(h_m(x_i) ≠ y_i)) and renormalise. After M rounds, the final classifier is the weighted sign of the base classifiers: H(x) = sign(Σ_m α_m h_m(x)). Three things happen simultaneously: the algorithm is forced to focus on the hard examples (their weights grow), good weak learners earn bigger voting rights (α grows as error shrinks), and the ensemble error on the training set falls geometrically as long as each weak learner beats random.

Why the weights have that form

AdaBoost is not just a clever heuristic — it is exactly coordinate descent on an exponential loss. The function F(x) = Σ_m α_m h_m(x) can be viewed as minimising Σ_i exp(−y_i F(x_i)) greedily, one α_m h_m at a time. The weights α_m = ½ ln((1 − ε_m) / ε_m) fall out as the closed-form minimiser at each step. This loss-function perspective (Friedman, Hastie, Tibshirani, 2000) is the bridge between AdaBoost and gradient boosting: once you know boosting is loss minimisation by coordinate descent, you can swap in any loss you want — squared error, log loss, Huber, quantile — and get a different boosting algorithm for free. AdaBoost is just gradient boosting on exponential loss.

Margin theory

AdaBoost keeps improving generalisation even after its training error hits zero, which confused everyone in the mid-1990s because it seemed to violate the classical bias–variance story. The margin theory of Schapire et al. (1998) resolved it: even after training accuracy is perfect, subsequent rounds push the margins (the signed distances from the decision boundary) outward, which according to PAC-Bayes bounds improves generalisation. This is philosophically similar to the margin maximisation of SVMs (Chapter 07): in both cases, confident-but-correct matters, not just correct. The margin view is why AdaBoost can run for thousands of rounds without overfitting on many problems — though, as practitioners found later, not on data with label noise, where the exponential loss up-weights mislabelled examples pathologically.

AdaBoost's real weakness. The exponential loss grows unboundedly in the margin, so a mislabelled example gets exponentially up-weighted at each round, eventually dominating the model. On noisy real-world data, AdaBoost is brittle in a way random forests are not. This is the problem gradient boosting and its descendants set out to solve, by replacing the exponential loss with one (log loss, squared error) that is bounded or less aggressive in the tails.

Legacy

AdaBoost was the first boosting algorithm to dominate benchmarks, and the Viola–Jones face detector (2001) — which made real-time face detection possible on consumer cameras a decade before deep learning — was an AdaBoost classifier over Haar-like features. It has since been almost entirely supplanted by gradient boosting for practical work, but it is the algorithm every boosting story starts with, and the algorithm from which the entire margin-theory literature descends. Every later method in this chapter either generalises it (gradient boosting, XGBoost) or avoids its weaknesses (LightGBM's log loss, CatBoost's ordered boosting).

07

Gradient boosting

Jerome Friedman's 1999 generalisation of AdaBoost — fit each new base learner to the negative gradient of the loss with respect to the current ensemble's predictions — turned boosting from a single algorithm into a framework. Any differentiable loss, any base learner, any regularisation: all become plug-in choices. Gradient boosting is the engine inside XGBoost, LightGBM, and CatBoost.

Functional gradient descent

The insight is to treat the function F(x) that maps inputs to predictions as the thing being optimised, rather than a fixed parametric form. Gradient descent in function space: at iteration m, compute the pseudo-residuals r_i = −∂L(y_i, F(x_i))/∂F(x_i) — the direction in which the loss decreases most rapidly at each training point — and then fit a base learner h_m to these pseudo-residuals with squared error. Add a step along it: F_m = F_{m−1} + η · γ_m · h_m, where γ_m is a line-search step size and η is the learning rate (shrinkage). For squared-error loss the pseudo-residuals are literally residuals y_i − F(x_i), which makes the classic "fit a tree to the residuals" recipe a special case. For log loss (binary classification), the pseudo-residuals are y_i − p_i where p_i = σ(F(x_i)). For quantile loss, they are ±(τ or 1−τ). The algorithm is the same; only the gradient changes.

Why trees are the canonical boosting learner

Any learner that can fit pseudo-residuals works, but shallow decision trees (usually depth 4–8) have three properties that make them ideal: they are fast, they can exploit interactions up to their depth, and they partition the input space into regions where the ensemble's prediction can be adjusted independently. That last point matters: after fitting a tree to pseudo-residuals, gradient boosting can further improve by re-optimising the leaf values directly against the original loss, not just against the pseudo-residual squared-error surrogate. This is the "TreeBoost" refinement in Friedman's paper, and it is the single biggest accuracy improvement over generic gradient boosting. Every modern boosting library does this.

Shrinkage and stochastic gradient boosting

Two regularisation ideas made gradient boosting practical. Shrinkage (Friedman, 2001) multiplies each step by a learning rate η ≪ 1, typically 0.01–0.1. Each tree makes a small contribution, more trees are needed for a given level of fit, and the resulting model generalises much better — the same idea as small learning rates in neural-network SGD. Stochastic gradient boosting (Friedman, 2002) fits each tree on a random subsample (typically 50–80%) of the training data, importing a piece of bagging into the boosting loop. This both regularises and speeds up training. The pair — low learning rate with subsampling — is the "slow learner" recipe that dominates boosting practice.

Gradient boosting is coordinate descent on a loss, performed in function space. At each step, fit a weak learner to the negative gradient of the loss, then take a small step along it. Any differentiable loss → any boosting algorithm.

Why gradient boosting wins on tabular data

Two reasons. First, tabular data is usually a mixture of numerical and categorical features with interactions of moderate order; depth-4 trees trained sequentially with residuals capture exactly those. Second, the boosting loop is an adaptive allocator of capacity: rounds where a sub-region of the input space is already well-fit produce small improvements there, and the algorithm spends more effort where error remains. A random forest, by contrast, treats every region uniformly. On the Kaggle platform between 2014 and 2022, gradient-boosted models won the overwhelming majority of tabular competitions — often the top-20 finishers were all XGBoost or LightGBM, differing only in hyperparameters and feature engineering.

08

XGBoost

Tianqi Chen's 2014 XGBoost library is the single most consequential implementation in applied ML of the last decade. It did not invent gradient boosting, but it combined a regularised objective, a second-order Taylor expansion, sparsity-aware splits, careful caching, and out-of-core training into a system that ran an order of magnitude faster and a few percent more accurately than anything else. For years, "XGBoost" was a synonym for "default tabular model".

Regularised objective

XGBoost adds an explicit regularisation term to the boosting objective: L(F) = Σ_i L(y_i, F(x_i)) + Σ_m Ω(h_m), where Ω(h) = γT + ½λ Σ_j w_j² penalises the number of leaves T and the sum-of-squared leaf weights. The γ term discourages adding a leaf unless it reduces loss by at least γ; the λ term shrinks leaf values toward zero. This is a major conceptual step: previous gradient-boosting libraries regularised implicitly through tree-depth caps and shrinkage; XGBoost regularises the loss directly, and the resulting objective has a closed-form optimal leaf value that uses both gradient and Hessian. This changes everything downstream.

Second-order approximation

At each boosting step, XGBoost approximates the loss by its second-order Taylor expansion: L(F + h) ≈ L(F) + Σ_i (g_i h(x_i) + ½ h_i h(x_i)²) + Ω(h), where g_i and h_i are the first and second derivatives of the loss at F(x_i). For a tree, this is optimised region-by-region, and the optimal leaf value in region R_j is w_j* = −(Σ_{i∈R_j} g_i) / (Σ_{i∈R_j} h_i + λ). The reduction in loss for a candidate split is then computable in closed form and becomes the gain used to choose split points. In scikit-learn's gradient boosting, the corresponding step was approximate and per-tree; in XGBoost, the Newton step is exact per tree and dramatically more efficient per round. The second-order method converges in fewer trees and tunes a few notches better than first-order boosting.

Sparsity and system engineering

Two engineering features made XGBoost a phenomenon. First, sparsity-aware split finding handles missing values and one-hot-encoded features by learning a default direction for missing entries at each split, instead of imputing. Second, the data layout — block-compressed column-sorted storage — lets split-finding scan each feature in parallel once, then cache the results across all boosting rounds. This is why XGBoost on a single machine is often faster than naive boosting on a cluster: the inner loop is memory-bandwidth-bound, and Chen's implementation saturates it. Add column- and row-subsampling (imported from random forests), histogram-based splits (later), and distributed training, and you have the library that, according to Chen's 2016 paper, was used by 17 of the 29 winning Kaggle solutions in 2015.

The XGBoost recipe. Use a small learning rate (0.05 or 0.1), trees of depth 4–8, subsample rows 0.7–0.9, subsample columns 0.5–1.0, L2 regularisation λ in the 1–10 range, γ in 0–5. Turn on early stopping against a validation set with 50–100 rounds of patience. Most of the leaderboard tuning is searching these ranges carefully; the defaults are already good.

When XGBoost plateaus

Two places. On very large datasets (tens of millions of rows), exact split-finding becomes expensive and histogram-based methods are faster — which is what LightGBM and CatBoost were built for, and what XGBoost later added. On categorical features with many levels (high-cardinality IDs), XGBoost's one-hot expansion blows up and target encoding can be unstable; CatBoost's ordered target encoding is noticeably better. On both fronts the newer libraries caught up by adopting different tradeoffs, which is the subject of the next section.

09

LightGBM and CatBoost

By 2017 two further gradient-boosting libraries had come to challenge XGBoost: Microsoft's LightGBM and Yandex's CatBoost. Each made a different architectural bet. LightGBM bet on histograms and leaf-wise tree growth for speed; CatBoost bet on ordered boosting and target encoding for categorical data. Both won, in different niches, and the three libraries now form the tabular ML toolkit.

Histogram-based splits

Exact gradient-boosting split-finding sorts each feature's values at every node of every tree and scans them for the best threshold. That is O(n log n) per feature per node, and it dominates training time. Histogram-based methods bin each feature into 256 buckets once, up front, and all subsequent split searches scan only the bucket boundaries: O(n + 256) per feature per node. The loss in accuracy from binning is negligible in practice; the speedup is often 5–20×. LightGBM (Ke et al., 2017) was the first library built around this, and XGBoost later added a histogram mode. On large data, both libraries now train at roughly comparable speed; on small data, XGBoost's exact mode is marginally more accurate.

Leaf-wise growth

The second LightGBM bet was leaf-wise (best-first) tree growth instead of level-wise (breadth-first). Classical gradient boosting grows trees level-by-level to a fixed depth; LightGBM picks the single leaf whose split would most reduce the loss and expands only that one, repeating until a node budget is hit. Leaf-wise trees can be very unbalanced — some paths deep, others shallow — and they fit training data more quickly round-for-round. This produces more accurate trees per-tree but is more prone to overfitting on small datasets, which is why LightGBM exposes num_leaves as a primary tuning parameter alongside max_depth. On large datasets leaf-wise wins; on small ones it can overfit and must be tamed.

CatBoost and ordered target encoding

CatBoost (Prokhorenkova et al., 2018) addresses a different problem: what to do with high-cardinality categorical features (user IDs, zip codes, product SKUs). The classical options are one-hot encoding (exploding the feature space) or target encoding (replacing each category with the mean of the target for that category — which leaks the target into the training features and causes overfitting). CatBoost's insight is ordered target encoding: for each training example, compute the target mean using only the preceding examples in a random permutation, so the encoded value is never influenced by the row's own label. This removes the leakage entirely. CatBoost also implements ordered boosting, a generalisation that computes residuals using only earlier examples, which further cuts a subtle form of bias present in all gradient boosting. On categorical-heavy data (retail, ads, web behaviour) CatBoost often beats XGBoost and LightGBM by a real margin without any categorical preprocessing.

Which one to use. For small-to-medium numerical datasets, XGBoost is the most predictable. For large datasets, LightGBM's speed wins. For datasets with many categorical features (especially high-cardinality), CatBoost wins. Most serious tabular ML teams carry all three and tune the best of them for the problem at hand — which is itself a kind of ensemble.

Convergence of the three

In a decade, the three libraries have converged feature-for-feature: XGBoost added histogram splits and GPU training, LightGBM added categorical handling, CatBoost added regularisation tricks from XGBoost. The remaining differences are in defaults, idiosyncrasies, and speed on specific workloads. For a practitioner starting today on a tabular problem, picking one of the three essentially at random and tuning carefully outperforms picking any other algorithm. The gap between the three is usually smaller than the gap between a well-tuned boosted model and everything else.

10

Tuning boosted trees

Gradient-boosted trees have a small number of hyperparameters with big effects and a small number with small effects. Getting the big ones right — learning rate, number of trees, depth, regularisation — is most of the work. Knowing which is which saves an enormous amount of compute.

The learning rate / tree count tradeoff

The two single most important parameters are the learning rate η (shrinkage) and the number of trees M. Smaller η needs larger M and produces better models. The canonical recipe: set η small (0.01–0.05), set M large (2000–10000), and use early stopping against a validation set to pick the actual number of trees needed. Early stopping is not optional in serious boosting — it is the single biggest guard against overfitting and the reason low learning rates work. With early stopping, the training loop will halt when validation loss fails to improve for, say, 50 rounds, and you can leave M set to a wildly pessimistic number without wasting compute.

Tree depth and leaves

Depth controls the interaction order the model can represent: a depth-1 tree is a stump (no interactions), depth-2 captures pairwise interactions, depth-6 captures up to six-way. For most tabular problems, depth 4–8 is the sweet spot. Deeper trees memorise faster, shallower ones bias harder. LightGBM users adjust num_leaves instead of or in addition to depth, typically in the 31–255 range; a rough rule is num_leaves < 2^depth. Getting depth right is worth more than getting almost any other parameter right, and it is also the parameter that most commonly has a strong interaction with η and the regularisers.

Subsampling and regularisation

Row subsampling (subsample, bagging_fraction) takes a random fraction of the training data for each tree, typically 0.6–0.9. Column subsampling (colsample_bytree, feature_fraction) takes a random fraction of the features per tree, typically 0.5–1.0. Both add bagging-style variance reduction on top of boosting's bias reduction, and both are essentially free regularisers. The L2 regularisation λ (or reg_lambda) penalises large leaf values and is the next most useful knob. min_child_weight (XGBoost) or min_data_in_leaf (LightGBM) sets a minimum Hessian / row count per leaf and prevents over-fragmenting trees on noisy data.

The 90-percent recipe. Set η = 0.05, max_depth = 6, subsample = 0.8, colsample_bytree = 0.8, reg_lambda = 1. Set M = 5000 and early-stop with patience 50. Tune only max_depth, η, and (for LightGBM) num_leaves if you need more. Ninety percent of boosted-tree gains come from these. Resist the urge to tune twenty parameters in a grid.

What to never do

Never skip early stopping — you will either stop too early by guessing M low, or overfit by guessing it high. Never tune on the test set. Never forget to scale your early-stopping-patience with your learning rate: small η means long plateaus, and low patience will stop prematurely. Never one-hot-encode a high-cardinality categorical in XGBoost when you could target-encode or use CatBoost. And never assume the defaults are bad — they usually aren't; they are the answer to a question someone very smart has thought about at length.

11

Feature importance and explanation

An ensemble of 500 trees is not readable. Feature-importance scores and example-level explanations are how a trained model gets communicated to stakeholders, audited by regulators, and debugged by its builders. Three tools matter: impurity importance, permutation importance, and SHAP values. They measure different things, and reading the wrong one can be misleading.

Impurity importance

The default importance metric in every tree-ensemble library is mean decrease in impurity: for each feature, sum the reduction in training-set impurity (Gini, entropy, squared error) contributed by splits on that feature, averaged over all trees. Fast and free — already computed during training. But it has a known failure mode: it is biased toward high-cardinality features (they have more possible splits, so they win more of them) and toward numerical features over categorical ones. On data where user-ID is an input column, impurity importance will rank it first even if it has no predictive power. Use impurity importance as a quick sanity check and never as a final audit.

Permutation importance

Permutation importance (Breiman, 2001, then Fisher, Rudin, Dominici, 2018) measures, for each feature, the drop in model performance on a held-out set when that feature's values are randomly shuffled. Randomness breaks the feature's relationship with the target; if the model relied on the feature, the score drops. Permutation importance is model-agnostic (works on any black box), unbiased with respect to cardinality, and directly interpretable ("shuffling this feature costs us 2.3% accuracy"). It is the correct tool for almost all importance questions. Caveat: when features are correlated, shuffling one of them pushes the input off the data manifold, and the permutation drop underestimates its true importance. Conditional permutation methods fix this at a higher compute cost.

SHAP values

SHAP (SHapley Additive exPlanations; Lundberg and Lee, 2017) assigns to each feature, for each specific prediction, a value quantifying how much that feature shifted the prediction away from the baseline. The values are derived from the Shapley value of cooperative game theory — the unique attribution rule satisfying a set of consistency axioms — and they are additive: the prediction equals the baseline plus the sum of the SHAP values. TreeSHAP (Lundberg et al., 2018) computes exact SHAP values for tree ensembles in polynomial time, which turned SHAP into the de facto model-explanation standard for boosted trees. You get both global importance (summed absolute SHAP across all rows) and local explanations ("this person got denied credit because feature X contributed −0.3 to the prediction"). The latter is what regulators increasingly want.

Impurity importance is free but biased. Permutation importance is honest but global only. SHAP gives local, consistent, additive attributions and is worth the compute. For any model that will be audited or deployed to affect people, SHAP is now the default explanation method.

Partial dependence and ICE

Partial dependence plots (PDPs) show how the model's average prediction varies with one feature, marginalising over the others — a global view of a single feature's effect. Individual conditional expectation (ICE) plots show the same relationship per-example, useful when interactions mean the average effect hides heterogeneity. Both are useful diagnostics for boosted trees: they show whether the model learned the expected monotonic trend, detected a threshold, or caught a surprising nonlinear bend. Modern libraries bundle PDP and ICE into a few lines of code, and they belong in every serious model-card document.

12

Stacking and blending

Bagging and boosting combine models of the same kind, trained on slightly different data. Stacking combines models of different kinds by learning a meta-model that takes their predictions as inputs. The Netflix Prize was won by a stacked ensemble of 107 models. Modern Kaggle top-tens routinely stack half a dozen boosted trees, a few neural nets, and a linear model.

Stacking (stacked generalisation)

Stacking (Wolpert, 1992) proceeds in two stages. Stage one: train K base learners f_1, …, f_K on the training set, typically heterogeneous (a gradient-boosted tree, a neural net, a logistic regression, a k-NN). Stage two: train a meta-learner g on the base learners' predictions — usually a simple linear model or another small tree ensemble — and use g(f_1(x), …, f_K(x)) as the final prediction. The critical detail: the meta-learner must be trained on out-of-fold predictions from the base learners, never on in-sample predictions. Use k-fold cross-validation: for each fold, train each base learner on the other folds and predict on the held-out fold; the concatenation of held-out predictions becomes the training input to the meta-learner. Without this, the meta-learner sees the base learners' overfit in-sample predictions and learns to trust them too much, producing a model that looks brilliant on training and collapses on test.

Blending

Blending is stacking's lazy cousin: instead of k-fold cross-validation, hold out a single validation set, train the base learners on the training portion, produce predictions on the validation set, and fit the meta-learner to those. Simpler to implement, lower variance of the meta-features (because the base learners trained on more data per prediction), but higher variance of the meta-model (because it sees fewer training examples). Blending was the term popularised by Netflix Prize competitors, who wanted a distinction between the full-cross-validation form and the holdout form. In practice the two perform similarly; use blending when cross-validation is too expensive or when the dataset is huge.

What the meta-learner should be

Counterintuitively, the meta-learner should almost always be simple. A linear regression / logistic regression on top of the base predictions works astonishingly well — essentially learning optimal weights across models. A shallow gradient-boosted tree can catch interactions between base predictions (model A is more accurate than B when feature X is high) but has to be regularised heavily or it will overfit. Neural-net meta-learners almost never win: the base predictions are few and smooth, and a linear combination usually suffices. The reason is that most of the ensemble's value comes from the diversity of the base learners, not from the sophistication of the aggregation rule.

The Netflix Prize lesson. BellKor's Pragmatic Chaos won with a gradient-boosted meta-model over 107 base predictors including matrix factorisations, restricted Boltzmann machines, and neighbourhood models. The base models' diversity did the work. The team also discovered that maintaining 107 models in production was infeasible for a streaming service, and Netflix never deployed the winning ensemble — which is a parable about the gap between leaderboard wins and production value.

When to stack

Stack when you have several genuinely different base models with uncorrelated errors, a validation pipeline that can produce clean out-of-fold predictions, and a downstream system that can afford the compute and operational complexity of deploying many models. In practice this describes Kaggle competitions and some high-stakes offline predictions (credit risk, fraud), and it does not describe most production systems, which prefer one carefully tuned boosted model to seven stacked ones. Stacking is a peak-accuracy tool, not a robustness tool.

13

Voting

Before stacking, before boosting, the most straightforward way to combine models is to vote. Hard voting picks the majority class; soft voting averages probabilities. They are the first things to try, the last things to forget, and the baseline any cleverer combination method must beat.

Hard voting

Hard voting takes the predicted class from each of K classifiers and returns the majority. Ties are broken arbitrarily, or by the classifier with the highest individual accuracy, or by weighted vote (some classifiers count for more). The simplest possible combiner, and under the Condorcet conditions — each classifier better than random, errors reasonably decorrelated — it works. The main limitation: hard voting throws away all the confidence information. If classifier A thinks it's 99% spam and classifiers B and C each think it's 51% ham, hard voting says ham. Soft voting says spam.

Soft voting

Soft voting averages the predicted probabilities across classifiers and picks the class with the highest average. p̄(y = k | x) = (1/K) Σ p_k(y = k | x). Soft voting is strictly more informative than hard voting — the same two classifiers' outputs determine the decision, but their confidences matter — and it is strictly better whenever the individual classifiers produce calibrated probabilities. When they do not (naive Bayes, SVMs with Platt-scaled outputs, uncalibrated gradient-boosted models), soft voting can do worse than hard voting because badly-calibrated probabilities warp the average. Calibration-before-averaging is almost always worth the cost.

Weighted voting

Both variants admit weights: give each classifier a weight based on its validation accuracy, its reliability, or learned weights fit on held-out data. Weighted soft voting is formally equivalent to a fixed-weight linear stacking ensemble, which is why, once weights are learned rather than assigned, the distinction between "voting with tuned weights" and "stacking with a linear meta-model" disappears. In scikit-learn, the VotingClassifier and VotingRegressor with voting='soft' and tuned weights is a full-featured blending tool dressed up as a voting tool.

Voting's one big limitation. Voting assumes the constituent models' errors are roughly decorrelated. Voting five copies of the same model (or five very similar ones, like five gradient-boosted trees with nearby hyperparameters) does almost nothing. Voting a gradient-boosted tree, a random forest, a neural net, and a logistic regression usually helps measurably. Diversity is the point.

When voting is enough

Voting is the right answer when you already have several diverse models you trust, you want the simplest possible production story, and you don't have the data or the infrastructure to run out-of-fold cross-validation for stacking. Small teams, internal tools, and conservative deployment environments often favour soft voting over stacking precisely because there is no meta-model to monitor. Soft voting of a gradient-boosted tree and a deep neural network is a standard move on tabular-plus-text problems, and it usually beats either alone.

14

Diversity and correlated errors

The variance-of-an-average formula keeps reappearing because it is the single most important fact about ensembles. When ensembles help, it is because the members' errors are decorrelated. When they don't, it is almost always because the members agree on the wrong answer. Diagnosing correlation is the key skill of ensemble design.

The error-correlation matrix

Given K models and a held-out evaluation set, compute the residual (for regression) or error indicator (for classification) of each model on each example, producing a n × K matrix. The K × K correlation matrix of its columns is the error-correlation matrix of the ensemble. Two models with correlation 0.98 are essentially duplicates for ensemble purposes; adding one gains almost nothing. Two models with correlation 0.3 are genuinely complementary; averaging them will meaningfully reduce variance. This simple matrix is the first diagnostic anyone building a stacked ensemble should compute, and it is routinely skipped.

Sources of diversity

Diversity can be manufactured in five basic ways: (1) data diversity — bootstrap, subsample, stratify, time-slice; (2) feature diversity — random subspaces, different feature engineering, different embeddings; (3) model diversity — different model families (GBM vs NN vs linear), different hyperparameters within a family; (4) loss diversity — different objectives (MAE vs MSE vs Huber), different calibration targets; (5) target diversity — different class encodings, different label smoothing, different multi-task auxiliary losses. A strong ensemble usually combines several of these. Relying on only one — say, just bagging — hits the ceiling set by that source's achievable decorrelation.

Why ensembles sometimes hurt

If one model is markedly better than the others and the others are correlated with each other but not with it, averaging drags the good model down toward the consensus. Unequal weights help, but there is a threshold below which including a weak model is simply noise. Similarly, averaging a well-calibrated classifier with a miscalibrated one warps the average's calibration. And averaging across classifiers that partition the decision space very differently can produce predictions with worse monotonicity than any individual. The simple answer — "always average, it can't hurt" — is wrong. Averaging without checking residual correlation and individual validation accuracy is a good way to ship an ensemble that is worse than its best member.

The two tests before shipping any ensemble: (1) Does the ensemble beat the best individual model on a held-out set by a margin that is larger than the validation noise? (2) Are the constituent models' residual correlations low enough that the improvement will survive in production? Skip these and the ensemble may be purely cargo cult.

Diversity vs accuracy

There is a tradeoff between the diversity of the base learners and their individual accuracy — a deliberately bad model added to an ensemble is diverse, but its errors drag the average down faster than its decorrelation lifts it. Kuncheva's 2003 survey documented this tradeoff across a wide family of diversity measures and found no clean way to predict ensemble gains from diversity alone. In practice, the rule is: take the best individual model, then look for models that are clearly worse but get different examples wrong. Adding a worse-but-different model helps more than adding a slightly-better-but-similar one — up to a point.

15

Ensembles and calibration

Calibration — the property that predicted probabilities match empirical frequencies — is as important for ensembles as for their members, and ensembling changes calibration in characteristic ways. Bagging tends to preserve it. Boosting distorts it. Deep ensembles improve it. Knowing which regime you're in determines whether post-hoc calibration is needed.

How bagging affects calibration

Random forests are typically well-calibrated on binary classification: averaging many trees' class-fractions produces probabilities whose reliability diagrams track the diagonal reasonably well. The intuition: each tree's leaf estimate is an unbiased estimator of the region's true class probability (modulo small-sample noise), and averaging many unbiased estimates gives an unbiased aggregate. For multiclass, forests are slightly over-confident on the modal class when class-vote aggregation is used, and the fix (averaging per-tree class probabilities rather than voting) is usually built in.

How boosting distorts calibration

Gradient-boosted trees are usually miscalibrated, often pushing probabilities toward the extremes: a GBM will say 0.01 or 0.99 for many cases where reality is more like 0.05 or 0.90. The cause is the loss function: the log loss used for classification is minimised by confident predictions when the predictions are correct, and the boosting procedure keeps increasing confidence at the margin. AdaBoost is even worse — its exponential loss pushes probabilities even more aggressively. The fix is a standard calibration wrapper (Chapter 02, Section 14): Platt scaling or isotonic regression on a held-out set. Most production boosted classifiers are calibrated this way even though their uncalibrated rank-ordering (AUC) is already excellent.

Deep ensembles and calibration

Deep neural networks trained with cross-entropy are famously miscalibrated (Guo et al., 2017): they become over-confident as they get deeper and wider, and modern ResNets or transformers produce softmax probabilities that look like 0.99 when the true probability is more like 0.7. Ensembling several independently-initialised deep networks — known as deep ensembles (Lakshminarayanan et al., 2017) — improves both accuracy and calibration, and is the simplest reliable way to get epistemic uncertainty out of a neural net. Five independently trained deep networks, averaged, beat nearly all fancier uncertainty methods (MC Dropout, variational Bayes) on held-out log-likelihood and calibration metrics. The cost is five times the inference compute.

Practical rule. If you averaged deep nets or bagged trees, calibration is probably fine — check with a reliability diagram. If you boosted, calibrate afterwards (isotonic regression on a held-out set is the default). If you stacked, check the meta-model's outputs and calibrate if needed; the meta-model often inherits miscalibration from its boosted base learners.

Why ensembles of calibrated models aren't automatically calibrated

Even averaging perfectly calibrated classifiers is not guaranteed to produce a perfectly calibrated average, though it is usually close. The exception is when the classifiers agree on the label but have systematic disagreements in magnitude — the average can be pulled toward the middle of a bimodal distribution of confidences, producing under-confidence. A second calibration pass after averaging is sometimes needed; it is very rarely expensive.

16

Ensembles in practice

A practitioner arriving at a new tabular problem has a well-worn ladder of models to climb. Each rung trades complexity for accuracy, and each has a reasonable chance of being the right stopping point. Knowing when to stop — and when to keep climbing — is the practical skill this chapter exists to build.

The model ladder

The canonical ladder: (1) a baseline — predict the mean for regression, the majority class for classification; (2) a simple linear or logistic regression with the obvious features; (3) a random forest with default hyperparameters; (4) a gradient-boosted tree (XGBoost / LightGBM / CatBoost) with light tuning; (5) the same GBM with serious tuning, including cross-validation and feature engineering; (6) a stacked ensemble of several GBMs and possibly a neural net; (7) a full competition-grade pipeline of dozens of models. Most problems stop somewhere between rung 3 and rung 5. Moving from rung 3 to rung 5 roughly doubles the engineering effort and typically buys 1–3% absolute accuracy. Moving from rung 5 to rung 6 quadruples it and buys 0.1–0.5%. Moving to rung 7 is for competitions.

When the ladder stops paying

The ladder has diminishing returns, and the marginal value of a rung depends on the business problem. A credit-scoring model where each 0.1% AUC is worth millions of dollars justifies rung 6. A recommender whose bottleneck is data freshness, not model accuracy, does not. A fraud system that must score transactions in under 10ms cannot afford rung 7. And a regulated medical classifier that needs explainable predictions must stop at rung 5 (one model you can SHAP-explain) rather than 6 (a black-box meta-model). Choose the rung based on the product, not the model.

The other axes

Accuracy is not the only thing ensembles can buy. They also buy uncertainty estimates (variance across ensemble members as a confidence score), robustness to distribution shift (if only some members are sensitive), and calibration (for deep ensembles especially). Sometimes the right reason to ensemble is not a 1% accuracy gain but a better-behaved confidence distribution in production. This is especially true in deep learning, where the same deep-ensembles technique that barely moves top-1 accuracy substantially improves calibration and out-of-distribution detection.

The default tabular recipe, 2026 edition: one gradient-boosted tree library of your choice (XGBoost, LightGBM, or CatBoost), tuned with Bayesian optimisation over max_depth, learning rate, and regularisation. Cross-validation for model selection. SHAP for explanation. Calibrate with isotonic regression if you need probabilities. This is the rung-5 model, and for most production systems it is the final answer.

Operational considerations

Ensembles multiply the costs of deployment: more models to serve, more versions to track, more explanations to audit. Production ML teams often decide against an ensemble that wins on accuracy because the cost of shipping and monitoring five models plus a meta-model exceeds the revenue from the 0.3% lift. Knowing when not to ensemble is as practical a skill as knowing how. The right decision is almost always "one well-tuned boosted model" for most production systems, and "a carefully stacked ensemble" for only the cases where every fraction of a percent has measurable business value.

17

Where it compounds in ML

Classical ensembling was the dominant paradigm on tabular data and has become something quieter but no less important in deep learning. Deep ensembles, mixtures of experts, and model soups are all descendants of the same idea. Much of what is now called "scaling" — training larger models, more seeds, averaging checkpoints — is ensembling in a different vocabulary.

Deep ensembles

Deep ensembles (Lakshminarayanan et al., 2017) are the simplest thing: train the same neural-network architecture several times from different random initialisations, average the predictions. The improvement is real — usually a few percent accuracy, substantial calibration gains, and much better behaviour on out-of-distribution inputs. For all the more sophisticated uncertainty-quantification techniques in deep learning (MC Dropout, variational Bayes, Laplace approximations), plain deep ensembles are often the best performer on standard benchmarks, at roughly the price of N× training and inference cost. Part of the explanation is that different random initialisations converge to different modes in the loss landscape, and averaging across modes captures a richer posterior than any single-mode approximation can.

Mixture of Experts

Mixture of Experts (MoE; Jacobs, Jordan, Nowlan, Hinton, 1991) is a structurally different kind of ensemble: rather than averaging all models on every input, a gating network decides which of N experts should handle each input, and only those experts run. Modern MoE models — Switch Transformer (Fedus, Zoph, Shazeer, 2021), GLaM, Mixtral — use this sparsity to grow model capacity cheaply: a 1.3-trillion-parameter MoE only activates a few billion parameters per token, costing roughly the compute of a dense model of that activated size. This is the dominant architecture for state-of-the-art language models in the 2024–2026 era. It is ensembling with a learned routing rule, and it is the largest-scale application of the idea in ML history.

Model soups and averaging weights

Model soups (Wortsman et al., 2022) take the ensemble idea in a surprising direction: rather than averaging predictions, average the weights of multiple fine-tuned models. For fine-tuned foundation models where all the checkpoints lie in the same basin of the loss landscape, the average-of-weights model performs as well as an average-of-predictions ensemble while using a single model's worth of inference compute. This is now standard practice for competition-grade fine-tuning and has been picked up into production ML pipelines, where the single weight-averaged model is strictly cheaper to serve.

The ensemble idea is one of the few ML techniques that has scaled all the way from Condorcet's jury to trillion-parameter language models. Every time the field has scaled up — from stumps to forests, from linear regressions to boosted trees, from single networks to ensembles to MoE — the ensemble idea has come along and kept providing diminishing but persistent returns. It is the ML technique least in need of reinvention.

The arc to what comes next

The next three chapters finish the classical-ML story: Chapter 04 on clustering (unsupervised analogues, which have their own ensembling tradition in consensus clustering and affinity propagation), Chapter 05 on dimensionality reduction, and then the model-by-model chapters on support-vector machines (07), graph-based methods (08), and classical reinforcement learning (09). Ensembles are the first place in this part where a clean theoretical insight — the variance of an average, the weak-to-strong learnability theorem — became the core of entire engineering disciplines. The pattern will recur in Part V (deep learning): theoretically motivated ideas, scaled into practical systems, then absorbed back into the baseline. Every later chapter in the compendium will assume the reader understands that an ensemble of diverse models is almost always better than any single one — and that this is the single most reliably useful empirical regularity in classical machine learning.

Further reading

Where to go next

Ensemble methods sit at a happy intersection: the foundational papers are short and beautifully written, the modern library documentation is where the daily work actually happens, and the textbooks are all still worth reading. The references below are the ones that reward re-reading — the classics that introduced the ideas, the software docs that are the de facto references for deployment, and the competition writeups where the practical craft is most concentrated.

The anchor textbooks

Foundational papers

Explanation and uncertainty

Modern extensions

Software documentation

This page is Chapter 03 of Part IV: Classical Machine Learning. The next chapter — Clustering and Mixture Models — leaves supervised learning for the unsupervised cousin, where the job is not to predict a label but to discover the structure in unlabelled data: k-means and its variants, hierarchical clustering, density-based methods (DBSCAN, HDBSCAN), Gaussian mixture models, and the consensus-clustering idea that imports the ensemble philosophy of this chapter into the unsupervised side. Later chapters of Part IV extend to dimensionality reduction (the linear-algebra-heavy cousin of clustering, and the place where PCA, ICA, and manifold methods live), probabilistic graphical models, kernel methods and SVMs, feature engineering, and model evaluation and selection — the discipline that turns a trained ensemble into a trustworthy deployment.