Part IV · Classical Machine Learning · Chapter 09

Model evaluation and selection, the measurement discipline that decides which of the preceding eight chapters' techniques actually wins on the problem in front of us.

The eight chapters before this one have produced a catalogue of ways to build a model: linear regression, logistic regression, trees and ensembles, k-means, GMMs, PCA, Bayesian networks, SVMs, engineered features, regularised coefficients, fifty-odd named algorithms in all. None of them told us how to decide whether the model we built is any good, or how to choose between two models that both look good. That is what this chapter is for. Evaluation is the quiet, unglamorous, high-stakes measurement layer underneath everything else — the part of classical machine learning that turns a fitting procedure into an honest claim. Get it wrong and every decision downstream is built on sand: the model that seemed to beat the baseline actually didn't; the hyperparameter search that chose XGBoost over logistic regression was fooled by a single lucky fold; the classifier that ships into production collapses because accuracy was the wrong summary statistic for an imbalanced problem. Get it right and everything else clicks into place. Evaluation is a coupled problem: the protocol (how you split the data), the metric (what you measure), and the uncertainty quantification (how sure you are) all have to be chosen together, because any one of them done wrong can mask the other two. This chapter walks through each layer in turn — splits, cross-validation, regression and classification metrics, probabilistic scoring, calibration, imbalanced-class problems, hyperparameter tuning as a model-selection procedure, statistical comparison of models, the overfitting/underfitting diagnosis, and the leakage failure modes that silently inflate every metric on this list if you are not careful.

How to read this chapter

Section one motivates the whole enterprise: why honest evaluation is the single most consequential discipline in classical machine learning, and why almost every failure mode in shipped ML systems is an evaluation failure in disguise. Section two introduces the generalisation-error framework — training error, validation error, test error, the bias–variance decomposition, and the learning curve as the shape-of-fit diagnostic. Section three covers holdout validation — the simple train/val/test split that is the right answer for large-enough datasets and the wrong answer for everything else — and why the three-way split is the minimum honest setup. Section four is cross-validation: k-fold, stratified, leave-one-out, leave-p-out, and the grouped variants that prevent identity leakage when rows share entities. Section five is nested cross-validation — the two-loop protocol that lets you tune hyperparameters and estimate test error honestly from the same dataset, and why "just report the inner-loop score" is the single most common evaluation cheat. Section six is time-series cross-validation: rolling origin, expanding window, blocked and purged CV for autocorrelated data, and the reason standard k-fold silently leaks the future when your rows are ordered in time.

Sections seven through thirteen are about what to measure. Section seven is regression metrics — MSE, RMSE, MAE, MAPE, Huber, quantile loss, R², adjusted R² — and the mapping between metric choice and implicit loss function. Section eight is classification metrics — accuracy, precision, recall, F1, Matthews correlation, balanced accuracy, Cohen's kappa — and the confusion matrix as the object every classification metric reduces to. Section nine is probabilistic scoring: log loss, Brier score, and the broader theory of proper scoring rules that ties evaluation back to density estimation. Section ten is ROC and precision–recall curves — the threshold-free view, ROC-AUC versus PR-AUC, when each is appropriate, and the DeLong test for AUC comparison. Section eleven is calibration: reliability diagrams, the expected-calibration-error metric, Platt scaling, isotonic regression, and why a classifier with high AUC can still be badly miscalibrated. Section twelve is imbalanced classification: stratification, resampling (SMOTE and friends), class weights, cost-sensitive learning, and the PR-curve-plus-calibration protocol that handles rare-positive problems correctly. Section thirteen covers hyperparameter tuning as the model-selection procedure it actually is — grid search, random search, Bayesian optimisation, successive halving and Hyperband, and the tuning budgets that make the results reproducible.

Sections fourteen through eighteen are the measurement-discipline topics that separate good evaluation from bad. Section fourteen is statistical comparison of models: McNemar's test, the paired t-test and its Nadeau–Bengio corrected-resampled cousin, the 5×2 cross-validation test, and the bootstrap confidence intervals that every serious paper should report alongside point estimates. Section fifteen is the overfitting / underfitting diagnosis — high training error, high test error, the bias–variance trade-off, learning curves read as diagnostic instruments, and the regularisation and capacity-control levers that each failure mode indicates. Section sixteen is the safety-critical topic: data leakage and split integrity. Target leakage, preprocessing leakage, group leakage, temporal leakage, and the single unified prescription — fit every transformation only on the training fold — that eliminates most of it. Section seventeen is the operational layer: evaluation in practice, with reporting, baseline comparisons, slice-based analysis, offline-vs-online discrepancies, and the MLOps integration that keeps evaluation honest after a model ships. Section eighteen places classical evaluation inside the broader modern-ML landscape: the way deep learning has stretched every assumption in this chapter, the reproducibility crisis in ML benchmarks, and the emerging concerns (offline-RL evaluation, LLM benchmarks, foundation-model evaluation) that classical protocols were never designed to handle.

Why evaluation is the measurement discipline of MLEverything downstream depends on honest measurement
Generalisation error and the bias–variance decompositionTraining error, test error, learning curves
Holdout validationTrain/val/test, stratification, the three-way split
Cross-validationk-fold, stratified, leave-one-out, grouped
Nested cross-validationHonest tuning and evaluation from one dataset
Time-series cross-validationRolling origin, expanding window, purged and blocked
Regression metricsMSE, RMSE, MAE, MAPE, Huber, quantile loss, R²
Classification metricsAccuracy, precision, recall, F1, MCC, balanced accuracy
Probabilistic scoring rulesLog loss, Brier, proper scoring rules
ROC and precision–recall curvesROC-AUC, PR-AUC, thresholding, DeLong
CalibrationReliability diagrams, ECE, Platt, isotonic
Imbalanced classesStratification, SMOTE, class weights, cost-sensitive learning
Hyperparameter tuning and model selectionGrid, random, Bayesian, successive halving
Statistical comparison of modelsMcNemar, paired t, 5×2 CV, bootstrap intervals
Overfitting and underfittingDiagnosing capacity, regularisation, learning curves
Data leakage and split integrityTarget, preprocessing, group, temporal leakage
Evaluation in practiceReporting, baselines, slicing, offline-vs-online, MLOps
Where it compounds in MLDeep-learning protocols, reproducibility, LLM benchmarks

Why evaluation is the measurement discipline of ML

Every modelling choice you make — which algorithm to use, which features to engineer, which hyperparameters to pick, whether to ship the model at all — is a decision made on the basis of some measurement. If the measurement is wrong, the decision is wrong. Evaluation is the single most consequential discipline in classical machine learning precisely because everything downstream of it inherits its errors.

The asymmetry of model success and model failure

Training a reasonable model is, on most problems, not very hard. Scikit-learn's LogisticRegression, fed features and labels, will give you something that runs. What separates a professional from an amateur is not whether the model works — it is whether the practitioner can tell honestly how well it works. A model that scores 94% accuracy in the notebook and 62% accuracy in production is a model whose evaluation was wrong. A team that cannot tell which of two candidate models is better will ship the worse one half the time. A paper that reports a 0.3% improvement over the baseline without confidence intervals has reported nothing. The asymmetry is brutal: good evaluation is invisible when it works, catastrophic when it fails, and the failure mode is usually that the practitioner did not realise they had a problem until a system was already deployed.

Three coupled decisions

An evaluation is a joint choice of three things, and getting any of them wrong invalidates the other two. The protocol is how the data is split — holdout, k-fold, stratified, grouped, time-ordered — and it determines what question the evaluation is actually answering. The metric is what you measure on each split — accuracy, log loss, F1, AUC, RMSE — and it determines which dimension of model quality the answer is about. The uncertainty quantification is how you summarise variability across splits — the standard error of the CV estimate, a bootstrap confidence interval, a statistical test against a baseline — and it determines whether the difference you observed is real or an artefact of the particular random seed. This chapter works through all three and, in the practitioner-facing sections near the end, the way they interact.

The evaluation principle. The protocol, the metric, and the uncertainty estimate are three pieces of one joint choice. "We got 94% accuracy" is not an evaluation; it is a number. "We got 94.0 ± 0.8% accuracy on a five-fold stratified cross-validation, using balanced accuracy as the metric, significantly better than the 91.2 ± 1.1% baseline (p < 0.01 by McNemar's test on the pooled predictions)" is an evaluation. The second sentence is ten times longer because it is actually saying something.

What "honest" means

An evaluation is honest when the number it produces is a good estimate of the quantity you would observe if you deployed the model on data drawn from the same distribution it was evaluated on. This is harder than it sounds. Every form of information leakage — from using the test set to choose hyperparameters, to standardising features using the full dataset's mean and variance, to splitting on rows when rows share entities — inflates the measurement above its true value. Every form of distribution mismatch — from evaluating on historical data when the model will serve live traffic, to testing on an urban sample when the model will serve rural users — breaks the "same distribution" assumption that makes evaluation meaningful at all. The job of this chapter is to name these failure modes, give them definitions precise enough to test for, and prescribe the protocols that defend against each one.

Generalisation error and the bias–variance decomposition

The quantity evaluation is trying to estimate is generalisation error — the expected loss of the model on a new draw from the same distribution the training data came from. Understanding why this quantity is difficult to estimate, and why it differs from training error, requires the bias–variance decomposition.

Training error, test error, and the gap between them

Fit a model on a dataset and the training error — the loss computed on the same data you trained on — will usually understate the generalisation error. The gap is the optimism of training error: a sufficiently flexible model can fit the training labels perfectly while learning nothing at all about the underlying distribution, making training error zero and generalisation error arbitrarily bad. A held-out test set gives an unbiased estimate of generalisation error precisely because the model has not seen it. The fundamental lesson of twentieth-century statistical learning theory is that this gap is not a bug but a structural feature of any finite-sample learning problem, and that bounding it requires explicit assumptions about the model's capacity relative to the amount of data.

The bias–variance decomposition

For squared-error regression with true regression function f(x), fitted estimator f̂(x), and noise variance σ², the expected test error at a point x decomposes exactly as E[(y − f̂(x))²] = σ² + Bias[f̂(x)]² + Var[f̂(x)]. The three terms have distinct interpretations. Irreducible noise σ² is the part of the target that no model can predict; it depends on the data, not on the estimator. Bias measures how far the average prediction (over repeated training sets) is from the truth; high bias means the model is systematically wrong because it is not flexible enough to capture the true relationship. Variance measures how much the prediction changes as the training set changes; high variance means the model is too flexible and fits noise. The decomposition makes precise the intuition that there is a sweet spot of model complexity — too simple and bias dominates, too complex and variance dominates, with generalisation error minimised somewhere in between.

A precise statement. Let y = f(x) + ε with E[ε] = 0, Var(ε) = σ². Averaging over training sets, E[(y − f̂(x))²] = σ² + (E[f̂(x)] − f(x))² + Var(f̂(x)). Simple models (fewer parameters, stronger regularisation) shift the balance toward bias; complex models (more parameters, weaker regularisation) shift it toward variance. Ensembling shifts it toward bias without proportionally raising variance, which is why bagging works.

Learning curves as diagnostic instrument

Plot training error and cross-validated test error as a function of training-set size (or of model complexity, or of training epochs) and the shape of the resulting curves tells you which regime you are in. High bias / underfitting: both errors converge to a value well above the irreducible-noise floor as data grows, and the two curves are close together — more data will not help; you need a more flexible model. High variance / overfitting: training error is low, test error is high, and the gap between them closes slowly as data grows — more data will help, and so will regularisation or simpler models. Well-fit: training error and test error converge close together and close to the noise floor. The learning-curve diagnosis is the single most useful plot in a model-development workflow, and scikit-learn's learning_curve function produces it in one call.

The modern wrinkle: double descent

The classical bias–variance story predicts that test error first decreases and then increases as you make a model more flexible, crossing a minimum at some intermediate complexity. In the deep-learning era this is not what happens. Belkin et al.'s 2019 Reconciling modern machine-learning practice and the classical bias–variance trade-off showed that test error can follow a double-descent curve: it decreases, rises near the point where the model exactly interpolates the training data, and then decreases again as the model grows beyond that interpolation threshold. The effect is real, has been observed in modern neural networks, and is one of several reasons that classical intuitions about capacity control need updating for very large models. For the classical tabular-ML methods in this chapter, the U-shape remains the right mental model.

Holdout validation

The simplest honest evaluation protocol — split the data once into a training set and a test set, fit the model on the training set, measure loss on the test set — is also the right one for a surprising range of problems. This section gives the version you should actually use.

Two splits, three splits, and why both exist

A two-way split — train + test — gives an unbiased estimate of generalisation error when you evaluate exactly once, at the end. But during development you need to choose between candidate models, candidate feature sets, and candidate hyperparameters, and every time you look at a test-set number you consume a little bit of that test set's independence. After fifty hyperparameter trials evaluated on the same test set, the final number is no longer an unbiased estimate of generalisation error; it is a number you have, in effect, optimised for. The standard defence is a three-way split: training data to fit the model, validation data to compare candidate configurations, and test data held strictly in reserve and evaluated once, at the very end, on the single configuration you picked. Typical proportions: 60/20/20 or 70/15/15 depending on dataset size.

Stratification

Random splitting of a classification dataset produces splits whose class frequencies differ from the population by a random amount. On a 99/1 imbalanced problem, the 1% minority class can be almost entirely absent from a 20% test set by pure chance, making the evaluation useless. Stratified splitting fixes this by partitioning each class independently and recombining, guaranteeing that each split preserves the population class frequencies to within one example per class. Stratification should be the default for every classification problem; train_test_split(..., stratify=y) in scikit-learn. For regression, the analogous operation is stratifying by binned target quantiles.

When holdout is the right answer

Holdout validation is appropriate when the dataset is large enough that the holdout set gives a low-variance estimate of generalisation error — say, tens of thousands of examples and up, depending on the base rate and metric. With a very large dataset, a single fixed test set gives you a perfectly adequate evaluation and avoids the complexity of cross-validation. With a very small dataset — a few hundred examples — the holdout set's variance is so large that the resulting number is almost uninformative, and cross-validation (or bootstrap) becomes necessary. The typical rule of thumb: holdout for more than ~10,000 examples; k-fold for less; leave-one-out for the extreme small-data regime.

Temporal, entity, and group structure

Simple random splitting assumes the rows are independent. They usually aren't. If the dataset is ordered in time and the model will predict the future, random splitting puts future rows into the training set and past rows into the test set — which inflates performance estimates because the future leaks backward into training. If the dataset has a group structure (multiple transactions per customer, multiple images per patient), splitting at the row level puts the same entity in both training and test sets, and the model memorises per-entity patterns rather than learning generalisable structure. The fix in both cases is the same: split at the right granularity. Temporal data splits by time (everything before date T to train, everything after to test); grouped data splits by group ID (GroupShuffleSplit in scikit-learn).

The three-way-split discipline. Touch the test set once, and only once. If you look at the test-set number and change anything based on it, the test set is burned and you need a new one. In practice this discipline fails gradually: a team looks at the test set twice, then five times, then eventually treats it as a second validation set. The fix is process, not machinery: name one person the test-set custodian, and require their sign-off before the test set is evaluated.

Cross-validation

When the dataset is small enough that a single holdout split gives a high-variance estimate of generalisation error, cross-validation averages over multiple splits to reduce that variance. K-fold cross-validation is the default protocol for the middle-data regime that most real-world tabular ML lives in.

k-fold cross-validation

Partition the data into k folds of approximately equal size. For each fold i, train the model on the union of the other k − 1 folds and evaluate on fold i. Average the k fold-level loss values to get the CV estimate of generalisation error. Typical choices of k are 5 and 10; larger k reduces the bias of the CV estimate (each training set is closer to the full dataset) but increases variance and compute cost. Kohavi's 1995 IJCAI paper A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection remains the canonical empirical comparison; its recommendation of 10-fold stratified CV as the default has held up remarkably well.

Stratified, grouped, and repeated variants

Stratified k-fold preserves class frequencies within each fold, the same way stratified holdout does; use it for classification by default. Grouped k-fold (GroupKFold) splits on a group ID, guaranteeing that all rows for a given group end up in the same fold — the correct choice when your data has entity structure. Repeated k-fold runs k-fold multiple times with different random partitions and averages across runs, giving a lower-variance estimate at a multiplicative compute cost. Combining the three yields variants like stratified grouped repeated k-fold, which is a mouthful but is sometimes exactly the right answer: preserve class frequencies, keep groups intact, and average out partition variance.

Leave-one-out and leave-p-out

At the k = n extreme, leave-one-out cross-validation (LOOCV) uses each single example as a one-element test fold. For n examples this means n model fits, which is expensive — but for linear models (ridge regression, linear discriminant analysis) there are closed-form formulas that compute the LOOCV error from a single fit, making it cheap in exactly the cases where it is statistically best motivated. LOOCV has very low bias but very high variance (every held-out fold contains a single, possibly-atypical example). Leave-p-out generalises to holding out every subset of size p; computationally infeasible for any realistic p, it is mostly of theoretical interest. The practical advice is: use k-fold with k = 5 or 10 in the middle-data regime, use LOOCV only when you have closed-form shortcut formulas or a very small dataset.

What the CV estimate actually estimates

A subtle but important point: the k-fold CV estimate is not an unbiased estimate of the generalisation error of the model you trained on the full dataset. It is an estimate of the expected generalisation error of a model trained on a random (k − 1)/k-sized subsample from the underlying distribution. For k = 10 this is close to the full-model error but slightly pessimistic. For k = 2 it can differ materially. The theoretical story is in Bengio & Grandvalet's 2004 No Unbiased Estimator of the Variance of K-Fold Cross-Validation; the practical consequence is that CV error estimates should be read as point estimates with non-trivial standard error, not as population parameters.

Nested cross-validation

When you use cross-validation both to choose a hyperparameter and to estimate generalisation error, the single-loop protocol systematically overestimates performance. Nested cross-validation separates the two roles with two loops, and is the only fully honest protocol for simultaneous tuning and evaluation on a single dataset.

The single-loop cheat

A tempting but wrong protocol: run k-fold CV over a grid of hyperparameter values, pick the hyperparameter with the best mean CV score, and report that mean score as your generalisation-error estimate. The reported score is biased upward because the hyperparameter was chosen specifically to maximise it — you have optimised the CV score, and the CV score is no longer an unbiased estimate of anything. With large hyperparameter grids and small datasets, the bias can be many percentage points. Varma & Simon's 2006 Bias in Error Estimation When Using Cross-Validation for Model Selection documented this experimentally on cancer-genomics benchmarks and made nested CV the recommended standard for the small-n-many-features regime.

The nested protocol

Two loops. The outer loop is a k-fold CV that partitions the data into k train/test splits; its job is to estimate generalisation error, and it must not see the test fold during any model-selection work. For each outer fold, take the outer training set and run a full model-selection protocol on it: this is the inner loop, typically a k′-fold CV over the hyperparameter grid, which returns the best hyperparameter choice for that outer training set. Fit the model with that hyperparameter on the outer training set, evaluate on the outer test fold, and record the score. Average the k outer scores. The reported number is an unbiased estimate of the generalisation error of the model-selection procedure — not of any one hyperparameter choice, but of the full pipeline "tune on inner CV, then refit".

The subtle interpretation. Nested CV does not produce a single "best" hyperparameter. Each outer fold may pick a different one. The thing it estimates is how well the procedure of tuning and refitting would generalise; what you actually ship is a single model trained on all the data, with hyperparameters chosen by one final inner-loop CV on the whole dataset. The nested-CV number is your honest report; the shipped model's hyperparameters are what it was trained with.

Cost and when to use it

Nested CV with k = 5 outer and k′ = 5 inner folds, over a grid of 50 hyperparameters, trains 5 × 5 × 50 = 1,250 models. This is expensive but not prohibitive with scikit-learn and modest compute. On very small datasets (hundreds of examples) where you have many hyperparameters to tune, nested CV is approximately mandatory for any claim to be taken seriously. On large datasets where a single holdout split already gives a reliable test-set estimate, nested CV is overkill and a plain train / validate / test split is more practical.

Bayesian-optimisation inner loops

Nothing in the nested protocol requires the inner loop to be grid search. It is increasingly common to plug in Bayesian hyperparameter-optimisation tools (Optuna, Hyperopt, scikit-optimize) as the inner selector — same nested-CV guarantees, dramatically better sample efficiency when the hyperparameter space is large. Section 13 returns to hyperparameter search in more detail.

Time-series cross-validation

Standard cross-validation randomly permutes rows, which assumes they are exchangeable. Time-ordered data is not. Every k-fold CV on a temporal dataset silently leaks future information into the training folds, and the resulting error estimates are usefully optimistic only to the point of being misleading. Use one of the protocols below instead.

Rolling origin (forward-chaining) CV

Sort the data by time. Choose a sequence of cutoff times t₁ < t₂ < … < t_k. For each cutoff t_i, train on all data with timestamp less than t_i and evaluate on the data in the window (t_i, t_i+1]. Advance the cutoff by one window and repeat. The result is a sequence of train-on-past, test-on-future evaluations that reflect how the model would actually be used in production. This is rolling origin or forward-chaining CV; scikit-learn's TimeSeriesSplit implements it in its default form.

Expanding window vs sliding window

Two variants. In the expanding-window version, each training set is all data up to t_i, so training sets grow over time — the right choice when more data always helps and the underlying process is approximately stationary. In the sliding-window version, each training set is a fixed-length window ending at t_i — the right choice when the process is non-stationary and older data is actively misleading (regime-shift markets, fashion recommendation, social-media content). Which to use is an empirical question; try both and see which gives better out-of-sample error on the final held-out window.

Purging and embargoing for feature-derived leakage

Even with time-ordered splits, features computed from windows that straddle the train/test boundary can leak. If your target for row t is built from events in (t − h, t + h), then rows in the training set with timestamp in (t_i − h, t_i) share information with rows in the test set with timestamp in (t_i, t_i + h). The fix is purging: remove training examples whose label-generation window overlaps any test-set timestamp. The related trick is embargo: additionally drop training examples in a buffer immediately before the test window, preventing any residual correlation from contaminating the estimate. Marcos López de Prado's Advances in Financial Machine Learning (2018) is the canonical reference; the techniques are essential in any financial-ML setting and useful in many others.

Blocked cross-validation. Bergmeir & Benítez's 2012 On the Use of Cross-Validation for Time Series Predictor Evaluation argued that for stationary time series with short-range dependence, ordinary k-fold CV with block partitions (each fold is a contiguous chunk of time, not a random sample) is less biased than rolling-origin CV and has lower variance. For non-stationary series rolling origin remains the right choice. A practical heuristic: if your data passes a stationarity test and the autocorrelation decays quickly, blocked CV is fine; otherwise use rolling origin.

The generalisation-error interpretation

With time-series CV, the quantity being estimated is the error of predicting one step into the future at the time of the final window. This is almost always what you want, but it is explicitly not the error averaged over all future time — a model's expected error may degrade as you try to predict further into the future, and the rolling-origin estimate captures a single horizon. For multi-horizon forecasting, run separate rolling-origin CVs at each horizon of interest.

Regression metrics

Every regression metric is an implicit loss function, and the choice of metric is a choice about which errors you want to penalise how heavily. There are perhaps a dozen metrics in common use; this section gives the map.

Mean squared error and its square root

Mean squared error — MSE — is (1/n) Σ (yᵢ − ŷᵢ)². It is the loss implicitly minimised by OLS, and it penalises large errors quadratically: an error of 10 is 100× worse than an error of 1. Root mean squared error — RMSE — is √MSE, which has the useful property of being measured in the same units as the target. RMSE is the single most common regression metric in machine-learning practice and is the right default when errors of all sizes are approximately equally important and you have no outliers.

Mean absolute error and its robust cousins

Mean absolute error — MAE — is (1/n) Σ |yᵢ − ŷᵢ|. It is the loss minimised by the conditional median rather than the conditional mean, and it penalises errors linearly: an error of 10 is 10× worse than an error of 1, not 100×. MAE is more robust to outliers than MSE; use it when a small number of very-large errors should not dominate the score. Huber loss (Huber 1964) is a hybrid — quadratic near zero and linear for errors above a threshold δ — that gives you OLS-like efficiency on clean data and MAE-like robustness on outliers. Quantile loss (a.k.a. pinball loss) generalises MAE to any quantile τ ∈ (0, 1), and is the loss of choice for prediction intervals and quantile regression.

Scale-invariant and relative metrics

RMSE's units-dependence is a problem when targets span many orders of magnitude. Mean absolute percentage error — MAPE — divides each error by the actual value, producing a unit-free score: (100/n) Σ |yᵢ − ŷᵢ|/|yᵢ|. Its well-known pathology is that it explodes when yᵢ is near zero and is not symmetric in over- vs under-prediction. Symmetric MAPE (sMAPE) fixes the asymmetry but introduces a different set of edge cases. The mean squared logarithmic error — MSLE — computes MSE on log-transformed targets, working well for positive skewed targets where proportional errors are what matter.

R² and adjusted R²

The coefficient of determination is R² = 1 − SS_res/SS_tot, the fraction of target variance the model explains. It ranges from 1 (perfect) down through 0 (no better than predicting the mean) to negative infinity (arbitrarily bad on the test set — which yes, does happen). R² is the right metric for reporting explanatory power on a natural scale, and the wrong metric if targets span multiple regimes (because one high-variance regime can dominate the denominator). Adjusted R² corrects for the upward-bias that R² has when you add more features: adding a useless feature cannot decrease R², but it can decrease adjusted R². Report both when comparing models of different dimensionality.

How to choose. If errors of all sizes matter equally, use RMSE. If outliers should not dominate, use MAE or Huber. If the natural scale of error is proportional, use MAPE or a log-transformed MSE. If you need a prediction interval rather than a point estimate, use quantile loss at the quantiles of interest. If you want an interpretable fraction-of-variance-explained number for reporting, use R² — but always alongside an absolute-scale metric, never alone.

Classification metrics

Classification metrics all start from the confusion matrix: the 2 × 2 table (for binary classification) of actual-vs-predicted counts — true positives, false positives, false negatives, true negatives. Every standard classification metric is a function of these four numbers, and the usefulness of the metric depends on which trade-offs the function exposes.

Accuracy and its failure mode

Accuracy is (TP + TN)/(TP + FP + FN + TN) — the fraction of predictions that are correct. On balanced problems it is a reasonable default. On imbalanced problems it is catastrophically misleading: a 99/1 dataset is trivially 99% accurate by predicting "negative" every time. For any problem where base rates differ materially from 50/50, accuracy should be viewed with suspicion and replaced with or supplemented by other metrics. Balanced accuracy — the mean of per-class recall — fixes the asymmetry by reweighting classes to equal importance and is a reasonable drop-in replacement for accuracy on imbalanced problems.

Precision, recall, and F1

Precision is TP/(TP + FP) — of the examples the classifier predicted positive, what fraction actually are. Recall (a.k.a. sensitivity, true-positive rate) is TP/(TP + FN) — of the examples that actually are positive, what fraction the classifier caught. They trade off: raise the decision threshold and precision goes up while recall goes down. F1 score is their harmonic mean, 2 · P · R/(P + R), giving a single number when both matter; F_β generalises to any relative weighting of recall-over-precision. For an information-retrieval problem where false positives and false negatives have different costs, precision/recall/F1 is usually the right family; for a diagnostic problem with a gold-standard base rate, sensitivity and specificity (TN/(TN + FP)) are the epidemiological equivalents.

Matthews correlation coefficient

The Matthews correlation coefficient — MCC — is (TP · TN − FP · FN)/√((TP+FP)(TP+FN)(TN+FP)(TN+FN)). It ranges from −1 to +1, equals 0 for random predictions regardless of class balance, and uses all four cells of the confusion matrix symmetrically. Chicco & Jurman's 2020 The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation argued it should be the default single-number binary-classification metric. It is not as widely used as F1 but has a good theoretical case on its side.

Cohen's kappa and class agreement

Cohen's kappa corrects accuracy for chance agreement, producing a score that is 1 for perfect agreement, 0 for random, and can be negative for systematically-disagreeing predictions. It is the standard metric when two annotators are compared and is sometimes used as a classifier metric in the same spirit as MCC — as a base-rate-insensitive alternative to accuracy.

Multiclass averaging: micro, macro, weighted

Extend any binary metric to multiclass problems by computing it per-class and averaging. Macro-averaging takes the unweighted mean across classes, giving equal weight to every class regardless of frequency — the right choice when every class matters equally. Micro-averaging pools the confusion-matrix counts across classes before computing the metric, giving equal weight to every example — on imbalanced problems this collapses to accuracy. Weighted-averaging weights each class's score by its support. In scikit-learn, classification_report prints all three variants side-by-side; the right one to report depends on the problem, and "which average are we using?" should be answered explicitly in every comparison.

Probabilistic scoring rules

Many classifiers do not output a hard decision but a probability. Metrics like accuracy and F1 collapse those probabilities to 0/1 predictions and throw away the fine-grained information they contain. Proper scoring rules evaluate the probabilities directly, and are the right metric whenever the downstream system uses the probability — for thresholding, for expected-value decision-making, or for calibration.

Log loss (cross-entropy)

Log loss, a.k.a. negative log-likelihood, a.k.a. cross-entropy, is −(1/n) Σ [yᵢ log p̂ᵢ + (1 − yᵢ) log(1 − p̂ᵢ)] for binary classification, and its natural multiclass generalisation. It penalises confident wrong predictions extremely heavily (log loss diverges as p̂ → 0 on a positive example) and rewards well-calibrated probabilities. Log loss is the strictly proper scoring rule that is minimised in expectation only when your predicted probability equals the true probability — which is exactly the property you want in a probability evaluation. Use log loss whenever you care about the quality of the probability estimates, not just the rank ordering.

Brier score

The Brier score is (1/n) Σ (yᵢ − p̂ᵢ)² — the MSE between the predicted probability and the 0/1 label. Also a strictly proper scoring rule; unlike log loss, it is bounded (in [0, 1]) and does not diverge on confident-wrong predictions. Brier scores are particularly nice for reliability diagrams because they decompose additively into a calibration component and a resolution component (the Murphy decomposition), allowing you to separately diagnose "the probabilities are systematically biased" from "the probabilities don't discriminate between classes".

Proper scoring rules, defined. A scoring rule S(p̂, y) is proper if the expected score under a true probability p is minimised (or maximised, depending on sign convention) when p̂ = p. It is strictly proper if that minimiser is unique. Log loss and the Brier score are both strictly proper. Accuracy and AUC are not — they can be optimised by probability estimates that are systematically biased, as long as the rank ordering or the hard-classification boundary is correct. Gneiting & Raftery's 2007 Strictly Proper Scoring Rules, Prediction, and Estimation is the definitive reference.

Continuous ranked probability score

The continuous ranked probability score — CRPS — generalises the Brier score from binary classification to continuous-valued probabilistic forecasting: it compares a predicted cumulative distribution function to the observed value. CRPS is the standard probabilistic-forecasting metric in meteorology, hydrology, and the adjacent disciplines that invented most of this material, and is increasingly used in probabilistic ML forecasting libraries (GluonTS, NeuralProphet). For point forecasts CRPS reduces to MAE; for a Gaussian predictive distribution it has a closed form.

When rank-only metrics lie

A subtle practical consequence: AUC (Section 10) is a rank-only metric. Two classifiers with identical AUC can have arbitrarily different calibration — one may output probabilities like p̂ ∈ {0.49, 0.51} while the other outputs p̂ ∈ {0.05, 0.95}, with both preserving the same ordering. If you threshold at 0.5, the two models produce identical hard predictions; if you use the probability as a decision input (e.g. expected-value arithmetic for a business decision), they produce wildly different downstream outcomes. Always check log loss or Brier alongside AUC when probabilities are actually used.

ROC and precision–recall curves

A classifier's threshold is a modelling choice — raise it and you get fewer, higher-precision positive predictions; lower it and you get more, higher-recall ones. Threshold-free evaluation sweeps the threshold across all possible values and summarises the trade-off as a curve or as the area under that curve.

The ROC curve and AUC

The receiver operating characteristic plots true-positive rate (recall) on the y-axis against false-positive rate (1 − specificity) on the x-axis as the decision threshold varies. A random classifier traces the diagonal; a perfect classifier traces the upper-left corner. The area under the ROC curve — ROC-AUC — summarises the whole curve as a single number in [0, 1], interpretable as the probability that the classifier ranks a randomly chosen positive above a randomly chosen negative. Hanley & McNeil's 1982 The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve is the foundational reference; ROC-AUC is the standard discrimination metric in medical statistics and fraud detection.

The precision–recall curve and PR-AUC

The precision–recall curve plots precision against recall as the threshold varies. Unlike ROC, PR is strongly sensitive to class imbalance: on a 99/1 problem the ROC curve can look excellent while the PR curve is terrible, because ROC hides the many false positives the classifier must accept to identify a few true positives. Average precision — AP — summarises the PR curve as the weighted-by-recall average precision. Saito & Rehmsmeier's 2015 The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets is the canonical argument for using PR instead of ROC on imbalanced problems.

When to use which

Rule of thumb: ROC-AUC when class balance is approximately equal and false positives and false negatives matter symmetrically. PR-AUC when the positive class is rare and you specifically care about the quality of the top-ranked predictions (search relevance, fraud detection, medical screening). In most realistic industrial problems — where the positive class is a few percent of the data and business value lives in the top of the ranked list — PR-AUC is the right choice. Report both when uncertain; they are cheap to compute.

Thresholding for operating points

A curve is for comparison; a threshold is for shipping. Once you have chosen a classifier, you must pick a threshold — the operating point on the curve you will actually deploy. The principled way is to write down the costs: if a false positive costs c_fp and a false negative costs c_fn, the optimal threshold is the one that minimises expected cost, which at the base rate π works out to p̂* = π c_fn / (π c_fn + (1 − π) c_fp). In practice teams often pick by a target precision or a target recall — "we will operate at 90% precision and accept whatever recall that gives us" — which is a cost assignment in disguise. Document the choice.

The DeLong test. To compare two AUCs statistically, use DeLong, DeLong & Clarke-Pearson's 1988 non-parametric test, which gives a valid p-value for the null hypothesis "AUC_A = AUC_B" computed from predictions on the same test set. It is implemented in the R pROC package and in Python's scipy.stats via the Mann–Whitney U. Reporting "classifier A beats classifier B with AUC 0.86 vs 0.85" without a statistical test is no different from reporting a point estimate without uncertainty — a number that may or may not be real.

Calibration

A classifier is calibrated when its predicted probabilities correspond to empirical frequencies — of the examples where it predicts 0.8, 80% really are positive. Classifiers with high AUC are often badly miscalibrated, and miscalibrated probabilities break any downstream system that uses them for decision-making.

Reliability diagrams

The standard calibration diagnostic: bin predictions by probability (say, into deciles), compute the mean predicted probability and the mean empirical label rate in each bin, and plot them against each other. A perfectly calibrated classifier traces the diagonal y = x. A classifier that is systematically overconfident has its curve below the diagonal — it predicts 80% but achieves only 65%. The reliability diagram is the visual version of the expected calibration error (ECE): the binned-average absolute gap between predicted and empirical. Naeini, Cooper & Hauskrecht's 2015 Obtaining Well Calibrated Probabilities Using Bayesian Binning is the standard reference for ECE; implementations in the netcal Python package and in every serious calibration audit.

Why AUC high and calibration bad often coexist

Consider a classifier that maps every true positive to p̂ = 0.51 and every true negative to p̂ = 0.49. Its AUC is 1.0 (perfect ranking) and its accuracy at threshold 0.5 is 100%. But its predicted probabilities are all near 0.5, so a downstream expected-value calculation will treat every prediction as maximally uncertain. Conversely, a classifier that maps positives to 0.8 and negatives to 0.1 but occasionally confuses them has lower AUC but more usable probabilities. Rank-preserving monotonic transformations leave AUC unchanged while changing calibration arbitrarily — which means fixing calibration is essentially always possible as a post-hoc step without disturbing the rank-based metrics.

Platt scaling

Platt's 1999 Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods introduced the simplest calibration procedure: fit a logistic regression p = σ(A · s + B) mapping model score s to calibrated probability p, learning A and B from a held-out calibration set. Platt scaling is parametric and assumes a sigmoidal shape; it works remarkably well for SVMs and boosted trees, and is the default one-liner when calibration is badly off. CalibratedClassifierCV in scikit-learn implements it directly.

Isotonic regression

For classifiers where Platt's sigmoidal assumption does not fit, isotonic regression gives a non-parametric monotonic calibration curve — the best piecewise-constant monotonic function mapping scores to probabilities under squared loss. Isotonic is more flexible than Platt and more data-hungry (it can overfit with less than a few thousand calibration examples), so the choice is an empirical one: try both on a validation set, pick whichever gives lower log loss. Zadrozny & Elkan's 2002 Transforming Classifier Scores into Accurate Multiclass Probability Estimates is the definitive paper; Niculescu-Mizil & Caruana's 2005 Predicting Good Probabilities with Supervised Learning is the empirical comparison that made both techniques mainstream in ML.

The calibration discipline. Always check calibration when the downstream system uses probabilities for anything other than ranking. Tree ensembles — random forests and gradient-boosted trees — are often badly calibrated out of the box, and a one-line Platt-scaling post-processor frequently produces meaningful downstream improvements without touching the model itself. The protocol is: train on training fold, hold out a calibration fold, fit Platt or isotonic on the calibration fold, and evaluate on the test fold — all three folds disjoint.

Imbalanced classes

Fraud is rare. Disease is rare. Churn is not as rare as either but is often well below 50%. Imbalanced classification problems — where one class is substantially rarer than another — are the norm in industrial ML, and every part of the evaluation pipeline handles them poorly by default.

What breaks on imbalanced data

Three things, in increasing order of subtlety. First, accuracy becomes a useless metric (Section 8): predict the majority class and you win. Second, the training objective of most classifiers weighs each example equally, so the gradient is dominated by the majority class and the minority class is under-fit. Third, the ROC curve can look good (ROC-AUC averages across the whole threshold range, where the many true negatives dominate) while the PR curve shows the classifier is near-useless at identifying positives (Section 10). The first is an evaluation-side problem; the second and third interact with training.

Resampling: oversampling, undersampling, SMOTE

The simplest training-side fix is to rebalance the classes. Random oversampling duplicates minority-class examples; simple but prone to overfitting on the duplicated points. Random undersampling discards majority-class examples; simple but throws away information. SMOTE (Chawla et al. 2002, Synthetic Minority Over-sampling Technique) generates synthetic minority examples by interpolating between a real minority example and one of its k-nearest-neighbour minority examples, producing new training points without duplication. The imbalanced-learn library implements SMOTE and a family of related variants (Borderline-SMOTE, ADASYN, SMOTEENN).

Class weights and cost-sensitive learning

The cleaner fix, where the model supports it, is class weighting: multiply the loss for each class by a weight inversely proportional to the class frequency, so that the optimiser treats the classes equally. Most scikit-learn classifiers accept class_weight="balanced" as a one-line activation. More generally, cost-sensitive learning assigns different costs to each type of error (FP vs FN) rather than each class, which is strictly more expressive: a false negative on a cancer screening is worse than a false positive, and cost-sensitive learning lets you say so directly in the loss. Elkan's 2001 The Foundations of Cost-Sensitive Learning is the canonical reference.

Thresholding as the post-hoc class-balance fix

An underappreciated fact: for any classifier that outputs well-calibrated probabilities, the class-imbalance problem at prediction time reduces to choosing the right threshold. You do not need to rebalance the training data — you need to threshold at the Bayes-optimal point given the class frequencies and costs (Section 10). Training on the natural class distribution and then thresholding is often the simplest, best-calibrated, and most reproducible fix. Sampling-based fixes (SMOTE, oversampling) distort the base rate in the training data, which can itself hurt calibration; use them when they empirically help, not as a reflex.

The imbalanced-evaluation protocol. Use stratified splitting (Section 3). Report PR-AUC in addition to or instead of ROC-AUC (Section 10). Check calibration (Section 11) — imbalanced classifiers are often badly calibrated. Report precision at a target recall (or vice versa) rather than F1 at the default 0.5 threshold. Document the base rate — "90% precision on a 1% positive rate" means something very different from "90% precision on a 50% positive rate". The biggest evaluation mistakes in ML are made on imbalanced problems by practitioners using protocols designed for balanced ones.

Hyperparameter tuning and model selection

Hyperparameter tuning is just model selection in disguise: each hyperparameter value is a candidate model, and selecting the best hyperparameter is choosing among candidate models by their cross-validated score. The algorithms in this section are search strategies over a hyperparameter space, and differ mostly in how they trade off sample efficiency against implementation simplicity.

Grid search

Exhaustively evaluate every combination on a discrete grid of hyperparameter values. Simple, reproducible, embarrassingly parallel, and cubically expensive in the number of hyperparameters. Appropriate when you have at most two or three hyperparameters each with five or so levels; useless when the search space has more than about a hundred configurations unless compute is free. GridSearchCV in scikit-learn.

Random search

Sample hyperparameter configurations uniformly (or from a specified prior) and evaluate each with cross-validation. Bergstra & Bengio's 2012 Random Search for Hyper-Parameter Optimization showed that for almost any hyperparameter space where only a subset of parameters actually matter (which is the overwhelmingly common case), random search outperforms grid search dramatically at the same compute budget: random search explores many more values of the handful of important hyperparameters because it does not waste trials on combinations of unimportant ones. Random search should be the default for any search with more than two hyperparameters. RandomizedSearchCV in scikit-learn.

Bayesian optimisation

Both grid and random search ignore the evidence from past trials. Bayesian optimisation fits a probabilistic model (usually a Gaussian process or a tree Parzen estimator) to the observed trial scores and uses it to choose the next trial that maximises an acquisition function — expected improvement, upper confidence bound, or Thompson sampling. The result is dramatically better sample efficiency on hyperparameter spaces with tens of dimensions. Snoek, Larochelle & Adams's 2012 Practical Bayesian Optimization of Machine Learning Algorithms was the paper that brought BO into mainstream ML practice; it is now the engine underneath Optuna (TPE), Hyperopt (TPE), scikit-optimize (GP), and the tuning features of most major cloud ML platforms.

Successive halving and Hyperband

A complementary family based on early stopping: evaluate many configurations at a small compute budget, kill the worst, double the budget, evaluate the survivors, and repeat. Jamieson & Talwalkar's 2016 Non-stochastic Best Arm Identification and Hyperparameter Optimization formalised the idea; Li et al.'s 2017 Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization extended it to handle the budget-vs-number-of-configs trade-off automatically. Hyperband and its Bayesian-optimisation refinement BOHB (Falkner et al. 2018) are the state of the art for expensive-to-train models where the budget-aware aspect makes a large practical difference; Optuna's ASHA pruner implements Asynchronous Successive Halving for distributed tuning.

The fairness constraint. When comparing model A vs model B, tune both with the same compute budget and the same protocol. The single most common mistake in tuning-heavy papers is tuning your preferred method exhaustively and the baseline lightly; the resulting comparison is about tuning effort, not about the methods. A rigorous experimental protocol fixes the tuning budget in advance — "each model gets 100 random-search trials, five-fold CV per trial, evaluated on the same splits" — and reports it alongside the results.

Tuning inside a nested-CV protocol

Every tuning method above slots into the inner loop of nested cross-validation (Section 5). The outer loop gives honest generalisation-error estimates; the inner loop chooses hyperparameters. Whether the inner loop is grid search, random search, or Bayesian optimisation is an efficiency decision, not a statistical one. Report the outer-loop number as your test error; report the chosen hyperparameters (from a final inner-CV run on the full dataset) as what the shipped model uses.

Statistical comparison of models

You have two models with similar CV scores. Is the difference real, or could it be explained by the random variation inherent in the cross-validation protocol? The machinery of statistical hypothesis testing and confidence intervals gives principled answers — and using it should be the default, not an afterthought.

McNemar's test for paired classifiers

For two classifiers evaluated on the same test set, the simplest comparison is McNemar's test: form the 2 × 2 table of example-level agreement / disagreement — examples both classifiers got right, both got wrong, A-only right, B-only right — and test the null hypothesis that the two off-diagonal counts are equal. The test is exact, uses the same-test-set paired structure explicitly, and does not require any assumption about the test set's distribution. Dietterich's 1998 Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms reviewed five popular comparison tests and concluded that McNemar, when applicable, is the most powerful and best-behaved.

The paired t-test on CV folds (with a caveat)

For comparing two methods via k-fold CV, a natural approach is the paired t-test on the per-fold score differences. The approach works in expectation but is known to have inflated Type I error because the per-fold score differences are not independent (the folds share training data). Nadeau & Bengio's 2003 Inference for the Generalization Error proposed a corrected resampled t-test that adjusts the variance estimate for the train/test overlap; it gives much better Type I error control and is the recommended replacement. For repeated cross-validation, the correction involves a term n₁/n₂ where n₁ is the size of the test fold and n₂ the size of the training fold.

Dietterich's 5×2 CV test

A further refinement: run 5 iterations of 2-fold CV (2-fold so that training and test folds are the same size, making the variance estimate more tractable), compute the paired score difference on each of the 10 folds, and form a test statistic that Dietterich 1998 showed has better-controlled Type I error than the plain paired t-test. The 5×2 CV test is the gold standard for small-data comparisons in the classical-ML literature; it is implemented in mlxtend.

Bootstrap confidence intervals

Rather than a hypothesis test, you can report a bootstrap confidence interval around the score difference: resample the test set n times with replacement, compute the score difference on each resample, and take the 2.5%/97.5% percentiles. The bootstrap approach, due to Efron 1979, is non-parametric, does not require assuming a specific variance structure, and produces a number that is directly interpretable as "with 95% confidence, A is better than B by somewhere between 0.3% and 1.2%". Report bootstrap intervals in any paper or report where a difference-of-metrics claim is being made.

Multiple-comparison corrections. If you test twenty hyperparameter configurations against a baseline at α = 0.05, you expect one to be spuriously "significant" by chance alone. Report Bonferroni-corrected or Holm-adjusted p-values when doing multiple comparisons; report false-discovery-rate-controlled q-values when the number of comparisons is large (genomics-scale feature selection, for instance). The Friedman test followed by the Nemenyi post-hoc test is the standard non-parametric multi-method comparison across multiple datasets; Demšar's 2006 Statistical Comparisons of Classifiers Over Multiple Data Sets is the canonical protocol.

Overfitting and underfitting

The evaluation outputs of the preceding sections — CV scores, learning curves, training-vs-test gaps — are diagnostic inputs, not just reporting outputs. Reading them tells you which of two canonical failure modes your model is in, and what to do next.

Diagnosing underfitting

Underfitting is high bias. Symptoms: both training error and test error are high, and they are close to each other. The learning curve flattens well above the irreducible-noise floor. The model is not capable of capturing the signal in the data. The remedies, in rough order of effort: add more features, add interaction or polynomial features, switch to a more flexible model (linear → tree → gradient boosting → neural network), reduce regularisation. Adding more training data will not help — the learning curve has already flattened, which means more data does not change the answer.

Diagnosing overfitting

Overfitting is high variance. Symptoms: training error is much lower than test error, with a large persistent gap between the two curves. The model memorises training-set specifics that do not generalise. The remedies, in rough order of effort: add more training data (the classical fix — more data shrinks the variance of the estimator), add regularisation (L1, L2, dropout, early stopping), reduce model capacity (shallower trees, fewer features, simpler model class), use cross-validation more aggressively in tuning, ensemble (bagging specifically targets variance reduction). Adding more features usually makes overfitting worse, not better — the opposite of underfitting.

Learning curves as the decisive instrument

Plot training and cross-validated test error as a function of training-set size. If both curves have converged close together and well above the noise floor, you are underfit and more data won't help. If they have converged close together and close to the noise floor, you are well-fit and there is not much left to optimise. If they are still separating as data grows, you are overfit and more data will close the gap. The learning curve is the single most useful diagnostic plot in a modelling workflow and should be run at the start of every serious model-development effort, not at the end.

The regularisation-path view

For any model with a tunable complexity parameter — ridge λ, lasso λ, tree max-depth, XGBoost n_estimators, neural-network training epochs — plot validation error as a function of that parameter and you see the classical U-curve: error decreases as the model becomes more capable, reaches a minimum, and rises again as overfitting takes over. The minimum is where you want to operate. The whole point of the training process is to find it. Cross-validation-with-grid-search is the standard procedure; early-stopping-on-validation-loss is the same thing done online during a gradient-descent fit.

A practical heuristic. If you plot validation loss vs training iteration and it is monotonically decreasing, you are underfit — train longer or use a more flexible model. If it decreases, reaches a minimum, and starts rising, you are in the overfitting phase — stop at the minimum (early stopping) or apply more regularisation. If validation loss is flat from the start, something is broken — your features are uninformative, your labels are noise, or your data pipeline has a bug.

Data leakage and split integrity

Data leakage — the use, during training, of information that would not be available at prediction time — is the single most common cause of over-optimistic evaluation and the resulting production failures. This section catalogues the forms and prescribes the defences.

Target leakage

The most destructive form. A feature that is a proxy for the target, either because it is the target under another name, because it is computed from data generated after the label, or because it simply is the label, produces a model that scores spectacularly on evaluation and cannot possibly work in production. Kaufman et al.'s 2012 Leakage in Data Mining catalogued dozens of real-world examples from the KDD Cup and industry post-mortems; the single recurring pattern is that the leaky feature was "obviously" informative and the modeller did not stop to ask whether it would be available at serving time. The defence is procedural: before training, write down for every feature the exact time at which it becomes known, and verify that it is strictly before the label time.

Preprocessing leakage

Nearly as common and much more subtle. A standardisation that uses the full dataset's mean and variance, a mean-imputation that uses the full dataset's mean, a PCA fitted on the full dataset, a target encoding computed on the full dataset — each of these silently passes information from the test fold back into training. The correct protocol is always the same: fit transformations only on the training fold, then transform the test fold using the fitted state. The scikit-learn Pipeline object exists specifically to make this easy to do right; the most common failure mode is a notebook that preprocesses the whole dataframe before splitting, which can inflate CV scores by several percentage points without any visible red flag.

Group leakage

When rows share entities — multiple transactions per customer, multiple images per patient, multiple sessions per user — random row-level splitting places the same entity on both sides of the train/test divide, and the model memorises entity-specific patterns rather than generalisable ones. The resulting CV score reflects memorisation more than generalisation. The defence is to split at the group level: GroupKFold, GroupShuffleSplit, or a bespoke group-aware splitter. The ubiquity of this failure mode is underappreciated: any dataset with a customer ID, patient ID, user ID, or device ID is a candidate for group leakage, and the naive random-row split is wrong by default.

Temporal leakage

A close cousin of group leakage but with time as the grouping variable. A model evaluated on past data when it will predict the future has seen future-correlated features during training; its CV error is optimistic and its deployed error is much worse. Section 6 covered the time-series-CV protocols that defend against this. The slogan: if rows have timestamps, use time-aware splitting. If unsure, plot a histogram of training-fold timestamps next to test-fold timestamps and look for overlap that should not be there.

The unified prescription. Every feature transformation, every imputation, every scaling, every encoding must be fit on the training fold only and applied identically to the test fold. Every split must respect the natural group structure of the data — split by time if rows are time-ordered, by group if rows share entities. Every dataset must be audited for target leakage before training begins. Using the scikit-learn Pipeline wrapped in a cross_val_score makes the first correct by construction; the second and third are process concerns, not tooling ones.

Evaluation in practice

The preceding sections describe the statistical machinery of evaluation. The practitioner's layer adds the procedural discipline that makes evaluation honest in a team, across a project lifecycle, and after a model ships.

Baselines: compare to the right thing

Every model report should contain the score of at least two baselines: a trivial baseline (predicting the majority class, predicting the mean, predicting zero) and a simple baseline (logistic regression, a decision stump, a one-nearest-neighbour, last-value-for-forecasting). If your elaborate neural network does not beat a logistic regression by a comfortable margin, you do not have a working neural network — you have a logistic regression with extra steps. Running the simple baseline takes ten minutes and protects against weeks of work on a methodology that does not outperform the obvious. Baselines are also the numerator of the relative-improvement claim, which is the only claim anyone should be making: "AUC improved from 0.82 to 0.86" means something; "AUC is 0.86" by itself does not.

Slice-based evaluation

A single aggregate metric hides systematic error patterns. A classifier with 90% overall accuracy can have 99% accuracy on the majority subgroup and 60% on the minority subgroup. The metric that matters for fairness, operational deployment, or real-world utility is often not the aggregate but the slice-level score. Slice the test set by demographic group, by input length, by time-of-day, by data source, by any axis the downstream user will care about, and report the metric on each slice. Barocas, Hardt & Narayanan's Fairness and Machine Learning (2019) and Google's What-If Tool documentation are good entry points; the fairlearn Python package implements the standard slice-based metrics directly.

Offline vs online evaluation

A model's offline CV score and its online production performance can differ dramatically, even in the absence of bugs. Causes include distribution shift (online traffic differs from the training distribution), feedback effects (the model's own predictions change user behaviour in ways that re-enter the training data), serving latency constraints that force approximations, and simple implementation drift between the training and serving code paths. The industry-standard response is online A/B testing: ship the candidate model to a fraction of traffic, measure production metrics against the incumbent, and promote or revert based on the online outcome. Google's Overlapping experiment infrastructure (Tang et al. 2010) and Kohavi et al.'s Controlled Experiments on the Web are the standard references; Trustworthy Online Controlled Experiments (Kohavi, Tang & Xu 2020) is the book-length version.

Monitoring after deployment

Evaluation does not end at deployment. Production systems drift: input distributions shift, label distributions shift, upstream data pipelines change, ground-truth sources get revised. A deployed model should have continuous monitoring — prediction-distribution histograms, feature-distribution histograms, online performance metrics, calibration tracking — with alerts that fire when any of them drifts beyond a threshold. The MLOps layer (Evidently AI, WhyLabs, Arize, cloud-native monitoring) is the current standard tooling. Every ML system that survives past initial launch does so because someone is watching the monitoring dashboards.

Reporting format. A complete evaluation report, for any shipped model, contains: the protocol (split strategy, k, stratification, group handling); the metric(s); the aggregate score with a confidence interval or standard error; slice-level scores on pre-specified slices; a baseline comparison with a significance test; the calibration diagnostic if probabilities are used; the feature-importance or SHAP attribution for interpretability; and the offline-to-online expected gap based on historical deployments of comparable models. Every element can be omitted only by positive decision, not by default.

Where it compounds in ML

The evaluation protocols in this chapter are the product of forty years of statistical-machine-learning experience on supervised tabular problems. In the deep-learning era and the foundation-model era, the classical toolkit still applies — but the problems being evaluated have stretched the toolkit's assumptions, and a set of new evaluation concerns has emerged alongside the old ones.

Classical protocols at deep-learning scale

Most of this chapter assumes you can train your model tens of times to run cross-validation. For deep-learning models that take days or weeks to train, that is not feasible; the field has largely fallen back on a single fixed holdout split per benchmark. The result is an evaluation culture with high variance across random seeds, systematic over-fitting to the public test sets (ImageNet, GLUE, SuperGLUE), and long-running debates about how much a 0.3% improvement means. The reproducibility crisis in ML that Pineau et al.'s 2021 Improving Reproducibility in Machine Learning Research documented is largely a consequence of this — not of fraud but of inherently high-variance evaluation done at single-seed scale. The fix, where affordable, is multi-seed training and reporting distributions rather than points.

Benchmark saturation and leakage

Fixed test sets, used for many years by many teams, become effectively training data through sheer exposure. A model that has been tuned for five years to do well on ImageNet's test set is overfitted to that test set in ways that no single submission-time evaluation will catch. Recht et al.'s 2019 Do ImageNet Classifiers Generalize to ImageNet? quantified the effect by collecting a new independent ImageNet-scale test set and observing systematic accuracy drops on all major architectures. The modern response is to rotate benchmarks (Dynabench, BIG-Bench), to evaluate on held-out distributions (out-of-distribution generalisation), and to treat headline benchmark numbers with appropriate skepticism.

LLM and foundation-model evaluation

Evaluating a large language model is not a classical-ML evaluation problem. Outputs are open-ended text, not labels; metrics based on string matching (BLEU, ROUGE) capture only a sliver of quality; human evaluation is expensive and noisy; and the models' few-shot flexibility means that "train" and "test" are not cleanly separable. The modern LLM evaluation toolkit — HELM (Liang et al. 2022), BIG-Bench, MMLU, human-preference ranking (ELO, pairwise judgments), LLM-as-a-judge, LMSYS Arena — is still actively evolving, and every element of it has known methodological problems. Classical evaluation principles still apply (calibration, imbalanced-class protocols, statistical-significance testing) but the underlying protocols have to be reinvented for each new class of task.

Offline evaluation of reinforcement learning

A parallel hard case. Evaluating a learned policy offline — without running it in the live environment — is counterfactual evaluation: what would the policy have done, had it been in control, when the data was generated by a different policy? The statistical machinery (inverse-propensity scoring, doubly-robust estimators, off-policy policy evaluation) is adapted from causal inference and has non-trivial variance characteristics. See Dudik et al.'s 2014 Doubly Robust Policy Evaluation and Optimization and the broader counterfactual-evaluation literature.

Bridging the classical and the modern. For the tabular and classical problems that make up the bulk of industrial ML, the protocols in this chapter are the state of the art in 2026: cross-validation, proper scoring rules, calibration, statistical tests, and leakage-aware splitting. For deep learning and foundation models they are necessary but not sufficient — you still need them, but you also need multi-seed variance estimates, out-of-distribution evaluation, human-judgment protocols, and task-specific adaptations. The practitioner's task is to use the classical protocols where they apply and to augment them carefully where they do not. This chapter closes Part IV; Part V (Deep Learning Foundations) takes up the new protocols one by one.

Where to go next

Evaluation has a bibliography more scattered than any other subfield of classical ML: the foundational papers are spread across statistics, biostatistics, meteorology, information retrieval, and machine learning, and no single textbook covers the full territory. The references below split into anchor textbooks and survey chapters, foundational papers (from Stone's 1974 cross-validation paper through Efron's bootstrap and Dietterich's comparison tests to the proper-scoring-rules literature), modern extensions that carry classical evaluation into deep learning and beyond, and the software where everyone actually does the work. If you only read one chapter of one book, read ESL Chapter 7.

The anchor textbooks

The Elements of Statistical Learning — Chapter 7: Model Assessment and Selection

Hastie, Tibshirani & Friedman · 2nd ed. · Springer · 2009 · free PDF

The best single chapter on evaluation in any statistics textbook. Covers the bias–variance decomposition, optimism of training error, in-sample vs out-of-sample error, AIC and BIC, cross-validation with the famous Section 7.10.2 on "the wrong and right way to do cross-validation", bootstrap error estimation, and the conditional-vs-expected-test-error distinction that underpins the whole chapter. Every serious ML practitioner should re-read this chapter once a year; everything in the current chapter is an elaboration on its material. Chapter 8 (Model Inference and Averaging) is the complementary chapter on bootstrap, EM, and Bayesian inference.

Free PDF
Evaluating Learning Algorithms: A Classification Perspective

Nathalie Japkowicz & Mohak Shah · Cambridge · 2011

The only full-length book devoted specifically to classifier evaluation. Covers metrics, statistical testing, cross-validation protocols, imbalanced-class problems, ROC analysis, and the multi-dataset comparison methodology in systematic detail. The chapters on hypothesis testing (the t-test and its failures, McNemar, 5×2 CV, Friedman–Nemenyi) are the best standalone treatment of the topic in any ML textbook. Academic in tone; slower-paced than ESL; the right reference when you need a longer version of any specific topic.

Cambridge
Probabilistic Machine Learning: An Introduction

Kevin Murphy · MIT Press · 2022 · free HTML

The modern unified reference. Chapter 4 (Statistics) covers frequentist and Bayesian estimation and the uncertainty-quantification foundations that evaluation builds on. Chapter 5 (Decision Theory) is the best textbook treatment of cost-sensitive evaluation and the threshold-choice problem. Chapter 7 (Linear Models) has the classical-model-selection material (AIC, BIC, cross-validation) in its natural context. The Advanced Topics companion volume (Chapter 14, Testing) has extended coverage of hypothesis testing for ML model comparison. Free HTML at probml.github.io.

probml.github.io
Pattern Recognition and Machine Learning — Chapter 1.5 and Chapter 3.4

Christopher Bishop · Springer · 2006

Chapter 1.5 (Decision Theory) develops evaluation from the expected-loss-minimisation foundations, with careful treatment of the reject option, the receiver operating characteristic, and the Bayes optimal classifier as a reference point. Chapter 3.4 (Bayesian Model Comparison) covers model selection via Bayesian model evidence, complementing the frequentist cross-validation view. Bishop's pedagogy is slower and more formal than ESL's; the two books are complementary and together cover the ground definitively.

PRML
Applied Predictive Modeling

Max Kuhn & Kjell Johnson · Springer · 2013

Kuhn's practitioner-oriented predecessor to Feature Engineering and Selection. Chapter 4 (Over-Fitting and Model Tuning) and Chapters 11–12 (classification and regression metrics) are the best applied treatment of evaluation in print: resampling protocols, the full classification-metric zoo, calibration, and the R-oriented workflow that Kuhn's caret package makes a one-line affair. The book also contains an unusually thorough treatment of the interaction between evaluation and the feature-engineering and tuning choices that precede it.

Springer
An Introduction to the Bootstrap

Bradley Efron & Robert Tibshirani · Chapman & Hall · 1993

The canonical textbook on the bootstrap, by the inventor of the method and its most influential advocate. Foundational reading for anyone using bootstrap confidence intervals, bootstrap hypothesis tests, or any of the resampling-based uncertainty-quantification methods that permeate modern evaluation. The treatment of bootstrap bias, the percentile and BCa intervals, and the relationship between bootstrap and jackknife is the clearest in print. Pair with Davison & Hinkley's Bootstrap Methods and Their Application (1997) for the broader statistical-methods perspective.

Routledge

Foundational papers

Cross-Validatory Choice and Assessment of Statistical Predictions

Mervyn Stone · JRSS B · 1974 · free

The paper that introduced cross-validation as a general methodology for model assessment and selection. Stone formulated the full leave-one-out protocol, proved its asymptotic properties, and argued for it as the default approach to choosing among competing models. A 1974 paper that still frames the modern discussion of cross-validation; every elaboration since — k-fold, stratified, nested, grouped — is a variation on Stone's original idea. Pair with Geisser's 1975 The Predictive Sample Reuse Method with Applications for the near-contemporaneous alternative development.

JSTOR
Bootstrap Methods: Another Look at the Jackknife

Bradley Efron · Annals of Statistics · 1979 · free

The bootstrap paper. Efron's insight — that the empirical distribution of the data can stand in for the unknown population distribution, and that resampling from it produces valid frequentist uncertainty estimates — is one of the most important statistical ideas of the late twentieth century. Every confidence-interval-via-bootstrap you compute is a descendant of this paper. Short, mathematically elegant, and accessible to anyone who has thought carefully about sampling distributions.

PDF
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

Ron Kohavi · IJCAI · 1995 · free

The canonical empirical comparison of resampling protocols. Kohavi ran hundreds of experiments comparing leave-one-out, k-fold CV at various k, and the 0.632 bootstrap on real classification datasets, and concluded that stratified 10-fold cross-validation is the default of choice: low bias, low variance, and computationally tractable. The recommendation has held up for thirty years and is the reason 10-fold is the modern default. Pair with Efron & Tibshirani's 1997 Improvements on Cross-Validation: the 0.632+ Bootstrap Method for the bootstrap-based alternative.

PDF
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms

Thomas Dietterich · Neural Computation · 1998 · free

The definitive comparison of statistical tests for classifier performance. Dietterich reviewed five candidate tests — McNemar, two-proportion z, paired t-test on CV folds, 5×2 CV, and an earlier variant — on systematically designed simulations, and concluded that McNemar is best when applicable and 5×2 CV is the best general-purpose alternative. The paper established the methodological standard for classifier comparison in ML; every subsequent paper comparing two classifiers should be doing one of these two tests.

MIT Press
Inference for the Generalization Error

Claude Nadeau & Yoshua Bengio · Machine Learning · 2003 · free

The Nadeau–Bengio corrected resampled t-test paper. Showed that the naive paired t-test on cross-validation fold differences has inflated Type I error because the folds overlap, and derived the variance correction that restores honest inference. The corrected test is the one you should be using any time you compare two methods via repeated k-fold CV; it is implemented in mlxtend and in most modern ML evaluation libraries. Pair with Bengio & Grandvalet's 2004 No Unbiased Estimator of the Variance of K-Fold Cross-Validation for the companion negative result.

Springer
Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves

Elizabeth DeLong, David DeLong & Daniel Clarke-Pearson · Biometrics · 1988

The DeLong test for comparing AUCs on the same test set. A non-parametric procedure based on the U-statistic representation of AUC that gives a valid p-value for the null "AUC_A = AUC_B" without distributional assumptions. The test is implemented in R's pROC package and in several Python libraries; it is the standard tool for comparing classifier discrimination in the biostatistics literature and should be the standard tool in ML too. Pair with Hanley & McNeil's 1982 The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve for the foundational ROC reference.

JSTOR
Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods

John Platt · Advances in Large Margin Classifiers · 1999 · free

The paper that introduced Platt scaling: fitting a sigmoid p = σ(A · s + B) to calibrate raw classifier scores to probabilities, learned on a held-out calibration set. The method is the one-line default calibrator for SVMs, boosted trees, and any classifier that outputs a real-valued score rather than a probability. Paired with Niculescu-Mizil & Caruana's 2005 Predicting Good Probabilities with Supervised Learning — the canonical empirical comparison of Platt scaling, isotonic regression, and a handful of alternatives — these two papers together define the modern calibration toolkit.

MSR Niculescu-Mizil
Transforming Classifier Scores into Accurate Multiclass Probability Estimates

Bianca Zadrozny & Charles Elkan · KDD · 2002 · free

The isotonic-regression calibration paper. Zadrozny & Elkan showed that pool-adjacent-violators isotonic regression, applied to the empirical CDF of classifier scores, produces a non-parametric monotonic calibration that outperforms Platt scaling whenever the sigmoidal assumption is violated. The method is the other standard calibrator in scikit-learn's CalibratedClassifierCV; which of Platt or isotonic wins is an empirical question decided on a held-out validation set. Pair with Kull, Silva Filho & Flach's 2017 Beta calibration for a third option that splits the difference.

KDD
Strictly Proper Scoring Rules, Prediction, and Estimation

Tilmann Gneiting & Adrian Raftery · JASA · 2007 · free

The definitive modern treatment of proper scoring rules, by the two statisticians who have done the most to bring the theory into contemporary forecasting practice. Derives log loss, the Brier score, the continuous ranked probability score, and the spherical and power scoring rules from a unified axiomatic framework; establishes the theoretical property that makes a scoring rule a legitimate evaluation tool (strict propriety) and the ways that common ML metrics (AUC, accuracy) fail it. Essential background for anyone taking probabilistic evaluation seriously; heavy on notation but worth the work.

PDF
Leakage in Data Mining: Formulation, Detection, and Avoidance

Shachar Kaufman, Saharon Rosset, Claudia Perlich & Ori Stitelman · KDD · 2012 · free

The canonical leakage paper — also cited as Foundational in the Feature Engineering chapter, and worth citing twice because leakage is both a feature-engineering failure and an evaluation failure. Surveys leakage incidents across KDD Cup competitions and industry retrospectives, classifies the failure modes, and proposes an evaluation-centred defensive protocol. Mandatory reading for anyone who has ever been surprised that a model scored too well.

KDD
Bias in Error Estimation When Using Cross-Validation for Model Selection

Sudhir Varma & Richard Simon · BMC Bioinformatics · 2006 · free

The nested-CV paper. Varma & Simon showed experimentally, on small-sample cancer-genomics datasets, that using a single cross-validation for both hyperparameter tuning and generalisation-error estimation produces badly optimistic error estimates — and that nested cross-validation, with an outer loop for error estimation and an inner loop for tuning, gives honest numbers. The paper that established nested CV as the recommended protocol for the p ≫ n regime and, increasingly, for any serious model comparison.

BMC
Statistical Comparisons of Classifiers Over Multiple Data Sets

Janez Demšar · JMLR · 2006 · free

The canonical protocol for comparing multiple classifiers across multiple datasets — the setting of most ML benchmark papers. Demšar argues that the default approach (paired t-tests, corrected for multiple comparisons, on per-dataset mean scores) has serious theoretical problems and proposes the Friedman test followed by the Nemenyi post-hoc as the non-parametric alternative. The Friedman–Nemenyi protocol, complete with critical-difference diagrams, is now the standard in benchmark evaluation papers; mlxtend and scikit-posthocs implement it.

JMLR

Modern extensions

Random Search for Hyper-Parameter Optimization

James Bergstra & Yoshua Bengio · JMLR · 2012 · free

The paper that killed grid search. Bergstra & Bengio showed — both theoretically and empirically — that when only a subset of hyperparameters meaningfully affect performance (which is almost always the case), random search explores the important dimensions much more thoroughly than grid search at the same compute budget. A short, elegant paper with outsized practical consequences: no serious hyperparameter search in the last decade has used grid for more than two hyperparameters, and random search is the default in scikit-learn, Optuna, Hyperopt, and every cloud-ML tuning service.

JMLR
Practical Bayesian Optimization of Machine Learning Algorithms

Jasper Snoek, Hugo Larochelle & Ryan Adams · NeurIPS · 2012 · free

The paper that brought Bayesian optimisation into mainstream ML practice. Using a Gaussian-process surrogate of the hyperparameter response surface and an expected-improvement acquisition function, Snoek et al. demonstrated dramatically better sample efficiency than random search on neural-network hyperparameter tuning. The paper's ideas live on in Spearmint, GPyOpt, scikit-optimize, and the Gaussian-process backend of every major Bayesian-optimisation library. Pair with Shahriari et al.'s 2016 Taking the Human Out of the Loop: A Review of Bayesian Optimization for the comprehensive survey.

arXiv
Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh & Ameet Talwalkar · JMLR · 2018 · free

The budget-aware hyperparameter-tuning paper. Hyperband allocates resources adaptively across configurations — run many configurations at a small budget, keep the promising ones for larger budgets, and double down on the survivors — with a principled scheme for trading off "many configurations at low budget" against "few configurations at high budget". Especially useful for neural-network tuning where each trial is expensive. Pair with Falkner, Klein & Hutter's 2018 BOHB: Robust and Efficient Hyperparameter Optimization at Scale for the Bayesian-Hyperband combination that is now the default in many AutoML pipelines.

Hyperband BOHB
Obtaining Well Calibrated Probabilities Using Bayesian Binning

Mahdi Pakdaman Naeini, Gregory Cooper & Milos Hauskrecht · AAAI · 2015 · free

The modern reference for expected calibration error as a metric and Bayesian binning as an alternative calibration method. Introduces the binned-reliability-diagram formalism that has become standard in calibration evaluation, and argues for Bayesian binning into quantiles (BBQ) as a competitive alternative to Platt and isotonic. The metric (ECE) has since been used extensively in the deep-learning calibration literature, notably in Guo et al.'s 2017 On Calibration of Modern Neural Networks, which showed that modern deep networks are systematically miscalibrated and proposed temperature scaling as the remedy.

Naeini Guo et al.
Do ImageNet Classifiers Generalize to ImageNet?

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt & Vaishaal Shankar · ICML · 2019 · free

The paper that put benchmark-overfitting on the mainstream ML agenda. Recht et al. collected a new independent test set for ImageNet, following the original collection protocol as closely as possible, and measured accuracy drops of 11 to 14 percentage points for every modern architecture — an enormous signal that the community had been collectively overfitting to the public test set. A short, rigorously done empirical paper with large methodological consequences. Pair with Roelofs et al.'s 2019 A Meta-Analysis of Overfitting in Machine Learning for the broader picture.

arXiv
Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-off

Mikhail Belkin, Daniel Hsu, Siyuan Ma & Soumik Mandal · PNAS · 2019 · free

The double-descent paper. Belkin et al. showed that the classical U-shaped test-error curve is only the first half of a double-descent shape: beyond the interpolation threshold where the model fits the training data exactly, test error can continue to decrease as the model grows further. The effect is a genuine modern-ML phenomenon that the classical bias–variance mental model does not predict, and it has important consequences for how we think about model complexity and regularisation in the deep-learning era. Pair with Nakkiran et al.'s 2020 Deep Double Descent: Where Bigger Models and More Data Hurt for the empirical deep-net follow-up.

arXiv
Holistic Evaluation of Language Models (HELM)

Percy Liang et al. · Stanford CRFM · 2022 · free

The most ambitious attempt to systematically evaluate large language models across multiple scenarios, metrics, and desiderata — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency, and general knowledge — all under a unified protocol. HELM is both a specific benchmark and a methodological proposal: that multi-metric, multi-scenario evaluation is the right paradigm for foundation models, in contrast to the single-metric leaderboards of the classical-ML era. Pair with BIG-Bench (Srivastava et al. 2022) for the complementary crowd-sourced benchmark, and with Liu et al.'s 2024 Trustworthy LLMs survey for the broader post-HELM landscape.

HELM Leaderboard
Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Ron Kohavi, Diane Tang & Ya Xu · Cambridge · 2020

The definitive practitioner guide to online A/B testing, by the three people who led experimentation programmes at Microsoft, LinkedIn, and Airbnb. Covers the statistical machinery (sample size, variance reduction, sequential testing, Simpson's paradox), the engineering (randomisation, bucketing, metrics pipelines), and the organisational patterns (experimentation platforms, trust hierarchies, scientific integrity) that separate organisations that learn from their experiments from those that fool themselves. The online companion to the offline-evaluation material in this chapter.

Companion site
Advances in Financial Machine Learning — Chapters 6 and 7

Marcos López de Prado · Wiley · 2018

Chapter 6 (Ensemble Methods) and especially Chapter 7 (Cross-Validation in Finance) are the definitive references for time-series cross-validation in the presence of autocorrelated features and labels. López de Prado formalises the purging and embargo protocols that prevent residual correlation from contaminating evaluation, and makes the sharp case that standard k-fold CV is actively harmful in quantitative-finance contexts. The techniques generalise to any domain where features and labels are built from overlapping time windows.

Wiley

Software and documentation

scikit-learn: model_selection and metrics

scikit-learn developers · open source · Python

The default Python toolkit for classical evaluation. sklearn.model_selection implements every splitter (KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit, LeaveOneGroupOut), search procedure (GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV), and CV driver (cross_val_score, cross_validate, learning_curve) in the chapter. sklearn.metrics implements every regression and classification metric in Sections 7–10, plus calibration_curve, brier_score_loss, and log_loss. The User Guide's Cross-validation, Tuning the hyper-parameters of an estimator, and Metrics and scoring pages together cover the practical ground.

CV guide Metrics guide Tuning guide
Optuna

Preferred Networks · open source · Python

The most widely used hyperparameter-optimisation library in the modern Python ML stack. Implements TPE (tree-structured Parzen estimator) as the default Bayesian-optimisation backend, ASHA (asynchronous successive halving) as the pruner, and a clean suggest_int / suggest_float / suggest_categorical API that integrates naturally with PyTorch, TensorFlow, scikit-learn, XGBoost, and LightGBM training code. The Key Features documentation page is the best quick-start; the visualization submodule produces the hyperparameter-importance and parallel-coordinates plots that make tuning decisions legible to humans.

optuna.org Docs
MLflow and Weights & Biases

Databricks / Weights & Biases · open-core · Python + web UI

The two standard experiment-tracking platforms for ML evaluation at team scale. Both log parameters, metrics, artifacts, and code versions for every training run, with rich UIs for comparing experiments and slicing metrics. MLflow is open source and self-hostable, with stronger integration into the Databricks ecosystem; Weights & Biases (wandb) is a hosted service with stronger visualisation and collaboration features. Either turns ad-hoc notebook-based evaluation into a reproducible, auditable record that survives team turnover. Pair with DVC for data-versioning and Neptune for the adjacent lighter-weight alternative.

MLflow W&B
mlxtend

Sebastian Raschka · open source · Python

The scikit-learn-adjacent library that fills in the evaluation-methodology gaps. Implements the 5×2 CV test, McNemar's test, the Nadeau–Bengio corrected resampled t-test, bootstrap-confidence-interval helpers, the learning-curve and validation-curve utilities, and the bias–variance decomposition estimator that the main scikit-learn does not. Smaller than scikit-learn but indispensable when you need any of the hypothesis-testing primitives from Section 14. Documentation is unusually pedagogical; worth reading even when you don't need the library.

mlxtend
imbalanced-learn

scikit-learn-contrib · open source · Python

The definitive library for class-imbalance handling in scikit-learn pipelines. Implements random over/undersampling, SMOTE and its family (Borderline-SMOTE, ADASYN, SMOTENC, SMOTEN for categorical data, KMeans-SMOTE), combined over- and under-samplers (SMOTEENN, SMOTETomek), and a Pipeline variant that correctly places sampling inside the CV loop rather than outside. For any serious work on imbalanced datasets this is the first library to reach for.

imbalanced-learn
netcal and calibration-tooling

Fabian Küppers / community · open source · Python

netcal is the most complete calibration library for Python, implementing Platt scaling, isotonic regression, temperature scaling, beta calibration, Bayesian binning, and the full suite of calibration metrics (ECE, MCE, reliability diagrams) with a clean scikit-learn-style API. Scikit-learn's built-in CalibratedClassifierCV covers Platt and isotonic; netcal adds everything else. Pair with uncertainty-toolbox (for regression-calibration evaluation) and Fortuna (AWS's broader uncertainty-quantification library) for the adjacent calibration-beyond-classification territory.

netcal uncertainty-toolbox
pROC and yardstick (R)

Xavier Robin et al. / tidymodels · open source · R

The two canonical R evaluation packages. pROC is the gold-standard ROC-analysis library, including the DeLong test, bootstrap confidence intervals, partial AUC, and the full visualisation suite; its documentation is the best ROC tutorial of any language. yardstick, part of the tidymodels ecosystem, provides a consistent tidyverse-aligned API for every regression and classification metric, with particularly strong support for multiclass averaging variants. For R-first workflows these two together cover most of the ground the Python stack covers with scikit-learn.

pROC yardstick
Evidently AI and fairlearn

Evidently / Microsoft · open source · Python

The two most widely used open-source libraries for evaluation after a model ships. Evidently provides dashboards and reports for monitoring data drift, prediction drift, target drift, model performance, and feature-importance evolution in production; the integration with MLflow and the scheduled-report generation make it a practical monitoring layer out of the box. fairlearn implements the standard fairness metrics (demographic parity, equal opportunity, equalised odds) and the slice-based evaluation tooling that makes it trivial to check metric parity across demographic or operational subgroups. Together they cover most of the "evaluation doesn't end at deployment" material in Section 17.

Evidently fairlearn

This page is Chapter 09 of Part IV: Classical Machine Learning, and the closing chapter of Part IV. The nine chapters together form a coherent arc: supervised regression and classification as the two canonical problems; ensembles, clustering, and dimensionality reduction as the extensions in three directions; probabilistic graphical models and kernel methods as the two great mathematical frameworks; feature engineering as the engineering connective tissue; and evaluation and selection — this chapter — as the measurement discipline that makes any claim about any of them meaningful. Part V opens the Deep Learning Foundations arc, which re-derives most of the problems of classical ML in a setting where the learned features, the optimisation, and the generalisation behaviour all work differently — but where the measurement discipline in the chapter you have just read remains the anchor you keep coming back to.

How to read this chapter

Contents

Why evaluation is the measurement discipline of ML

The asymmetry of model success and model failure

Three coupled decisions

What "honest" means

Generalisation error and the bias–variance decomposition

Training error, test error, and the gap between them

The bias–variance decomposition

Learning curves as diagnostic instrument

The modern wrinkle: double descent

Holdout validation

Two splits, three splits, and why both exist

Stratification

When holdout is the right answer

Temporal, entity, and group structure

Cross-validation

k-fold cross-validation

Stratified, grouped, and repeated variants

Leave-one-out and leave-p-out

What the CV estimate actually estimates

Nested cross-validation

The single-loop cheat

The nested protocol

Cost and when to use it

Bayesian-optimisation inner loops

Time-series cross-validation

Rolling origin (forward-chaining) CV

Expanding window vs sliding window

Purging and embargoing for feature-derived leakage

The generalisation-error interpretation

Regression metrics

Mean squared error and its square root

Mean absolute error and its robust cousins

Scale-invariant and relative metrics

R² and adjusted R²

Classification metrics

Accuracy and its failure mode

Precision, recall, and F1

Matthews correlation coefficient

Cohen's kappa and class agreement

Multiclass averaging: micro, macro, weighted

Probabilistic scoring rules

Log loss (cross-entropy)

Brier score

Continuous ranked probability score

When rank-only metrics lie

ROC and precision–recall curves

The ROC curve and AUC

The precision–recall curve and PR-AUC

When to use which

Thresholding for operating points

Calibration

Reliability diagrams

Why AUC high and calibration bad often coexist

Platt scaling

Isotonic regression

Imbalanced classes

What breaks on imbalanced data

Resampling: oversampling, undersampling, SMOTE

Class weights and cost-sensitive learning

Thresholding as the post-hoc class-balance fix

Hyperparameter tuning and model selection

Grid search

Random search

Bayesian optimisation

Successive halving and Hyperband

Tuning inside a nested-CV protocol

Statistical comparison of models

McNemar's test for paired classifiers

The paired t-test on CV folds (with a caveat)

Dietterich's 5×2 CV test

Bootstrap confidence intervals

Overfitting and underfitting

Diagnosing underfitting

Diagnosing overfitting

Learning curves as the decisive instrument

The regularisation-path view

Data leakage and split integrity

Target leakage