The eight chapters before this one have produced a catalogue of ways to build a model: linear regression, logistic regression, trees and ensembles, k-means, GMMs, PCA, Bayesian networks, SVMs, engineered features, regularised coefficients, fifty-odd named algorithms in all. None of them told us how to decide whether the model we built is any good, or how to choose between two models that both look good. That is what this chapter is for. Evaluation is the quiet, unglamorous, high-stakes measurement layer underneath everything else — the part of classical machine learning that turns a fitting procedure into an honest claim. Get it wrong and every decision downstream is built on sand: the model that seemed to beat the baseline actually didn't; the hyperparameter search that chose XGBoost over logistic regression was fooled by a single lucky fold; the classifier that ships into production collapses because accuracy was the wrong summary statistic for an imbalanced problem. Get it right and everything else clicks into place. Evaluation is a coupled problem: the protocol (how you split the data), the metric (what you measure), and the uncertainty quantification (how sure you are) all have to be chosen together, because any one of them done wrong can mask the other two. This chapter walks through each layer in turn — splits, cross-validation, regression and classification metrics, probabilistic scoring, calibration, imbalanced-class problems, hyperparameter tuning as a model-selection procedure, statistical comparison of models, the overfitting/underfitting diagnosis, and the leakage failure modes that silently inflate every metric on this list if you are not careful.
Section one motivates the whole enterprise: why honest evaluation is the single most consequential discipline in classical machine learning, and why almost every failure mode in shipped ML systems is an evaluation failure in disguise. Section two introduces the generalisation-error framework — training error, validation error, test error, the bias–variance decomposition, and the learning curve as the shape-of-fit diagnostic. Section three covers holdout validation — the simple train/val/test split that is the right answer for large-enough datasets and the wrong answer for everything else — and why the three-way split is the minimum honest setup. Section four is cross-validation: k-fold, stratified, leave-one-out, leave-p-out, and the grouped variants that prevent identity leakage when rows share entities. Section five is nested cross-validation — the two-loop protocol that lets you tune hyperparameters and estimate test error honestly from the same dataset, and why "just report the inner-loop score" is the single most common evaluation cheat. Section six is time-series cross-validation: rolling origin, expanding window, blocked and purged CV for autocorrelated data, and the reason standard k-fold silently leaks the future when your rows are ordered in time.
Sections seven through thirteen are about what to measure. Section seven is regression metrics — MSE, RMSE, MAE, MAPE, Huber, quantile loss, R², adjusted R² — and the mapping between metric choice and implicit loss function. Section eight is classification metrics — accuracy, precision, recall, F1, Matthews correlation, balanced accuracy, Cohen's kappa — and the confusion matrix as the object every classification metric reduces to. Section nine is probabilistic scoring: log loss, Brier score, and the broader theory of proper scoring rules that ties evaluation back to density estimation. Section ten is ROC and precision–recall curves — the threshold-free view, ROC-AUC versus PR-AUC, when each is appropriate, and the DeLong test for AUC comparison. Section eleven is calibration: reliability diagrams, the expected-calibration-error metric, Platt scaling, isotonic regression, and why a classifier with high AUC can still be badly miscalibrated. Section twelve is imbalanced classification: stratification, resampling (SMOTE and friends), class weights, cost-sensitive learning, and the PR-curve-plus-calibration protocol that handles rare-positive problems correctly. Section thirteen covers hyperparameter tuning as the model-selection procedure it actually is — grid search, random search, Bayesian optimisation, successive halving and Hyperband, and the tuning budgets that make the results reproducible.
Sections fourteen through eighteen are the measurement-discipline topics that separate good evaluation from bad. Section fourteen is statistical comparison of models: McNemar's test, the paired t-test and its Nadeau–Bengio corrected-resampled cousin, the 5×2 cross-validation test, and the bootstrap confidence intervals that every serious paper should report alongside point estimates. Section fifteen is the overfitting / underfitting diagnosis — high training error, high test error, the bias–variance trade-off, learning curves read as diagnostic instruments, and the regularisation and capacity-control levers that each failure mode indicates. Section sixteen is the safety-critical topic: data leakage and split integrity. Target leakage, preprocessing leakage, group leakage, temporal leakage, and the single unified prescription — fit every transformation only on the training fold — that eliminates most of it. Section seventeen is the operational layer: evaluation in practice, with reporting, baseline comparisons, slice-based analysis, offline-vs-online discrepancies, and the MLOps integration that keeps evaluation honest after a model ships. Section eighteen places classical evaluation inside the broader modern-ML landscape: the way deep learning has stretched every assumption in this chapter, the reproducibility crisis in ML benchmarks, and the emerging concerns (offline-RL evaluation, LLM benchmarks, foundation-model evaluation) that classical protocols were never designed to handle.
Every modelling choice you make — which algorithm to use, which features to engineer, which hyperparameters to pick, whether to ship the model at all — is a decision made on the basis of some measurement. If the measurement is wrong, the decision is wrong. Evaluation is the single most consequential discipline in classical machine learning precisely because everything downstream of it inherits its errors.
Training a reasonable model is, on most problems, not very hard. Scikit-learn's LogisticRegression, fed features and labels, will give you something that runs. What separates a professional from an amateur is not whether the model works — it is whether the practitioner can tell honestly how well it works. A model that scores 94% accuracy in the notebook and 62% accuracy in production is a model whose evaluation was wrong. A team that cannot tell which of two candidate models is better will ship the worse one half the time. A paper that reports a 0.3% improvement over the baseline without confidence intervals has reported nothing. The asymmetry is brutal: good evaluation is invisible when it works, catastrophic when it fails, and the failure mode is usually that the practitioner did not realise they had a problem until a system was already deployed.
An evaluation is a joint choice of three things, and getting any of them wrong invalidates the other two. The protocol is how the data is split — holdout, k-fold, stratified, grouped, time-ordered — and it determines what question the evaluation is actually answering. The metric is what you measure on each split — accuracy, log loss, F1, AUC, RMSE — and it determines which dimension of model quality the answer is about. The uncertainty quantification is how you summarise variability across splits — the standard error of the CV estimate, a bootstrap confidence interval, a statistical test against a baseline — and it determines whether the difference you observed is real or an artefact of the particular random seed. This chapter works through all three and, in the practitioner-facing sections near the end, the way they interact.
An evaluation is honest when the number it produces is a good estimate of the quantity you would observe if you deployed the model on data drawn from the same distribution it was evaluated on. This is harder than it sounds. Every form of information leakage — from using the test set to choose hyperparameters, to standardising features using the full dataset's mean and variance, to splitting on rows when rows share entities — inflates the measurement above its true value. Every form of distribution mismatch — from evaluating on historical data when the model will serve live traffic, to testing on an urban sample when the model will serve rural users — breaks the "same distribution" assumption that makes evaluation meaningful at all. The job of this chapter is to name these failure modes, give them definitions precise enough to test for, and prescribe the protocols that defend against each one.
The quantity evaluation is trying to estimate is generalisation error — the expected loss of the model on a new draw from the same distribution the training data came from. Understanding why this quantity is difficult to estimate, and why it differs from training error, requires the bias–variance decomposition.
Fit a model on a dataset and the training error — the loss computed on the same data you trained on — will usually understate the generalisation error. The gap is the optimism of training error: a sufficiently flexible model can fit the training labels perfectly while learning nothing at all about the underlying distribution, making training error zero and generalisation error arbitrarily bad. A held-out test set gives an unbiased estimate of generalisation error precisely because the model has not seen it. The fundamental lesson of twentieth-century statistical learning theory is that this gap is not a bug but a structural feature of any finite-sample learning problem, and that bounding it requires explicit assumptions about the model's capacity relative to the amount of data.
For squared-error regression with true regression function f(x), fitted estimator f̂(x), and noise variance σ², the expected test error at a point x decomposes exactly as E[(y − f̂(x))²] = σ² + Bias[f̂(x)]² + Var[f̂(x)]. The three terms have distinct interpretations. Irreducible noise σ² is the part of the target that no model can predict; it depends on the data, not on the estimator. Bias measures how far the average prediction (over repeated training sets) is from the truth; high bias means the model is systematically wrong because it is not flexible enough to capture the true relationship. Variance measures how much the prediction changes as the training set changes; high variance means the model is too flexible and fits noise. The decomposition makes precise the intuition that there is a sweet spot of model complexity — too simple and bias dominates, too complex and variance dominates, with generalisation error minimised somewhere in between.
Plot training error and cross-validated test error as a function of training-set size (or of model complexity, or of training epochs) and the shape of the resulting curves tells you which regime you are in. High bias / underfitting: both errors converge to a value well above the irreducible-noise floor as data grows, and the two curves are close together — more data will not help; you need a more flexible model. High variance / overfitting: training error is low, test error is high, and the gap between them closes slowly as data grows — more data will help, and so will regularisation or simpler models. Well-fit: training error and test error converge close together and close to the noise floor. The learning-curve diagnosis is the single most useful plot in a model-development workflow, and scikit-learn's learning_curve function produces it in one call.
The classical bias–variance story predicts that test error first decreases and then increases as you make a model more flexible, crossing a minimum at some intermediate complexity. In the deep-learning era this is not what happens. Belkin et al.'s 2019 Reconciling modern machine-learning practice and the classical bias–variance trade-off showed that test error can follow a double-descent curve: it decreases, rises near the point where the model exactly interpolates the training data, and then decreases again as the model grows beyond that interpolation threshold. The effect is real, has been observed in modern neural networks, and is one of several reasons that classical intuitions about capacity control need updating for very large models. For the classical tabular-ML methods in this chapter, the U-shape remains the right mental model.
The simplest honest evaluation protocol — split the data once into a training set and a test set, fit the model on the training set, measure loss on the test set — is also the right one for a surprising range of problems. This section gives the version you should actually use.
A two-way split — train + test — gives an unbiased estimate of generalisation error when you evaluate exactly once, at the end. But during development you need to choose between candidate models, candidate feature sets, and candidate hyperparameters, and every time you look at a test-set number you consume a little bit of that test set's independence. After fifty hyperparameter trials evaluated on the same test set, the final number is no longer an unbiased estimate of generalisation error; it is a number you have, in effect, optimised for. The standard defence is a three-way split: training data to fit the model, validation data to compare candidate configurations, and test data held strictly in reserve and evaluated once, at the very end, on the single configuration you picked. Typical proportions: 60/20/20 or 70/15/15 depending on dataset size.
Random splitting of a classification dataset produces splits whose class frequencies differ from the population by a random amount. On a 99/1 imbalanced problem, the 1% minority class can be almost entirely absent from a 20% test set by pure chance, making the evaluation useless. Stratified splitting fixes this by partitioning each class independently and recombining, guaranteeing that each split preserves the population class frequencies to within one example per class. Stratification should be the default for every classification problem; train_test_split(..., stratify=y) in scikit-learn. For regression, the analogous operation is stratifying by binned target quantiles.
Holdout validation is appropriate when the dataset is large enough that the holdout set gives a low-variance estimate of generalisation error — say, tens of thousands of examples and up, depending on the base rate and metric. With a very large dataset, a single fixed test set gives you a perfectly adequate evaluation and avoids the complexity of cross-validation. With a very small dataset — a few hundred examples — the holdout set's variance is so large that the resulting number is almost uninformative, and cross-validation (or bootstrap) becomes necessary. The typical rule of thumb: holdout for more than ~10,000 examples; k-fold for less; leave-one-out for the extreme small-data regime.
Simple random splitting assumes the rows are independent. They usually aren't. If the dataset is ordered in time and the model will predict the future, random splitting puts future rows into the training set and past rows into the test set — which inflates performance estimates because the future leaks backward into training. If the dataset has a group structure (multiple transactions per customer, multiple images per patient), splitting at the row level puts the same entity in both training and test sets, and the model memorises per-entity patterns rather than learning generalisable structure. The fix in both cases is the same: split at the right granularity. Temporal data splits by time (everything before date T to train, everything after to test); grouped data splits by group ID (GroupShuffleSplit in scikit-learn).
When the dataset is small enough that a single holdout split gives a high-variance estimate of generalisation error, cross-validation averages over multiple splits to reduce that variance. K-fold cross-validation is the default protocol for the middle-data regime that most real-world tabular ML lives in.
Partition the data into k folds of approximately equal size. For each fold i, train the model on the union of the other k − 1 folds and evaluate on fold i. Average the k fold-level loss values to get the CV estimate of generalisation error. Typical choices of k are 5 and 10; larger k reduces the bias of the CV estimate (each training set is closer to the full dataset) but increases variance and compute cost. Kohavi's 1995 IJCAI paper A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection remains the canonical empirical comparison; its recommendation of 10-fold stratified CV as the default has held up remarkably well.
Stratified k-fold preserves class frequencies within each fold, the same way stratified holdout does; use it for classification by default. Grouped k-fold (GroupKFold) splits on a group ID, guaranteeing that all rows for a given group end up in the same fold — the correct choice when your data has entity structure. Repeated k-fold runs k-fold multiple times with different random partitions and averages across runs, giving a lower-variance estimate at a multiplicative compute cost. Combining the three yields variants like stratified grouped repeated k-fold, which is a mouthful but is sometimes exactly the right answer: preserve class frequencies, keep groups intact, and average out partition variance.
At the k = n extreme, leave-one-out cross-validation (LOOCV) uses each single example as a one-element test fold. For n examples this means n model fits, which is expensive — but for linear models (ridge regression, linear discriminant analysis) there are closed-form formulas that compute the LOOCV error from a single fit, making it cheap in exactly the cases where it is statistically best motivated. LOOCV has very low bias but very high variance (every held-out fold contains a single, possibly-atypical example). Leave-p-out generalises to holding out every subset of size p; computationally infeasible for any realistic p, it is mostly of theoretical interest. The practical advice is: use k-fold with k = 5 or 10 in the middle-data regime, use LOOCV only when you have closed-form shortcut formulas or a very small dataset.
A subtle but important point: the k-fold CV estimate is not an unbiased estimate of the generalisation error of the model you trained on the full dataset. It is an estimate of the expected generalisation error of a model trained on a random (k − 1)/k-sized subsample from the underlying distribution. For k = 10 this is close to the full-model error but slightly pessimistic. For k = 2 it can differ materially. The theoretical story is in Bengio & Grandvalet's 2004 No Unbiased Estimator of the Variance of K-Fold Cross-Validation; the practical consequence is that CV error estimates should be read as point estimates with non-trivial standard error, not as population parameters.
When you use cross-validation both to choose a hyperparameter and to estimate generalisation error, the single-loop protocol systematically overestimates performance. Nested cross-validation separates the two roles with two loops, and is the only fully honest protocol for simultaneous tuning and evaluation on a single dataset.
A tempting but wrong protocol: run k-fold CV over a grid of hyperparameter values, pick the hyperparameter with the best mean CV score, and report that mean score as your generalisation-error estimate. The reported score is biased upward because the hyperparameter was chosen specifically to maximise it — you have optimised the CV score, and the CV score is no longer an unbiased estimate of anything. With large hyperparameter grids and small datasets, the bias can be many percentage points. Varma & Simon's 2006 Bias in Error Estimation When Using Cross-Validation for Model Selection documented this experimentally on cancer-genomics benchmarks and made nested CV the recommended standard for the small-n-many-features regime.
Two loops. The outer loop is a k-fold CV that partitions the data into k train/test splits; its job is to estimate generalisation error, and it must not see the test fold during any model-selection work. For each outer fold, take the outer training set and run a full model-selection protocol on it: this is the inner loop, typically a k′-fold CV over the hyperparameter grid, which returns the best hyperparameter choice for that outer training set. Fit the model with that hyperparameter on the outer training set, evaluate on the outer test fold, and record the score. Average the k outer scores. The reported number is an unbiased estimate of the generalisation error of the model-selection procedure — not of any one hyperparameter choice, but of the full pipeline "tune on inner CV, then refit".
Nested CV with k = 5 outer and k′ = 5 inner folds, over a grid of 50 hyperparameters, trains 5 × 5 × 50 = 1,250 models. This is expensive but not prohibitive with scikit-learn and modest compute. On very small datasets (hundreds of examples) where you have many hyperparameters to tune, nested CV is approximately mandatory for any claim to be taken seriously. On large datasets where a single holdout split already gives a reliable test-set estimate, nested CV is overkill and a plain train / validate / test split is more practical.
Nothing in the nested protocol requires the inner loop to be grid search. It is increasingly common to plug in Bayesian hyperparameter-optimisation tools (Optuna, Hyperopt, scikit-optimize) as the inner selector — same nested-CV guarantees, dramatically better sample efficiency when the hyperparameter space is large. Section 13 returns to hyperparameter search in more detail.
Standard cross-validation randomly permutes rows, which assumes they are exchangeable. Time-ordered data is not. Every k-fold CV on a temporal dataset silently leaks future information into the training folds, and the resulting error estimates are usefully optimistic only to the point of being misleading. Use one of the protocols below instead.
Sort the data by time. Choose a sequence of cutoff times t₁ < t₂ < … < tk. For each cutoff ti, train on all data with timestamp less than ti and evaluate on the data in the window (ti, ti+1]. Advance the cutoff by one window and repeat. The result is a sequence of train-on-past, test-on-future evaluations that reflect how the model would actually be used in production. This is rolling origin or forward-chaining CV; scikit-learn's TimeSeriesSplit implements it in its default form.
Two variants. In the expanding-window version, each training set is all data up to ti, so training sets grow over time — the right choice when more data always helps and the underlying process is approximately stationary. In the sliding-window version, each training set is a fixed-length window ending at ti — the right choice when the process is non-stationary and older data is actively misleading (regime-shift markets, fashion recommendation, social-media content). Which to use is an empirical question; try both and see which gives better out-of-sample error on the final held-out window.
Even with time-ordered splits, features computed from windows that straddle the train/test boundary can leak. If your target for row t is built from events in (t − h, t + h), then rows in the training set with timestamp in (ti − h, ti) share information with rows in the test set with timestamp in (ti, ti + h). The fix is purging: remove training examples whose label-generation window overlaps any test-set timestamp. The related trick is embargo: additionally drop training examples in a buffer immediately before the test window, preventing any residual correlation from contaminating the estimate. Marcos López de Prado's Advances in Financial Machine Learning (2018) is the canonical reference; the techniques are essential in any financial-ML setting and useful in many others.
With time-series CV, the quantity being estimated is the error of predicting one step into the future at the time of the final window. This is almost always what you want, but it is explicitly not the error averaged over all future time — a model's expected error may degrade as you try to predict further into the future, and the rolling-origin estimate captures a single horizon. For multi-horizon forecasting, run separate rolling-origin CVs at each horizon of interest.
Every regression metric is an implicit loss function, and the choice of metric is a choice about which errors you want to penalise how heavily. There are perhaps a dozen metrics in common use; this section gives the map.
Mean squared error — MSE — is (1/n) Σ (yᵢ − ŷᵢ)². It is the loss implicitly minimised by OLS, and it penalises large errors quadratically: an error of 10 is 100× worse than an error of 1. Root mean squared error — RMSE — is √MSE, which has the useful property of being measured in the same units as the target. RMSE is the single most common regression metric in machine-learning practice and is the right default when errors of all sizes are approximately equally important and you have no outliers.
Mean absolute error — MAE — is (1/n) Σ |yᵢ − ŷᵢ|. It is the loss minimised by the conditional median rather than the conditional mean, and it penalises errors linearly: an error of 10 is 10× worse than an error of 1, not 100×. MAE is more robust to outliers than MSE; use it when a small number of very-large errors should not dominate the score. Huber loss (Huber 1964) is a hybrid — quadratic near zero and linear for errors above a threshold δ — that gives you OLS-like efficiency on clean data and MAE-like robustness on outliers. Quantile loss (a.k.a. pinball loss) generalises MAE to any quantile τ ∈ (0, 1), and is the loss of choice for prediction intervals and quantile regression.
RMSE's units-dependence is a problem when targets span many orders of magnitude. Mean absolute percentage error — MAPE — divides each error by the actual value, producing a unit-free score: (100/n) Σ |yᵢ − ŷᵢ|/|yᵢ|. Its well-known pathology is that it explodes when yᵢ is near zero and is not symmetric in over- vs under-prediction. Symmetric MAPE (sMAPE) fixes the asymmetry but introduces a different set of edge cases. The mean squared logarithmic error — MSLE — computes MSE on log-transformed targets, working well for positive skewed targets where proportional errors are what matter.
The coefficient of determination is R² = 1 − SSres/SStot, the fraction of target variance the model explains. It ranges from 1 (perfect) down through 0 (no better than predicting the mean) to negative infinity (arbitrarily bad on the test set — which yes, does happen). R² is the right metric for reporting explanatory power on a natural scale, and the wrong metric if targets span multiple regimes (because one high-variance regime can dominate the denominator). Adjusted R² corrects for the upward-bias that R² has when you add more features: adding a useless feature cannot decrease R², but it can decrease adjusted R². Report both when comparing models of different dimensionality.
Classification metrics all start from the confusion matrix: the 2 × 2 table (for binary classification) of actual-vs-predicted counts — true positives, false positives, false negatives, true negatives. Every standard classification metric is a function of these four numbers, and the usefulness of the metric depends on which trade-offs the function exposes.
Accuracy is (TP + TN)/(TP + FP + FN + TN) — the fraction of predictions that are correct. On balanced problems it is a reasonable default. On imbalanced problems it is catastrophically misleading: a 99/1 dataset is trivially 99% accurate by predicting "negative" every time. For any problem where base rates differ materially from 50/50, accuracy should be viewed with suspicion and replaced with or supplemented by other metrics. Balanced accuracy — the mean of per-class recall — fixes the asymmetry by reweighting classes to equal importance and is a reasonable drop-in replacement for accuracy on imbalanced problems.
Precision is TP/(TP + FP) — of the examples the classifier predicted positive, what fraction actually are. Recall (a.k.a. sensitivity, true-positive rate) is TP/(TP + FN) — of the examples that actually are positive, what fraction the classifier caught. They trade off: raise the decision threshold and precision goes up while recall goes down. F1 score is their harmonic mean, 2 · P · R/(P + R), giving a single number when both matter; Fβ generalises to any relative weighting of recall-over-precision. For an information-retrieval problem where false positives and false negatives have different costs, precision/recall/F1 is usually the right family; for a diagnostic problem with a gold-standard base rate, sensitivity and specificity (TN/(TN + FP)) are the epidemiological equivalents.
The Matthews correlation coefficient — MCC — is (TP · TN − FP · FN)/√((TP+FP)(TP+FN)(TN+FP)(TN+FN)). It ranges from −1 to +1, equals 0 for random predictions regardless of class balance, and uses all four cells of the confusion matrix symmetrically. Chicco & Jurman's 2020 The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation argued it should be the default single-number binary-classification metric. It is not as widely used as F1 but has a good theoretical case on its side.
Cohen's kappa corrects accuracy for chance agreement, producing a score that is 1 for perfect agreement, 0 for random, and can be negative for systematically-disagreeing predictions. It is the standard metric when two annotators are compared and is sometimes used as a classifier metric in the same spirit as MCC — as a base-rate-insensitive alternative to accuracy.
Extend any binary metric to multiclass problems by computing it per-class and averaging. Macro-averaging takes the unweighted mean across classes, giving equal weight to every class regardless of frequency — the right choice when every class matters equally. Micro-averaging pools the confusion-matrix counts across classes before computing the metric, giving equal weight to every example — on imbalanced problems this collapses to accuracy. Weighted-averaging weights each class's score by its support. In scikit-learn, classification_report prints all three variants side-by-side; the right one to report depends on the problem, and "which average are we using?" should be answered explicitly in every comparison.
Many classifiers do not output a hard decision but a probability. Metrics like accuracy and F1 collapse those probabilities to 0/1 predictions and throw away the fine-grained information they contain. Proper scoring rules evaluate the probabilities directly, and are the right metric whenever the downstream system uses the probability — for thresholding, for expected-value decision-making, or for calibration.
Log loss, a.k.a. negative log-likelihood, a.k.a. cross-entropy, is −(1/n) Σ [yᵢ log p̂ᵢ + (1 − yᵢ) log(1 − p̂ᵢ)] for binary classification, and its natural multiclass generalisation. It penalises confident wrong predictions extremely heavily (log loss diverges as p̂ → 0 on a positive example) and rewards well-calibrated probabilities. Log loss is the strictly proper scoring rule that is minimised in expectation only when your predicted probability equals the true probability — which is exactly the property you want in a probability evaluation. Use log loss whenever you care about the quality of the probability estimates, not just the rank ordering.
The Brier score is (1/n) Σ (yᵢ − p̂ᵢ)² — the MSE between the predicted probability and the 0/1 label. Also a strictly proper scoring rule; unlike log loss, it is bounded (in [0, 1]) and does not diverge on confident-wrong predictions. Brier scores are particularly nice for reliability diagrams because they decompose additively into a calibration component and a resolution component (the Murphy decomposition), allowing you to separately diagnose "the probabilities are systematically biased" from "the probabilities don't discriminate between classes".
The continuous ranked probability score — CRPS — generalises the Brier score from binary classification to continuous-valued probabilistic forecasting: it compares a predicted cumulative distribution function to the observed value. CRPS is the standard probabilistic-forecasting metric in meteorology, hydrology, and the adjacent disciplines that invented most of this material, and is increasingly used in probabilistic ML forecasting libraries (GluonTS, NeuralProphet). For point forecasts CRPS reduces to MAE; for a Gaussian predictive distribution it has a closed form.
A subtle practical consequence: AUC (Section 10) is a rank-only metric. Two classifiers with identical AUC can have arbitrarily different calibration — one may output probabilities like p̂ ∈ {0.49, 0.51} while the other outputs p̂ ∈ {0.05, 0.95}, with both preserving the same ordering. If you threshold at 0.5, the two models produce identical hard predictions; if you use the probability as a decision input (e.g. expected-value arithmetic for a business decision), they produce wildly different downstream outcomes. Always check log loss or Brier alongside AUC when probabilities are actually used.
A classifier's threshold is a modelling choice — raise it and you get fewer, higher-precision positive predictions; lower it and you get more, higher-recall ones. Threshold-free evaluation sweeps the threshold across all possible values and summarises the trade-off as a curve or as the area under that curve.
The receiver operating characteristic plots true-positive rate (recall) on the y-axis against false-positive rate (1 − specificity) on the x-axis as the decision threshold varies. A random classifier traces the diagonal; a perfect classifier traces the upper-left corner. The area under the ROC curve — ROC-AUC — summarises the whole curve as a single number in [0, 1], interpretable as the probability that the classifier ranks a randomly chosen positive above a randomly chosen negative. Hanley & McNeil's 1982 The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve is the foundational reference; ROC-AUC is the standard discrimination metric in medical statistics and fraud detection.
The precision–recall curve plots precision against recall as the threshold varies. Unlike ROC, PR is strongly sensitive to class imbalance: on a 99/1 problem the ROC curve can look excellent while the PR curve is terrible, because ROC hides the many false positives the classifier must accept to identify a few true positives. Average precision — AP — summarises the PR curve as the weighted-by-recall average precision. Saito & Rehmsmeier's 2015 The Precision–Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets is the canonical argument for using PR instead of ROC on imbalanced problems.
Rule of thumb: ROC-AUC when class balance is approximately equal and false positives and false negatives matter symmetrically. PR-AUC when the positive class is rare and you specifically care about the quality of the top-ranked predictions (search relevance, fraud detection, medical screening). In most realistic industrial problems — where the positive class is a few percent of the data and business value lives in the top of the ranked list — PR-AUC is the right choice. Report both when uncertain; they are cheap to compute.
A curve is for comparison; a threshold is for shipping. Once you have chosen a classifier, you must pick a threshold — the operating point on the curve you will actually deploy. The principled way is to write down the costs: if a false positive costs cfp and a false negative costs cfn, the optimal threshold is the one that minimises expected cost, which at the base rate π works out to p̂* = π cfn / (π cfn + (1 − π) cfp). In practice teams often pick by a target precision or a target recall — "we will operate at 90% precision and accept whatever recall that gives us" — which is a cost assignment in disguise. Document the choice.
pROC package and in Python's scipy.stats via the Mann–Whitney U. Reporting "classifier A beats classifier B with AUC 0.86 vs 0.85" without a statistical test is no different from reporting a point estimate without uncertainty — a number that may or may not be real.
A classifier is calibrated when its predicted probabilities correspond to empirical frequencies — of the examples where it predicts 0.8, 80% really are positive. Classifiers with high AUC are often badly miscalibrated, and miscalibrated probabilities break any downstream system that uses them for decision-making.
The standard calibration diagnostic: bin predictions by probability (say, into deciles), compute the mean predicted probability and the mean empirical label rate in each bin, and plot them against each other. A perfectly calibrated classifier traces the diagonal y = x. A classifier that is systematically overconfident has its curve below the diagonal — it predicts 80% but achieves only 65%. The reliability diagram is the visual version of the expected calibration error (ECE): the binned-average absolute gap between predicted and empirical. Naeini, Cooper & Hauskrecht's 2015 Obtaining Well Calibrated Probabilities Using Bayesian Binning is the standard reference for ECE; implementations in the netcal Python package and in every serious calibration audit.
Consider a classifier that maps every true positive to p̂ = 0.51 and every true negative to p̂ = 0.49. Its AUC is 1.0 (perfect ranking) and its accuracy at threshold 0.5 is 100%. But its predicted probabilities are all near 0.5, so a downstream expected-value calculation will treat every prediction as maximally uncertain. Conversely, a classifier that maps positives to 0.8 and negatives to 0.1 but occasionally confuses them has lower AUC but more usable probabilities. Rank-preserving monotonic transformations leave AUC unchanged while changing calibration arbitrarily — which means fixing calibration is essentially always possible as a post-hoc step without disturbing the rank-based metrics.
Platt's 1999 Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods introduced the simplest calibration procedure: fit a logistic regression p = σ(A · s + B) mapping model score s to calibrated probability p, learning A and B from a held-out calibration set. Platt scaling is parametric and assumes a sigmoidal shape; it works remarkably well for SVMs and boosted trees, and is the default one-liner when calibration is badly off. CalibratedClassifierCV in scikit-learn implements it directly.
For classifiers where Platt's sigmoidal assumption does not fit, isotonic regression gives a non-parametric monotonic calibration curve — the best piecewise-constant monotonic function mapping scores to probabilities under squared loss. Isotonic is more flexible than Platt and more data-hungry (it can overfit with less than a few thousand calibration examples), so the choice is an empirical one: try both on a validation set, pick whichever gives lower log loss. Zadrozny & Elkan's 2002 Transforming Classifier Scores into Accurate Multiclass Probability Estimates is the definitive paper; Niculescu-Mizil & Caruana's 2005 Predicting Good Probabilities with Supervised Learning is the empirical comparison that made both techniques mainstream in ML.
Fraud is rare. Disease is rare. Churn is not as rare as either but is often well below 50%. Imbalanced classification problems — where one class is substantially rarer than another — are the norm in industrial ML, and every part of the evaluation pipeline handles them poorly by default.
Three things, in increasing order of subtlety. First, accuracy becomes a useless metric (Section 8): predict the majority class and you win. Second, the training objective of most classifiers weighs each example equally, so the gradient is dominated by the majority class and the minority class is under-fit. Third, the ROC curve can look good (ROC-AUC averages across the whole threshold range, where the many true negatives dominate) while the PR curve shows the classifier is near-useless at identifying positives (Section 10). The first is an evaluation-side problem; the second and third interact with training.
The simplest training-side fix is to rebalance the classes. Random oversampling duplicates minority-class examples; simple but prone to overfitting on the duplicated points. Random undersampling discards majority-class examples; simple but throws away information. SMOTE (Chawla et al. 2002, Synthetic Minority Over-sampling Technique) generates synthetic minority examples by interpolating between a real minority example and one of its k-nearest-neighbour minority examples, producing new training points without duplication. The imbalanced-learn library implements SMOTE and a family of related variants (Borderline-SMOTE, ADASYN, SMOTEENN).
The cleaner fix, where the model supports it, is class weighting: multiply the loss for each class by a weight inversely proportional to the class frequency, so that the optimiser treats the classes equally. Most scikit-learn classifiers accept class_weight="balanced" as a one-line activation. More generally, cost-sensitive learning assigns different costs to each type of error (FP vs FN) rather than each class, which is strictly more expressive: a false negative on a cancer screening is worse than a false positive, and cost-sensitive learning lets you say so directly in the loss. Elkan's 2001 The Foundations of Cost-Sensitive Learning is the canonical reference.
An underappreciated fact: for any classifier that outputs well-calibrated probabilities, the class-imbalance problem at prediction time reduces to choosing the right threshold. You do not need to rebalance the training data — you need to threshold at the Bayes-optimal point given the class frequencies and costs (Section 10). Training on the natural class distribution and then thresholding is often the simplest, best-calibrated, and most reproducible fix. Sampling-based fixes (SMOTE, oversampling) distort the base rate in the training data, which can itself hurt calibration; use them when they empirically help, not as a reflex.
Hyperparameter tuning is just model selection in disguise: each hyperparameter value is a candidate model, and selecting the best hyperparameter is choosing among candidate models by their cross-validated score. The algorithms in this section are search strategies over a hyperparameter space, and differ mostly in how they trade off sample efficiency against implementation simplicity.
Exhaustively evaluate every combination on a discrete grid of hyperparameter values. Simple, reproducible, embarrassingly parallel, and cubically expensive in the number of hyperparameters. Appropriate when you have at most two or three hyperparameters each with five or so levels; useless when the search space has more than about a hundred configurations unless compute is free. GridSearchCV in scikit-learn.
Sample hyperparameter configurations uniformly (or from a specified prior) and evaluate each with cross-validation. Bergstra & Bengio's 2012 Random Search for Hyper-Parameter Optimization showed that for almost any hyperparameter space where only a subset of parameters actually matter (which is the overwhelmingly common case), random search outperforms grid search dramatically at the same compute budget: random search explores many more values of the handful of important hyperparameters because it does not waste trials on combinations of unimportant ones. Random search should be the default for any search with more than two hyperparameters. RandomizedSearchCV in scikit-learn.
Both grid and random search ignore the evidence from past trials. Bayesian optimisation fits a probabilistic model (usually a Gaussian process or a tree Parzen estimator) to the observed trial scores and uses it to choose the next trial that maximises an acquisition function — expected improvement, upper confidence bound, or Thompson sampling. The result is dramatically better sample efficiency on hyperparameter spaces with tens of dimensions. Snoek, Larochelle & Adams's 2012 Practical Bayesian Optimization of Machine Learning Algorithms was the paper that brought BO into mainstream ML practice; it is now the engine underneath Optuna (TPE), Hyperopt (TPE), scikit-optimize (GP), and the tuning features of most major cloud ML platforms.
A complementary family based on early stopping: evaluate many configurations at a small compute budget, kill the worst, double the budget, evaluate the survivors, and repeat. Jamieson & Talwalkar's 2016 Non-stochastic Best Arm Identification and Hyperparameter Optimization formalised the idea; Li et al.'s 2017 Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization extended it to handle the budget-vs-number-of-configs trade-off automatically. Hyperband and its Bayesian-optimisation refinement BOHB (Falkner et al. 2018) are the state of the art for expensive-to-train models where the budget-aware aspect makes a large practical difference; Optuna's ASHA pruner implements Asynchronous Successive Halving for distributed tuning.
Every tuning method above slots into the inner loop of nested cross-validation (Section 5). The outer loop gives honest generalisation-error estimates; the inner loop chooses hyperparameters. Whether the inner loop is grid search, random search, or Bayesian optimisation is an efficiency decision, not a statistical one. Report the outer-loop number as your test error; report the chosen hyperparameters (from a final inner-CV run on the full dataset) as what the shipped model uses.
You have two models with similar CV scores. Is the difference real, or could it be explained by the random variation inherent in the cross-validation protocol? The machinery of statistical hypothesis testing and confidence intervals gives principled answers — and using it should be the default, not an afterthought.
For two classifiers evaluated on the same test set, the simplest comparison is McNemar's test: form the 2 × 2 table of example-level agreement / disagreement — examples both classifiers got right, both got wrong, A-only right, B-only right — and test the null hypothesis that the two off-diagonal counts are equal. The test is exact, uses the same-test-set paired structure explicitly, and does not require any assumption about the test set's distribution. Dietterich's 1998 Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms reviewed five popular comparison tests and concluded that McNemar, when applicable, is the most powerful and best-behaved.
For comparing two methods via k-fold CV, a natural approach is the paired t-test on the per-fold score differences. The approach works in expectation but is known to have inflated Type I error because the per-fold score differences are not independent (the folds share training data). Nadeau & Bengio's 2003 Inference for the Generalization Error proposed a corrected resampled t-test that adjusts the variance estimate for the train/test overlap; it gives much better Type I error control and is the recommended replacement. For repeated cross-validation, the correction involves a term n₁/n₂ where n₁ is the size of the test fold and n₂ the size of the training fold.
A further refinement: run 5 iterations of 2-fold CV (2-fold so that training and test folds are the same size, making the variance estimate more tractable), compute the paired score difference on each of the 10 folds, and form a test statistic that Dietterich 1998 showed has better-controlled Type I error than the plain paired t-test. The 5×2 CV test is the gold standard for small-data comparisons in the classical-ML literature; it is implemented in mlxtend.
Rather than a hypothesis test, you can report a bootstrap confidence interval around the score difference: resample the test set n times with replacement, compute the score difference on each resample, and take the 2.5%/97.5% percentiles. The bootstrap approach, due to Efron 1979, is non-parametric, does not require assuming a specific variance structure, and produces a number that is directly interpretable as "with 95% confidence, A is better than B by somewhere between 0.3% and 1.2%". Report bootstrap intervals in any paper or report where a difference-of-metrics claim is being made.
The evaluation outputs of the preceding sections — CV scores, learning curves, training-vs-test gaps — are diagnostic inputs, not just reporting outputs. Reading them tells you which of two canonical failure modes your model is in, and what to do next.
Underfitting is high bias. Symptoms: both training error and test error are high, and they are close to each other. The learning curve flattens well above the irreducible-noise floor. The model is not capable of capturing the signal in the data. The remedies, in rough order of effort: add more features, add interaction or polynomial features, switch to a more flexible model (linear → tree → gradient boosting → neural network), reduce regularisation. Adding more training data will not help — the learning curve has already flattened, which means more data does not change the answer.
Overfitting is high variance. Symptoms: training error is much lower than test error, with a large persistent gap between the two curves. The model memorises training-set specifics that do not generalise. The remedies, in rough order of effort: add more training data (the classical fix — more data shrinks the variance of the estimator), add regularisation (L1, L2, dropout, early stopping), reduce model capacity (shallower trees, fewer features, simpler model class), use cross-validation more aggressively in tuning, ensemble (bagging specifically targets variance reduction). Adding more features usually makes overfitting worse, not better — the opposite of underfitting.
Plot training and cross-validated test error as a function of training-set size. If both curves have converged close together and well above the noise floor, you are underfit and more data won't help. If they have converged close together and close to the noise floor, you are well-fit and there is not much left to optimise. If they are still separating as data grows, you are overfit and more data will close the gap. The learning curve is the single most useful diagnostic plot in a modelling workflow and should be run at the start of every serious model-development effort, not at the end.
For any model with a tunable complexity parameter — ridge λ, lasso λ, tree max-depth, XGBoost n_estimators, neural-network training epochs — plot validation error as a function of that parameter and you see the classical U-curve: error decreases as the model becomes more capable, reaches a minimum, and rises again as overfitting takes over. The minimum is where you want to operate. The whole point of the training process is to find it. Cross-validation-with-grid-search is the standard procedure; early-stopping-on-validation-loss is the same thing done online during a gradient-descent fit.
Data leakage — the use, during training, of information that would not be available at prediction time — is the single most common cause of over-optimistic evaluation and the resulting production failures. This section catalogues the forms and prescribes the defences.
The most destructive form. A feature that is a proxy for the target, either because it is the target under another name, because it is computed from data generated after the label, or because it simply is the label, produces a model that scores spectacularly on evaluation and cannot possibly work in production. Kaufman et al.'s 2012 Leakage in Data Mining catalogued dozens of real-world examples from the KDD Cup and industry post-mortems; the single recurring pattern is that the leaky feature was "obviously" informative and the modeller did not stop to ask whether it would be available at serving time. The defence is procedural: before training, write down for every feature the exact time at which it becomes known, and verify that it is strictly before the label time.
Nearly as common and much more subtle. A standardisation that uses the full dataset's mean and variance, a mean-imputation that uses the full dataset's mean, a PCA fitted on the full dataset, a target encoding computed on the full dataset — each of these silently passes information from the test fold back into training. The correct protocol is always the same: fit transformations only on the training fold, then transform the test fold using the fitted state. The scikit-learn Pipeline object exists specifically to make this easy to do right; the most common failure mode is a notebook that preprocesses the whole dataframe before splitting, which can inflate CV scores by several percentage points without any visible red flag.
When rows share entities — multiple transactions per customer, multiple images per patient, multiple sessions per user — random row-level splitting places the same entity on both sides of the train/test divide, and the model memorises entity-specific patterns rather than generalisable ones. The resulting CV score reflects memorisation more than generalisation. The defence is to split at the group level: GroupKFold, GroupShuffleSplit, or a bespoke group-aware splitter. The ubiquity of this failure mode is underappreciated: any dataset with a customer ID, patient ID, user ID, or device ID is a candidate for group leakage, and the naive random-row split is wrong by default.
A close cousin of group leakage but with time as the grouping variable. A model evaluated on past data when it will predict the future has seen future-correlated features during training; its CV error is optimistic and its deployed error is much worse. Section 6 covered the time-series-CV protocols that defend against this. The slogan: if rows have timestamps, use time-aware splitting. If unsure, plot a histogram of training-fold timestamps next to test-fold timestamps and look for overlap that should not be there.
Pipeline wrapped in a cross_val_score makes the first correct by construction; the second and third are process concerns, not tooling ones.
The preceding sections describe the statistical machinery of evaluation. The practitioner's layer adds the procedural discipline that makes evaluation honest in a team, across a project lifecycle, and after a model ships.
Every model report should contain the score of at least two baselines: a trivial baseline (predicting the majority class, predicting the mean, predicting zero) and a simple baseline (logistic regression, a decision stump, a one-nearest-neighbour, last-value-for-forecasting). If your elaborate neural network does not beat a logistic regression by a comfortable margin, you do not have a working neural network — you have a logistic regression with extra steps. Running the simple baseline takes ten minutes and protects against weeks of work on a methodology that does not outperform the obvious. Baselines are also the numerator of the relative-improvement claim, which is the only claim anyone should be making: "AUC improved from 0.82 to 0.86" means something; "AUC is 0.86" by itself does not.
A single aggregate metric hides systematic error patterns. A classifier with 90% overall accuracy can have 99% accuracy on the majority subgroup and 60% on the minority subgroup. The metric that matters for fairness, operational deployment, or real-world utility is often not the aggregate but the slice-level score. Slice the test set by demographic group, by input length, by time-of-day, by data source, by any axis the downstream user will care about, and report the metric on each slice. Barocas, Hardt & Narayanan's Fairness and Machine Learning (2019) and Google's What-If Tool documentation are good entry points; the fairlearn Python package implements the standard slice-based metrics directly.
A model's offline CV score and its online production performance can differ dramatically, even in the absence of bugs. Causes include distribution shift (online traffic differs from the training distribution), feedback effects (the model's own predictions change user behaviour in ways that re-enter the training data), serving latency constraints that force approximations, and simple implementation drift between the training and serving code paths. The industry-standard response is online A/B testing: ship the candidate model to a fraction of traffic, measure production metrics against the incumbent, and promote or revert based on the online outcome. Google's Overlapping experiment infrastructure (Tang et al. 2010) and Kohavi et al.'s Controlled Experiments on the Web are the standard references; Trustworthy Online Controlled Experiments (Kohavi, Tang & Xu 2020) is the book-length version.
Evaluation does not end at deployment. Production systems drift: input distributions shift, label distributions shift, upstream data pipelines change, ground-truth sources get revised. A deployed model should have continuous monitoring — prediction-distribution histograms, feature-distribution histograms, online performance metrics, calibration tracking — with alerts that fire when any of them drifts beyond a threshold. The MLOps layer (Evidently AI, WhyLabs, Arize, cloud-native monitoring) is the current standard tooling. Every ML system that survives past initial launch does so because someone is watching the monitoring dashboards.
The evaluation protocols in this chapter are the product of forty years of statistical-machine-learning experience on supervised tabular problems. In the deep-learning era and the foundation-model era, the classical toolkit still applies — but the problems being evaluated have stretched the toolkit's assumptions, and a set of new evaluation concerns has emerged alongside the old ones.
Most of this chapter assumes you can train your model tens of times to run cross-validation. For deep-learning models that take days or weeks to train, that is not feasible; the field has largely fallen back on a single fixed holdout split per benchmark. The result is an evaluation culture with high variance across random seeds, systematic over-fitting to the public test sets (ImageNet, GLUE, SuperGLUE), and long-running debates about how much a 0.3% improvement means. The reproducibility crisis in ML that Pineau et al.'s 2021 Improving Reproducibility in Machine Learning Research documented is largely a consequence of this — not of fraud but of inherently high-variance evaluation done at single-seed scale. The fix, where affordable, is multi-seed training and reporting distributions rather than points.
Fixed test sets, used for many years by many teams, become effectively training data through sheer exposure. A model that has been tuned for five years to do well on ImageNet's test set is overfitted to that test set in ways that no single submission-time evaluation will catch. Recht et al.'s 2019 Do ImageNet Classifiers Generalize to ImageNet? quantified the effect by collecting a new independent ImageNet-scale test set and observing systematic accuracy drops on all major architectures. The modern response is to rotate benchmarks (Dynabench, BIG-Bench), to evaluate on held-out distributions (out-of-distribution generalisation), and to treat headline benchmark numbers with appropriate skepticism.
Evaluating a large language model is not a classical-ML evaluation problem. Outputs are open-ended text, not labels; metrics based on string matching (BLEU, ROUGE) capture only a sliver of quality; human evaluation is expensive and noisy; and the models' few-shot flexibility means that "train" and "test" are not cleanly separable. The modern LLM evaluation toolkit — HELM (Liang et al. 2022), BIG-Bench, MMLU, human-preference ranking (ELO, pairwise judgments), LLM-as-a-judge, LMSYS Arena — is still actively evolving, and every element of it has known methodological problems. Classical evaluation principles still apply (calibration, imbalanced-class protocols, statistical-significance testing) but the underlying protocols have to be reinvented for each new class of task.
A parallel hard case. Evaluating a learned policy offline — without running it in the live environment — is counterfactual evaluation: what would the policy have done, had it been in control, when the data was generated by a different policy? The statistical machinery (inverse-propensity scoring, doubly-robust estimators, off-policy policy evaluation) is adapted from causal inference and has non-trivial variance characteristics. See Dudik et al.'s 2014 Doubly Robust Policy Evaluation and Optimization and the broader counterfactual-evaluation literature.
Evaluation has a bibliography more scattered than any other subfield of classical ML: the foundational papers are spread across statistics, biostatistics, meteorology, information retrieval, and machine learning, and no single textbook covers the full territory. The references below split into anchor textbooks and survey chapters, foundational papers (from Stone's 1974 cross-validation paper through Efron's bootstrap and Dietterich's comparison tests to the proper-scoring-rules literature), modern extensions that carry classical evaluation into deep learning and beyond, and the software where everyone actually does the work. If you only read one chapter of one book, read ESL Chapter 7.
pROC package and in several Python libraries; it is the standard tool for comparing classifier discrimination in the biostatistics literature and should be the standard tool in ML too. Pair with Hanley & McNeil's 1982 The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve for the foundational ROC reference.CalibratedClassifierCV; which of Platt or isotonic wins is an empirical question decided on a held-out validation set. Pair with Kull, Silva Filho & Flach's 2017 Beta calibration for a third option that splits the difference.sklearn.model_selection implements every splitter (KFold, StratifiedKFold, GroupKFold, TimeSeriesSplit, LeaveOneGroupOut), search procedure (GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV), and CV driver (cross_val_score, cross_validate, learning_curve) in the chapter. sklearn.metrics implements every regression and classification metric in Sections 7–10, plus calibration_curve, brier_score_loss, and log_loss. The User Guide's Cross-validation, Tuning the hyper-parameters of an estimator, and Metrics and scoring pages together cover the practical ground.suggest_int / suggest_float / suggest_categorical API that integrates naturally with PyTorch, TensorFlow, scikit-learn, XGBoost, and LightGBM training code. The Key Features documentation page is the best quick-start; the visualization submodule produces the hyperparameter-importance and parallel-coordinates plots that make tuning decisions legible to humans.Pipeline variant that correctly places sampling inside the CV loop rather than outside. For any serious work on imbalanced datasets this is the first library to reach for.CalibratedClassifierCV covers Platt and isotonic; netcal adds everything else. Pair with uncertainty-toolbox (for regression-calibration evaluation) and Fortuna (AWS's broader uncertainty-quantification library) for the adjacent calibration-beyond-classification territory.This page is Chapter 09 of Part IV: Classical Machine Learning, and the closing chapter of Part IV. The nine chapters together form a coherent arc: supervised regression and classification as the two canonical problems; ensembles, clustering, and dimensionality reduction as the extensions in three directions; probabilistic graphical models and kernel methods as the two great mathematical frameworks; feature engineering as the engineering connective tissue; and evaluation and selection — this chapter — as the measurement discipline that makes any claim about any of them meaningful. Part V opens the Deep Learning Foundations arc, which re-derives most of the problems of classical ML in a setting where the learned features, the optimisation, and the generalisation behaviour all work differently — but where the measurement discipline in the chapter you have just read remains the anchor you keep coming back to.