The seven chapters before this one have described learning algorithms — regression, classification, ensembles, clustering, dimensionality reduction, probabilistic graphical models, kernel machines — as if the features they consume were given, ready-made, well-typed, and on the right scale. Real features almost never arrive that way. They arrive as raw timestamps that a linear model cannot read, as categorical strings with two hundred thousand distinct values, as continuous measurements with heavy tails and outliers and twenty percent missing, as free-form text, as nested relational joins, as images that haven't been resized. Feature engineering is the discipline of turning that raw material into numerical inputs a model can use — and, done well, it is the single highest-leverage activity in classical machine learning. Iterating on features almost always moves metrics more than iterating on algorithms; a gradient-boosted tree with well-engineered features will beat a deep network with raw inputs on most tabular problems even in 2026. Feature selection is the closely related discipline of figuring out which features to keep: many candidate features degrade a model rather than improve it, through noise, multicollinearity, leakage, or simply by costing compute and maintenance for marginal gain. Selection methods come in three canonical flavours — filter, wrapper, embedded — and choosing between them is a real engineering decision. Feature engineering is also where the largest ML accidents happen: a feature that predicts too well in training because it leaks label information is the single most common cause of models that crash on contact with production. This chapter covers the craft.
Section one motivates the whole enterprise: why the columns you feed a model matter more than the model itself on almost every tabular problem, and why that is going to remain true despite the deep-learning revolution. Section two frames feature engineering as a pipeline — a sequence of idempotent transformations fitted on training data and applied to both training and serving — and introduces the principle that there are no end-runs around this structure: every transformation must be fittable, persistable, and reproducible. Section three covers numerical features: scaling (standardisation, min-max, robust), power transforms (log, Box–Cox, Yeo–Johnson), binning, and the monotonic-transform families that tame heavy tails. Section four is the workhorse of tabular ML: categorical encoding. One-hot, ordinal, frequency, and binary encodings all have characteristic failure modes, and knowing which to reach for is most of the practical skill. Section five tackles high-cardinality categoricals (customer IDs, city names, zip codes) where one-hot is infeasible — target encoding, frequency encoding, learned embeddings, and the careful cross-validation they require. Section six is interaction terms and polynomial features: how you explicitly tell a linear model about the multiplicative structure it cannot discover on its own, and when that pays off versus letting a tree model find it automatically.
Sections seven through twelve survey the rest of the engineering toolkit. Section seven is datetime features — the most underused class of features in practice and a place where small effort routinely yields outsized wins. Section eight covers text features: bag-of-words, TF-IDF, character n-grams, and the feature-hashing trick that lets you handle vocabularies that will not fit in memory. Section nine is missing-value imputation, which is both statistical inference and feature engineering at once: simple means and medians, KNN and model-based imputation, missingness indicators, and the sharp question of when missingness is itself informative. Section ten is outlier handling: detection (z-score, IQR, isolation forests), treatment (winsorising, clipping, separate indicator features), and when outliers are the signal rather than the noise. Section eleven returns to the hashing trick in its general form, as a scalable alternative to any one-hot or vocabulary-based encoding. Section twelve introduces feature selection proper, with the filter–wrapper–embedded taxonomy that organises everything to follow.
Sections thirteen through fifteen develop the three selection families in detail. Section thirteen covers filter methods — correlation, chi-square, ANOVA F-test, mutual information, and minimum-redundancy–maximum-relevance (mRMR) — which rank features by a statistic computed independently of any model. Section fourteen covers wrapper methods — forward selection, backward elimination, recursive feature elimination (RFE), and genetic-algorithm wrappers — which treat selection as a search over feature subsets and evaluate each subset by training a model. Section fifteen covers embedded methods — L1-regularised regression (Lasso) and elastic net, tree-based importances, SHAP-based selection, and the permutation importance that has become the de facto standard diagnostic. Section sixteen is the safety-critical topic: data leakage in feature engineering. This is the section that prevents the catastrophes; read it twice. Section seventeen is a practical operational guide — when to engineer, when to stop, what to automate, and what to stash in a feature store. Section eighteen places feature engineering in the broader ML landscape: deep learning's attempt to learn features from scratch, the enduring dominance of classical feature engineering on tabular data, the rise of automated feature engineering (featuretools, auto-sklearn, H2O Driverless AI), and the coming world of learned embeddings and retrieval-augmented features.
On tabular problems — which still account for the majority of production machine-learning work — the engineering of the features beats the choice of algorithm most days of the week. A gradient-boosted tree with well-chosen features will outperform a modestly-tuned neural network with raw inputs on almost every business dataset. This is not a nostalgic claim about classical methods; it is an observation about the structure of the learning problem.
Machine-learning algorithms learn functions of the features. They cannot learn structure the features do not expose. If the relationship between customer tenure in days and churn probability is periodic over a seven-day week, a linear model that sees only tenure-in-days will not discover that periodicity no matter how much data you give it. A linear model that sees day-of-week as a one-hot feature will find it instantly. The first model is architecturally incapable of the finding; the second is already most of the way there. Choosing an algorithm and choosing features are complementary: the algorithm defines a hypothesis space, and the features define the basis in which that space is parameterised. Changing the basis can change everything.
Three empirical regularities make feature engineering the highest-leverage activity in classical ML. First, tabular data contains a lot of prior human knowledge that is not in the raw columns: the practitioner who knows that transaction amount matters less than transaction amount as a fraction of the customer's typical spend can write that feature in one line and buy a ten-percent improvement in F1 that no algorithm change will match. Second, real datasets are smaller than researchers pretend. The canonical "big data" benchmarks in the deep-learning literature are not the business datasets most practitioners work on; the typical real-world classification problem has tens of thousands of examples and dozens to hundreds of features, a regime where feature engineering dominates algorithm choice. Third, classical ML algorithms — especially gradient-boosted trees — are extraordinarily forgiving consumers of sensible features: scale them however you like, include redundant versions, miss a few, and the tree will still sort it out.
This chapter covers the two halves of the problem: engineering (turning raw data into usable numerical inputs) and selection (deciding which of the candidate features to keep). Both are craft as much as science: the statistical machinery exists, and you will see it, but the working intuition — "try target encoding, it usually helps"; "always check for leakage before celebrating a win" — is the bulk of the value. We will also be severe about the pitfalls: data leakage, train–serve skew, and the selection-bias traps of wrapper methods all lie in wait for the practitioner who cares only about accuracy. The techniques in this chapter are the ones every production ML system actually uses, and the ones every failure post-mortem keeps pointing back to.
A feature-engineering step is not a script you run once: it is a transformer — a pair of functions, one that fits state from training data and one that applies that state to new data. Getting this structure right is what separates a feature pipeline that survives production deployment from one that doesn't.
Consider standardising a numerical column to zero mean and unit variance. The operation has two parts. Fit: compute the mean and standard deviation from the training data. Transform: subtract the mean, divide by the standard deviation. Naively you might write one function that does both, but that function cannot be applied to the test set or to a single serving example without recomputing the mean — which would either use future data (on test, the test-set mean) or not be well-defined at all (on a single example). The solution is the fit/transform split used throughout scikit-learn: fit(X_train) estimates the mean and standard deviation and stores them in the transformer object; transform(X) applies the stored statistics to any future data. This structure is not optional; every non-trivial feature pipeline in production is built on it.
Feature-engineering steps compose. You typically want to do some imputation, then some scaling, then some encoding, then pass the result to a model. Scikit-learn's Pipeline object expresses this as a named sequence, exposing a single fit/predict interface that hides the internal structure: Pipeline([("impute", SimpleImputer()), ("scale", StandardScaler()), ("encode", OneHotEncoder()), ("model", LogisticRegression())]). ColumnTransformer is the complementary construct for applying different transformers to different columns (scale the numeric, one-hot the categorical). Together they make an entire fit-and-predict workflow a single object — crucial for cross-validation (which must re-fit transforms on each fold to avoid leakage), for model persistence, and for deployment.
A production-ready feature pipeline maintains three invariants. It is fittable: all statistics needed at serving time are estimated from training data and nowhere else. It is persistable: the fitted state serialises to a file and loads back identically. It is reproducible: re-fitting on the same data produces bit-identical state (up to floating-point nondeterminism). Violating any of these invariants is the classic source of train–serve skew, the bug where features are computed one way during training and another way during inference. The most common cause is computing some statistic at training time using pandas and at serving time using a different library: the edge cases differ, the results differ, and the model silently becomes less accurate in production than it was at evaluation.
At scale, the Pipeline pattern breaks down: you want to share features across multiple models, compute them once rather than re-fitting for each training run, and keep the training and serving paths exactly aligned. The modern answer is a feature store — Feast, Tecton, Hopsworks, Databricks Feature Store — which materialises features into a cache that both offline training and online serving read from. The feature store enforces the three invariants by construction: there is one definition of each feature, one computation path, and a point-in-time query API that prevents training on features computed after the label was known. Feature stores are the subject of Section 17; for now, the principle is enough.
Numerical features rarely arrive on the scale the model wants. Raw income is skewed across four orders of magnitude; raw time-to-checkout is bimodal with a heavy right tail; raw latency is log-normal. The transformations in this section are the standard toolkit for giving these features a shape the learning algorithm can use.
Linear models, neural networks, distance-based methods (kNN, k-means, SVMs with RBF), and gradient descent in general are all sensitive to the scale of features. Tree models are not. The three standard scaling transforms cover the waterfront. Standardisation (z-scoring) subtracts the mean and divides by the standard deviation, producing a feature with zero mean and unit variance; the workhorse when the feature is roughly symmetric. Min-max scaling maps the feature to [0, 1] via (x − min)/(max − min); useful when the downstream model assumes a bounded input (some neural architectures, some calibration methods). Robust scaling subtracts the median and divides by the interquartile range; the version to use when the data has outliers that would otherwise dominate the variance estimate. Apply them after a train–test split has been established; fit on training data only.
When a feature is strongly right-skewed (income, session duration, file size, page views), scaling alone does not help — the top one percent of values still dominate the model's view. The canonical fix is a power transform. The log transform x → log(x + c) is the simplest and usually first to try; it turns a log-normal distribution into a normal one, and gracefully handles four-orders-of-magnitude ranges. Box–Cox (Box & Cox 1964) parameterises a family of transforms by a single parameter λ and selects the value that makes the result most nearly normal; restricted to strictly positive inputs. Yeo–Johnson (Yeo & Johnson 2000) generalises Box–Cox to real-valued inputs (positive and negative). In scikit-learn these are one-liners via PowerTransformer(method="yeo-johnson"). Power transforms are almost always worth trying on any feature with a ratio max/median above about twenty.
Sometimes the right answer is to throw away the continuous structure altogether. Binning replaces a numerical feature with a categorical one indicating which bucket the value falls into. Equal-width bins (1–10, 10–20, 20–30, …), equal-frequency bins (quantile-based), and supervised binning (where bin boundaries are chosen to maximise target information) all have their uses. Binning is useful when the relationship with the target is strongly non-monotonic, when the model is linear and cannot learn the non-linearity itself, or when you want to handle a heavy-tailed feature while preserving interpretability. The cost is that binning discards information within each bin; use it where the trade-off is favourable, not as a default.
Beyond log, the square root (x → √x) is useful for count data (mildly skewed), the reciprocal (x → 1/x) for latencies and rates (where differences matter in inverse terms), and Winsorising (clipping at, say, the 1st and 99th percentiles) for bounded robustness against outliers. None of these change the rank ordering of the data, which is why tree models are indifferent to them; but for any linear or kernel model, the choice of monotonic transform is a genuine modelling choice.
Most ML algorithms consume numerical vectors, not strings. The translation from categorical values to numerical representations is the single most common feature-engineering task in tabular ML, and it is where the largest number of subtle mistakes are made.
The standard encoding for a categorical feature with k distinct values is the one-hot (or dummy-variable) representation: k binary columns, of which exactly one is 1 for each row. This is the representation that lets a linear model assign a separate coefficient to each category, and it is the representation scikit-learn's OneHotEncoder produces. Two practical subtleties. First, if the categorical is fed to a linear model with an intercept, drop one category (the drop="first" option) to avoid perfect collinearity between the intercept and the one-hots; tree models do not care. Second, if a category appears at inference time that was not in training, the encoder must have a plan — usually handle_unknown="ignore", which encodes unseen categories as all zeros. One-hot scales poorly when k grows: at a thousand categories you have a very sparse thousand-column matrix; at a million categories you need a different encoding entirely (Section 5).
When the categorical values have a natural order — low / medium / high, elementary / high-school / college / graduate, bronze / silver / gold — use ordinal encoding: replace each value with an integer reflecting the order. This is the right representation for tree models (which can split at the right threshold automatically) and for linear models where you expect the effect to be monotonic in the rank (a reasonable prior for rating scales). Do not use ordinal encoding on nominal categoricals with no natural order (colours, country codes, product categories); that would impose a fictitious ordering that most models will try to use, badly.
Label encoding assigns each distinct category an arbitrary integer. This is fine for tree models (which can carve up the integer axis however they like) and for the target column of a classification problem. It is a catastrophe for linear models and distance-based methods on feature inputs: "country=77" and "country=78" are adjacent on the integer axis but not semantically adjacent, so the model will learn a fictitious smoothness that generalises poorly. Scikit-learn's LabelEncoder documents this prominently; use it only for the target, or use OrdinalEncoder when you genuinely do have an order.
Frequency encoding replaces each category with the fraction of training rows in which it appears. This turns a categorical into a single numerical feature that a tree model can split on; popular values become one feature value, rare values another. It loses the within-group information but compresses k categories into one column, which is sometimes the right trade-off. Binary encoding first ordinal-encodes to integers and then expresses each integer in binary across log₂(k) columns; it preserves more category information than frequency encoding but less than one-hot, and is a reasonable middle ground for cardinalities in the low thousands. Both are available in the category_encoders Python package.
Customer IDs, zip codes, product SKUs, URL paths — any real-world ML problem has at least one feature with tens of thousands to tens of millions of distinct categories. One-hot is out of the question. This section is the toolkit that replaces it.
The most important high-cardinality technique is target encoding (also called mean encoding, likelihood encoding, or impact encoding). Replace each category with a summary statistic of the target computed within that category: for a regression target, the category's mean target value; for a binary target, the category's positive-class rate; for a multiclass target, the category's class-conditional probabilities. A single high-cardinality column becomes a single low-cardinality column carrying the most relevant signal — the target distribution conditional on the category. In production it routinely beats one-hot by enough to be worth the effort.
Raw target encoding has a serious problem: for categories with few observations, the within-category mean is noisy and can overfit badly. The standard fix is smoothing toward the global mean: the encoded value for category c is a weighted average of the category mean μ_c and the global mean μ, with the weight depending on the number of observations n_c in the category — typically w_c = n_c / (n_c + m) for a smoothing parameter m (the "equivalent sample size"). This is precisely the James–Stein / empirical-Bayes shrinkage estimator, and it dramatically improves out-of-sample behaviour. The category_encoders library exposes this as TargetEncoder(smoothing=...).
Target encoding has a second, subtler problem: because the encoded value for row i uses row i's target in the encoding computation, the feature trivially leaks the label. Train a model on target-encoded features computed naively and you will see absurd in-sample accuracy; deploy it and watch it collapse. The solutions are all variants of out-of-fold computation: use k-fold cross-validation to compute each row's encoding from a fold that excludes that row, or use leave-one-out target encoding, or use the CatBoost approach of ordered target encoding where each row's encoding uses only the rows that came before it in a random permutation. Every serious production deployment of target encoding uses one of these schemes. The naive implementation — "just compute the group-by mean on training data" — is one of the most common causes of leakage in classical ML.
For the highest-cardinality features (URL paths with billions of distinct values, IP addresses, free-text product titles), even target encoding becomes awkward. Two alternatives. Frequency encoding (Section 4) replaces each value with its count — cheap, low-leakage, loses the target conditional. Hashing (Section 11) maps each category to one of a fixed number of buckets via a hash function, accepting collisions in exchange for a bounded memory footprint.
The modern high-end solution is to learn a low-dimensional embedding for each category by backpropagation. Each category is mapped to a vector in a d-dimensional space (typically d = 8 to 64), and those vectors are trained jointly with the rest of the model. This is the mechanism behind entity embeddings (Guo & Berkhahn 2016, whose Rossmann Kaggle win popularised the technique) and behind every modern recommender system's treatment of user and item IDs. In tabular deep learning this is the default; in tree models it is more awkward (you have to embed outside the tree), but tools like pytorch-tabular and the fastai tabular stack make it turnkey.
A linear model is linear in its features — not in the underlying phenomenon. If the right explanation involves a product of two variables (price × demand), a linear model can't find it unless you give it the product explicitly. Polynomial and interaction features are the canonical way to do that.
Consider predicting purchase probability from price and demand. The true relationship might be something like probability ∝ 1 − price/demand: the cheaper a product relative to demand, the more likely it sells. A linear model y = β₀ + β₁ price + β₂ demand cannot express this; it has no access to the ratio. A linear model with an interaction term y = β₀ + β₁ price + β₂ demand + β₃ (price × demand) still can't express a ratio exactly, but it can approximate it reasonably over a bounded range — and adding further interactions or polynomial terms tightens the approximation. This is the Taylor-expansion view of feature engineering: you're giving the linear model access to progressively higher-order terms of the true function.
Scikit-learn's PolynomialFeatures(degree=d) generates all polynomial combinations of the input features up to total degree d. With p input features and degree d, the output has C(p+d, d) features — quickly explosive: 50 features at degree 3 produces 23,426 polynomial features. The practical range is degree 2 on small feature sets, with interaction-only mode (interaction_only=True) suppressing the pure squared-and-cubed terms when only cross-terms are wanted. For large feature sets the explosion makes polynomial features impractical, and you either use a kernel method (Chapter 07, which computes a polynomial kernel implicitly without materialising the features) or you use a model that finds interactions automatically (trees, gradient boosting, neural networks).
The quiet truth is that most of the gains from interaction features come from domain-driven ones, not exhaustive polynomial expansion. The ratio feature price / typical price for this customer; the product visit_count × average_order_value; the difference current_month_spend − last_month_spend — these are one-line features that routinely outperform degree-3 polynomial expansion on the same inputs. The reason is that they encode the structure a domain expert already knows: the data has ratios, products, and differences that matter, and the model saves the work of finding them. Which interactions to try is a question to take to the subject-matter expert, not to the grid search.
Tree-based models — random forests, gradient-boosted trees — discover interactions automatically. A tree that splits on price and then on demand in a child node is effectively using an interaction between the two. This is a large part of why GBMs dominate tabular ML: they don't need you to specify the polynomial basis, because they build a piecewise-constant approximation to whatever function fits. Explicit interaction features help GBMs too (they shorten the paths the tree has to traverse), but the marginal value is much smaller than for linear models. Reserve heavy interaction engineering for linear and kernel pipelines; rely on the tree otherwise.
SplineTransformer in scikit-learn and the splines-based GAM literature (Hastie & Tibshirani 1990) are the standard tools. Splines are the linear-modelling answer to "represent this feature's non-linearity with a handful of basis functions."
A raw timestamp — 2026-04-17 14:23:01 — is useless to most ML models. Decomposed into its components, it is routinely the most predictive feature in the dataset. The engineering is cheap, the win is reliable, and practitioners who skip this step leave large amounts of accuracy on the table.
Extract year, month, day, day-of-week, hour, minute, is_weekend, is_month_start, is_month_end, is_quarter_start, week-of-year, day-of-year, and any holiday flags relevant to the business (New Year's, Black Friday, regional holidays). These take one line each with pandas's .dt accessor. Each one exposes a different kind of seasonality: day-of-week catches weekly cycles, month catches seasonal ones, hour catches intraday patterns, holiday flags catch the human-calendar shocks. A single datetime column can legitimately produce fifteen to twenty engineered features that together capture the temporal structure the model needs.
Components like hour and day-of-week are cyclical: 23:59 is close to 00:01, and Sunday is close to Monday. Naively encoding hour as an integer 0–23 imposes a fake linear structure where the model thinks 00:00 and 23:00 are maximally far apart. The fix is cyclical encoding: replace the integer with two features, sin(2π × hour / 24) and cos(2π × hour / 24). The pair traces a circle in the plane, and distances in that plane reflect the true cyclical distance. Do the same for day-of-week (period 7), month (period 12), day-of-year (period 365.25). For tree models this matters less (the tree can split on the integer), but for any linear, kernel, or neural model, cyclical encoding is the right default.
Beyond decomposition, elapsed times are enormously useful: days since last purchase, days since signup, days until next scheduled event. These are easy to compute from the raw datetime and tend to be highly predictive in customer-lifetime-value and churn problems. Lag features — the value of some quantity at time t − k — are the feature-engineering workhorse of time-series forecasting. Rolling-window statistics — rolling mean, rolling standard deviation, rolling max over the last k time steps — extend lags into smoothed-history features. The library tsfresh automates a huge inventory of these time-series features (Christ et al. 2018, Time Series Feature Extraction on basis of Scalable Hypothesis tests) and is a standard tool in the time-series feature-engineering toolkit.
Two ubiquitous pitfalls. First, time zones: if your timestamps mix UTC and local time without documentation, datetime features will be systematically biased. Normalise to UTC at ingest, and compute any "user local time" features explicitly using the user's known timezone. Second, point-in-time correctness: when you build a feature like "user's total purchases so far", compute it with the knowledge available as of the feature's timestamp — not the full-dataset aggregate. Getting this wrong causes a classic leak: the "user's total purchases" computed over the full dataset includes future purchases that would not be known at the feature's moment, and the model trains on that future information without noticing.
Free-form text — reviews, support tickets, product descriptions, tweets, medical notes — is among the most common non-numeric feature types. Before large language models displaced them, the classical feature-engineering approaches in this section were the state of the art for a quarter-century; and they remain the right tool when the task is small, the labels are few, or the inference budget is tight.
The simplest text representation is the bag-of-words (BoW): build a vocabulary of the k most frequent tokens in the training corpus, and represent each document as a k-dimensional vector whose entries are the counts of each vocabulary token in the document. The representation throws away word order — hence "bag" — but preserves enough to support classification, topic modelling, and many retrieval tasks. Scikit-learn's CountVectorizer produces BoW vectors in one line, with standard options for lowercasing, stop-word removal, and minimum/maximum document-frequency filtering. For most problems you want to cap the vocabulary at 10,000 to 100,000 tokens; beyond that the rare-word tail is mostly noise.
Term frequency–inverse document frequency reweights BoW so that common words (which carry little discriminative power) get down-weighted and rare words (which distinguish documents) get up-weighted. TF, the term frequency, is the count in the document (possibly log-scaled); IDF, the inverse document frequency, is log(N/df(t)) where N is the number of documents and df(t) the number containing term t. The product weights rare terms much more heavily. TF-IDF is the standard input to a linear text classifier and will beat raw counts on essentially every text classification task. Scikit-learn's TfidfVectorizer fuses the two steps.
N-grams extend BoW by including consecutive token sequences of length n: "machine learning" becomes a single feature alongside "machine" and "learning". Unigrams plus bigrams (ngram_range=(1, 2)) is a common default that captures enough short-range word order to lift most classification tasks. Character n-grams — sliding windows of characters rather than tokens — are the right representation for morphologically rich languages, noisy social-media text, and misspelling-robust classification. Character 3-grams through 5-grams will outperform word-level features on tasks where spelling varies, capitalisation is unreliable, or the domain has many out-of-vocabulary tokens.
BoW and TF-IDF both require building a vocabulary from the training corpus — expensive in memory and awkward when the corpus is streaming or distributed. The hashing vectorizer (Weinberger et al. 2009, Feature Hashing for Large Scale Multitask Learning) sidesteps this: hash each token into one of k buckets, and represent the document as the count (or TF-IDF-weighted count) in each bucket. No vocabulary required; the mapping is fixed; the memory is bounded. The cost is collisions between unrelated tokens, but at k = 2²⁰ or so, collisions are rare enough to be negligible for most classification tasks. Scikit-learn's HashingVectorizer is the standard implementation; it is the right choice when the training corpus is too large to fit a vocabulary, or when the same model must accept tokens it has never seen before.
Real datasets are full of missing values. Before they reach most ML algorithms they must be filled in, dropped, or explicitly flagged — and the choice between those options is both a statistical decision and a feature-engineering one.
The statistical taxonomy of missingness, due to Rubin (1976), cuts the problem three ways. Missing completely at random (MCAR): the probability of missingness does not depend on any value, observed or unobserved. The easy case — any reasonable imputation is unbiased. Missing at random (MAR): missingness depends on observed values but not on the missing value itself (e.g., older customers more often have unreported income, but the unreported incomes are distributed normally conditional on age). Model-based imputation conditional on the observed features works. Missing not at random (MNAR): missingness depends on the missing value itself (high earners refuse to report income). The hardest case — no unbiased imputation exists without auxiliary assumptions, and the missingness pattern is itself informative.
The workhorse approach is mean imputation for continuous features, median imputation for skewed ones, and mode (most-frequent) imputation for categoricals. Scikit-learn's SimpleImputer does all three in one line. Simple imputation is a blunt instrument — it biases variance downward, distorts multivariate distributions, and can damage downstream inference — but it is fast, stable, and usually an acceptable baseline. Constant imputation (filling with a sentinel value like −999 or the string "missing") is a reasonable alternative for tree models, which can split on the sentinel and effectively treat missingness as its own category.
More sophisticated approaches treat imputation as a supervised-learning problem: fit a model to predict each feature from the others, and use the prediction to fill in missing values. KNN imputation (KNNImputer) fills each missing entry with an average from the k nearest rows; simple and often effective. Iterative imputation (the MICE algorithm, van Buuren 2011; scikit-learn's IterativeImputer) cycles through features, imputing each one conditional on the rest using a regression model, and iterates until convergence. For small-to-medium datasets iterative imputation typically beats simple imputation by a meaningful margin. For very large datasets it becomes computationally awkward and practitioners fall back to simple methods or handle missingness directly in the model.
Regardless of which imputation method you use, add a binary indicator for each feature that had missing values in training. X_missing = X.isna().astype(int). This gives the model a chance to learn that "this feature was missing" is itself predictive — the MNAR case, where missingness carries information. In practice missingness indicators are often among the most important features in a churn or fraud model: the customers who don't fill in a form are systematically different from the ones who do. Missingness indicators are almost free to compute and almost always worth including.
Modern gradient-boosting libraries (XGBoost, LightGBM, CatBoost) handle missing values natively: at each split, they send missing values down the branch that minimises training loss. This is effectively learned imputation, and on tabular problems it routinely outperforms any explicit imputation step. If your model is a gradient-boosted tree, often the right pipeline is to do no imputation, let the model handle it, and compare against imputation-based alternatives. For neural networks and linear models, explicit imputation is still required.
Pipeline or ColumnTransformer: cross-validation fits the imputer separately on each fold's training portion, which is exactly what you want.
Outliers — values far from the rest of the distribution — distort feature scales, destabilise linear models, and sometimes carry the most important signal in the dataset. The engineering challenge is distinguishing the two cases.
Three standard detection approaches cover most cases. The z-score rule flags points with |z| > 3 (or some other threshold); fine for approximately normal features, fragile for heavy-tailed ones. The interquartile range (IQR) rule flags points below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR; much more robust for skewed distributions, and the basis of the boxplot whisker. Model-based approaches — isolation forest (Liu, Ting, Zhou 2008), one-class SVM (met in Chapter 07), local outlier factor — are worth reaching for when features have non-Gaussian multivariate structure that univariate methods miss.
Once detected, outliers can be winsorised (replaced with the p-th percentile value, typically 1st and 99th) or clipped (bounded at a fixed threshold). Both truncate the tails without dropping rows; both are much safer than removal when the row otherwise contains valid information. Winsorising is a one-line operation in pandas, and is particularly effective when fed into linear models that would otherwise have their coefficient estimates dominated by a handful of extreme observations.
Section 3 mentioned robust scaling — subtract the median, divide by the IQR — as an alternative to standardisation. In the presence of outliers, robust scaling is strictly better: extreme values do not distort the scale parameters that govern the rest of the data. Combined with winsorising, it gives a linear model a feature distribution that is approximately Gaussian in its bulk, with the tails capped at a manageable range.
Crucially, sometimes outliers are exactly what you want to capture. In fraud detection, the fraudulent transactions are the outliers; clipping them away removes the signal. In anomaly detection and novelty detection, outliers are the entire point. In scientific data (particle-physics events, astronomical transients), the interesting discoveries are the ones that do not fit the model. The decision of whether to treat outliers as noise or signal is a modelling choice that depends entirely on the task. The clipping heuristic in Section 10 is for tasks where the outliers are genuinely data-quality problems or measurement errors — not for tasks where they are what you are trying to find.
When the feature space is too large to hold a vocabulary in memory — billions of URLs, hundreds of millions of user agents, an open-ended set of product IDs — the hashing trick replaces the vocabulary with a fixed-width hash function. It is the only realistic encoding for genuinely large categorical spaces.
Choose a hash function h (a fast non-cryptographic one — MurmurHash3, FNV, xxHash) and a bucket count k (typically a power of 2 between 2¹⁶ and 2²²). Map each categorical value v to bucket h(v) mod k. Represent a row as the k-dimensional sparse vector counting how many values hashed to each bucket. The mapping is stateless — no training-time vocabulary to build, no inference-time dictionary to load. New categories at inference time are hashed just like old ones. The entire representation fits in O(k) memory regardless of how many distinct values the feature takes.
Two different values can hash to the same bucket. In a linear model this means the coefficient for that bucket sums the contributions of all colliding values — which, for unrelated values, is noise. The literature (Weinberger, Dasgupta, Langford, Smola, Attenberg 2009) shows that with k ≫ n (bucket count much greater than the number of active values per example), collisions are rare enough to cost only a small amount of accuracy. A signed variant of the hashing trick uses a second hash to randomise the sign of each feature — this makes collisions cancel on average rather than add, reducing bias at no additional cost.
Feature hashing is the representation of choice for: click-through-rate models at ad networks (where the feature space is the cartesian product of tens of categorical fields, each with millions of values); online learning with streaming data (no vocabulary to maintain); and multi-tenant ML systems that must accept arbitrary new categorical features without schema changes. Google's Vowpal Wabbit is built around hashing-based feature representation; scikit-learn's HashingVectorizer and FeatureHasher bring the same idea to the Python ecosystem. For most tabular problems with k under 100,000 categories, hashing is overkill and direct target-encoding or one-hot is cleaner. Above that threshold, hashing becomes the only feasible option.
The engineering cost of hashing is interpretability. "Feature 17,342 is important" is not a human-readable statement. If you need to explain what drove a model's prediction, hashed features require reverse-mapping: iterating through candidate values and checking which hash to the bucket of interest. That cost is usually acceptable for production ML systems but sometimes rules out hashing in regulated domains (lending, healthcare) where per-feature attribution is required.
With features engineered, the next question is which to keep. Feature selection is almost always worthwhile: removing uninformative features reduces overfitting, training time, serving cost, and the surface area for data drift. The methods split into three canonical families.
Four reasons feature selection pays off. First, accuracy: on small datasets, uninformative features are pure noise and hurt generalisation — even models with regularisation don't fully recover. Second, speed: fewer features means faster training and faster inference, often by a large factor. Third, interpretability: a model with ten features is easier to explain than one with a thousand. Fourth, maintenance: every feature in production is a feature that can go stale, drift, break, or suffer upstream schema changes. Selection is the activity of paying for the features that earn their keep and dropping the rest.
Guyon & Elisseeff's 2003 introduction to feature selection (An Introduction to Variable and Feature Selection, JMLR) codified the now-standard three-family taxonomy. Filter methods rank features by a statistic computed independently of any model — correlation with the target, mutual information, chi-square. They are fast, model-agnostic, and ignore interactions between features. Wrapper methods treat selection as a search over feature subsets and evaluate each subset by training a model on it. They capture interactions but are expensive and prone to overfitting the selection process. Embedded methods perform selection as a side effect of model fitting — L1-regularised regression zeros out coefficients automatically, decision trees score features by how often they are used — giving most of the wrapper benefit at a fraction of the cost.
A cross-cutting distinction: univariate selection methods look at each feature in isolation (Pearson correlation, mutual information, ANOVA F-test); multivariate methods consider features in combination (RFE, Lasso, mRMR). Univariate methods miss features that are uninformative alone but useful in combination (the classic XOR example: two features that together perfectly predict the target but are individually uncorrelated with it). Multivariate methods catch these but are more expensive and harder to analyse. A reasonable default is to start with a univariate filter for speed and follow with a multivariate embedded or wrapper method for precision.
Feature selection is not always worth the trouble. Gradient-boosted trees are robust to redundant and even uninformative features; large datasets often absorb extra noise without generalisation loss; and regularisation accomplishes much of the same goal inside the model itself. In the deep-learning era, selection has largely moved inside the model: the network learns which features to attend to. The techniques in Sections 13–15 remain important for tabular linear, kernel, and small-tree pipelines; they matter less for large-scale GBMs and practically not at all for large neural networks.
Pipeline, or use nested cross-validation where the outer fold's test data is entirely excluded from the selection process.
Filter methods score each feature by a statistic of its relationship to the target, independent of any downstream model. They are the fast, scalable, first-pass tool of feature selection.
For continuous features and a continuous target, the obvious statistic is the Pearson correlation coefficient. Rank features by |r|, keep the top k. Quick, cheap, and effective when the relationships are approximately linear and monotonic. Spearman correlation (rank-based) captures any monotonic relationship, not just linear ones, and is the safer default when features have heavy tails. For continuous features and a categorical target, the ANOVA F-statistic — the ratio of between-class variance to within-class variance — plays the same role: higher F means the feature's class-conditional means are more separated, so the feature is more informative. Scikit-learn's f_classif and f_regression compute these directly.
For categorical features and a categorical target, the chi-square statistic on the contingency table tests whether feature and target are independent. Features with large chi-square values are more informative. Scikit-learn exposes this as chi2, with the usual caveat that the statistic requires non-negative features (frequency counts are fine; centred or negative values are not).
The most general filter statistic is mutual information — the reduction in uncertainty about the target induced by knowing the feature. Unlike correlation, MI captures arbitrary non-linear relationships; unlike chi-square, it applies cleanly to mixed continuous–categorical cases. Scikit-learn's mutual_info_classif and mutual_info_regression use nearest-neighbour entropy estimators (Kraskov, Stögbauer, Grassberger 2004) that work reasonably well in practice. MI is the right filter statistic when you don't know in advance what shape the relationship takes.
Univariate filters ignore an important consideration: redundancy between features. Two features that are each highly correlated with the target and with each other contribute less additional information than one of them alone. mRMR (Peng, Long, Ding 2005) formalises this as a two-criterion problem: maximise relevance (mutual information with the target) while minimising redundancy (mutual information with the already-selected features). The greedy algorithm selects features one at a time, at each step choosing the feature that maximises relevance − redundancy. mRMR remains a standard workhorse in bioinformatics and any problem where feature costs are high and the chosen subset must be small.
Every filter statistic has a bias that a practitioner ought to know. Correlation misses non-monotonic relationships. Chi-square requires positive features. Mutual information estimators are noisy for small samples. ANOVA assumes approximate normality and equal variances within groups. And all univariate filters miss interaction effects — the XOR example where neither feature alone correlates with the target but their product does. Treat filters as a cheap first-pass that flags features worth further attention, not as a decisive selection criterion.
Wrapper methods treat feature selection as a search over the 2ᵖ possible feature subsets, evaluating each candidate subset by training a model on it. They are more accurate than filters — they account for feature interactions — but also more expensive and more prone to selection bias.
The two simplest wrapper algorithms are forward selection (start with no features; at each step add the feature whose inclusion most improves cross-validated score; stop when no addition helps) and backward elimination (start with all features; at each step remove the feature whose removal most improves score). Both are greedy — they don't consider larger swaps — but on typical tabular problems they produce reasonable subsets in polynomial rather than exponential time. Scikit-learn's SequentialFeatureSelector implements both.
The most widely used wrapper in practice is recursive feature elimination (Guyon, Weston, Barnhill, Vapnik 2002, Gene Selection for Cancer Classification Using Support Vector Machines). Fit a model to all features, rank features by the model's own importance metric (coefficient magnitude for a linear model, feature importance for a tree), remove the bottom k%, and repeat until a target number of features remains. RFE is most naturally paired with linear SVMs or L2-regularised logistic regression, where coefficient magnitudes are meaningful, and with random forests or gradient boosters, which provide a feature-importance signal for free. Scikit-learn's RFE and RFECV (the latter chooses the feature count by cross-validation) are the standard implementations.
For problems where the right subset is unlikely to be reachable by greedy steps — non-separable feature interactions, multiple local optima in the subset space — stochastic-search wrappers can help. A genetic algorithm treats each feature subset as a binary chromosome, evaluates fitness by cross-validated model score, and uses crossover and mutation to explore. Simulated annealing does the analogous local-search walk with a temperature schedule that flattens acceptance probabilities. These methods are slow, hyperparameter-laden, and rarely worth it for tabular problems with p in the hundreds; they come into their own in genomics and other applications where p is in the thousands and the subset size is tightly constrained.
Wrapper methods that use the same data to both select features and evaluate the selection will overstate the resulting model's accuracy. The fix is nested cross-validation: an inner loop selects features on each training fold, an outer loop evaluates the selected features on that fold's held-out data. This is expensive — the computation cost multiplies — but it is the only way to get an honest estimate of how the selection+model pipeline will perform on new data. Skipping this step is one of the more common causes of "my ML worked great in the lab and died in production" stories.
Embedded methods do feature selection as a side effect of model fitting. You pay almost nothing extra — the selection comes for free from a model you were going to fit anyway — and you get most of the accuracy benefit of a wrapper without the computational cost.
The archetypal embedded method is Lasso (Tibshirani 1996): add an ℓ₁ penalty on the coefficient vector to a linear-model loss. The geometry of the ℓ₁ ball means the optimum typically hits its axes — setting many coefficients exactly to zero — rather than merely shrinking them. A Lasso-regularised linear regression produces a sparse coefficient vector, and the non-zero coefficients are the selected features. The regularisation strength α is the sole hyperparameter that controls the aggressiveness of the selection; larger α means fewer features retained. Elastic net (Zou & Hastie 2005) combines ℓ₁ and ℓ₂ penalties to handle groups of correlated features (Lasso alone tends to arbitrarily pick one of a correlated pair; elastic net includes both). Both are workhorses. Scikit-learn's Lasso, LassoCV, and ElasticNet are the standard APIs.
Random forests and gradient-boosted trees produce a feature-importance score as a byproduct of training. The classical score is mean decrease in impurity (MDI): for each feature, sum the impurity reductions at splits that use that feature, averaged across trees. MDI is fast and free but has a well-known bias toward high-cardinality features (a categorical with many unique values has more opportunities to split). The more reliable alternative is permutation importance: for each feature, measure how much the model's score drops when that feature's values are randomly shuffled. Permutation importance is model-agnostic, has no cardinality bias, and is the modern default. Scikit-learn exposes it as permutation_importance.
The state of the art in model-specific feature attribution is SHAP (Lundberg & Lee 2017, A Unified Approach to Interpreting Model Predictions). SHAP values are the game-theoretic Shapley values applied to features, decomposing each prediction into feature contributions that sum to the prediction's deviation from the mean. Mean absolute SHAP values across a dataset give a principled feature-importance ranking that respects feature interactions and is consistent in a well-defined axiomatic sense. The shap library provides fast computations for tree models (TreeSHAP, an exact polynomial-time algorithm for ensemble trees), kernel-based approximations for arbitrary models, and deep-network-specific variants. SHAP has effectively replaced MDI and permutation importance as the default in serious production ML reporting.
Boruta (Kursa & Rudnicki 2010) combines the embedded idea with a statistical test. It creates shadow features — shuffled copies of each original feature — and fits a random forest on the augmented dataset. A real feature is retained if its importance is significantly greater (in a repeated-measures sense across forest iterations) than the maximum importance among the shadow features. Boruta answers the "all-relevant" feature-selection question — which features carry any useful information at all — rather than the "minimal-optimal" question that Lasso and RFE answer (the smallest subset that gives the best model). The two questions are different, and Boruta is the workhorse for the all-relevant case, particularly in biology and medicine.
Data leakage is the single most common cause of ML models that look great offline and fail in production. The offender is almost always in the feature-engineering pipeline, not the model. This section catalogues the traps and the defences.
Data leakage occurs when information from outside the training window enters the training features — most commonly, information from the future or information derived from the labels. The result is a model that predicts accurately on historical evaluation data (because the leaked information is essentially the answer) and fails on new data (where that information is not available). Kaufman, Rosset, Perlich, Stitelman 2012 (Leakage in Data Mining) is the classic reference; their worked examples, taken from real Kaggle competitions, are a sobering reminder of how easy leakage is to commit accidentally.
The most direct form: a feature whose value is computed using the target itself. Classic examples — the amount_due column in a credit-default dataset that is only populated after default is known; the last_login column in a churn dataset that is updated only when the customer comes back; the was_refunded flag in a fraud dataset. Each of these perfectly predicts the target in training because the target determines the feature's value, but none of them is available at the moment prediction must be made. Target leakage is usually obvious in retrospect but subtle at the time: the feature just looks too good. Healthy skepticism toward any single feature that achieves near-perfect accuracy is the first line of defence.
In time-series or streaming data, a feature computed using future information is a leak. "Rolling 30-day average of revenue" computed over the full dataset includes rows from the future; the correct version uses only rows up to but not including the target row. "Days since last purchase" computed from the full dataset sees purchases that haven't happened yet. The defence is point-in-time correctness: every feature must be computable from information known as of the feature's timestamp. Feature stores enforce this by construction via their point-in-time join APIs; ad hoc pipelines must enforce it by discipline.
Computing any statistic (mean, median, quantile, imputation fill value) on the combined train+test set and then applying it to training is a leak: the test set has influenced the training features. The defence is simple — fit on training, apply to test — but the bug is common because the offending line is often innocent-looking (df['col'].fillna(df['col'].mean()) looks fine unless you notice the mean is over the full dataframe). The cleanest prevention is the Pipeline pattern, which re-fits every transformation on each cross-validation fold's training portion.
When rows are not independent — multiple rows per customer, multiple measurements per patient, multiple sensor readings per device — a random train–test split puts the same entity on both sides, and a feature that identifies the entity can leak. "This customer's historical default rate" is a useful feature in churn prediction, but if the same customer appears in both training and test, the model simply memorises customer identity rather than learning general patterns. Group-aware splitting (scikit-learn's GroupKFold, GroupShuffleSplit) is the fix; it also matters for time-series (TimeSeriesSplit).
The techniques in this chapter are individually simple. The skill is in assembling them into a feature-engineering workflow that produces maintainable, reproducible, leakage-free features at production scale.
Productive feature engineering is iterative. Start with a baseline model on the raw features (or the smallest reasonable set). Build a tight evaluation harness — cross-validated score, a single-command benchmark, tracking of each feature's contribution. Then iterate: generate candidate features one at a time, retrain, measure impact, keep if helpful. Two signals matter beyond raw score: variance across folds (a feature that helps on average but has high variance across folds is often capturing noise) and correlation with existing features (a new feature that is nearly identical to one already in the set adds little). Stop when the marginal score gain from each new feature falls below some threshold, or when the model hits a diminishing-returns plateau.
Hand-crafted feature engineering is slow. Automated feature engineering (AutoFE) applies a fixed library of operations — arithmetic combinations, aggregations over time windows or group keys, common transforms — to generate many candidate features at once, then uses selection to pick the survivors. The main library in the open-source Python ecosystem is featuretools (Kanter & Veeramachaneni 2015), which defines Deep Feature Synthesis: systematic generation of features by stacking aggregation and transformation primitives across relational tables. AutoML platforms (H2O Driverless AI, DataRobot, Google Cloud AutoML) include AutoFE as a core capability. AutoFE's accuracy rarely matches expert manual feature engineering, but it gets most of the way there with a fraction of the labour — a good match for teams with limited ML expertise or very large feature spaces.
At scale, features become a shared asset across many models. A feature store (Feast, Tecton, Hopsworks, Databricks Feature Store, AWS SageMaker Feature Store) is a managed system that stores feature values, serves them online with low latency, provides point-in-time-correct offline joins for training, and enforces consistency between the two. The feature-store abstraction — a feature is defined once, computed once, and consumed everywhere — is the mature pattern for organisations with more than a handful of ML models. It also centralises the enforcement of the discipline around training–serving skew and leakage that this chapter has been belabouring.
Features have meaning that is not captured in the column name. days_since_last_purchase is ambiguous until you know whether it counts only completed purchases or also abandoned cart transactions, whether it is computed at the moment of prediction or at some lag, whether "purchase" includes refunded ones. A production feature pipeline needs documentation — the feature definition, the computation path, the business meaning, the expected distribution, and the owner. And it needs versioning: the feature definition evolves over time, and the model must know which version it was trained on. Feature stores bake this in; ad-hoc pipelines must do it by convention.
The single most valuable test in a production feature pipeline: compute the features at both training time and inference time on the same input data, and assert the results are identical. This one test catches the entire class of bugs where the offline and online computation paths have drifted apart. The corollary practice is feature monitoring: track the feature distributions at inference time and alert on drift from the training distributions. Most production ML failures announce themselves as feature-distribution drift before they become prediction-quality failures; monitoring gives you a chance to respond in advance.
Deep learning's headline promise was "learn features from data; no manual feature engineering required." That promise held on images, text, speech, and molecular graphs — and it held badly on tabular data, where classical feature engineering continues to win. The modern picture is a more nuanced division of labour, where hand-engineered, learned, and retrieval-augmented features all play complementary roles.
Deep learning displaced hand-engineered features in three domains: images (where convolutional networks learn edge and texture detectors from raw pixels, ending the decades-long SIFT/HOG/GIST tradition), speech (where end-to-end acoustic models displaced MFCC and filterbank features), and text (where pretrained transformer embeddings now subsume TF-IDF and n-gram representations on most tasks). In each case, the learned features were dramatically better than the engineered ones, and the relevant feature-engineering literature is now mostly historical. The pattern is that learned features win when data is plentiful, the signal is buried deep in multi-scale patterns, and the model has enough capacity to recover those patterns.
On tabular data the pattern reverses. Gradient-boosted trees with hand-engineered features continue to match or beat deep-learning approaches on the bulk of structured-data benchmarks (Grinsztajn, Oyallon & Varoquaux 2022, Why do tree-based models still outperform deep learning on tabular data? — the most-cited empirical result on this question, drawing on 45 datasets). Four reasons: tabular features already encode substantial human knowledge (something neural networks have to recover from scratch); tabular datasets are small (thousands to millions of rows, not billions); tabular features are heterogeneous in scale and type (which trees handle natively, but neural networks handle badly without feature engineering); and the inductive biases of CNNs and RNNs do not match tabular structure. The upshot is that classical feature engineering remains the dominant practice for tabular ML, and this chapter's techniques are as relevant now as they were twenty years ago.
Even on tabular problems, one piece of deep-learning machinery has crossed over cleanly: entity embeddings for high-cardinality categoricals (Guo & Berkhahn 2016). Learning a dense vector for each category via backpropagation — much like word embeddings — gives a representation that captures category similarity and can outperform target encoding, especially when the same categorical appears in many models (a learned user embedding used across a recommender, a churn model, and a fraud model). Modern tabular-deep-learning libraries (FT-Transformer, TabNet, SAINT) use entity embeddings internally. The technique is the cleanest example of "old-school feature engineering plus new-school learning" that works in practice.
The 2020s have brought a new class of features: retrieval-augmented ones, where an embedding of the input is used to look up related records in an external store, and the retrieved records' aggregates become features. A customer-churn model might retrieve the k most-similar customers by embedding distance and feature-engineer summary statistics over those retrieved customers' behaviours. This is the feature-engineering side of the retrieval-augmented-generation pattern now dominant in LLM engineering, and it is starting to appear in tabular pipelines too. The practitioner who knows classical feature engineering and can compose it with learned embeddings and retrieval is using the full modern toolkit.
Two frontiers. First, automated feature engineering via tree-of-thought search and LLM-driven feature proposal (CAAFE, Hollmann et al. 2023; OpenFE, Zhang et al. 2023) — where a language model is prompted with the dataset schema and target, proposes feature transformations in code, and iteratively improves them based on validation scores. Early results are encouraging but not yet production-grade. Second, foundation-model features: using pretrained embeddings (text, image, multimodal) as direct inputs to a downstream classifier. For many tasks, the right feature engineering in 2026 is "call a foundation model to get embeddings, then engineer a classical tabular model on top." This pattern is eating a growing share of what used to be pure NLP or computer-vision pipelines, and it brings the techniques in this chapter — scaling, selection, missing-value handling, leakage discipline — into the foundation-model era.
Feature engineering has a peculiar bibliography: most of the craft lives in blog posts, Kaggle notebooks, and competition retrospectives rather than in a canonical textbook. But there are anchor references, and the selection literature — the half of the chapter that the machine-learning research community has cared about more than the engineering half — has a proper theoretical tradition. The references below split into anchor textbooks, foundational papers on selection and leakage, modern extensions that carry feature engineering into the deep-learning and large-model era, and the software that everyone ends up using. If you only read one book, read Kuhn & Johnson's Feature Engineering and Selection.
L1_ratio in scikit-learn's ElasticNet is this paper in miniature. Pair with Yuan & Lin's 2006 Model Selection and Estimation in Regression with Grouped Variables for the group-lasso extension.sklearn.feature_selection.RFE is a nearly line-for-line implementation. Also a rare example of a feature-selection paper with a clear biological deliverable: the chosen genes were themselves the scientific result.sklearn.inspection.permutation_importance. Pair with Strobl et al.'s 2007 Bias in Random Forest Variable Importance Measures for the critical follow-up that shows when MDI misleads and why permutation importance should be the default.sklearn.ensemble.IsolationForest is this paper in production form. Pair with the same authors' 2012 Isolation-Based Anomaly Detection journal version for the extended empirical study.mean, count, time_since_last) across entity relationships. The approach proved competitive with expert hand-crafted features on several Kaggle competitions and remains the most widely used automated-feature-engineering library for tabular relational data. Pair with the authors' Feature Labs blog posts for the DSF case studies, and with the featuretools documentation for the current API.sklearn.preprocessing provides the transformer zoo (scalers, encoders, binners, polynomial features, power transforms); sklearn.compose.ColumnTransformer is the column-wise dispatch primitive that makes multi-type tabular pipelines tractable; sklearn.pipeline.Pipeline is the fittable-composable-persistable container that prevents train-serve skew; sklearn.feature_selection implements every filter (SelectKBest), wrapper (RFE, SequentialFeatureSelector), and embedded (SelectFromModel) method in the chapter. Read the Common pitfalls and Preprocessing data User Guide sections start to finish — they are the most concentrated feature-engineering prose in the documentation of any ML library.mean and count, transform primitives like day_of_week and time_since_last) that can be composed across entity relationships to auto-generate hundreds of candidate features. Strong point-in-time-correctness support via cutoff times for temporal datasets — the feature matrix for each training row uses only data available at that row's cutoff. Pair with TPOT and AutoGluon for the adjacent AutoML-pipeline tools that sometimes subsume featuretools.pip install shap) implements TreeSHAP, KernelSHAP, DeepSHAP, and the rich visualisation suite (summary plots, dependence plots, force plots) that has become the standard way to read a tree-ensemble's feature usage. BorutaPy and Boruta (R) implement the all-relevant selection algorithm. Together they cover roughly 80% of day-to-day feature-importance and feature-selection work for tabular ML. Pair with LIME (the SHAP predecessor), ELI5, and alibi for the broader interpretability ecosystem.SequentialFeatureSelector with forward / backward / floating variants, the Bias-Variance decomposition tool, association-rule mining, and several ensemble stackers; its accompanying textbook Python Machine Learning is a solid practical feature-engineering companion. yellowbrick provides visual diagnostics — feature importance bar charts, rank-1D / rank-2D feature ranking visualisations, residual plots — that make the feature-engineering feedback loop much tighter than staring at numeric metrics alone.This page is Chapter 08 of Part IV: Classical Machine Learning. Chapters 01 through 07 built a gallery of algorithms — regression, classification, ensembles, clustering, dimensionality reduction, probabilistic graphical models, kernel methods and SVMs — each with its own inductive biases and characteristic failure modes. Chapter 08 is the connective tissue: the engineering and statistical discipline of deciding what inputs the algorithms should see in the first place. Chapter 09, which closes the classical-ML arc, completes the triad by turning to model evaluation and selection — the measurement discipline that tells us honestly which of the preceding algorithm-and-feature combinations actually wins on the problem in front of us, with all of the cross-validation and resampling subtleties that separate an honest answer from a wishful one.