Part IV · Classical Machine Learning · Chapter 08

Feature engineering and selection, where the difference between a mediocre model and a good one lives almost entirely in the columns you decide to feed it.

The seven chapters before this one have described learning algorithms — regression, classification, ensembles, clustering, dimensionality reduction, probabilistic graphical models, kernel machines — as if the features they consume were given, ready-made, well-typed, and on the right scale. Real features almost never arrive that way. They arrive as raw timestamps that a linear model cannot read, as categorical strings with two hundred thousand distinct values, as continuous measurements with heavy tails and outliers and twenty percent missing, as free-form text, as nested relational joins, as images that haven't been resized. Feature engineering is the discipline of turning that raw material into numerical inputs a model can use — and, done well, it is the single highest-leverage activity in classical machine learning. Iterating on features almost always moves metrics more than iterating on algorithms; a gradient-boosted tree with well-engineered features will beat a deep network with raw inputs on most tabular problems even in 2026. Feature selection is the closely related discipline of figuring out which features to keep: many candidate features degrade a model rather than improve it, through noise, multicollinearity, leakage, or simply by costing compute and maintenance for marginal gain. Selection methods come in three canonical flavours — filter, wrapper, embedded — and choosing between them is a real engineering decision. Feature engineering is also where the largest ML accidents happen: a feature that predicts too well in training because it leaks label information is the single most common cause of models that crash on contact with production. This chapter covers the craft.

How to read this chapter

Section one motivates the whole enterprise: why the columns you feed a model matter more than the model itself on almost every tabular problem, and why that is going to remain true despite the deep-learning revolution. Section two frames feature engineering as a pipeline — a sequence of idempotent transformations fitted on training data and applied to both training and serving — and introduces the principle that there are no end-runs around this structure: every transformation must be fittable, persistable, and reproducible. Section three covers numerical features: scaling (standardisation, min-max, robust), power transforms (log, Box–Cox, Yeo–Johnson), binning, and the monotonic-transform families that tame heavy tails. Section four is the workhorse of tabular ML: categorical encoding. One-hot, ordinal, frequency, and binary encodings all have characteristic failure modes, and knowing which to reach for is most of the practical skill. Section five tackles high-cardinality categoricals (customer IDs, city names, zip codes) where one-hot is infeasible — target encoding, frequency encoding, learned embeddings, and the careful cross-validation they require. Section six is interaction terms and polynomial features: how you explicitly tell a linear model about the multiplicative structure it cannot discover on its own, and when that pays off versus letting a tree model find it automatically.

Sections seven through twelve survey the rest of the engineering toolkit. Section seven is datetime features — the most underused class of features in practice and a place where small effort routinely yields outsized wins. Section eight covers text features: bag-of-words, TF-IDF, character n-grams, and the feature-hashing trick that lets you handle vocabularies that will not fit in memory. Section nine is missing-value imputation, which is both statistical inference and feature engineering at once: simple means and medians, KNN and model-based imputation, missingness indicators, and the sharp question of when missingness is itself informative. Section ten is outlier handling: detection (z-score, IQR, isolation forests), treatment (winsorising, clipping, separate indicator features), and when outliers are the signal rather than the noise. Section eleven returns to the hashing trick in its general form, as a scalable alternative to any one-hot or vocabulary-based encoding. Section twelve introduces feature selection proper, with the filter–wrapper–embedded taxonomy that organises everything to follow.

Sections thirteen through fifteen develop the three selection families in detail. Section thirteen covers filter methods — correlation, chi-square, ANOVA F-test, mutual information, and minimum-redundancy–maximum-relevance (mRMR) — which rank features by a statistic computed independently of any model. Section fourteen covers wrapper methods — forward selection, backward elimination, recursive feature elimination (RFE), and genetic-algorithm wrappers — which treat selection as a search over feature subsets and evaluate each subset by training a model. Section fifteen covers embedded methods — L1-regularised regression (Lasso) and elastic net, tree-based importances, SHAP-based selection, and the permutation importance that has become the de facto standard diagnostic. Section sixteen is the safety-critical topic: data leakage in feature engineering. This is the section that prevents the catastrophes; read it twice. Section seventeen is a practical operational guide — when to engineer, when to stop, what to automate, and what to stash in a feature store. Section eighteen places feature engineering in the broader ML landscape: deep learning's attempt to learn features from scratch, the enduring dominance of classical feature engineering on tabular data, the rise of automated feature engineering (featuretools, auto-sklearn, H2O Driverless AI), and the coming world of learned embeddings and retrieval-augmented features.

Why features matter more than the modelThe highest-leverage activity in classical ML
The feature-engineering pipelineFittable, persistable, reproducible transforms
Transforming numerical featuresScaling, log, Box–Cox, Yeo–Johnson, binning
Categorical encodingOne-hot, ordinal, frequency, binary
High-cardinality categoricalsTarget encoding, frequency encoding, embeddings
Interaction terms and polynomial featuresTeaching linear models about multiplicative structure
Datetime featuresComponents, cyclical encoding, lags, elapsed times
Text featuresBag-of-words, TF-IDF, n-grams, hashing
Missing-value imputationSimple, model-based, missingness indicators, MCAR/MAR/MNAR
Outlier handlingDetection, winsorising, clipping, signal vs noise
Feature hashingScaling encodings beyond memory
Feature selection overviewFilter, wrapper, embedded — the three families
Filter methodsCorrelation, chi-square, ANOVA F, mutual information, mRMR
Wrapper methodsForward, backward, RFE, genetic wrappers
Embedded methodsLasso, elastic net, tree importances, SHAP, permutation
Data leakage in feature engineeringThe single most common cause of production failure
Feature engineering in practiceWorkflow, automation, feature stores
Where it compounds in MLDeep learning, AutoFE, embeddings, retrieval-augmented features

Why features matter more than the model

On tabular problems — which still account for the majority of production machine-learning work — the engineering of the features beats the choice of algorithm most days of the week. A gradient-boosted tree with well-chosen features will outperform a modestly-tuned neural network with raw inputs on almost every business dataset. This is not a nostalgic claim about classical methods; it is an observation about the structure of the learning problem.

The fundamental observation

Machine-learning algorithms learn functions of the features. They cannot learn structure the features do not expose. If the relationship between customer tenure in days and churn probability is periodic over a seven-day week, a linear model that sees only tenure-in-days will not discover that periodicity no matter how much data you give it. A linear model that sees day-of-week as a one-hot feature will find it instantly. The first model is architecturally incapable of the finding; the second is already most of the way there. Choosing an algorithm and choosing features are complementary: the algorithm defines a hypothesis space, and the features define the basis in which that space is parameterised. Changing the basis can change everything.

Why this matters in practice

Three empirical regularities make feature engineering the highest-leverage activity in classical ML. First, tabular data contains a lot of prior human knowledge that is not in the raw columns: the practitioner who knows that transaction amount matters less than transaction amount as a fraction of the customer's typical spend can write that feature in one line and buy a ten-percent improvement in F1 that no algorithm change will match. Second, real datasets are smaller than researchers pretend. The canonical "big data" benchmarks in the deep-learning literature are not the business datasets most practitioners work on; the typical real-world classification problem has tens of thousands of examples and dozens to hundreds of features, a regime where feature engineering dominates algorithm choice. Third, classical ML algorithms — especially gradient-boosted trees — are extraordinarily forgiving consumers of sensible features: scale them however you like, include redundant versions, miss a few, and the tree will still sort it out.

The practitioner's proverb. "Applied machine learning is basically feature engineering." — Andrew Ng, 2013. Twelve years and one foundation-model revolution later, it is still approximately true for the kind of ML most people are paid to do. Deep learning has displaced hand-engineered features for images, speech, and text — but those three domains are a minority of the ML that happens in practice. For everything else, the feature matters more than the model.

What this chapter does

This chapter covers the two halves of the problem: engineering (turning raw data into usable numerical inputs) and selection (deciding which of the candidate features to keep). Both are craft as much as science: the statistical machinery exists, and you will see it, but the working intuition — "try target encoding, it usually helps"; "always check for leakage before celebrating a win" — is the bulk of the value. We will also be severe about the pitfalls: data leakage, train–serve skew, and the selection-bias traps of wrapper methods all lie in wait for the practitioner who cares only about accuracy. The techniques in this chapter are the ones every production ML system actually uses, and the ones every failure post-mortem keeps pointing back to.

The feature-engineering pipeline

A feature-engineering step is not a script you run once: it is a transformer — a pair of functions, one that fits state from training data and one that applies that state to new data. Getting this structure right is what separates a feature pipeline that survives production deployment from one that doesn't.

Fit and transform are different operations

Consider standardising a numerical column to zero mean and unit variance. The operation has two parts. Fit: compute the mean and standard deviation from the training data. Transform: subtract the mean, divide by the standard deviation. Naively you might write one function that does both, but that function cannot be applied to the test set or to a single serving example without recomputing the mean — which would either use future data (on test, the test-set mean) or not be well-defined at all (on a single example). The solution is the fit/transform split used throughout scikit-learn: fit(X_train) estimates the mean and standard deviation and stores them in the transformer object; transform(X) applies the stored statistics to any future data. This structure is not optional; every non-trivial feature pipeline in production is built on it.

Composition and the Pipeline object

Feature-engineering steps compose. You typically want to do some imputation, then some scaling, then some encoding, then pass the result to a model. Scikit-learn's Pipeline object expresses this as a named sequence, exposing a single fit/predict interface that hides the internal structure: Pipeline([("impute", SimpleImputer()), ("scale", StandardScaler()), ("encode", OneHotEncoder()), ("model", LogisticRegression())]). ColumnTransformer is the complementary construct for applying different transformers to different columns (scale the numeric, one-hot the categorical). Together they make an entire fit-and-predict workflow a single object — crucial for cross-validation (which must re-fit transforms on each fold to avoid leakage), for model persistence, and for deployment.

The three invariants

A production-ready feature pipeline maintains three invariants. It is fittable: all statistics needed at serving time are estimated from training data and nowhere else. It is persistable: the fitted state serialises to a file and loads back identically. It is reproducible: re-fitting on the same data produces bit-identical state (up to floating-point nondeterminism). Violating any of these invariants is the classic source of train–serve skew, the bug where features are computed one way during training and another way during inference. The most common cause is computing some statistic at training time using pandas and at serving time using a different library: the edge cases differ, the results differ, and the model silently becomes less accurate in production than it was at evaluation.

The rule. Never compute a feature transformation at training time in a way you cannot exactly replicate at serving time. If the training code uses pandas and the serving code uses Spark, the transformation must produce identical output on identical input. The cleanest way to enforce this is to write the transformation once, in a shared library, and call it from both sides. The dirtiest way — and the usual real-world way — is to rely on automated tests that hash a sample of features computed by both pipelines and refuse to deploy if they diverge.

The feature store pattern

At scale, the Pipeline pattern breaks down: you want to share features across multiple models, compute them once rather than re-fitting for each training run, and keep the training and serving paths exactly aligned. The modern answer is a feature store — Feast, Tecton, Hopsworks, Databricks Feature Store — which materialises features into a cache that both offline training and online serving read from. The feature store enforces the three invariants by construction: there is one definition of each feature, one computation path, and a point-in-time query API that prevents training on features computed after the label was known. Feature stores are the subject of Section 17; for now, the principle is enough.

Transforming numerical features

Numerical features rarely arrive on the scale the model wants. Raw income is skewed across four orders of magnitude; raw time-to-checkout is bimodal with a heavy right tail; raw latency is log-normal. The transformations in this section are the standard toolkit for giving these features a shape the learning algorithm can use.

Scaling: standardisation, min-max, and robust

Linear models, neural networks, distance-based methods (kNN, k-means, SVMs with RBF), and gradient descent in general are all sensitive to the scale of features. Tree models are not. The three standard scaling transforms cover the waterfront. Standardisation (z-scoring) subtracts the mean and divides by the standard deviation, producing a feature with zero mean and unit variance; the workhorse when the feature is roughly symmetric. Min-max scaling maps the feature to [0, 1] via (x − min)/(max − min); useful when the downstream model assumes a bounded input (some neural architectures, some calibration methods). Robust scaling subtracts the median and divides by the interquartile range; the version to use when the data has outliers that would otherwise dominate the variance estimate. Apply them after a train–test split has been established; fit on training data only.

Power transforms for skewness

When a feature is strongly right-skewed (income, session duration, file size, page views), scaling alone does not help — the top one percent of values still dominate the model's view. The canonical fix is a power transform. The log transform x → log(x + c) is the simplest and usually first to try; it turns a log-normal distribution into a normal one, and gracefully handles four-orders-of-magnitude ranges. Box–Cox (Box & Cox 1964) parameterises a family of transforms by a single parameter λ and selects the value that makes the result most nearly normal; restricted to strictly positive inputs. Yeo–Johnson (Yeo & Johnson 2000) generalises Box–Cox to real-valued inputs (positive and negative). In scikit-learn these are one-liners via PowerTransformer(method="yeo-johnson"). Power transforms are almost always worth trying on any feature with a ratio max/median above about twenty.

Binning and discretisation

Sometimes the right answer is to throw away the continuous structure altogether. Binning replaces a numerical feature with a categorical one indicating which bucket the value falls into. Equal-width bins (1–10, 10–20, 20–30, …), equal-frequency bins (quantile-based), and supervised binning (where bin boundaries are chosen to maximise target information) all have their uses. Binning is useful when the relationship with the target is strongly non-monotonic, when the model is linear and cannot learn the non-linearity itself, or when you want to handle a heavy-tailed feature while preserving interpretability. The cost is that binning discards information within each bin; use it where the trade-off is favourable, not as a default.

Monotonic transforms: square root, reciprocal, and friends

Beyond log, the square root (x → √x) is useful for count data (mildly skewed), the reciprocal (x → 1/x) for latencies and rates (where differences matter in inverse terms), and Winsorising (clipping at, say, the 1st and 99th percentiles) for bounded robustness against outliers. None of these change the rank ordering of the data, which is why tree models are indifferent to them; but for any linear or kernel model, the choice of monotonic transform is a genuine modelling choice.

A practical heuristic. When in doubt, plot the feature. A histogram takes three seconds and tells you immediately whether you have a near-Gaussian (standardise), a log-normal (log-transform then standardise), a heavy-tail with a point mass at zero (log(1 + x) or a separate zero-indicator), or a bimodal distribution (probably binning or a mixture-of-indicators). Most mistakes in this section are made by skipping the histogram.

Categorical encoding: the workhorse

Most ML algorithms consume numerical vectors, not strings. The translation from categorical values to numerical representations is the single most common feature-engineering task in tabular ML, and it is where the largest number of subtle mistakes are made.

One-hot encoding

The standard encoding for a categorical feature with k distinct values is the one-hot (or dummy-variable) representation: k binary columns, of which exactly one is 1 for each row. This is the representation that lets a linear model assign a separate coefficient to each category, and it is the representation scikit-learn's OneHotEncoder produces. Two practical subtleties. First, if the categorical is fed to a linear model with an intercept, drop one category (the drop="first" option) to avoid perfect collinearity between the intercept and the one-hots; tree models do not care. Second, if a category appears at inference time that was not in training, the encoder must have a plan — usually handle_unknown="ignore", which encodes unseen categories as all zeros. One-hot scales poorly when k grows: at a thousand categories you have a very sparse thousand-column matrix; at a million categories you need a different encoding entirely (Section 5).

Ordinal encoding

When the categorical values have a natural order — low / medium / high, elementary / high-school / college / graduate, bronze / silver / gold — use ordinal encoding: replace each value with an integer reflecting the order. This is the right representation for tree models (which can split at the right threshold automatically) and for linear models where you expect the effect to be monotonic in the rank (a reasonable prior for rating scales). Do not use ordinal encoding on nominal categoricals with no natural order (colours, country codes, product categories); that would impose a fictitious ordering that most models will try to use, badly.

Label encoding — a classic trap

Label encoding assigns each distinct category an arbitrary integer. This is fine for tree models (which can carve up the integer axis however they like) and for the target column of a classification problem. It is a catastrophe for linear models and distance-based methods on feature inputs: "country=77" and "country=78" are adjacent on the integer axis but not semantically adjacent, so the model will learn a fictitious smoothness that generalises poorly. Scikit-learn's LabelEncoder documents this prominently; use it only for the target, or use OrdinalEncoder when you genuinely do have an order.

Frequency encoding and binary encoding

Frequency encoding replaces each category with the fraction of training rows in which it appears. This turns a categorical into a single numerical feature that a tree model can split on; popular values become one feature value, rare values another. It loses the within-group information but compresses k categories into one column, which is sometimes the right trade-off. Binary encoding first ordinal-encodes to integers and then expresses each integer in binary across log₂(k) columns; it preserves more category information than frequency encoding but less than one-hot, and is a reasonable middle ground for cardinalities in the low thousands. Both are available in the category_encoders Python package.

Which encoding for which model? For a linear or neural model with a few-hundred-category feature, one-hot is the default. For a tree or gradient-boosting model with the same feature, either one-hot or ordinal usually works, with gradient boosting libraries (XGBoost, LightGBM, CatBoost) often supporting categorical inputs natively and doing something clever internally. For high-cardinality features, see the next section. For ordered categoricals, ordinal encoding beats one-hot in most cases.

High-cardinality categoricals: target encoding and embeddings

Customer IDs, zip codes, product SKUs, URL paths — any real-world ML problem has at least one feature with tens of thousands to tens of millions of distinct categories. One-hot is out of the question. This section is the toolkit that replaces it.

Target encoding

The most important high-cardinality technique is target encoding (also called mean encoding, likelihood encoding, or impact encoding). Replace each category with a summary statistic of the target computed within that category: for a regression target, the category's mean target value; for a binary target, the category's positive-class rate; for a multiclass target, the category's class-conditional probabilities. A single high-cardinality column becomes a single low-cardinality column carrying the most relevant signal — the target distribution conditional on the category. In production it routinely beats one-hot by enough to be worth the effort.

Regularisation and smoothing

Raw target encoding has a serious problem: for categories with few observations, the within-category mean is noisy and can overfit badly. The standard fix is smoothing toward the global mean: the encoded value for category c is a weighted average of the category mean μ_c and the global mean μ, with the weight depending on the number of observations n_c in the category — typically w_c = n_c / (n_c + m) for a smoothing parameter m (the "equivalent sample size"). This is precisely the James–Stein / empirical-Bayes shrinkage estimator, and it dramatically improves out-of-sample behaviour. The category_encoders library exposes this as TargetEncoder(smoothing=...).

Leakage is the real problem

Target encoding has a second, subtler problem: because the encoded value for row i uses row i's target in the encoding computation, the feature trivially leaks the label. Train a model on target-encoded features computed naively and you will see absurd in-sample accuracy; deploy it and watch it collapse. The solutions are all variants of out-of-fold computation: use k-fold cross-validation to compute each row's encoding from a fold that excludes that row, or use leave-one-out target encoding, or use the CatBoost approach of ordered target encoding where each row's encoding uses only the rows that came before it in a random permutation. Every serious production deployment of target encoding uses one of these schemes. The naive implementation — "just compute the group-by mean on training data" — is one of the most common causes of leakage in classical ML.

Frequency encoding and hashing

For the highest-cardinality features (URL paths with billions of distinct values, IP addresses, free-text product titles), even target encoding becomes awkward. Two alternatives. Frequency encoding (Section 4) replaces each value with its count — cheap, low-leakage, loses the target conditional. Hashing (Section 11) maps each category to one of a fixed number of buckets via a hash function, accepting collisions in exchange for a bounded memory footprint.

Learned embeddings

The modern high-end solution is to learn a low-dimensional embedding for each category by backpropagation. Each category is mapped to a vector in a d-dimensional space (typically d = 8 to 64), and those vectors are trained jointly with the rest of the model. This is the mechanism behind entity embeddings (Guo & Berkhahn 2016, whose Rossmann Kaggle win popularised the technique) and behind every modern recommender system's treatment of user and item IDs. In tabular deep learning this is the default; in tree models it is more awkward (you have to embed outside the tree), but tools like pytorch-tabular and the fastai tabular stack make it turnkey.

The choice depends on cardinality and leakage discipline. Under 100 categories: one-hot. 100–10,000 categories with supervision: target encoding with out-of-fold smoothing. 10,000–1,000,000 categories: learned embeddings if you have a deep model, hashing or frequency encoding if you don't. Above a million: hashing is often the only feasible option. Always write a test that verifies your target encoding is computed without leakage.

Interaction terms and polynomial features

A linear model is linear in its features — not in the underlying phenomenon. If the right explanation involves a product of two variables (price × demand), a linear model can't find it unless you give it the product explicitly. Polynomial and interaction features are the canonical way to do that.

The motivating example

Consider predicting purchase probability from price and demand. The true relationship might be something like probability ∝ 1 − price/demand: the cheaper a product relative to demand, the more likely it sells. A linear model y = β₀ + β₁ price + β₂ demand cannot express this; it has no access to the ratio. A linear model with an interaction term y = β₀ + β₁ price + β₂ demand + β₃ (price × demand) still can't express a ratio exactly, but it can approximate it reasonably over a bounded range — and adding further interactions or polynomial terms tightens the approximation. This is the Taylor-expansion view of feature engineering: you're giving the linear model access to progressively higher-order terms of the true function.

Polynomial features

Scikit-learn's PolynomialFeatures(degree=d) generates all polynomial combinations of the input features up to total degree d. With p input features and degree d, the output has C(p+d, d) features — quickly explosive: 50 features at degree 3 produces 23,426 polynomial features. The practical range is degree 2 on small feature sets, with interaction-only mode (interaction_only=True) suppressing the pure squared-and-cubed terms when only cross-terms are wanted. For large feature sets the explosion makes polynomial features impractical, and you either use a kernel method (Chapter 07, which computes a polynomial kernel implicitly without materialising the features) or you use a model that finds interactions automatically (trees, gradient boosting, neural networks).

Domain-driven interactions

The quiet truth is that most of the gains from interaction features come from domain-driven ones, not exhaustive polynomial expansion. The ratio feature price / typical price for this customer; the product visit_count × average_order_value; the difference current_month_spend − last_month_spend — these are one-line features that routinely outperform degree-3 polynomial expansion on the same inputs. The reason is that they encode the structure a domain expert already knows: the data has ratios, products, and differences that matter, and the model saves the work of finding them. Which interactions to try is a question to take to the subject-matter expert, not to the grid search.

Why tree models change the calculus

Tree-based models — random forests, gradient-boosted trees — discover interactions automatically. A tree that splits on price and then on demand in a child node is effectively using an interaction between the two. This is a large part of why GBMs dominate tabular ML: they don't need you to specify the polynomial basis, because they build a piecewise-constant approximation to whatever function fits. Explicit interaction features help GBMs too (they shorten the paths the tree has to traverse), but the marginal value is much smaller than for linear models. Reserve heavy interaction engineering for linear and kernel pipelines; rely on the tree otherwise.

Splines as an alternative. For continuous features where you suspect non-linearity but not specifically polynomial, splines (particularly B-splines or natural cubic splines) give smoother, better-behaved non-linear feature expansions than polynomials. The SplineTransformer in scikit-learn and the splines-based GAM literature (Hastie & Tibshirani 1990) are the standard tools. Splines are the linear-modelling answer to "represent this feature's non-linearity with a handful of basis functions."

Datetime features: the most underused feature class

A raw timestamp — 2026-04-17 14:23:01 — is useless to most ML models. Decomposed into its components, it is routinely the most predictive feature in the dataset. The engineering is cheap, the win is reliable, and practitioners who skip this step leave large amounts of accuracy on the table.

The standard components

Extract year, month, day, day-of-week, hour, minute, is_weekend, is_month_start, is_month_end, is_quarter_start, week-of-year, day-of-year, and any holiday flags relevant to the business (New Year's, Black Friday, regional holidays). These take one line each with pandas's .dt accessor. Each one exposes a different kind of seasonality: day-of-week catches weekly cycles, month catches seasonal ones, hour catches intraday patterns, holiday flags catch the human-calendar shocks. A single datetime column can legitimately produce fifteen to twenty engineered features that together capture the temporal structure the model needs.

Cyclical encoding

Components like hour and day-of-week are cyclical: 23:59 is close to 00:01, and Sunday is close to Monday. Naively encoding hour as an integer 0–23 imposes a fake linear structure where the model thinks 00:00 and 23:00 are maximally far apart. The fix is cyclical encoding: replace the integer with two features, sin(2π × hour / 24) and cos(2π × hour / 24). The pair traces a circle in the plane, and distances in that plane reflect the true cyclical distance. Do the same for day-of-week (period 7), month (period 12), day-of-year (period 365.25). For tree models this matters less (the tree can split on the integer), but for any linear, kernel, or neural model, cyclical encoding is the right default.

Elapsed-time and lag features

Beyond decomposition, elapsed times are enormously useful: days since last purchase, days since signup, days until next scheduled event. These are easy to compute from the raw datetime and tend to be highly predictive in customer-lifetime-value and churn problems. Lag features — the value of some quantity at time t − k — are the feature-engineering workhorse of time-series forecasting. Rolling-window statistics — rolling mean, rolling standard deviation, rolling max over the last k time steps — extend lags into smoothed-history features. The library tsfresh automates a huge inventory of these time-series features (Christ et al. 2018, Time Series Feature Extraction on basis of Scalable Hypothesis tests) and is a standard tool in the time-series feature-engineering toolkit.

Time zones and point-in-time correctness

Two ubiquitous pitfalls. First, time zones: if your timestamps mix UTC and local time without documentation, datetime features will be systematically biased. Normalise to UTC at ingest, and compute any "user local time" features explicitly using the user's known timezone. Second, point-in-time correctness: when you build a feature like "user's total purchases so far", compute it with the knowledge available as of the feature's timestamp — not the full-dataset aggregate. Getting this wrong causes a classic leak: the "user's total purchases" computed over the full dataset includes future purchases that would not be known at the feature's moment, and the model trains on that future information without noticing.

Text features: bag-of-words and TF-IDF

Free-form text — reviews, support tickets, product descriptions, tweets, medical notes — is among the most common non-numeric feature types. Before large language models displaced them, the classical feature-engineering approaches in this section were the state of the art for a quarter-century; and they remain the right tool when the task is small, the labels are few, or the inference budget is tight.

Bag-of-words

The simplest text representation is the bag-of-words (BoW): build a vocabulary of the k most frequent tokens in the training corpus, and represent each document as a k-dimensional vector whose entries are the counts of each vocabulary token in the document. The representation throws away word order — hence "bag" — but preserves enough to support classification, topic modelling, and many retrieval tasks. Scikit-learn's CountVectorizer produces BoW vectors in one line, with standard options for lowercasing, stop-word removal, and minimum/maximum document-frequency filtering. For most problems you want to cap the vocabulary at 10,000 to 100,000 tokens; beyond that the rare-word tail is mostly noise.

TF-IDF weighting

Term frequency–inverse document frequency reweights BoW so that common words (which carry little discriminative power) get down-weighted and rare words (which distinguish documents) get up-weighted. TF, the term frequency, is the count in the document (possibly log-scaled); IDF, the inverse document frequency, is log(N/df(t)) where N is the number of documents and df(t) the number containing term t. The product weights rare terms much more heavily. TF-IDF is the standard input to a linear text classifier and will beat raw counts on essentially every text classification task. Scikit-learn's TfidfVectorizer fuses the two steps.

N-grams and character n-grams

N-grams extend BoW by including consecutive token sequences of length n: "machine learning" becomes a single feature alongside "machine" and "learning". Unigrams plus bigrams (ngram_range=(1, 2)) is a common default that captures enough short-range word order to lift most classification tasks. Character n-grams — sliding windows of characters rather than tokens — are the right representation for morphologically rich languages, noisy social-media text, and misspelling-robust classification. Character 3-grams through 5-grams will outperform word-level features on tasks where spelling varies, capitalisation is unreliable, or the domain has many out-of-vocabulary tokens.

The hashing vectorizer

BoW and TF-IDF both require building a vocabulary from the training corpus — expensive in memory and awkward when the corpus is streaming or distributed. The hashing vectorizer (Weinberger et al. 2009, Feature Hashing for Large Scale Multitask Learning) sidesteps this: hash each token into one of k buckets, and represent the document as the count (or TF-IDF-weighted count) in each bucket. No vocabulary required; the mapping is fixed; the memory is bounded. The cost is collisions between unrelated tokens, but at k = 2²⁰ or so, collisions are rare enough to be negligible for most classification tasks. Scikit-learn's HashingVectorizer is the standard implementation; it is the right choice when the training corpus is too large to fit a vocabulary, or when the same model must accept tokens it has never seen before.

When to use classical text features in 2026. Pre-trained embeddings (sentence-transformers, OpenAI text-embedding-3) and fine-tuned transformers (BERT, RoBERTa, DeBERTa) now outperform BoW/TF-IDF on most benchmarks. But classical features still have four things going for them: they are orders of magnitude faster to train and serve; they need no GPU; they work with tiny label sets where fine-tuning is unstable; and they are interpretable (you can read off which words drove a prediction). For a few-thousand-example classification task with a tight latency budget, TF-IDF plus a linear model is often the right engineering answer even today.

Missing-value imputation

Real datasets are full of missing values. Before they reach most ML algorithms they must be filled in, dropped, or explicitly flagged — and the choice between those options is both a statistical decision and a feature-engineering one.

MCAR, MAR, MNAR

The statistical taxonomy of missingness, due to Rubin (1976), cuts the problem three ways. Missing completely at random (MCAR): the probability of missingness does not depend on any value, observed or unobserved. The easy case — any reasonable imputation is unbiased. Missing at random (MAR): missingness depends on observed values but not on the missing value itself (e.g., older customers more often have unreported income, but the unreported incomes are distributed normally conditional on age). Model-based imputation conditional on the observed features works. Missing not at random (MNAR): missingness depends on the missing value itself (high earners refuse to report income). The hardest case — no unbiased imputation exists without auxiliary assumptions, and the missingness pattern is itself informative.

Simple imputation

The workhorse approach is mean imputation for continuous features, median imputation for skewed ones, and mode (most-frequent) imputation for categoricals. Scikit-learn's SimpleImputer does all three in one line. Simple imputation is a blunt instrument — it biases variance downward, distorts multivariate distributions, and can damage downstream inference — but it is fast, stable, and usually an acceptable baseline. Constant imputation (filling with a sentinel value like −999 or the string "missing") is a reasonable alternative for tree models, which can split on the sentinel and effectively treat missingness as its own category.

Model-based imputation

More sophisticated approaches treat imputation as a supervised-learning problem: fit a model to predict each feature from the others, and use the prediction to fill in missing values. KNN imputation (KNNImputer) fills each missing entry with an average from the k nearest rows; simple and often effective. Iterative imputation (the MICE algorithm, van Buuren 2011; scikit-learn's IterativeImputer) cycles through features, imputing each one conditional on the rest using a regression model, and iterates until convergence. For small-to-medium datasets iterative imputation typically beats simple imputation by a meaningful margin. For very large datasets it becomes computationally awkward and practitioners fall back to simple methods or handle missingness directly in the model.

Missingness indicators

Regardless of which imputation method you use, add a binary indicator for each feature that had missing values in training. X_missing = X.isna().astype(int). This gives the model a chance to learn that "this feature was missing" is itself predictive — the MNAR case, where missingness carries information. In practice missingness indicators are often among the most important features in a churn or fraud model: the customers who don't fill in a form are systematically different from the ones who do. Missingness indicators are almost free to compute and almost always worth including.

Tree models and native missing handling

Modern gradient-boosting libraries (XGBoost, LightGBM, CatBoost) handle missing values natively: at each split, they send missing values down the branch that minimises training loss. This is effectively learned imputation, and on tabular problems it routinely outperforms any explicit imputation step. If your model is a gradient-boosted tree, often the right pipeline is to do no imputation, let the model handle it, and compare against imputation-based alternatives. For neural networks and linear models, explicit imputation is still required.

The cardinal rule. Fit your imputer on training data only, and apply it to both training and test. Fitting on the combined train+test set is a leak — the test-set statistics have seeped into the training features. This rule is the reason imputation belongs inside a Pipeline or ColumnTransformer: cross-validation fits the imputer separately on each fold's training portion, which is exactly what you want.

Outlier handling: when to clip and when to keep

Outliers — values far from the rest of the distribution — distort feature scales, destabilise linear models, and sometimes carry the most important signal in the dataset. The engineering challenge is distinguishing the two cases.

Detection: z-score, IQR, and model-based

Three standard detection approaches cover most cases. The z-score rule flags points with |z| > 3 (or some other threshold); fine for approximately normal features, fragile for heavy-tailed ones. The interquartile range (IQR) rule flags points below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR; much more robust for skewed distributions, and the basis of the boxplot whisker. Model-based approaches — isolation forest (Liu, Ting, Zhou 2008), one-class SVM (met in Chapter 07), local outlier factor — are worth reaching for when features have non-Gaussian multivariate structure that univariate methods miss.

Winsorising and clipping

Once detected, outliers can be winsorised (replaced with the p-th percentile value, typically 1st and 99th) or clipped (bounded at a fixed threshold). Both truncate the tails without dropping rows; both are much safer than removal when the row otherwise contains valid information. Winsorising is a one-line operation in pandas, and is particularly effective when fed into linear models that would otherwise have their coefficient estimates dominated by a handful of extreme observations.

Robust scaling revisited

Section 3 mentioned robust scaling — subtract the median, divide by the IQR — as an alternative to standardisation. In the presence of outliers, robust scaling is strictly better: extreme values do not distort the scale parameters that govern the rest of the data. Combined with winsorising, it gives a linear model a feature distribution that is approximately Gaussian in its bulk, with the tails capped at a manageable range.

When outliers are the signal

Crucially, sometimes outliers are exactly what you want to capture. In fraud detection, the fraudulent transactions are the outliers; clipping them away removes the signal. In anomaly detection and novelty detection, outliers are the entire point. In scientific data (particle-physics events, astronomical transients), the interesting discoveries are the ones that do not fit the model. The decision of whether to treat outliers as noise or signal is a modelling choice that depends entirely on the task. The clipping heuristic in Section 10 is for tasks where the outliers are genuinely data-quality problems or measurement errors — not for tasks where they are what you are trying to find.

Winsorise on training statistics only. Like imputation, outlier treatment must be fitted on training data and applied identically to test and serving data. The 99th percentile of the training set becomes the clipping threshold used forever after — not the 99th percentile of whatever data arrives at inference time. Violations of this rule are a steady source of production bugs.

Feature hashing: encoding at web scale

When the feature space is too large to hold a vocabulary in memory — billions of URLs, hundreds of millions of user agents, an open-ended set of product IDs — the hashing trick replaces the vocabulary with a fixed-width hash function. It is the only realistic encoding for genuinely large categorical spaces.

The mechanics

Choose a hash function h (a fast non-cryptographic one — MurmurHash3, FNV, xxHash) and a bucket count k (typically a power of 2 between 2¹⁶ and 2²²). Map each categorical value v to bucket h(v) mod k. Represent a row as the k-dimensional sparse vector counting how many values hashed to each bucket. The mapping is stateless — no training-time vocabulary to build, no inference-time dictionary to load. New categories at inference time are hashed just like old ones. The entire representation fits in O(k) memory regardless of how many distinct values the feature takes.

Collisions and why they are usually fine

Two different values can hash to the same bucket. In a linear model this means the coefficient for that bucket sums the contributions of all colliding values — which, for unrelated values, is noise. The literature (Weinberger, Dasgupta, Langford, Smola, Attenberg 2009) shows that with k ≫ n (bucket count much greater than the number of active values per example), collisions are rare enough to cost only a small amount of accuracy. A signed variant of the hashing trick uses a second hash to randomise the sign of each feature — this makes collisions cancel on average rather than add, reducing bias at no additional cost.

Where it wins

Feature hashing is the representation of choice for: click-through-rate models at ad networks (where the feature space is the cartesian product of tens of categorical fields, each with millions of values); online learning with streaming data (no vocabulary to maintain); and multi-tenant ML systems that must accept arbitrary new categorical features without schema changes. Google's Vowpal Wabbit is built around hashing-based feature representation; scikit-learn's HashingVectorizer and FeatureHasher bring the same idea to the Python ecosystem. For most tabular problems with k under 100,000 categories, hashing is overkill and direct target-encoding or one-hot is cleaner. Above that threshold, hashing becomes the only feasible option.

Debuggability — the one downside

The engineering cost of hashing is interpretability. "Feature 17,342 is important" is not a human-readable statement. If you need to explain what drove a model's prediction, hashed features require reverse-mapping: iterating through candidate values and checking which hash to the bucket of interest. That cost is usually acceptable for production ML systems but sometimes rules out hashing in regulated domains (lending, healthcare) where per-feature attribution is required.

Feature selection: the three families

With features engineered, the next question is which to keep. Feature selection is almost always worthwhile: removing uninformative features reduces overfitting, training time, serving cost, and the surface area for data drift. The methods split into three canonical families.

Why select at all

Four reasons feature selection pays off. First, accuracy: on small datasets, uninformative features are pure noise and hurt generalisation — even models with regularisation don't fully recover. Second, speed: fewer features means faster training and faster inference, often by a large factor. Third, interpretability: a model with ten features is easier to explain than one with a thousand. Fourth, maintenance: every feature in production is a feature that can go stale, drift, break, or suffer upstream schema changes. Selection is the activity of paying for the features that earn their keep and dropping the rest.

The taxonomy

Guyon & Elisseeff's 2003 introduction to feature selection (An Introduction to Variable and Feature Selection, JMLR) codified the now-standard three-family taxonomy. Filter methods rank features by a statistic computed independently of any model — correlation with the target, mutual information, chi-square. They are fast, model-agnostic, and ignore interactions between features. Wrapper methods treat selection as a search over feature subsets and evaluate each subset by training a model on it. They capture interactions but are expensive and prone to overfitting the selection process. Embedded methods perform selection as a side effect of model fitting — L1-regularised regression zeros out coefficients automatically, decision trees score features by how often they are used — giving most of the wrapper benefit at a fraction of the cost.

Univariate versus multivariate

A cross-cutting distinction: univariate selection methods look at each feature in isolation (Pearson correlation, mutual information, ANOVA F-test); multivariate methods consider features in combination (RFE, Lasso, mRMR). Univariate methods miss features that are uninformative alone but useful in combination (the classic XOR example: two features that together perfectly predict the target but are individually uncorrelated with it). Multivariate methods catch these but are more expensive and harder to analyse. A reasonable default is to start with a univariate filter for speed and follow with a multivariate embedded or wrapper method for precision.

When not to select

Feature selection is not always worth the trouble. Gradient-boosted trees are robust to redundant and even uninformative features; large datasets often absorb extra noise without generalisation loss; and regularisation accomplishes much of the same goal inside the model itself. In the deep-learning era, selection has largely moved inside the model: the network learns which features to attend to. The techniques in Sections 13–15 remain important for tabular linear, kernel, and small-tree pipelines; they matter less for large-scale GBMs and practically not at all for large neural networks.

Anti-pattern: selecting on the full dataset. Selection must be fit on training data only, just like scaling and imputation. Selecting features using cross-validation scores or target statistics computed on the full dataset is a classic leak: the test set has influenced which features the model sees. The resulting accuracy estimate is optimistic, sometimes substantially so. Always wrap selection inside a Pipeline, or use nested cross-validation where the outer fold's test data is entirely excluded from the selection process.

Filter methods

Filter methods score each feature by a statistic of its relationship to the target, independent of any downstream model. They are the fast, scalable, first-pass tool of feature selection.

Correlation and ANOVA F-test

For continuous features and a continuous target, the obvious statistic is the Pearson correlation coefficient. Rank features by |r|, keep the top k. Quick, cheap, and effective when the relationships are approximately linear and monotonic. Spearman correlation (rank-based) captures any monotonic relationship, not just linear ones, and is the safer default when features have heavy tails. For continuous features and a categorical target, the ANOVA F-statistic — the ratio of between-class variance to within-class variance — plays the same role: higher F means the feature's class-conditional means are more separated, so the feature is more informative. Scikit-learn's f_classif and f_regression compute these directly.

Chi-square for categorical features

For categorical features and a categorical target, the chi-square statistic on the contingency table tests whether feature and target are independent. Features with large chi-square values are more informative. Scikit-learn exposes this as chi2, with the usual caveat that the statistic requires non-negative features (frequency counts are fine; centred or negative values are not).

Mutual information

The most general filter statistic is mutual information — the reduction in uncertainty about the target induced by knowing the feature. Unlike correlation, MI captures arbitrary non-linear relationships; unlike chi-square, it applies cleanly to mixed continuous–categorical cases. Scikit-learn's mutual_info_classif and mutual_info_regression use nearest-neighbour entropy estimators (Kraskov, Stögbauer, Grassberger 2004) that work reasonably well in practice. MI is the right filter statistic when you don't know in advance what shape the relationship takes.

Minimum-redundancy maximum-relevance (mRMR)

Univariate filters ignore an important consideration: redundancy between features. Two features that are each highly correlated with the target and with each other contribute less additional information than one of them alone. mRMR (Peng, Long, Ding 2005) formalises this as a two-criterion problem: maximise relevance (mutual information with the target) while minimising redundancy (mutual information with the already-selected features). The greedy algorithm selects features one at a time, at each step choosing the feature that maximises relevance − redundancy. mRMR remains a standard workhorse in bioinformatics and any problem where feature costs are high and the chosen subset must be small.

The bias caveat

Every filter statistic has a bias that a practitioner ought to know. Correlation misses non-monotonic relationships. Chi-square requires positive features. Mutual information estimators are noisy for small samples. ANOVA assumes approximate normality and equal variances within groups. And all univariate filters miss interaction effects — the XOR example where neither feature alone correlates with the target but their product does. Treat filters as a cheap first-pass that flags features worth further attention, not as a decisive selection criterion.

Practical recipe. On a wide dataset (thousands of features), start with a mutual-information filter to drop the bottom 50–80% of features, then follow with an embedded method (Lasso or tree importance) on what remains. You save the time cost of the embedded method on uninformative features without losing much signal. This two-stage filter-then-embed pattern is the single most common selection pipeline in production classical ML.

Wrapper methods

Wrapper methods treat feature selection as a search over the 2ᵖ possible feature subsets, evaluating each candidate subset by training a model on it. They are more accurate than filters — they account for feature interactions — but also more expensive and more prone to selection bias.

Forward selection and backward elimination

The two simplest wrapper algorithms are forward selection (start with no features; at each step add the feature whose inclusion most improves cross-validated score; stop when no addition helps) and backward elimination (start with all features; at each step remove the feature whose removal most improves score). Both are greedy — they don't consider larger swaps — but on typical tabular problems they produce reasonable subsets in polynomial rather than exponential time. Scikit-learn's SequentialFeatureSelector implements both.

Recursive feature elimination

The most widely used wrapper in practice is recursive feature elimination (Guyon, Weston, Barnhill, Vapnik 2002, Gene Selection for Cancer Classification Using Support Vector Machines). Fit a model to all features, rank features by the model's own importance metric (coefficient magnitude for a linear model, feature importance for a tree), remove the bottom k%, and repeat until a target number of features remains. RFE is most naturally paired with linear SVMs or L2-regularised logistic regression, where coefficient magnitudes are meaningful, and with random forests or gradient boosters, which provide a feature-importance signal for free. Scikit-learn's RFE and RFECV (the latter chooses the feature count by cross-validation) are the standard implementations.

Genetic-algorithm and simulated-annealing wrappers

For problems where the right subset is unlikely to be reachable by greedy steps — non-separable feature interactions, multiple local optima in the subset space — stochastic-search wrappers can help. A genetic algorithm treats each feature subset as a binary chromosome, evaluates fitness by cross-validated model score, and uses crossover and mutation to explore. Simulated annealing does the analogous local-search walk with a temperature schedule that flattens acceptance probabilities. These methods are slow, hyperparameter-laden, and rarely worth it for tabular problems with p in the hundreds; they come into their own in genomics and other applications where p is in the thousands and the subset size is tightly constrained.

The selection-bias pitfall

Wrapper methods that use the same data to both select features and evaluate the selection will overstate the resulting model's accuracy. The fix is nested cross-validation: an inner loop selects features on each training fold, an outer loop evaluates the selected features on that fold's held-out data. This is expensive — the computation cost multiplies — but it is the only way to get an honest estimate of how the selection+model pipeline will perform on new data. Skipping this step is one of the more common causes of "my ML worked great in the lab and died in production" stories.

Embedded methods

Embedded methods do feature selection as a side effect of model fitting. You pay almost nothing extra — the selection comes for free from a model you were going to fit anyway — and you get most of the accuracy benefit of a wrapper without the computational cost.

L1 regularisation: Lasso and elastic net

The archetypal embedded method is Lasso (Tibshirani 1996): add an ℓ₁ penalty on the coefficient vector to a linear-model loss. The geometry of the ℓ₁ ball means the optimum typically hits its axes — setting many coefficients exactly to zero — rather than merely shrinking them. A Lasso-regularised linear regression produces a sparse coefficient vector, and the non-zero coefficients are the selected features. The regularisation strength α is the sole hyperparameter that controls the aggressiveness of the selection; larger α means fewer features retained. Elastic net (Zou & Hastie 2005) combines ℓ₁ and ℓ₂ penalties to handle groups of correlated features (Lasso alone tends to arbitrarily pick one of a correlated pair; elastic net includes both). Both are workhorses. Scikit-learn's Lasso, LassoCV, and ElasticNet are the standard APIs.

Tree-based feature importance

Random forests and gradient-boosted trees produce a feature-importance score as a byproduct of training. The classical score is mean decrease in impurity (MDI): for each feature, sum the impurity reductions at splits that use that feature, averaged across trees. MDI is fast and free but has a well-known bias toward high-cardinality features (a categorical with many unique values has more opportunities to split). The more reliable alternative is permutation importance: for each feature, measure how much the model's score drops when that feature's values are randomly shuffled. Permutation importance is model-agnostic, has no cardinality bias, and is the modern default. Scikit-learn exposes it as permutation_importance.

SHAP values

The state of the art in model-specific feature attribution is SHAP (Lundberg & Lee 2017, A Unified Approach to Interpreting Model Predictions). SHAP values are the game-theoretic Shapley values applied to features, decomposing each prediction into feature contributions that sum to the prediction's deviation from the mean. Mean absolute SHAP values across a dataset give a principled feature-importance ranking that respects feature interactions and is consistent in a well-defined axiomatic sense. The shap library provides fast computations for tree models (TreeSHAP, an exact polynomial-time algorithm for ensemble trees), kernel-based approximations for arbitrary models, and deep-network-specific variants. SHAP has effectively replaced MDI and permutation importance as the default in serious production ML reporting.

Boruta: a principled all-or-nothing wrapper

Boruta (Kursa & Rudnicki 2010) combines the embedded idea with a statistical test. It creates shadow features — shuffled copies of each original feature — and fits a random forest on the augmented dataset. A real feature is retained if its importance is significantly greater (in a repeated-measures sense across forest iterations) than the maximum importance among the shadow features. Boruta answers the "all-relevant" feature-selection question — which features carry any useful information at all — rather than the "minimal-optimal" question that Lasso and RFE answer (the smallest subset that gives the best model). The two questions are different, and Boruta is the workhorse for the all-relevant case, particularly in biology and medicine.

The modern default. For a tabular classification or regression problem in 2026, the dominant pattern is: fit a gradient-boosted tree on all features; compute SHAP-based importances; drop the bottom 50% of features; retrain. This gives you 80% of the benefit of wrapper methods at 5% of the cost, and the resulting model is more interpretable and faster to serve. Reserve wrappers for low-p problems where every feature counts and embedded methods for everything else.

Data leakage in feature engineering

Data leakage is the single most common cause of ML models that look great offline and fail in production. The offender is almost always in the feature-engineering pipeline, not the model. This section catalogues the traps and the defences.

What leakage means

Data leakage occurs when information from outside the training window enters the training features — most commonly, information from the future or information derived from the labels. The result is a model that predicts accurately on historical evaluation data (because the leaked information is essentially the answer) and fails on new data (where that information is not available). Kaufman, Rosset, Perlich, Stitelman 2012 (Leakage in Data Mining) is the classic reference; their worked examples, taken from real Kaggle competitions, are a sobering reminder of how easy leakage is to commit accidentally.

Target leakage

The most direct form: a feature whose value is computed using the target itself. Classic examples — the amount_due column in a credit-default dataset that is only populated after default is known; the last_login column in a churn dataset that is updated only when the customer comes back; the was_refunded flag in a fraud dataset. Each of these perfectly predicts the target in training because the target determines the feature's value, but none of them is available at the moment prediction must be made. Target leakage is usually obvious in retrospect but subtle at the time: the feature just looks too good. Healthy skepticism toward any single feature that achieves near-perfect accuracy is the first line of defence.

Time leakage

In time-series or streaming data, a feature computed using future information is a leak. "Rolling 30-day average of revenue" computed over the full dataset includes rows from the future; the correct version uses only rows up to but not including the target row. "Days since last purchase" computed from the full dataset sees purchases that haven't happened yet. The defence is point-in-time correctness: every feature must be computable from information known as of the feature's timestamp. Feature stores enforce this by construction via their point-in-time join APIs; ad hoc pipelines must enforce it by discipline.

Statistic leakage

Computing any statistic (mean, median, quantile, imputation fill value) on the combined train+test set and then applying it to training is a leak: the test set has influenced the training features. The defence is simple — fit on training, apply to test — but the bug is common because the offending line is often innocent-looking (df['col'].fillna(df['col'].mean()) looks fine unless you notice the mean is over the full dataframe). The cleanest prevention is the Pipeline pattern, which re-fits every transformation on each cross-validation fold's training portion.

Group leakage

When rows are not independent — multiple rows per customer, multiple measurements per patient, multiple sensor readings per device — a random train–test split puts the same entity on both sides, and a feature that identifies the entity can leak. "This customer's historical default rate" is a useful feature in churn prediction, but if the same customer appears in both training and test, the model simply memorises customer identity rather than learning general patterns. Group-aware splitting (scikit-learn's GroupKFold, GroupShuffleSplit) is the fix; it also matters for time-series (TimeSeriesSplit).

The defensive practice. Assume every new feature is a potential leak. Before celebrating a metric improvement, ask three questions. (1) Is this feature computed using the target, directly or transitively? (2) Is this feature computed using information from the future? (3) Is this feature computed using statistics that include the test set? If the answer to any is yes or "I'm not sure," the win is not real. Paranoia is cheap; post-deployment failures are expensive.

Feature engineering in practice

The techniques in this chapter are individually simple. The skill is in assembling them into a feature-engineering workflow that produces maintainable, reproducible, leakage-free features at production scale.

The iteration loop

Productive feature engineering is iterative. Start with a baseline model on the raw features (or the smallest reasonable set). Build a tight evaluation harness — cross-validated score, a single-command benchmark, tracking of each feature's contribution. Then iterate: generate candidate features one at a time, retrain, measure impact, keep if helpful. Two signals matter beyond raw score: variance across folds (a feature that helps on average but has high variance across folds is often capturing noise) and correlation with existing features (a new feature that is nearly identical to one already in the set adds little). Stop when the marginal score gain from each new feature falls below some threshold, or when the model hits a diminishing-returns plateau.

Automated feature engineering

Hand-crafted feature engineering is slow. Automated feature engineering (AutoFE) applies a fixed library of operations — arithmetic combinations, aggregations over time windows or group keys, common transforms — to generate many candidate features at once, then uses selection to pick the survivors. The main library in the open-source Python ecosystem is featuretools (Kanter & Veeramachaneni 2015), which defines Deep Feature Synthesis: systematic generation of features by stacking aggregation and transformation primitives across relational tables. AutoML platforms (H2O Driverless AI, DataRobot, Google Cloud AutoML) include AutoFE as a core capability. AutoFE's accuracy rarely matches expert manual feature engineering, but it gets most of the way there with a fraction of the labour — a good match for teams with limited ML expertise or very large feature spaces.

Feature stores

At scale, features become a shared asset across many models. A feature store (Feast, Tecton, Hopsworks, Databricks Feature Store, AWS SageMaker Feature Store) is a managed system that stores feature values, serves them online with low latency, provides point-in-time-correct offline joins for training, and enforces consistency between the two. The feature-store abstraction — a feature is defined once, computed once, and consumed everywhere — is the mature pattern for organisations with more than a handful of ML models. It also centralises the enforcement of the discipline around training–serving skew and leakage that this chapter has been belabouring.

Documentation and versioning

Features have meaning that is not captured in the column name. days_since_last_purchase is ambiguous until you know whether it counts only completed purchases or also abandoned cart transactions, whether it is computed at the moment of prediction or at some lag, whether "purchase" includes refunded ones. A production feature pipeline needs documentation — the feature definition, the computation path, the business meaning, the expected distribution, and the owner. And it needs versioning: the feature definition evolves over time, and the model must know which version it was trained on. Feature stores bake this in; ad-hoc pipelines must do it by convention.

Tests that catch train–serve skew

The single most valuable test in a production feature pipeline: compute the features at both training time and inference time on the same input data, and assert the results are identical. This one test catches the entire class of bugs where the offline and online computation paths have drifted apart. The corollary practice is feature monitoring: track the feature distributions at inference time and alert on drift from the training distributions. Most production ML failures announce themselves as feature-distribution drift before they become prediction-quality failures; monitoring gives you a chance to respond in advance.

Where it compounds in ML

Deep learning's headline promise was "learn features from data; no manual feature engineering required." That promise held on images, text, speech, and molecular graphs — and it held badly on tabular data, where classical feature engineering continues to win. The modern picture is a more nuanced division of labour, where hand-engineered, learned, and retrieval-augmented features all play complementary roles.

What deep learning actually replaced

Deep learning displaced hand-engineered features in three domains: images (where convolutional networks learn edge and texture detectors from raw pixels, ending the decades-long SIFT/HOG/GIST tradition), speech (where end-to-end acoustic models displaced MFCC and filterbank features), and text (where pretrained transformer embeddings now subsume TF-IDF and n-gram representations on most tasks). In each case, the learned features were dramatically better than the engineered ones, and the relevant feature-engineering literature is now mostly historical. The pattern is that learned features win when data is plentiful, the signal is buried deep in multi-scale patterns, and the model has enough capacity to recover those patterns.

Why tabular data is different

On tabular data the pattern reverses. Gradient-boosted trees with hand-engineered features continue to match or beat deep-learning approaches on the bulk of structured-data benchmarks (Grinsztajn, Oyallon & Varoquaux 2022, Why do tree-based models still outperform deep learning on tabular data? — the most-cited empirical result on this question, drawing on 45 datasets). Four reasons: tabular features already encode substantial human knowledge (something neural networks have to recover from scratch); tabular datasets are small (thousands to millions of rows, not billions); tabular features are heterogeneous in scale and type (which trees handle natively, but neural networks handle badly without feature engineering); and the inductive biases of CNNs and RNNs do not match tabular structure. The upshot is that classical feature engineering remains the dominant practice for tabular ML, and this chapter's techniques are as relevant now as they were twenty years ago.

Learned embeddings for categorical features

Even on tabular problems, one piece of deep-learning machinery has crossed over cleanly: entity embeddings for high-cardinality categoricals (Guo & Berkhahn 2016). Learning a dense vector for each category via backpropagation — much like word embeddings — gives a representation that captures category similarity and can outperform target encoding, especially when the same categorical appears in many models (a learned user embedding used across a recommender, a churn model, and a fraud model). Modern tabular-deep-learning libraries (FT-Transformer, TabNet, SAINT) use entity embeddings internally. The technique is the cleanest example of "old-school feature engineering plus new-school learning" that works in practice.

Retrieval-augmented features

The 2020s have brought a new class of features: retrieval-augmented ones, where an embedding of the input is used to look up related records in an external store, and the retrieved records' aggregates become features. A customer-churn model might retrieve the k most-similar customers by embedding distance and feature-engineer summary statistics over those retrieved customers' behaviours. This is the feature-engineering side of the retrieval-augmented-generation pattern now dominant in LLM engineering, and it is starting to appear in tabular pipelines too. The practitioner who knows classical feature engineering and can compose it with learned embeddings and retrieval is using the full modern toolkit.

Automated and foundation-model feature engineering

Two frontiers. First, automated feature engineering via tree-of-thought search and LLM-driven feature proposal (CAAFE, Hollmann et al. 2023; OpenFE, Zhang et al. 2023) — where a language model is prompted with the dataset schema and target, proposes feature transformations in code, and iteratively improves them based on validation scores. Early results are encouraging but not yet production-grade. Second, foundation-model features: using pretrained embeddings (text, image, multimodal) as direct inputs to a downstream classifier. For many tasks, the right feature engineering in 2026 is "call a foundation model to get embeddings, then engineer a classical tabular model on top." This pattern is eating a growing share of what used to be pure NLP or computer-vision pipelines, and it brings the techniques in this chapter — scaling, selection, missing-value handling, leakage discipline — into the foundation-model era.

Bridge to the next chapter. Chapter 08 has given you the tools to engineer features well; Chapter 09 turns to the discipline of evaluating what you have built. Feature engineering and model evaluation are tightly coupled: every feature you add changes the evaluation, every cross-validation split affects which features look useful, every leaky feature shows up first in an implausibly high evaluation score. Chapter 09 (Model Evaluation & Selection) covers cross-validation schemes, the right metrics for classification and regression, calibration, overfitting diagnostics, and the long catalogue of evaluation leaks that mirror the feature leaks here. With Chapters 01–09 behind you, you will have the full toolkit of classical machine learning — the one that still wins on the tabular problems where most ML dollars are actually spent.

Where to go next

Feature engineering has a peculiar bibliography: most of the craft lives in blog posts, Kaggle notebooks, and competition retrospectives rather than in a canonical textbook. But there are anchor references, and the selection literature — the half of the chapter that the machine-learning research community has cared about more than the engineering half — has a proper theoretical tradition. The references below split into anchor textbooks, foundational papers on selection and leakage, modern extensions that carry feature engineering into the deep-learning and large-model era, and the software that everyone ends up using. If you only read one book, read Kuhn & Johnson's Feature Engineering and Selection.

The anchor textbooks

Feature Engineering and Selection: A Practical Approach for Predictive Models

Max Kuhn & Kjell Johnson · CRC Press · 2019 · free HTML

The closest thing the field has to a canonical textbook. Covers encoding, transformations, interactions, missing values, resampling-based selection, and the full pipeline-of-decisions framing with remarkable care. Chapter 3 (a review of the modelling process) and Chapter 11 (greedy search methods for feature selection) are the best single-chapter treatments in print. Kuhn is the author of the caret R package and a co-author of tidymodels, and the book reads like two unusually thoughtful practitioners walking through every decision they make. Free HTML at feat.engineering.

feat.engineering
Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists

Alice Zheng & Amanda Casari · O'Reilly · 2018

The friendly, practitioner-oriented companion to Kuhn–Johnson. Strong on the why of each transformation: what kind of data distribution motivates log-transforms, when target encoding helps and when it backfires, how TF-IDF is really just a sparse kernel trick. Chapter 5 (categorical variables) and Chapter 7 (text features) are particularly good. Pair with the book's companion code repository for a library of worked mini-examples on real datasets.

O'Reilly Code
Feature Extraction: Foundations and Applications

Isabelle Guyon, Steve Gunn, Masoud Nikravesh & Lofti Zadeh (eds.) · Springer · 2006

The closest thing to a textbook on feature selection as a statistical-learning problem. Chapter 1 (Guyon & Elisseeff's long-form version of their 2003 JMLR introduction) is the single best essay on what selection is, including the subtle distinctions between filter, wrapper, and embedded methods and the famous warnings about correlation-does-not-equal-relevance. The later chapters collect the NIPS 2003 feature-selection challenge retrospectives, which remain the best case studies of selection in the wild. Pair with Liu & Motoda's Computational Methods of Feature Selection (2007) for a broader algorithmic survey.

Springer
The Elements of Statistical Learning

Hastie, Tibshirani & Friedman · 2nd ed. · Springer · 2009 · free PDF

Chapter 3 (linear regression) is the best single chapter on the regularisation-as-selection perspective — subset selection, forward/backward stagewise, ridge, lasso, elastic net, least-angle regression, and the variance–bias trade-off that runs underneath all of them. Chapter 7 (model assessment and selection) is the most careful treatment of cross-validation-with-feature-selection that exists, including the famous section 7.10.2 on "the wrong and right way to do cross-validation" that every feature-selection tutorial should cite. Chapter 15 covers random forest variable importance; Chapter 18 covers high-dimensional problems where p ≫ n makes selection essential.

Free PDF
Statistical Learning with Sparsity: The Lasso and Generalizations

Trevor Hastie, Robert Tibshirani & Martin Wainwright · CRC · 2015 · free PDF

The definitive modern reference on L1-regularised feature selection and its many relatives — lasso, elastic net, group lasso, fused lasso, graphical lasso, SLOPE, and the post-selection-inference story that has consumed a decade of statistics. Written by the people who invented most of these ideas. Chapter 2 (the lasso for linear models) and Chapter 4 (generalized linear models) are the essential reading; the later chapters on matrix decomposition and graphical models are useful background for the embedded-selection material in this chapter. Free PDF at Hastie's Stanford page.

Free PDF
Probabilistic Machine Learning: An Introduction

Kevin Murphy · MIT Press · 2022 · free HTML

The modern unified reference that places feature engineering and selection inside a probabilistic-model framework. Chapter 11 (linear regression) covers ridge/lasso/elastic net with the Bayesian-prior interpretation; Chapter 20 (dimensionality reduction) covers PCA, autoencoders, and representation learning as learned feature engineering; the Advanced Topics companion volume has more detailed treatment of SHAP-style attribution and of the Bayesian-model-averaging view of selection. Free HTML at probml.github.io.

probml.github.io

Foundational papers

An Introduction to Variable and Feature Selection

Isabelle Guyon & André Elisseeff · JMLR · 2003 · free

The canonical short introduction. Defines the three-family taxonomy (filter / wrapper / embedded), works through the worked-example pathologies — two-irrelevant-variables-can-be-individually-useless-but-jointly-useful, a-variable-with-higher-marginal-correlation-can-be-worse-than-one-with-lower — and lays out the experimental-methodology cautions that still define good practice. If you only read one paper on feature selection, read this one. Pair with Guyon's 2003 Result Analysis of the NIPS 2003 Feature Selection Challenge for the empirical counterpart.

JMLR
Regression Shrinkage and Selection via the Lasso

Robert Tibshirani · JRSS B · 1996 · free

The paper that introduced the lasso and, in doing so, turned feature selection from a discrete combinatorial search into a convex optimisation problem with built-in sparsity. Tibshirani frames the L1 penalty as the direct relaxation of subset selection, derives the characteristic soft-thresholding behaviour at the origin, and demonstrates the selection-plus-shrinkage effect on the prostate-cancer dataset that the later ESL chapters made iconic. The most cited paper in statistics for a reason. Pair with Efron, Hastie, Johnstone & Tibshirani's 2004 Least Angle Regression for the algorithmic story of how to compute the whole lasso regularisation path.

JSTOR
Regularization and Variable Selection via the Elastic Net

Hui Zou & Trevor Hastie · JRSS B · 2005 · free

The elastic-net paper. Combines the L1 penalty of the lasso with an L2 ridge component to fix two lasso pathologies: in the p > n regime the lasso selects at most n variables, and in the presence of strongly correlated predictors the lasso picks one somewhat arbitrarily and drops the rest. The elastic net selects groups, keeps more variables, and is now the default convex-sparsity tool for high-dimensional problems — the L1_ratio in scikit-learn's ElasticNet is this paper in miniature. Pair with Yuan & Lin's 2006 Model Selection and Estimation in Regression with Grouped Variables for the group-lasso extension.

PDF
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

Hanchuan Peng, Fuhui Long & Chris Ding · TPAMI · 2005 · free

The mRMR paper. Formalises the intuition that a good feature subset is one whose features are each individually predictive of the target and collectively uncorrelated with each other, and derives the greedy max-relevance-minus-average-redundancy criterion that is now the default mutual-information-based filter method. Still the most cited information-theoretic-selection paper, and the one implemented in almost every open-source filter-selection library. Pair with Battiti's 1994 Using Mutual Information for Selecting Features in Supervised Neural Net Learning for the earlier foundational version, and with Brown et al.'s 2012 Conditional Likelihood Maximisation survey for the unified derivation of the whole MI-filter family.

TPAMI
Gene Selection for Cancer Classification Using Support Vector Machines

Isabelle Guyon, Jason Weston, Stephen Barnhill & Vladimir Vapnik · Machine Learning · 2002 · free

The RFE paper. Introduces recursive feature elimination — train a linear SVM, rank features by squared weight, drop the bottom fraction, refit, repeat — as the canonical wrapper method for the p ≫ n regime of gene-expression classification. RFE remains the most widely used model-based wrapper in practice, and sklearn.feature_selection.RFE is a nearly line-for-line implementation. Also a rare example of a feature-selection paper with a clear biological deliverable: the chosen genes were themselves the scientific result.

Springer
Leakage in Data Mining: Formulation, Detection, and Avoidance

Shachar Kaufman, Saharon Rosset, Claudia Perlich & Ori Stitelman · KDD · 2012 · free

The only paper in a major venue that treats leakage as a subject in its own right. Surveys leakage incidents across KDD Cup competitions and industry experience, classifies the failure modes (leakage from the future, from out-of-scope observations, from statistic-of-the-target, from group structure), and proposes a hierarchical learning-with-honest-evaluation protocol as a defence. Mandatory reading for anyone who has ever been surprised by a model scoring too well. Pair with Nisbet, Miner & Yale's Handbook of Statistical Analysis and Data Mining Applications (2018) for extended industrial case studies.

KDD
Feature Hashing for Large Scale Multitask Learning

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola & Josh Attenberg · ICML · 2009 · free

The hashing-trick paper. Shows that mapping features through a random hash function with a sign-flipping auxiliary hash produces an unbiased inner-product estimator, allowing arbitrarily large categorical or text feature spaces to be represented in fixed dimensions with predictable approximation error. The trick powers the Vowpal Wabbit online-learning system and remains the only scalable way to handle very-high-cardinality categoricals when target encoding is inappropriate. Pair with Shi et al.'s 2009 Hash Kernels for Structured Data for the kernel-theoretic parallel.

arXiv
Random Forests

Leo Breiman · Machine Learning · 2001 · free

The random-forest paper is also, almost incidentally, the paper that popularised two of the most-used feature-importance measures in existence: mean decrease in impurity (MDI) and permutation importance. Breiman's derivation of permutation importance — shuffle a feature's values, measure the drop in out-of-bag accuracy — is still the cleanest model-agnostic importance statistic, and the one implemented in sklearn.inspection.permutation_importance. Pair with Strobl et al.'s 2007 Bias in Random Forest Variable Importance Measures for the critical follow-up that shows when MDI misleads and why permutation importance should be the default.

Springer Strobl
Missing-Data Imputation: Inference and Missing Data

Donald Rubin · Biometrika · 1976 · paywall · see free alternatives

The paper that introduced the MCAR/MAR/MNAR taxonomy that underlies every serious treatment of missing data. Rubin's definitions are short, precise, and still the universal vocabulary five decades later. The book-length expansion, Statistical Analysis with Missing Data by Little & Rubin (3rd ed., 2019), is the canonical reference; Schafer & Graham's 2002 Missing Data: Our View of the State of the Art is the most approachable free-to-read overview. For practical usage see Van Buuren's Flexible Imputation of Missing Data (2018, free online), which is the MICE reference text written by the author of the eponymous R package.

FIMD Schafer & Graham
Isolation Forest

Fei Tony Liu, Kai Ming Ting & Zhi-Hua Zhou · ICDM · 2008 · free

The canonical modern outlier-detection algorithm, and the most widely used preprocessing step before serious feature engineering on tabular data. Builds an ensemble of random-split trees and defines anomalies as points that are isolated in few splits — a surprisingly effective heuristic with linear scaling, no distance computation, and no assumption about the normal-class geometry. sklearn.ensemble.IsolationForest is this paper in production form. Pair with the same authors' 2012 Isolation-Based Anomaly Detection journal version for the extended empirical study.

ICDM

Modern extensions

A Unified Approach to Interpreting Model Predictions

Scott Lundberg & Su-In Lee · NeurIPS · 2017 · free

The SHAP paper. Unifies local-attribution methods (LIME, DeepLIFT, layer-wise relevance propagation, Shapley regression) under a single Shapley-value-based framework with desirable consistency and local-accuracy properties, and introduces the tree-SHAP algorithm that computes exact Shapley values for tree ensembles in polynomial time. SHAP has become the industry-standard feature-importance tool for tabular models, with implications both for selection (drop the features with consistently-low mean |SHAP|) and for debugging (find the features that are behaving differently on the slice you care about). Pair with Lundberg et al.'s 2020 From local explanations to global understanding with explainable AI for trees for the follow-up on tree-specific theory.

SHAP Tree SHAP
Deep Feature Synthesis: Towards Automating Data Science Endeavors

James Max Kanter & Kalyan Veeramachaneni · DSAA · 2015 · free

The featuretools paper. Introduces Deep Feature Synthesis — a schema-aware recursive-aggregation framework that takes a relational database and generates hundreds of candidate features automatically by composing primitives (mean, count, time_since_last) across entity relationships. The approach proved competitive with expert hand-crafted features on several Kaggle competitions and remains the most widely used automated-feature-engineering library for tabular relational data. Pair with the authors' Feature Labs blog posts for the DSF case studies, and with the featuretools documentation for the current API.

IEEE Docs
Entity Embeddings of Categorical Variables

Cheng Guo & Felix Berkhahn · arXiv · 2016 · free

The paper that brought learned embeddings to tabular categorical features, based on the 3rd-place Kaggle Rossmann-Sales solution. Replaces one-hot encoding of high-cardinality categoricals (store ID, product ID, geographic region) with learned low-dimensional embeddings trained end-to-end, and shows that the learned embeddings often encode human-interpretable structure (geographically close stores cluster together in the embedding space). The template for every modern "learned categorical encoder" and a direct ancestor of the embedding layers in TabNet, FT-Transformer, and every other tabular-DL architecture. Pair with Howard & Ruder's fast.ai tabular tutorial for the practitioner-oriented explanation.

arXiv
Why do tree-based models still outperform deep learning on typical tabular data?

Léo Grinsztajn, Edouard Oyallon & Gaël Varoquaux · NeurIPS · 2022 · free

The paper that definitively documents — across 45 benchmark datasets and several thousand training runs — that gradient-boosted trees beat deep neural networks on medium-sized tabular data, and traces the result to three specific inductive biases: rotation invariance, robustness to uninformative features, and irregular decision boundaries. The practical consequence is that explicit feature engineering still matters for tabular data, because neural networks are much more sensitive to feature quality than GBDTs are. Essential reading before believing any "deep learning replaces feature engineering" claim. Pair with Shwartz-Ziv & Armon's 2022 Tabular Data: Deep Learning is Not All You Need for the earlier, complementary result.

arXiv
CAAFE: Context-Aware Automated Feature Engineering with Large Language Models

Noah Hollmann, Samuel Müller & Frank Hutter · NeurIPS · 2023 · free

The first serious foray into LLM-driven automated feature engineering. CAAFE reads the dataset description and column schemas, prompts an LLM to propose candidate features as executable Python snippets, evaluates each proposal via cross-validation, and iterates. The approach is semantically aware in a way that classical AutoML is not — it can propose features like "ratio of household income to number of children" on a social-survey dataset because it reads the column names. A template that is now being elaborated in dozens of follow-up papers. Pair with Zhang et al.'s 2023 OpenFE for the expansion-and-ranking approach, and with TabPFN (same authors) for the complementary "learned prior over feature distributions" direction.

CAAFE OpenFE
Boruta: A System for Feature Selection

Miron Kursa & Witold Rudnicki · Fundamenta Informaticae · 2010 · free

The all-relevant feature-selection algorithm. Most selection algorithms return a minimal set of features sufficient for prediction; Boruta instead returns all features that are statistically more informative than random noise, via a shadow-feature permutation procedure wrapped around a random forest. This is the right semantics for scientific applications where interpretability matters — you want to know every variable that carries signal, not the smallest set that suffices for a black-box model. Now widely implemented (BorutaPy in Python, Boruta in R). Pair with Kursa's 2014 Robustness of Random Forest-Based Gene Selection Methods for the stability analysis.

JSS
Time-Based Cross-Validation and the Feature Store Pattern

Mike Del Balso & Willem Pienaar · Uber Michelangelo / Tecton · blog & papers · 2017–2023

The feature-store movement did not start from any single academic paper but from a series of engineering blog posts — Uber's Michelangelo (2017), Airbnb's Zipline (2018), Gojek's Feast (2019) — that together defined the feature-store architecture: a shared layer that computes features once, stores them both offline (for training with point-in-time correctness) and online (for low-latency serving), and guarantees train-serve consistency. Feast is now the canonical open-source implementation. For the pattern read the Uber Michelangelo post; for the production-system treatment read the Tecton "Feature Engineering for Real-Time ML" white papers.

Michelangelo Feast Tecton
Revisiting Deep Learning Models for Tabular Data

Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov & Artem Babenko · NeurIPS · 2021 · free

A careful benchmark of modern tabular deep-learning architectures (TabNet, SAINT, FT-Transformer, NODE) against properly tuned GBDT baselines. The result — that a well-tuned ResNet-style MLP with learned categorical embeddings matches or beats the bespoke architectures, and that all of them lose to tuned XGBoost on most datasets — is the sobering corrective to much of the tabular-DL hype of 2019–2021. The paper's "FT-Transformer" architecture is nonetheless the best-performing tabular transformer and a useful reference implementation. Pair with Arik & Pfister's 2021 TabNet paper for the architectural counterpoint.

arXiv

Software and documentation

scikit-learn: preprocessing, feature selection, and pipelines

scikit-learn developers · open source · Python

The default Python toolkit for classical feature engineering. sklearn.preprocessing provides the transformer zoo (scalers, encoders, binners, polynomial features, power transforms); sklearn.compose.ColumnTransformer is the column-wise dispatch primitive that makes multi-type tabular pipelines tractable; sklearn.pipeline.Pipeline is the fittable-composable-persistable container that prevents train-serve skew; sklearn.feature_selection implements every filter (SelectKBest), wrapper (RFE, SequentialFeatureSelector), and embedded (SelectFromModel) method in the chapter. Read the Common pitfalls and Preprocessing data User Guide sections start to finish — they are the most concentrated feature-engineering prose in the documentation of any ML library.

Preprocessing Selection Pitfalls
category_encoders

Will McGinnis / scikit-learn-contrib · open source · Python

The definitive collection of categorical-encoding transformers, with scikit-learn-compatible APIs. Provides target encoding (with smoothing), leave-one-out encoding, Catboost-style ordered target encoding, James–Stein encoding, hashing encoding, weight-of-evidence encoding, and two dozen others. The right out-of-the-box choice for any project where one-hot is insufficient and you don't want to handwrite target-encoding-with-out-of-fold-leakage-protection yourself. Pair with the Encoding Categorical Features section of the scikit-learn user guide for the subset that ships in core sklearn.

Docs
featuretools

Alteryx · open source · Python

The canonical automated-feature-engineering library for relational tabular data. Implements Kanter & Veeramachaneni's Deep Feature Synthesis with a modular primitive system (aggregation primitives like mean and count, transform primitives like day_of_week and time_since_last) that can be composed across entity relationships to auto-generate hundreds of candidate features. Strong point-in-time-correctness support via cutoff times for temporal datasets — the feature matrix for each training row uses only data available at that row's cutoff. Pair with TPOT and AutoGluon for the adjacent AutoML-pipeline tools that sometimes subsume featuretools.

featuretools
tsfresh

blue-yonder · open source · Python

The automated time-series feature-extraction library. Computes ~800 standard time-series features (statistical moments, spectral features, entropy measures, autocorrelation at various lags, linear-trend coefficients) in parallel from time-indexed dataframes, then optionally selects the subset that is statistically significant for the prediction target using hypothesis-testing-based filters. The standard starting point for any tabular-from-time-series feature-engineering problem. Pair with sktime for the broader scikit-learn-aligned time-series modelling ecosystem that sometimes wraps tsfresh.

Docs
Feast

Linux Foundation AI & Data · open source · Python

The open-source reference feature store. Provides offline feature retrieval with point-in-time correctness (for training), online feature retrieval at single-digit-millisecond latency (for serving), and a feature registry that keeps offline and online definitions in lockstep. The central design decision — feature definitions as code, materialised to both stores — is the cleanest answer to the train-serve-skew problem that the blog-post-era ad-hoc pipelines struggled with. Pair with Tecton, Hopsworks, and the managed feature stores from the hyperscalers for the commercial alternatives.

feast.dev Docs
SHAP and Boruta

Scott Lundberg / Daniel Homola & Miron Kursa · open source · Python & R

The two most important standalone feature-attribution / selection libraries. SHAP (pip install shap) implements TreeSHAP, KernelSHAP, DeepSHAP, and the rich visualisation suite (summary plots, dependence plots, force plots) that has become the standard way to read a tree-ensemble's feature usage. BorutaPy and Boruta (R) implement the all-relevant selection algorithm. Together they cover roughly 80% of day-to-day feature-importance and feature-selection work for tabular ML. Pair with LIME (the SHAP predecessor), ELI5, and alibi for the broader interpretability ecosystem.

SHAP BorutaPy
mlxtend and yellowbrick

Sebastian Raschka / District Data Labs · open source · Python

Two smaller but high-quality scikit-learn-adjacent libraries that round out the feature-engineering toolkit. mlxtend provides the SequentialFeatureSelector with forward / backward / floating variants, the Bias-Variance decomposition tool, association-rule mining, and several ensemble stackers; its accompanying textbook Python Machine Learning is a solid practical feature-engineering companion. yellowbrick provides visual diagnostics — feature importance bar charts, rank-1D / rank-2D feature ranking visualisations, residual plots — that make the feature-engineering feedback loop much tighter than staring at numeric metrics alone.

mlxtend yellowbrick
pandas, polars, and DuckDB

pandas community / Ritchie Vink / MotherDuck · open source · Python & C++

The dataframe layer where most feature engineering actually happens. pandas is the workhorse that everyone knows; polars is the fast columnar-engine replacement that handles ten-million-row feature engineering without the memory pressure; DuckDB is the embedded analytical database that lets you write the same feature-engineering logic in SQL with excellent performance on files that don't fit in memory. All three integrate cleanly with the scikit-learn/featuretools/tsfresh pipelines above. Choose pandas for familiarity, polars for speed, DuckDB for SQL-native workflows.

pandas polars DuckDB

This page is Chapter 08 of Part IV: Classical Machine Learning. Chapters 01 through 07 built a gallery of algorithms — regression, classification, ensembles, clustering, dimensionality reduction, probabilistic graphical models, kernel methods and SVMs — each with its own inductive biases and characteristic failure modes. Chapter 08 is the connective tissue: the engineering and statistical discipline of deciding what inputs the algorithms should see in the first place. Chapter 09, which closes the classical-ML arc, completes the triad by turning to model evaluation and selection — the measurement discipline that tells us honestly which of the preceding algorithm-and-feature combinations actually wins on the problem in front of us, with all of the cross-validation and resampling subtleties that separate an honest answer from a wishful one.

How to read this chapter

Contents

Why features matter more than the model

The fundamental observation

Why this matters in practice

What this chapter does

The feature-engineering pipeline

Fit and transform are different operations

Composition and the Pipeline object

The three invariants

The feature store pattern

Transforming numerical features

Scaling: standardisation, min-max, and robust

Power transforms for skewness

Binning and discretisation

Monotonic transforms: square root, reciprocal, and friends

Categorical encoding: the workhorse

One-hot encoding

Ordinal encoding

Label encoding — a classic trap

Frequency encoding and binary encoding

High-cardinality categoricals: target encoding and embeddings

Target encoding

Regularisation and smoothing

Leakage is the real problem

Frequency encoding and hashing

Learned embeddings

Interaction terms and polynomial features

The motivating example

Polynomial features

Domain-driven interactions

Why tree models change the calculus

Datetime features: the most underused feature class

The standard components

Cyclical encoding

Elapsed-time and lag features

Time zones and point-in-time correctness

Text features: bag-of-words and TF-IDF

Bag-of-words

TF-IDF weighting

N-grams and character n-grams

The hashing vectorizer

Missing-value imputation

MCAR, MAR, MNAR

Simple imputation

Model-based imputation

Missingness indicators

Tree models and native missing handling

Outlier handling: when to clip and when to keep

Detection: z-score, IQR, and model-based

Winsorising and clipping

Robust scaling revisited

When outliers are the signal

Feature hashing: encoding at web scale

The mechanics

Collisions and why they are usually fine

Where it wins

Debuggability — the one downside

Feature selection: the three families

Why select at all

The taxonomy

Univariate versus multivariate

When not to select

Filter methods

Correlation and ANOVA F-test

Chi-square for categorical features

Mutual information

Minimum-redundancy maximum-relevance (mRMR)

The bias caveat

Wrapper methods

Forward selection and backward elimination

Recursive feature elimination

Genetic-algorithm and simulated-annealing wrappers

The selection-bias pitfall

Embedded methods

L1 regularisation: Lasso and elastic net

Tree-based feature importance

SHAP values

Boruta: a principled all-or-nothing wrapper

Data leakage in feature engineering