The previous three chapters built the general machinery of deep networks — the forward and backward pass, the optimisers that make training converge at scale, the regularisers that make the resulting function generalise. All of it was architecture-agnostic: the same framework applies whether the input is an image, a sentence, a spectrogram, or a graph. This chapter is about the first architecture family to show that how you connect the neurons matters. Yann LeCun's 1989 handwritten-digit recogniser proved that convolutional layers — weight-sharing across spatial position, small local receptive fields, pooling for translation invariance — beat the fully-connected MLPs of the era by a wide margin on image tasks. It took another twenty-three years and three orders of magnitude of GPU compute before Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton's AlexNet (ImageNet 2012) proved the same point decisively enough to re-found the field: a convolutional network trained end-to-end with gradient descent on a million natural images could halve the error rate of every hand-engineered computer-vision pipeline in existence. What followed was a decade of CNN architecture development — VGG's radical minimalism, GoogLeNet's multi-scale branches, ResNet's residual connections that broke the depth barrier, DenseNet's feature-reuse topology, MobileNet and EfficientNet's accuracy–compute frontier, ConvNeXt's rehabilitation of convolutions after the transformer revolution — plus the downstream applications (detection, segmentation, video understanding) that depend on convolutional backbones. Convolutional networks also carry the torch for a deeper idea: that inductive bias built into the architecture — convolution's translation equivariance, pooling's local invariance — is a form of regularisation more powerful than any explicit penalty. This chapter is the foundational computer-vision chapter of the compendium, both for the architectures themselves and for the principle they established.
Sections one through six build the convolutional vocabulary from first principles, with no reference to any particular architecture. Section one motivates convolution by comparing it to a fully-connected network on an image task — the parameter count, the invariances the MLP fails to capture, and the argument that architectural priors are a kind of regularisation. Section two defines the convolution operation itself: kernels, channels, the cross-correlation-vs-convolution terminology debate, and the algebraic identity that makes the operation translation-equivariant. Section three handles the geometry that surrounds the operation — padding (valid, same, reflection), stride, and dilation (Yu and Koltun 2016) — and the formulae that predict output dimensions. Section four is pooling: max-pooling, average-pooling, global pooling, and the modern discussions of anti-aliased pooling (Zhang 2019) and learned pooling. Section five introduces the receptive field — both the theoretical receptive field that grows with depth and the effective receptive field (Luo, Li, Urtasun, Zemel 2016) that a trained network actually attends to, typically much smaller. Section six covers feature hierarchies and the classic visualisation work of Zeiler and Fergus (2014), which revealed that successive CNN layers encode edges, textures, parts, and eventually object-like concepts.
Sections seven through twelve are the classic CNN architectures, taken in chronological order because each one solved the previous one's problem. Section seven is LeNet — LeCun, Boser, Denker, Henderson, Howard, Hubbard, Jackel's 1989 handwritten-digit recogniser, refined into LeNet-5 (LeCun, Bottou, Bengio, Haffner 1998) — the proof of concept that shaped everything that followed. Section eight is AlexNet (Krizhevsky, Sutskever, Hinton 2012), the ImageNet-winning paper that combined ReLUs, GPU training, aggressive augmentation, and dropout into the network that re-founded deep learning. Section nine is VGG (Simonyan, Zisserman 2014), whose radical simplification to stacked 3×3 convolutions became the default design idiom for the next five years. Section ten is GoogLeNet/Inception (Szegedy et al. 2015), which introduced multi-scale parallel branches and 1×1 convolutions as bottlenecks. Section eleven is ResNet (He, Zhang, Ren, Sun 2015) — the residual connection that broke the depth barrier, won ImageNet 2015, and became the single most important architectural invention of the decade. Section twelve is DenseNet (Huang, Liu, van der Maaten, Weinberger 2017), which took feature reuse to its logical extreme with every layer connected to every later layer.
Sections thirteen and fourteen turn to the efficient and modern CNN lineage. Section thirteen covers the efficient CNN frontier: MobileNet (Howard et al. 2017) and its depthwise-separable convolutions; SqueezeNet (Iandola et al. 2016); ShuffleNet (Zhang et al. 2018); the compound-scaling recipe of EfficientNet (Tan and Le 2019); and the broader neural-architecture-search (NAS) literature that produced much of this work. Section fourteen is ConvNeXt (Liu, Mao, Wu, Feichtenhofer, Darrell, Xie 2022), the rehabilitation of convolutions after the transformer revolution — a plain convnet re-engineered with transformer-era training and design choices that matches ViT accuracy on ImageNet at comparable compute, making the case that the architecture family was never the bottleneck.
Sections fifteen through seventeen are the downstream vision tasks built on top of CNN backbones. Section fifteen is object detection: the R-CNN lineage (Girshick, Donahue, Darrell, Malik 2014; Fast R-CNN 2015; Faster R-CNN 2015; Mask R-CNN 2017); the single-shot detectors YOLO (Redmon, Divvala, Girshick, Farhadi 2016) and SSD (Liu et al. 2016); the focal-loss stage of RetinaNet (Lin et al. 2017); and the transformer-based DETR (Carion et al. 2020) that closes the CNN chapter. Section sixteen is semantic and instance segmentation: Fully Convolutional Networks (Long, Shelhamer, Darrell 2015), U-Net (Ronneberger, Fischer, Brox 2015) for medical imaging, DeepLab's atrous convolutions (Chen et al. 2017), and Mask R-CNN for instance-level segmentation. Section seventeen is the Vision Transformer crossover — Dosovitskiy et al.'s 2020 An image is worth 16×16 words, the surprising finding that a transformer trained on enough images outperforms convnets without any convolution at all, and the subsequent hybrid architectures (Swin, CoAtNet, MaxViT) that split the difference. The chapter's closing in-ml section places CNNs in the broader landscape — as the archetype of architecture-as-inductive-bias, as the backbone family underlying most deployed vision systems, and as the benchmark the Vision Transformer, graph neural networks, and every subsequent spatial-data architecture has had to measure itself against.
A fully-connected network treats an image as an unstructured vector of pixels. It has to learn, from data alone, that adjacent pixels are related, that shifting an object a few pixels left doesn't change what it is, and that the same edge can appear anywhere in the frame. The convolutional network's founding idea is to build these facts into the architecture — to hardwire locality, weight sharing, and translation equivariance — so the network doesn't have to discover them from scratch. That single architectural choice turns out to be one of the most effective forms of regularisation deep learning has ever found.
Consider an image classifier that takes a 224 × 224 RGB image — typical for ImageNet — and maps it to 1000 class logits through two hidden layers of width 4096, the way AlexNet's final classifier head does. A fully-connected version has a first-layer weight matrix of (224 × 224 × 3) × 4096 ≈ 6.2 × 10⁸ parameters — over 600 million in a single layer. The entire AlexNet convolutional network has 60 million parameters total, an order of magnitude less, and outperforms any fully-connected network you can fit on the same dataset. The fully-connected architecture is not just compute-expensive; it is actively bad: every weight has to learn an independent mapping from one pixel position to one hidden unit, and most of those weights end up mapping noise. Worse, a fully-connected network trained on cats-centred-in-the-image will fail on cats-in-the-corner, because the weights for the corner pixels learned nothing about cats. The parameter cost and the generalisation failure are two sides of the same architectural mistake: the MLP is uninformed about how images work.
A convolutional layer encodes three prior beliefs about images. Locality: a pixel is more related to its neighbours than to a pixel on the opposite side of the frame, so hidden units should have small receptive fields in their input. Weight sharing (stationarity): a feature detector that works in one part of the image should work in any other part, so the same weights should be applied everywhere, like a stencil slid across the image. Hierarchy: complex features are compositions of simpler features over progressively larger spatial extents, so stacking convolutional layers with growing receptive fields should build edge → texture → part → object representations. These three priors are not universal truths — they fail on images where global context matters, on non-stationary patterns, on non-Euclidean grids — but for natural images of objects and scenes, they are close enough to correct that encoding them in the architecture saves data, compute, and generalisation headroom.
A convolutional layer with a 3 × 3 kernel, 64 input channels, and 128 output channels has 3 × 3 × 64 × 128 = 73,728 weights — regardless of whether the input is 32 × 32 or 512 × 512. A fully-connected layer with the same 64-channel, 512 × 512 input and 128-channel, 512 × 512 output would have (64 × 512²) × (128 × 512²) ≈ 4.4 × 10¹⁵ weights, roughly a quadrillion. The convolutional layer makes the same hidden representation available at every spatial position by reusing the same 74,000 weights across the whole image, rather than learning a separate mapping at every position. This is the essential bargain: hardcoded weight sharing for massive parameter efficiency. Every modern convnet depends on it.
Chapter 03 catalogued explicit regularisation techniques — weight decay, dropout, data augmentation. Convolution is a different kind of regularisation, baked into the function class rather than added to the loss. A convolutional network cannot represent functions that violate translation equivariance. A fully-connected network can represent such functions, and has to learn not to, from examples. When the true function has the structure the architecture encodes, the architectural prior is a free lunch — a stronger regularisation than any explicit penalty, at zero training-time cost. When the true function violates the prior (a convnet asked to do global-structure reasoning that crosses the whole image, an attention task, a non-grid input), the same prior becomes an obstacle. The art is matching architecture to data, and for natural images the match is nearly perfect.
"CNNs are translation-equivariant" is approximately true but subtly qualified. A pure convolution is exactly translation-equivariant: if f(x) is the output and you shift the input x by (u, v), the output f also shifts by (u, v). Strided convolution, pooling, and padding all break this in small ways — a stride-2 convolution is only equivariant to shifts that are multiples of 2, and the border handling at image edges is slightly asymmetric. Richard Zhang's 2019 Making convolutional networks shift-invariant again documented that modern convnets are actually quite unstable under small pixel shifts (a cat shifted by one pixel can change its predicted class), because the subsampling operations alias. The fix — blurpool (anti-aliased pooling) — restored shift-invariance at a small accuracy cost. The cleaner statement is that convnets are approximately translation-equivariant, and that modern refinements have been pushing that approximation closer to exact.
Before we can talk about architectures, we need the operation. Convolution in a neural network is a specific piece of arithmetic — simple in one dimension, a little trickier in two, and carrying a rich mathematical structure that the network training procedure exploits without the network ever having to be told it exists.
Given a signal x[t] and a kernel w[t], the discrete convolution is (x ∗ w)[t] = ∑_τ x[τ] · w[t − τ]. Strictly speaking, the operation used in every major deep-learning framework is cross-correlation, (x ⋆ w)[t] = ∑_τ x[τ] · w[t + τ], which differs only in that the kernel is not flipped. The two are equivalent for learning because the network learns the kernel, so whether we define it as "learn w" with convolution or "learn w′ = flip(w)" with cross-correlation makes no functional difference. The community long ago standardised on calling the cross-correlation operation "convolution" in a deep-learning context, with the ancient signal-processing convention politely ignored. For the rest of this chapter we use the term in that casual way; a pedant can mentally insert the kernel flip.
For a 2D input (an image) and a 2D kernel, the operation is (X ⋆ W)[i, j] = ∑_{m, n} X[i + m, j + n] · W[m, n], with m, n ranging over the kernel's support. If the kernel is 3 × 3 and the image is H × W, the output is (H − 2) × (W − 2) — the kernel "walks" across the image, and its output at each position is the inner product of the kernel with the aligned image patch. Each output position is a linear combination of the nine pixels in its receptive field, weighted by the nine kernel entries. A single 2D kernel learns a single feature detector — an edge at a given orientation, a corner, a texture. To detect many features, we use many kernels.
A real convolutional layer has more structure. The input has C_in channels (3 for RGB; 64 or 128 or 256 for intermediate layers). The output has C_out channels, each a learned feature detector. For each output channel, the kernel is a 3 × 3 × C_in tensor: 3 × 3 in spatial extent, C_in deep. The full layer's weight tensor is 3 × 3 × C_in × C_out. The operation is Y[i, j, c_out] = ∑_{m, n, c_in} X[i + m, j + n, c_in] · W[m, n, c_in, c_out] + b[c_out]. Each output position and channel is a full weighted sum across the kernel's spatial footprint and the full input channel stack. This structure — local in space, global in channels — is the hallmark of a convolutional layer, and it's why a deeper network with more channels captures more feature types while still keeping spatial locality.
The convolution operation has a crisp mathematical property: translation equivariance. If T_v denotes the operation of shifting an image by vector v, then (T_v X) ⋆ W = T_v (X ⋆ W) — shifting the input produces a shifted output. This is exactly the property we wanted the architecture to encode: a feature detected at position (i, j) in the input is detected at position (i + v_i, j + v_j) in the input shifted by v. The weight sharing (same kernel applied at every position) is precisely what delivers it. The full CNN is equivariant if each of its layers is, and the final pooling or classifier turns equivariance into invariance (the output is the same regardless of shift) by discarding the position information.
Practically, frameworks implement convolution as a matrix multiplication. The input image patches are rearranged into a large matrix (one row per spatial position, each row holding the flattened 3 × 3 × C_in patch at that position) — the im2col operation. The kernel is reshaped into a matrix (C_out × 3 × 3 × C_in). The convolution output is the product of these two matrices. This looks wasteful — input pixels are duplicated across many rows of the im2col matrix — but it maps exceptionally well to hardware. GPU matrix-multiply kernels (cuBLAS GEMM) are among the most heavily optimised pieces of code on Earth; rearranging convolution as matmul lets us use them. Modern frameworks also have direct convolution implementations (Winograd's algorithm, FFT-based convolution for large kernels) that can be faster in specific regimes, but the im2col-plus-GEMM backbone is the universal default.
The same operation generalises to 1D (for audio, text, time series), 3D (for video, volumetric medical imaging), and higher. 1D convolutions with large kernels have had a revival in modern speech and language models (WaveNet's causal dilated convolutions, for example). 3D convolutions show up in video classification (§17) and in 3D medical imaging (U-Net 3D variants). Graph convolutions (Kipf and Welling 2017) generalise the operation to non-Euclidean grids — adjacency-weighted neighbour aggregation — and form the basis of graph neural networks (Part XI, Ch 05). The specific 2D case dominates this chapter because it underpins the architectures, but the mathematical structure is general.
A convolution by itself shrinks the spatial dimensions of its input — a 3 × 3 kernel applied to an H × W image produces an (H − 2) × (W − 2) output. Three knobs — padding, stride, and dilation — let the network designer control that shrinkage precisely, and each has its own downstream consequences for receptive field, compute, and what the network can represent.
Padding is the operation of adding extra border pixels to the input before applying the convolution. The standard options are: valid padding (no padding; output shrinks), same padding (pad with zeros so the output has the same spatial size as the input), reflection padding (mirror the border; used when the zero-boundary would introduce artefacts), and replication padding (repeat the edge pixel). For a 3 × 3 kernel, same padding adds one row and one column of zeros on each side. For a 5 × 5 kernel, two rows and columns on each side. The vast majority of modern CNNs use same padding with zeros inside the backbone — the loss of spatial dimension from each convolution would otherwise make the network architecturally constrained to very few layers before running out of pixels. Padding has a small but non-trivial effect: the zero border is not representative of the image statistics, so early-layer responses near the edge are subtly different from responses in the interior. Reflection and replication padding mitigate this for some applications (style transfer, super-resolution, any task where edge artefacts are noticeable), at minor implementation cost.
Stride controls how far the kernel moves between output positions. A stride-1 convolution outputs one value per input position (after padding); a stride-2 convolution outputs one value per two positions, halving each spatial dimension. Stride is the cheap way to subsample inside a convolutional layer: instead of convolution-then-pool, a stride-2 convolution does both in one step. It saves a layer of compute and a layer of code. The dimension formula: for input size H, kernel size k, padding p, and stride s, the output size is ⌊(H + 2p − k) / s⌋ + 1. For the most common case (k = 3, p = 1, s = 1), output equals input; for (k = 3, p = 1, s = 2), output is half. Stride-2 convolutions show up at regular intervals in every modern convnet — roughly once per "stage" — where the network downsamples from high-resolution early features to low-resolution abstract features.
Dilation (also called atrous convolution, from the French à trous, "with holes") inserts gaps between kernel taps. A dilation-rate-2 3 × 3 kernel covers the pixels at positions (i, j), (i, j+2), (i+2, j), … rather than the adjacent (i, j+1) etc. — a 3 × 3 kernel with dilation 2 sees a 5 × 5 receptive-field footprint while using only 9 weights. Fisher Yu and Vladlen Koltun's 2016 Multi-scale context aggregation by dilated convolutions introduced this into semantic segmentation, and Liang-Chieh Chen et al.'s DeepLab series made it the standard for that task. The point of dilation is to expand receptive field without losing spatial resolution — a key property when the network's output must be a spatial map (a segmentation mask, a depth map, a dense prediction) at the input's resolution. Stacking dilated convolutions with exponentially growing rates (1, 2, 4, 8, …) gives exponential receptive-field growth while keeping the feature map at full resolution throughout.
When the network must produce a full-resolution output from low-resolution features (segmentation, image generation, super-resolution), some layer has to upsample the spatial dimensions. The classical answer is transposed convolution (sometimes imprecisely called deconvolution), the operation that reverses the spatial shape change of a strided convolution. Transposed convolutions have well-known artefact problems — the checkerboard pattern (Odena, Dumoulin, Olah 2016) that appears when the stride divides the kernel size imperfectly — and modern pipelines often replace them with nearest-neighbour or bilinear upsampling followed by a stride-1 convolution, which is functionally similar but artefact-free. For generative models, PixelCNN-style upsampling and the U-Net decoder's bilinear-plus-conv pattern are standard.
A painful part of designing a CNN is tracking the spatial dimensions through every layer. With the formula ⌊(H + 2p − k·(1 + (d−1)·(d_rate−1))) / s⌋ + 1 (where d_rate is the dilation rate), it is easy to have an off-by-one error that makes the final feature map the wrong shape, which cascades into shape mismatches at the classifier. Modern frameworks tolerate this — PyTorch's nn.AdaptiveAvgPool2d will average-pool to a target size regardless of the input size, saving the designer from exact arithmetic. But for anyone designing a new architecture, the discipline of working out the spatial dimensions on paper before writing code saves hours of debugging.
If convolution is the mechanism that detects features, pooling is the mechanism that summarises them. A pooling layer takes a small spatial window of the input feature map and reduces it to a single output value — the maximum, or the average, or some learned combination. Pooling is the CNN's main tool for translation invariance (shift-robustness), for subsampling, and for achieving a receptive field that eventually covers the whole image.
Max pooling takes the maximum over each spatial window (typically 2 × 2, stride 2). Average pooling takes the mean. Max pooling was the default through the LeNet–AlexNet–VGG era — it approximates "whether the feature was present anywhere in this window" — while average pooling is smoother and more faithful to the full spatial signal. Geoffrey Hinton once argued that "the operation used in pooling is the thing that's wrong with CNNs" (his capsule-networks line of argument from 2017–2018), on the grounds that max-pooling throws away too much positional information. The field partially agreed: modern CNNs lean toward a single global average-pool at the end of the backbone followed by a linear classifier, rather than repeated max-pools through the trunk. Inside the trunk itself, strided convolutions have largely replaced max-pools — they do similar subsampling but keep more information, and have tunable weights.
Early CNNs (AlexNet, VGG) ended the backbone with a set of fully-connected layers — enormous weight matrices mapping the flattened final feature map to the logits. This structure was replaced by global average pooling (GAP) in GoogLeNet and subsequent architectures: after the last convolutional stage, average each channel's feature map to a single scalar, giving a C-dimensional vector, and apply a single linear layer to produce class logits. GAP has three advantages. It drastically reduces the parameter count (the VGG-16 classifier head has ~100 million parameters; a GAP head has C × num_classes plus a bias, typically a few hundred thousand). It allows the network to accept inputs of variable size (the backbone's output spatial dimensions can vary, but after GAP you always get a C-vector). And it introduces a useful inductive bias: the last classifier doesn't care where in the image a feature appeared, only whether it appeared. From 2015 onward, GAP + single-linear-layer is the dominant classifier head.
A 2 × 2 max-pool is invariant to shifts of up to one pixel within each pooling window: the feature map inside the window is shifted, but the max doesn't care about position. Stacking pooling layers extends this invariance geometrically — a network with four stages of 2 × 2 pooling is (approximately) invariant to shifts of up to 16 pixels. The invariance is approximate because it's only exact for shifts aligned with the pooling stride, and because after pooling, the network can still represent shift-sensitive features via the inter-window structure. This is the CNN's mechanism for object recognition to work regardless of exact image position: a cat's face detector fires in some window, pooling abstracts away the within-window position, and the next layer classifies based on "cat-face-present" rather than "cat-face-at-pixel-(117, 84)".
The naive subsampling in standard max-pool and strided-convolution layers violates the Nyquist sampling theorem: signals with frequencies above half the sampling rate alias, appearing in the output at incorrect frequencies and producing spurious sensitivities. Richard Zhang's 2019 Making convolutional networks shift-invariant again showed that ordinary CNNs have surprising instability under small input shifts — a single-pixel shift can change the predicted class — and that the cause is aliasing during subsampling. His fix: low-pass filter the feature map with a small blur kernel before the subsampling step. This BlurPool operation recovers smooth shift-invariance at a trivial compute cost. Anti-aliased pooling is now widely used in research architectures; industrial architectures have been slower to adopt it, largely because the accuracy gain on standard benchmarks is small (under 1% top-1) even though the stability gain is substantial.
Beyond max and average, several pooling operations have been proposed. L_p pooling generalises max (p = ∞) and average (p = 1) as (∑|x|^p)^{1/p}. Stochastic pooling (Zeiler, Fergus 2013) samples from the pooling window according to activation magnitude. RoI pooling and RoI align (He et al. 2017, Mask R-CNN) are specialised pools used in object detection and segmentation, aligning a fixed-size feature representation to a region of interest at a non-integer position. Attention pooling — computing a learned weighted average over spatial positions — has become common in transformer-era vision models and in multiple-instance learning. The unifying view is that pooling is a reduction over the spatial dimension, and there's no unique best choice; the right reduction depends on the downstream task.
A unit deep in a convolutional network has a receptive field — the region of the input image whose pixels can influence its activation. Receptive field is the key spatial quantity in CNN design: it determines what a unit can "see" and therefore what concepts it can encode. Textbook treatments give a simple geometric recipe for the receptive field; actual networks behave in more interesting ways, and the difference between the theoretical and effective receptive field is one of the more useful insights of the last decade of CNN research.
For a CNN made of convolutions and pools, the theoretical receptive field of a unit is computed by walking backward through the network. At the input layer, every unit sees one pixel — receptive field 1. A 3 × 3 convolution (stride 1) produces a unit that sees a 3 × 3 patch; a second 3 × 3 convolution adds another two pixels of extent on each side, giving a 5 × 5 theoretical receptive field; a third gives 7 × 7; and so on. Stride-2 pools or strided convolutions multiply subsequent receptive-field increments by the cumulative stride, so receptive field grows geometrically with depth when striding is present. A ResNet-50 on a 224 × 224 ImageNet input has a theoretical receptive field that covers essentially the entire image well before the final stage. VGG-16 covers the whole image by its last convolutional layer. By construction, the classifier head at the top of a modern backbone can "see" every input pixel.
Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel's 2016 NeurIPS paper Understanding the effective receptive field in deep convolutional neural networks made a simple but influential observation: the theoretical receptive field tells you what pixels can influence a deep unit, but not which pixels actually do in a trained network. They computed the partial derivative of a deep unit's activation with respect to each input pixel and observed that the magnitudes form a roughly Gaussian pattern over the receptive field, centred on the unit's nominal position and with a standard deviation that grows as √(depth), not linearly with depth. The effective receptive field is typically much smaller than the theoretical one — often half or a third of the nominal area carries most of the gradient magnitude. This has architectural consequences. A network whose theoretical receptive field covers the whole image may, in practice, only integrate a smaller window, so tasks requiring global context (scene understanding, long-range dependencies) may need architectural help even when the arithmetic says the receptive field is sufficient.
Three design choices grow receptive field faster than stacking 3 × 3 convs. Larger kernels — a 7 × 7 convolution adds 6 pixels of receptive field instead of 2. The cost is quadratic in kernel size; modern architectures use this sparingly (AlexNet's first layer is 11 × 11; most of the rest of the field standardised on 3 × 3). Dilated convolutions (§3) grow receptive field multiplicatively without adding parameters or losing resolution. Stride-2 downsampling doubles the effective per-layer receptive-field increment, at the cost of spatial resolution. The DeepLab segmentation architecture uses dilation heavily to get a large receptive field while keeping feature-map resolution; the ConvNeXt architecture (§14) re-popularised large kernels (7 × 7 depthwise convolutions) for the same reason, after ViT showed that global context matters more than the 3 × 3 community assumed.
Different tasks need different receptive-field profiles. Image classification typically needs whole-image receptive field at the classifier — the class is a global property of the image. Dense prediction (segmentation, depth estimation) needs varied receptive fields — nearby pixels' labels depend on local appearance, far-apart pixels' labels depend on global scene structure. Object detection needs a receptive field that comfortably covers the largest expected object but is not so much larger that small objects get drowned in context. The feature-pyramid network (Lin et al. 2017) made this explicit: build a pyramid of feature maps at multiple resolutions, each with a receptive field matched to an object-size range, and detect at the appropriate pyramid level.
A Vision Transformer's receptive field is "the whole input from layer 1" — attention has no local-versus-global distinction, since any patch can attend to any other from the start. This makes the receptive-field analysis of §17 very different from the CNN analysis here. An honest comparison: CNNs build receptive field up through spatial hierarchy (helpful when the task has spatial hierarchy, costly when it doesn't); ViTs have global receptive field from layer 1 (helpful when global context matters, wasteful when it doesn't, and worse-scaling when compute is tight). The modern hybrid architectures (ConvNeXt, Swin, CoAtNet) split the difference.
A convolutional network learns features by gradient descent, one loss per training example, with no explicit instruction about what each layer should represent. What actually ends up inside a trained convnet? The answer — edges in the early layers, textures and parts in the middle, object-like concepts near the top — was one of the most satisfying empirical discoveries of the deep-learning era, and it has shaped everything from architecture design to transfer learning to mechanistic interpretability.
Matthew Zeiler and Rob Fergus's 2014 ECCV paper Visualizing and understanding convolutional networks introduced the deconvolutional network technique for inspecting what a CNN has learned. The method: pass an image through a trained convnet, identify a single active unit in an intermediate layer, zero out all other activations in that layer, and pass the resulting sparse activation pattern back through transposed versions of the forward operations to produce an image in pixel space that depicts what pattern in the input that unit was responding to. The resulting images (reproduced in every introductory deep-learning lecture since) show a striking progression: units in the first convolutional layer respond to oriented edges and colour blobs; units in the second layer respond to simple textures and corners; units in the middle layers respond to parts — eyes, wheels, textures of fur or grass; units in the later layers respond to object-like concepts — whole faces, dogs of specific breeds, cars at specific orientations. This hierarchy emerges from training, not from any built-in architectural constraint.
The feature hierarchy discovered in CNNs has a structural analogue in the visual cortex. David Hubel and Torsten Wiesel's 1959–1962 experiments on cat V1 identified simple cells responsive to oriented edges and complex cells responsive to edges within a translation-tolerant window. Subsequent work showed the ventral visual stream (V1 → V2 → V4 → IT) as a hierarchy of increasing complexity and increasing invariance — edges to textures to parts to object categories, with the final inferotemporal cortex carrying face- and object-selective responses. That a CNN trained with backprop on ImageNet reproduces this structural organisation is remarkable. Daniel Yamins, James DiCarlo, and collaborators (2014–2016) showed quantitatively that the activations of hidden layers in a trained CNN are better predictors of monkey visual-cortex responses than any hand-designed neuroscience model. The convnet isn't biologically faithful in detail, but the kind of hierarchical feature decomposition it learns is one that evolution independently arrived at.
Several methods exist for asking "what in this image drove this prediction?". The simplest is the saliency map (Simonyan, Vedaldi, Zisserman 2014): compute |∂ y_c / ∂ x|, the gradient of the class-c logit with respect to each input pixel, and display the magnitude. Pixels with larger gradients are pixels the prediction is more sensitive to. Refinements include SmoothGrad (Smilkov et al. 2017), which averages saliency maps over small random input perturbations to denoise them; Integrated Gradients (Sundararajan, Taly, Yan 2017), which integrates gradients along a path from a reference baseline to the input; and Grad-CAM (Selvaraju et al. 2017), which uses the gradient of the class score with respect to feature maps to produce a class-discriminative heatmap, commonly overlaid on the image to show where the network looked. These methods are useful diagnostically — they can reveal when a network is cheating (focusing on backgrounds, spurious correlations) — though they are known to be unreliable in various ways (Adebayo et al. 2018 Sanity checks for saliency maps), so the field treats them as rough evidence rather than ground truth.
Two deeper lines of work probe what CNN features represent. Feature inversion (Mahendran, Vedaldi 2015) attempts to reconstruct an input that produces a given feature map — the resulting images show what structure a layer preserves (early layers preserve almost the whole image; later layers preserve only object-level structure, with the texture and precise position scrambled). Network dissection (Bau et al. 2017) correlates individual-unit responses with a library of human-labelled concepts (objects, parts, materials, textures) and quantifies how many units in a given layer are strongly selective for specific concepts. For a trained ResNet on Places365, roughly 20–30% of late-layer units show strong selectivity for named concepts; the rest are harder to interpret. This gave an empirical foundation to the notion that CNNs learn recognisable features, at least in part, and motivated the mechanistic-interpretability lineage that followed (Part XI, Ch 06).
A countervailing line of evidence complicated the clean "CNNs learn the right features" story. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy's 2015 Explaining and harnessing adversarial examples showed that tiny, imperceptible pixel perturbations can change a CNN's prediction from "panda" to "gibbon" with high confidence. Andrew Ilyas et al.'s 2019 Adversarial examples are not bugs, they are features argued that CNNs rely heavily on non-robust features — high-frequency image statistics that are predictive on natural data but human-imperceptible and easily perturbed. The picture is therefore mixed: the interpretable hierarchy (edges, textures, parts, objects) exists and is real, but trained CNNs also use subtle statistical cues that don't show up in Zeiler–Fergus visualisations and that humans don't use. The full picture of CNN representation remains an active research topic.
Before AlexNet stunned the computer-vision community in 2012, the convolutional neural network existed as a tested-and-working-but-niche technique pioneered largely by one group over two decades. Yann LeCun's LeNet family of networks proved, from 1989 onward, that gradient-descent-trained convolutional networks could solve real problems — first for handwritten-digit recognition, eventually for commercial bank cheque reading. The architecture that came to embody the approach, LeNet-5, is the bridge between the ideas of the 1980s and the deep-learning era that followed.
Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel's 1989 Neural Computation paper Backpropagation applied to handwritten zip code recognition is the first fully-convolutional, end-to-end-trainable, backprop-optimised deep network in the modern sense. Trained on 7291 handwritten zip-code digits from U.S. Postal Service mail, the network reached a 5% error rate on the test set — at the time, competitive with the best hand-engineered pipelines. The architecture had three convolutional layers (with 5 × 5 kernels and between 12 and 16 feature maps), two pooling-like subsampling layers, and two fully-connected layers at the top. Training used standard backpropagation on a SUN-4 workstation; a full training run took three days. The paper's central methodological claim — that features can be learned from data rather than hand-designed — was far from mainstream in a computer-vision community built on wavelet filters and edge detectors, but the error-rate numbers were impossible to ignore.
The refined LeNet-5 architecture appeared in Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner's 1998 Proceedings of the IEEE paper Gradient-based learning applied to document recognition. It has seven weight layers: alternating convolutional and subsampling layers (C1 — six 5 × 5 filters on the 32 × 32 input; S2 — 2 × 2 average-pool; C3 — sixteen 5 × 5 filters; S4 — 2 × 2 average-pool; C5 — a fully-connected layer realised as a convolution), a fully-connected layer F6 with 84 units, and an output layer with 10 units (one per MNIST class). The activation was a scaled tanh, the loss was a Gaussian-connection approximate-log-likelihood, and training used stochastic gradient descent. On MNIST — the post-2000 benchmark built largely from the zip-code and NIST datasets — LeNet-5 achieved 0.95% error, a number that remained competitive for a decade.
Reading the 1998 paper now, the thing that stands out is how much of modern deep learning was already there. Weight sharing, local receptive fields, and pooling — all the architectural priors of §1 — are present and correctly motivated. End-to-end learning with SGD and backprop, the idea that features should come out of the training process, not be hand-designed. Data augmentation (the paper trains on distorted versions of the MNIST digits). Modular construction of the training pipeline. Attention to invariances: the paper discusses translation, scale, and small deformations, all of which subsequent augmentation pipelines formalise. What LeNet-5 did not have was scale — neither the data (MNIST has 60,000 training images at 32 × 32), nor the compute (no GPUs yet), nor the depth (seven layers was at the limit of what sigmoid-activated networks could train reliably). The 14 years between LeNet-5 and AlexNet were largely about waiting for those three constraints to relax.
Throughout the 1990s and 2000s, convolutional networks were a working technique for document recognition (LeCun's NCR bank-cheque-reading pipeline processed roughly 10% of U.S. cheques at one point), used by a small community. Most of the computer-vision mainstream used different techniques — SIFT features (Lowe 1999, 2004), HOG descriptors (Dalal, Triggs 2005), bag-of-visual-words pipelines (Sivic, Zisserman 2003), deformable part models (Felzenszwalb, Girshick, McAllester, Ramanan 2010). On MNIST, CNNs were the best. On the harder benchmarks emerging in the mid-2000s — PASCAL VOC for object detection, Caltech-101/256 for classification — the hand-engineered pipelines were competitive and often better. The consensus was that CNNs worked on small, clean datasets like MNIST but didn't scale. That consensus would be wrong for reasons that only became clear with AlexNet.
Two developments in the late 2000s made the bridge possible. First, GPU-accelerated deep learning: Rajat Raina, Anand Madhavan, and Andrew Ng's 2009 Large-scale deep unsupervised learning using graphics processors showed that CUDA-programmed GPUs could train deep networks an order of magnitude faster than CPUs, and Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Gambardella, and Jürgen Schmidhuber's 2011 High-performance neural networks for visual object classification combined this with deep convnets and beat state-of-the-art on several small benchmarks. Second, a large labelled dataset: Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei's 2009 ImageNet: A large-scale hierarchical image database offered a million labelled natural images across 1000 object categories — two orders of magnitude more data than any prior benchmark. When a convolutional network in the LeNet tradition was trained on ImageNet on GPUs, it could finally be the network the LeNet-era authors had always argued was possible. That network was AlexNet.
The paper that re-founded deep learning was published at NeurIPS 2012 under the title ImageNet classification with deep convolutional neural networks. Its authors — Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton — submitted to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and won by a margin that had no precedent in the competition's history: 15.3% top-5 error to the runner-up's 26.2%. The 11-percentage-point gap was so large the field assumed for weeks that there was some implementation error. When the network was reproduced and the result confirmed, computer vision was done with hand-engineered features and the deep-learning era had begun.
AlexNet has eight weight layers: five convolutional layers followed by three fully-connected layers, 60 million parameters in total. The first convolutional layer uses 11 × 11 kernels with stride 4 on the 224 × 224 RGB input (an aggressive early subsampling to reduce spatial dimension quickly). The second layer uses 5 × 5 kernels; the remaining three convolutional layers use 3 × 3 kernels. Max-pooling layers follow the first, second, and fifth convolutional layers. The three fully-connected layers at the top are 4096 wide each, with dropout applied between them (the network is so over-parameterised that without dropout it memorises the ImageNet training set trivially). Output is a 1000-way softmax over ImageNet classes. The architecture was split across two GTX 580 GPUs with 3 GB of memory each — no single GPU of the era could hold the whole thing — and the paper includes the careful engineering of which layers communicate across GPUs.
The 2012 paper identified four choices that together delivered the breakthrough. ReLU activation — rectified linear units max(0, x) instead of sigmoids or tanh. ReLUs train roughly six times faster than tanh on ImageNet (the figure in the paper is striking), because their gradient is 1 or 0 rather than saturating, so deeper networks can be trained before vanishing gradients become fatal. Training on GPUs — CUDA-programmed CNN training was a recent capability, and AlexNet's six-day training run on two GPUs was achievable where the equivalent CPU run would have taken years. Dropout — the then-new regulariser from Hinton's group (Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov 2014 formalised it, but it was already in use in 2012 experiments); dropout at 0.5 in the fully-connected layers was essential to controlling overfitting on the 1-million-image training set. Aggressive data augmentation — random crops, horizontal flips, colour jitter (via PCA on the RGB pixel values, AlexNet's distinctive variant of augmentation). Each of these choices was individually present in the literature; their combination at scale was the decisive step.
ImageNet in 2012 was the benchmark the field had not known it needed. Fei-Fei Li and collaborators had spent years building a 14-million-image, 22,000-category dataset organised along the WordNet hierarchy, with the ILSVRC subset using 1.2 million training images across 1000 classes. Previous years of the challenge had been won by feature-engineering pipelines (Fisher vectors, locality-constrained linear coding) achieving top-5 error rates in the 26–28% range. When AlexNet delivered 15.3% in one shot, the signal was clear: learned features beat engineered features at scale. The 2013 edition of ILSVRC was dominated by deep networks; by 2014 every entry was a deep network. The hand-engineered era ended that quickly.
Several features of the original architecture did not survive the decade. Local response normalisation (LRN), a biologically motivated cross-channel normalisation applied after ReLU, turned out to help only marginally and was replaced by batch normalisation in subsequent work. The two-GPU split architecture was an engineering constraint, not a design principle; subsequent networks are homogeneous. The 11 × 11 first-layer kernel was a choice to reduce spatial dimension quickly given the compute budget; subsequent networks (VGG, ResNet) use smaller first-layer kernels and retain more spatial information. The three stacked fully-connected layers at the top were replaced in GoogLeNet and ResNet by global average pooling, which uses 100× fewer parameters. But the core — stacked convolutions, ReLU activations, pooling, dropout on the classifier, GPU training, data augmentation — is essentially all still standard practice.
The effect of AlexNet on the research community was immediate and profound. In late 2012 and early 2013, every major computer-vision lab re-tooled around deep convnets; hiring for anyone who had trained a CNN spiked; every computer-vision benchmark that had been dominated by hand-engineered pipelines — detection (PASCAL VOC), segmentation, tracking, face recognition — was re-run with convnet-based methods in 2013–2014 and records fell across the board. Geoffrey Hinton, Krizhevsky, and Sutskever started a company (DNNresearch) that was acquired by Google a few months after AlexNet's publication. Ilya Sutskever would later co-found OpenAI; Alex Krizhevsky would eventually leave the field over a disagreement about the direction of neural-network research. The paper itself has, as of the mid-2020s, been cited over 140,000 times — one of the most-cited papers in all of computer science.
Two years after AlexNet, Karen Simonyan and Andrew Zisserman's VGG family of networks asked a simple question: what happens if you just keep making the architecture deeper, and use the smallest sensible kernel throughout? The answer — 16- and 19-layer networks built almost entirely from 3 × 3 convolutions and 2 × 2 max-pools — became the default CNN design idiom and held that position through most of the 2014–2016 period.
Simonyan and Zisserman's 2014 paper Very deep convolutional networks for large-scale image recognition (ICLR 2015) advocated a radical simplification. AlexNet used kernels of 11 × 11, 5 × 5, and 3 × 3 in different layers, with ad-hoc padding and stride choices. VGG uses only 3 × 3 convolutions with stride 1 and padding 1 (preserving spatial dimensions), alternating with 2 × 2 max-pools with stride 2 (halving them). Feature-map depth doubles at each pooling stage (64 → 128 → 256 → 512 → 512), a pattern that became the default for most subsequent convnets. This gave networks of 16 or 19 layers — roughly twice AlexNet's depth — with a simple, regular structure that was easy to reason about and easy to reproduce. The design had the important pedagogical side effect that "deep means many 3 × 3 convs" entered the field's vocabulary.
Why 3 × 3? Simonyan and Zisserman's argument was that two stacked 3 × 3 convolutions have the same receptive field as a single 5 × 5 convolution (both see a 5 × 5 patch), but with two differences. First, fewer parameters: two 3 × 3 kernels have 2 · 9 · C² = 18 C² weights, versus 25 C² for a single 5 × 5 kernel — a 28% reduction. Second, more nonlinearity: two stacked convs have two ReLU activations between input and output, where a single conv has one, making the function class richer. The same argument applies to three stacked 3 × 3 vs a single 7 × 7 — 27 C² parameters vs 49 C², and three ReLUs vs one. For a given receptive field, small stacked kernels are more parameter-efficient and more expressive than a single large kernel. This argument held for most of the next seven years; ConvNeXt (§14) would later partially reverse it.
The two canonical configurations are VGG-16 (13 convolutional layers in five stages of 2, 2, 3, 3, 3 convs each, followed by three fully-connected layers) and VGG-19 (14 convolutional layers in stages of 2, 2, 4, 4, 4). VGG-16 has ~138 million parameters, of which ~120 million are in the three fully-connected layers at the top. The convolutional trunk of ~14 million parameters is modest; the fully-connected classifier head is a relic of the AlexNet template that became obvious bloat. Later architectures (GoogLeNet, ResNet) replaced these fully-connected layers with global average pooling plus a single linear layer, cutting the parameter count by an order of magnitude without loss of accuracy. VGG-16's ImageNet top-1 accuracy was 71.5%, a solid improvement over AlexNet's 63% but eclipsed the following year by GoogLeNet and ResNet.
For half a decade after its publication, VGG-16 was the default feature extractor for almost every downstream vision task. Object-detection architectures (Faster R-CNN), style-transfer systems (Gatys, Ecker, Bethge 2015), perceptual-loss functions for image generation (Johnson, Alahi, Fei-Fei 2016), image-captioning systems — most of them used the pretrained VGG-16 convolutional trunk as a fixed, or fine-tuned, feature backbone. The reason was partly that VGG features were good, but mostly that the architecture was regular and well-understood, so anyone could pretrain their own variant or fine-tune a released checkpoint. The perceptual loss based on VGG features — comparing the squared distance between VGG-layer activations of two images rather than the distance between the images themselves — became one of the most useful techniques in image generation; it's still in use today even though the underlying architecture has been superseded.
VGG is also a cautionary tale about depth-without-help. At 19 layers, VGG is essentially at the limit of what can be trained without the residual-connection trick introduced by ResNet. Making VGG deeper — 25 or 30 layers of the same structure — is possible architecturally but fails empirically: training accuracy degrades, not just test accuracy. This is the degradation problem that motivated ResNet, and it is why VGG-19 is the end of the line for this design family. The other VGG limitation is sheer compute and memory cost: the 138-million-parameter model is twice AlexNet's size, and running it on commodity hardware is expensive. The ratio of accuracy to compute cost in VGG is not great — a ResNet-18 achieves similar top-1 accuracy with 4× fewer FLOPs — so VGG eventually got displaced except for the narrow feature-extraction use case.
At the same ILSVRC 2014 where VGG took second place, first place went to a network with a very different architectural philosophy. Christian Szegedy and the Google Research team's GoogLeNet abandoned the VGG style of "stack many identical small kernels" in favour of multi-scale parallel branches inside each layer. The result was a 22-layer network with only 5 million parameters — an order of magnitude smaller than VGG-16 — that won ImageNet 2014 with 6.67% top-5 error.
The central architectural unit of GoogLeNet is the Inception module (Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich 2015, Going deeper with convolutions, CVPR). Instead of applying one kernel size at each layer, an Inception module applies several in parallel: a 1 × 1 convolution, a 3 × 3 convolution, a 5 × 5 convolution, and a 3 × 3 max-pool, all operating on the same input and all producing feature maps that are concatenated along the channel dimension at the output. The network is free to learn feature detectors at multiple spatial scales at the same layer, and the next layer receives a rich multi-scale representation as input. This was a direct architectural encoding of the idea that natural images have features at multiple scales simultaneously — a belief that had been implicit in multi-scale image pyramids for decades but had not been built into a single CNN layer before.
Naively, the Inception module is expensive: applying 5 × 5 convolutions to a feature map with 256 channels produces 5² × 256 × C_out = 6400 C_out weights per output channel, and the compute cost scales quadratically with kernel size. The GoogLeNet paper's signature trick was to use 1 × 1 convolutions as channel-wise bottlenecks: before the 3 × 3 and 5 × 5 branches, insert a 1 × 1 convolution that reduces the channel dimension (e.g., 256 → 64), then apply the larger kernel, then let the concatenation at the output restore the full channel count. This decoupled spatial receptive-field cost from channel cost, and it was the key efficiency move of GoogLeNet. The 1 × 1 convolution has two interpretations — a channel-mixing MLP applied identically at each spatial position, or a pointwise feature linear combination — and both turn out to be extensively useful across subsequent architectures.
GoogLeNet's 22-layer depth was near the training-stability limit for the pre-residual era. The paper added auxiliary classifiers at intermediate layers — small 1000-way softmax heads partway through the network, each with its own cross-entropy loss — as a form of deep supervision (Lee et al. 2014). The gradients from these auxiliary losses flowed back through the lower layers directly, bypassing the upper layers and mitigating vanishing gradients. At inference time, the auxiliary classifiers were discarded. This trick became largely obsolete once residual connections solved the depth problem more cleanly, but it was a notable intermediate step.
The Inception family evolved over four papers. Inception v2 (Szegedy, Ioffe, Vanhoucke 2016 Rethinking the Inception architecture for computer vision) introduced batch normalisation (Ioffe, Szegedy 2015) as standard — indeed the BatchNorm paper is sometimes referred to as "BN-Inception" because the authors used Inception as the test platform for BN. v2 also factored the 5 × 5 kernels into two stacked 3 × 3 kernels (the VGG argument applied to Inception), and factored n × n into 1 × n followed by n × 1 for asymmetric efficiency. Inception v3 refined the same template with more aggressive label smoothing and RMSProp. Inception v4 and Inception-ResNet v1/v2 (Szegedy et al. 2017) combined Inception modules with residual connections, importing the then-dominant ResNet trick into the Inception lineage. Each version pushed ImageNet top-1 accuracy up a couple of percent; none displaced the residual-connection paradigm that was dominating by 2016.
Inception's specific architectural choices — multi-branch modules, aggressive use of 1 × 1 convolutions, auxiliary classifiers — largely did not carry forward into the post-ResNet era. The ResNet-style "identity plus a small residual computation" won the architectural debate by 2016 and became the dominant template. But three Inception contributions are still with us. Batch normalisation (via BN-Inception) became a universal training-time tool. The 1 × 1 convolution as a channel-mixing and bottleneck operation is now a staple of almost every efficient convnet (MobileNet, ShuffleNet, EfficientNet all use it extensively; ResNet's bottleneck block is built around it). And the general principle of multi-scale feature processing survives in modern architectures like the Feature Pyramid Network (§15) and high-resolution networks (HRNet).
If one architectural invention defines the second half of the CNN decade, it is the residual connection. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun's 2015 paper Deep residual learning for image recognition (CVPR 2016 best paper) took a problem the field was stuck on — networks deeper than about 20 layers degraded in both training and test accuracy, no matter the regularisation — and solved it in a single line of code: add the input of a block to its output. The change unlocked 50-layer, 100-layer, and 1000-layer networks; the ImageNet top-5 error fell from GoogLeNet's 6.67% to ResNet-152's 3.57% in a single generation; and the residual-connection pattern migrated from vision into every subsequent deep-learning architecture, transformer included.
Before ResNet, the field had an embarrassing empirical observation: making a CNN deeper beyond a certain point made it worse, even on training accuracy. A 34-layer plain VGG-style network had higher training error than a 20-layer version. This was not overfitting (if it were, training error would decrease while test error rose); it was an optimisation failure. The network had more capacity, so in principle it could express every function the shallower network could (just set the extra layers to the identity) plus more — but SGD couldn't find the parameters that would do so. He and colleagues' key observation was that learning an identity mapping is hard when the architecture forces every layer to be a non-trivial transformation; if the network needed some layers to be identity-like (as the degradation argument suggests), the optimiser had to stumble onto them.
ResNet's solution: restructure the network so that each block learns a residual on top of the identity, rather than a full transformation. A residual block computes y = F(x) + x, where x is the input and F(x) is a small convolutional sub-network (typically two or three conv-bn-relu layers). If the task needs this block to be identity, the network learns F(x) = 0 — easy, just push the weights to zero. If the task needs this block to transform, the network learns a small correction on top of the identity. The identity shortcut adds zero parameters and negligible compute, but it restructures the optimisation landscape in a way that makes arbitrary depth trainable.
For ResNet-50 and deeper, the residual block uses a bottleneck structure to save compute. Instead of two 3 × 3 convolutions, the block is: a 1 × 1 convolution that reduces channels (e.g., 256 → 64), a 3 × 3 convolution at the reduced channel count, a 1 × 1 convolution that restores channels (64 → 256), and the identity shortcut. This saves roughly 4× the FLOPs of the straightforward two-3 × 3 variant while maintaining expressive capacity. The bottleneck-plus-residual pattern — channel reduce, spatial convolution, channel expand, plus identity shortcut — is the single most-replicated architectural motif in deep learning; it shows up in ResNet, ResNeXt, MobileNet, EfficientNet, and even in the MLP blocks of modern Transformer variants.
The original paper reported ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152 on ImageNet, with top-1 accuracies steadily climbing with depth (69.8% → 73.3% → 76.0% → 77.4% → 78.6%). ResNet-50 — the middle of this range — became the standard CNN benchmark reference for the next decade, used in hundreds of thousands of experiments and cited in almost every subsequent vision paper for accuracy comparison. ResNet-1202 (in the follow-up CIFAR experiments) was trained successfully — a network deeper than any of the time — though it slightly overfitted relative to ResNet-152. The depth-vs-accuracy curve flattened in the thousands range: beyond about 1000 layers, the marginal benefit of depth was small.
He, Zhang, Ren, Sun's 2016 follow-up Identity mappings in deep residual networks studied the exact placement of the residual connection. The original ResNet applied ReLU after the residual addition; the follow-up showed that moving the ReLU and BN before the convolutions in each block (the pre-activation variant) produced an even cleaner identity path — gradient could flow through the shortcuts unchanged — and improved accuracy slightly. The pre-activation variant became standard for most subsequent work, though the original layout is still in wide use because it is what the first ResNet checkpoints were trained with.
Two close relatives extended the residual template in different directions. Wide ResNet (Zagoruyko, Komodakis 2016) argued that depth wasn't the main driver — channel width was — and produced shallow-and-wide variants (WRN-28-10: 28 layers, 10× wider than ResNet) that matched deeper ResNets with lower compute cost. ResNeXt (Xie, Girshick, Dollár, Tu, He 2017) introduced grouped convolutions inside the bottleneck, splitting the intermediate 3 × 3 conv into several parallel branches — a kind of Inception-style multi-path inside a ResNet block, with a single "cardinality" parameter controlling the number of branches. ResNeXt-101-64×4d achieved ~80% top-1 on ImageNet without tricks, and the grouped-conv idea propagated into MobileNet (where depthwise separable convolution is the extreme case of cardinality = channel-count).
If ResNet's central idea was additive feature reuse (the block's output plus its input), DenseNet took the idea to its logical extreme with concatenative feature reuse — every layer's output is concatenated into the input of every subsequent layer in the same stage. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Weinberger's Densely connected convolutional networks (CVPR 2017 best paper) showed that this extravagant connectivity gave comparable or better accuracy than ResNet at a fraction of the parameter count.
A dense block consists of a series of layers, each of which receives as input the concatenation of all preceding layers' outputs within the block. If layer 1 produces an output of k channels, then layer 2 receives the original input plus k more channels; layer 3 receives the input plus 2k; layer 4 receives the input plus 3k; and so on. The number k is the growth rate — typically 12 or 24 — the number of new feature maps each layer contributes to the block's running concatenation. The block grows wider rather than deeper as layers are added, but each layer only has to learn a small incremental contribution (the k new channels). Between dense blocks, transition layers (a 1 × 1 convolution followed by a 2 × 2 average-pool) reduce the accumulated channel count and downsample spatial dimensions before the next block.
DenseNet's authors argued that dense connectivity provides three benefits. Gradient flow: the direct connections from every layer to every subsequent layer give gradients a short path backward, even more directly than ResNet's residual shortcuts. Feature reuse: instead of every layer having to recompute useful features from scratch (or inherit them via an additive residual), later layers can directly use earlier layers' features by accessing them through the concatenation. This means each layer can produce a small number of genuinely new features rather than having to redundantly carry forward the useful features from earlier layers. Parameter efficiency: because layers don't have to carry forward redundant information, DenseNet achieves comparable accuracy to ResNet with roughly half the parameter count.
The original paper reported four configurations — DenseNet-121, 169, 201, and 264 — named for their layer count. DenseNet-121 has 8 million parameters and matches ResNet-34's 21 million in ImageNet top-1 accuracy. DenseNet-264 has 33 million parameters and matches ResNet-152's 60 million. The parameter savings are real, but they come with a memory-cost trade-off: DenseNet's concatenated activations use substantial memory at training time, which can make it slower per iteration on GPUs than a ResNet of similar parameter count. This memory overhead kept DenseNet from becoming the default in spite of its parameter-efficiency arguments; most practitioners continued to use ResNet-50 as the default, accepting the higher parameter count for the simpler implementation and lower memory profile.
DenseNet's specific architecture didn't dominate the post-2017 landscape — ResNet variants did. But several DenseNet ideas have become standard. Dense connectivity within a block is the template of U-Net skip connections (§16), and of the feature-aggregation patterns in Feature Pyramid Networks and High-Resolution Networks. The principle that concatenation is sometimes better than addition — the network can learn to route which subset of features to use, rather than having them averaged — shows up in modern encoder-decoder architectures for dense prediction. And the growth-rate idea — a network that grows channel-wise rather than purely deepening — is an antecedent of modern compound-scaling strategies like EfficientNet (§13).
Despite its narrower role in mainstream image classification, DenseNet retained a strong presence in specific application areas — medical imaging, satellite imagery, and dense-prediction tasks (segmentation, depth estimation) where the feature-reuse and multi-scale-aggregation properties were particularly useful. Several high-resolution-imaging models (HRNet, in particular) use DenseNet-like repeated multi-scale aggregation. The overall lesson — that connectivity patterns beyond "previous layer to next layer" can be beneficial — became a standard design degree of freedom even when the specific dense-connectivity pattern was not the final choice.
ResNet and DenseNet optimised for ImageNet top-1 accuracy with compute and memory treated as secondary constraints. A parallel line of architectural work, starting in 2016, treated accuracy per unit of compute as the primary objective — building networks designed to run on phones, microcontrollers, and inference accelerators. The resulting architectures — MobileNet, SqueezeNet, ShuffleNet, EfficientNet — pushed the compute-accuracy Pareto frontier far beyond the ResNet template and produced the networks that shipped in billions of consumer devices.
The key efficient-CNN trick is the depthwise-separable convolution, first used in Inception and popularised by Xception (Chollet 2017) and MobileNet. A standard convolution mixes spatial and channel information simultaneously: the 3 × 3 × C_in × C_out kernel combines the 3 × 3 spatial structure with a full sum over C_in input channels for each of C_out output channels. A depthwise-separable convolution factors this into two steps. First, depthwise convolution: a single 3 × 3 × 1 kernel per input channel, applied independently — spatial mixing with no cross-channel mixing. Second, pointwise convolution: a 1 × 1 × C_in × C_out kernel that mixes channels with no spatial structure. The combined compute cost is 3·3·C_in + C_in·C_out rather than 3·3·C_in·C_out — roughly a factor of 1/C_out + 1/9 the compute of a full convolution, which is 8–12× smaller for typical channel counts.
Andrew Howard, Menglong Zhu, Bo Chen, and colleagues' 2017 MobileNets: Efficient convolutional neural networks for mobile vision applications built an entire ImageNet classifier out of depthwise-separable blocks, with a width multiplier α that scaled channel counts to trade accuracy for compute. MobileNet v1 at α = 1.0 matched VGG-16 in top-1 accuracy using ~30× fewer FLOPs. MobileNet v2 (Sandler, Howard, Zhu, Zhmoginov, Chen 2018) added two innovations: inverted residuals (a bottleneck block that expands channels in the middle rather than contracting them — thin-to-thick-to-thin, opposite of ResNet's bottleneck) and linear bottlenecks (omitting the ReLU in the last 1 × 1 projection to avoid information loss in the low-channel representation). MobileNet v3 (Howard et al. 2019) used NAS to search for block-level hyperparameters, added the h-swish activation (a piecewise-linear approximation to swish that's cheap to compute), and produced two versions — Small for microcontrollers, Large for mainstream mobile. Together the three MobileNet generations became the default efficient backbone for on-device vision; Google's Android Vision API, Apple's Core ML Vision framework, and most mobile-OS vision primitives use MobileNet variants.
Forrest Iandola, Song Han, and colleagues' 2016 SqueezeNet used 1 × 1 "squeeze" convolutions to reduce channels before 3 × 3 "expand" convolutions, producing AlexNet-level accuracy in 1.25 million parameters — 50× smaller — small enough to fit on the flash memory of an iPhone. ShuffleNet (Zhang, Zhou, Lin, Sun 2018) used grouped convolutions plus a channel shuffle operation to get the compute savings of groups without the information-flow bottleneck that pure grouping introduces. Both targeted the same extreme-efficiency regime as MobileNet; ShuffleNet v2 (Ma, Zhang, Zheng, Sun 2018) provided a set of design guidelines for efficient architectures (equal channel widths minimise memory access, avoid excessive group convolution, etc.) that influenced subsequent work.
Mingxing Tan and Quoc Le's 2019 EfficientNet: Rethinking model scaling for convolutional neural networks observed that scaling a CNN could be done along three independent axes — depth (more layers), width (more channels per layer), and resolution (larger input images) — and that prior work had mostly scaled one axis at a time. Their compound scaling prescription scaled all three jointly with a single parameter φ: depth ∝ α^φ, width ∝ β^φ, resolution ∝ γ^φ, with α·β²·γ² ≈ 2 to keep compute doubling per unit of φ. Starting from a small NAS-designed base network (EfficientNet-B0), they scaled it to EfficientNet-B7 by varying φ. The resulting family of eight networks (B0 through B7) dominated the ImageNet accuracy-vs-compute Pareto frontier for two years: EfficientNet-B7 reached 84.4% top-1 on ImageNet with 66M parameters and 37B FLOPs, versus GPipe's 84.3% at 557M parameters and 64B FLOPs. EfficientNet-B0 at 5.3M parameters achieved ResNet-50's 76% top-1 at 1/8 the compute.
Most modern efficient CNNs use some form of neural architecture search (NAS): automated search over architectural choices — block types, depths, widths, activations, kernel sizes — to find configurations optimised for a specific compute budget. Zoph, Le 2017 (Neural Architecture Search with Reinforcement Learning), NASNet (Zoph et al. 2018), MnasNet (Tan et al. 2019), and EfficientNet all employed variants of NAS. The compute cost of NAS has come down dramatically (original NASNet ran for tens of thousands of GPU-days; modern one-shot NAS like DARTS and ProxylessNAS run for GPU-hours), making NAS a practical tool for architecture design rather than a research curiosity. The design playbook for efficient CNNs — depthwise-separable convs, inverted residuals, NAS-tuned block sizes, compound scaling, carefully tuned activations — has become the standard recipe for ~2020-era efficient CNNs.
The Vision Transformer (ViT, Dosovitskiy et al. 2020) outperformed ResNet on ImageNet in late 2020 — and the vision community largely concluded that convolutions were obsolete. ConvNeXt (Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie 2022) challenged that conclusion with a carefully controlled study: take a plain ResNet, apply one at a time the design choices that distinguished modern transformers from 2015-era convnets, measure what each change gave, and see if the result was a competitive architecture. It was.
Starting from a ResNet-50 baseline and ending at ConvNeXt-T (a network with similar compute cost), the ConvNeXt paper's "roadmap" applied a sequence of small modifications: (1) the training recipe — copy the ViT-era training recipe (AdamW, weight decay 0.05, cosine schedule, 300 epochs, RandAugment, MixUp, CutMix, label smoothing, stochastic depth); already this moved a plain ResNet-50 from 76.1% to 78.8% on ImageNet top-1. (2) Macro design: change the stage ratio from ResNet-50's (3, 4, 6, 3) to (3, 3, 9, 3) mimicking Swin Transformer's (1, 1, 3, 1) ratio; replace the ResNet stem (7 × 7 conv, stride 2) with a "patchify" stem (4 × 4 non-overlapping conv). (3) ResNeXt-ify: use grouped convolutions with groups equal to input channels (i.e., depthwise convolutions), then widen the network to compensate. (4) Inverted bottleneck: use MobileNet-v2-style inverted residuals rather than ResNet's straight bottleneck. (5) Large kernels: replace the 3 × 3 depthwise conv with 7 × 7 — the large-kernel move that had been out of fashion since VGG but that ViT-style global attention suggested might be useful. (6) Micro design: replace ReLU with GELU, replace BatchNorm with LayerNorm, use fewer activations and normalisations per block, add a separate downsampling layer rather than strided convolutions. Each change individually gave a small improvement (0.1–0.8%); together, the sum pushed ConvNeXt-T to 82.1% on ImageNet — matching Swin-T's 81.3% at the same compute.
Following the recipe at larger scales produced ConvNeXt-S, -B, -L, and -XL. ConvNeXt-L at ImageNet-22k-pretrained achieved 87.0% top-1 on ImageNet-1k — competitive with the best Swin Transformer and ViT results at that compute budget. Transfer to downstream tasks (COCO detection, ADE20K segmentation) likewise showed ConvNeXt roughly matching or slightly beating the best transformer-based models. ConvNeXt v2 (Woo et al. 2023) added masked-autoencoder-style self-supervised pretraining and pushed the numbers further. The overall conclusion was firm: a properly-modernised convolutional architecture is not inferior to a transformer at the scale and compute regimes typical of vision tasks. What had seemed in 2020 like an architectural revolution was, more accurately, a set of training-recipe and design-choice upgrades that happened to first be applied to transformers.
The paper's understated but pointed methodological claim is that many comparisons of CNN and ViT architectures had been confounded by training recipe. ViT benefited from AdamW, heavy augmentation, cosine schedules, and long training, while contemporaneous CNN comparisons used the older SGD-momentum / step-schedule / modest augmentation recipe. When the CNN was given the same recipe, the gap narrowed dramatically; when the architecture was additionally modernised with transformer-era design choices, the gap closed. This doesn't mean ViT has no advantages — at very large data scales (billions of images), the scaling curves of ViT and ConvNeXt may still diverge in ViT's favour — but at typical supervised-ImageNet scale, the architecture family is no longer the bottleneck.
Several subsequent architectures took the convnet side of the debate further. MaxViT (Tu et al. 2022) combined MBConv (MobileNet inverted residuals) with block-attention to get a hybrid convnet-transformer. ConvFormer and MogaNet (Li et al. 2022/2023) explored further modernisations. InternImage (Wang et al. 2023) used deformable convolutions (Dai et al. 2017) for adaptive receptive fields, achieving 65.4% COCO AP at scale, a strong CNN-based detection number. None of these has become the single dominant architecture — the field has diffused into an accept-both state rather than crowning a winner — but the cumulative evidence is clear: CNNs remain a live, competitive architecture family.
ConvNeXt's implicit methodological lesson — that apparent architectural advantages can be confounded by training recipe, and that careful controlled experiments are needed before declaring architectural revolutions — has been widely absorbed. Modern vision-architecture papers now routinely include ablations over training recipe and compute-matched baselines. The broader takeaway — that architectural convergence between CNNs and transformers may be deeper than either community initially believed — is also now widely accepted; several analyses have shown that deep ViTs learn features that behave much like CNN features, and that the differences between the architectures are smaller at the feature-level than their Fourier-space or self-attention descriptions would suggest.
Image classification — "what is the main object in this image?" — is the problem that most of the CNN architecture innovation was benchmarked on. Object detection — "find every object in this image and draw a box around each one, labelling its class" — is the canonical harder task built on top of a classification backbone. The decade after AlexNet saw the detection task re-engineered in three generations (region-based, single-shot, and transformer-based) that each brought the frontier closer to real-time, high-accuracy performance.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik's 2014 Rich feature hierarchies for accurate object detection and semantic segmentation introduced the region-based CNN (R-CNN) pipeline. The process: use a classical region-proposal algorithm (Selective Search) to generate ~2000 candidate object bounding boxes per image; pass each proposed region through a pretrained CNN classifier; output per-class scores for each region and non-maximum-suppress overlapping detections. R-CNN achieved a 30% mean-average-precision (mAP) improvement over the best pre-CNN detection on PASCAL VOC — the detection equivalent of AlexNet's classification breakthrough. The drawback was speed: passing 2000 regions through a CNN serially took 47 seconds per image.
Girshick's 2015 Fast R-CNN fixed the speed problem by running the CNN once on the whole image, then cropping region-of-interest features from the shared feature map via RoIPool. Ren, He, Girshick, Sun 2015 Faster R-CNN replaced the external Selective Search with a region proposal network (RPN) — a small CNN that shared features with the detector and proposed candidate regions directly — yielding the first fully-learned detection pipeline. Mask R-CNN (He, Gkioxari, Dollár, Girshick 2017) added a per-region mask-prediction branch, extending region-based detection to instance segmentation. The R-CNN lineage became the standard high-accuracy detector of the late 2010s.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi's 2016 You only look once (YOLO) took the opposite architectural approach: a single forward pass of a CNN directly predicts bounding boxes and class probabilities for all objects in the image, with no region-proposal step. YOLO divides the input into a grid, has each grid cell predict a fixed number of bounding boxes and classes, and uses a combined loss that balances localisation and classification. The first YOLO ran at 45 FPS on a Titan X — an order of magnitude faster than Faster R-CNN at comparable accuracy. The Single Shot MultiBox Detector (SSD, Liu et al. 2016) was a contemporary architecture with a similar single-pass design and multi-scale feature-pyramid heads. The YOLO lineage has iterated continuously: YOLOv2 (Redmon, Farhadi 2017), YOLOv3 (2018), YOLOv4 (Bochkovskiy et al. 2020), YOLOv5 (Ultralytics, 2020), YOLOv7, YOLOv8, YOLO-NAS, YOLOv9, YOLOv10. Each iteration pushes the accuracy–speed frontier; YOLO variants are the dominant real-time detector in production systems.
Single-shot detectors had historically trailed region-based detectors in accuracy, for a specific reason: the sea of negative (background) proposals in the dense single-shot grid dominated the loss, and the rare positive examples (actual objects) got drowned out. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár's 2017 Focal loss for dense object detection introduced the focal loss, a modulation of cross-entropy that down-weights easy (high-confidence) examples and keeps the gradient focused on hard examples. The resulting RetinaNet matched the accuracy of two-stage detectors while keeping single-shot speed, closing the accuracy gap and making single-shot the default for many applications.
Detection requires recognising objects at many scales simultaneously — a tiny person in the background and a large car in the foreground — and this is exactly the scenario the CNN's feature hierarchy was built for. Tsung-Yi Lin and colleagues' 2017 Feature Pyramid Networks for object detection (FPN) formalised the multi-scale detection idea: take feature maps from multiple stages of the CNN backbone, upsample and combine them into a pyramid of feature maps at different resolutions, and run detection heads at each level. Small objects are detected from high-resolution early-stage features; large objects from low-resolution late-stage features. FPN became part of essentially every subsequent detection architecture (Faster R-CNN + FPN, RetinaNet + FPN, etc.).
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko's 2020 End-to-end object detection with transformers (DETR) re-architected detection around transformer attention. The architecture: a CNN backbone produces features, a transformer encoder contextualises them, a transformer decoder takes a fixed set of object queries and attends to the features to predict a set of (class, box) tuples. Training uses a Hungarian-matching loss that optimally assigns each predicted tuple to a ground-truth object. DETR eliminated many hand-designed components (anchor boxes, NMS, region-proposal heuristics) and produced competitive accuracy, though early versions were slow to train (500 epochs). Subsequent improvements — Deformable DETR (Zhu et al. 2021), DINO (Zhang et al. 2022), Grounding DINO — have made DETR-style detectors competitive with CNN-based detectors and have increasingly taken over open-vocabulary and text-conditioned detection.
Object detection labels whole boxes; segmentation labels individual pixels. Semantic segmentation labels each pixel with a class (road, car, pedestrian, sky); instance segmentation distinguishes individual objects (this is pedestrian #3, distinct from pedestrian #4). Segmentation has been the dense-prediction workhorse of applied computer vision — autonomous driving, medical imaging, satellite analysis, content-aware image editing — and its architectures have developed a distinctive design vocabulary of encoder-decoder structures with skip connections.
Jonathan Long, Evan Shelhamer, and Trevor Darrell's 2015 Fully convolutional networks for semantic segmentation (CVPR 2015) made the first CNN-based semantic segmentation system competitive with hand-engineered pipelines. The architectural insight was that if you remove the fully-connected classifier head from a CNN and replace it with more convolutional layers (possibly transposed to upsample), the whole network produces a per-pixel output map the same spatial size as the input. Long et al.'s FCN took a VGG-16 backbone, replaced its FC layers with 1 × 1 convolutions to produce per-pixel class logits, and added upsampling layers to restore resolution from the coarse features back to the input scale. The resulting network produced a class label per pixel in a single forward pass. Key refinement: skip connections from earlier stages of the backbone (higher spatial resolution) to the upsampling decoder, so that fine detail from early layers combined with semantic content from late layers.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox's 2015 U-Net: Convolutional networks for biomedical image segmentation (MICCAI 2015) introduced the architecture that became the default for dense-prediction tasks. The structure: an encoder (progressively downsampling convolutional stages that extract features) and a decoder (progressively upsampling convolutional stages that reconstruct a full-resolution output map), with dense skip connections from each encoder stage to the corresponding decoder stage, concatenating features from the encoder side to the decoder side at matching resolutions. The name comes from the U-shape of the architecture diagram. U-Net was originally designed for biomedical imaging (cell segmentation in microscopy images, where the training set was a few hundred heavily-augmented images) but proved effective across domains — it has become the default segmentation architecture, and its structure is directly inherited by diffusion models (Ho, Jain, Abbeel 2020 DDPM uses a U-Net backbone). U-Net's 2024 citation count is in the hundreds of thousands, putting it among the most-cited architectures in machine learning.
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille's DeepLab series (v1 2015, v2 2017, v3 2017, v3+ 2018) approached dense prediction through a different architectural lens. Rather than an encoder-decoder U-Net with skip connections, DeepLab uses a backbone with dilated (atrous) convolutions that preserve spatial resolution throughout the late stages, followed by an Atrous Spatial Pyramid Pooling (ASPP) module that captures multi-scale context. DeepLab avoided the information loss of aggressive downsampling, at the cost of much higher compute in the late stages. DeepLab v3+ combined the ASPP-style multi-scale module with a light U-Net-style decoder, becoming a mainstream segmentation architecture for autonomous driving and robotics applications.
Semantic segmentation labels pixels by class but doesn't distinguish instances: all pedestrians become a single "pedestrian" region. Instance segmentation — distinguishing the 3rd pedestrian from the 4th — was solved by Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick's 2017 Mask R-CNN, which extends Faster R-CNN with a per-region mask-prediction branch. Each detected bounding box gets a fine-grained binary mask at the box's resolution, producing instance-level segmentation. Panoptic segmentation (Kirillov et al. 2019) unified semantic and instance segmentation into a single task — every pixel labelled with both a class and (for countable objects) an instance ID — and associated architectures like Panoptic FPN combined both prediction heads into a single network.
The 2023 Segment Anything Model (SAM, Kirillov et al., Meta FAIR) is the segmentation analogue of a foundation model. Trained on 1.1 billion masks across 11 million images (the SA-1B dataset, assembled partly by data-engine methods), SAM takes an image and a prompt (a point, a box, a coarse mask, or text) and produces high-quality segmentation masks — without task-specific fine-tuning. SAM's backbone is a ViT; its mask decoder is a small transformer. The release reshaped the segmentation landscape: for many applications, fine-tuning a task-specific segmentation architecture was replaced by prompting SAM. The follow-on SAM 2 (2024) extended this to video. For CNN-based segmentation, this is the story's current coda — the problem has partly moved from "design the right U-Net variant" to "prompt a pretrained foundation model" — though U-Net remains the default for tasks where SAM doesn't apply (small domains, custom training, tight compute budgets).
In October 2020, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby posted a paper titled An image is worth 16×16 words: transformers for image recognition at scale. Its claim — that a pure transformer, applied directly to image patches with no convolution anywhere in the architecture, could match or beat CNNs on ImageNet classification given enough pretraining data — forced the vision community to confront a possibility that had been dismissed for years. Five years later the crossover is incomplete but undeniable: CNNs and transformers coexist, hybridise, and specialise for different regimes.
ViT's architecture is almost the minimum-viable adaptation of a language transformer to vision. Split the input image into non-overlapping patches (typically 16 × 16 pixels on a 224 × 224 image, giving 14 × 14 = 196 patches). Linearly project each patch into a token embedding of dimension D (typically 768 or 1024). Add learned positional embeddings so the transformer knows which patch was where. Prepend a special [CLS] token whose final representation will be the image-level classifier input. Run the sequence through a stack of standard transformer encoder blocks (multi-head self-attention + MLP, with LayerNorm and residual connections). Take the final [CLS] representation, feed it through a linear layer to produce class logits. That's it — no convolution anywhere.
ViT's performance in the paper was dataset-dependent in an important way. Trained from scratch on ImageNet-1k (1.2M images), ViT underperformed a comparable-compute ResNet. Pretrained on JFT-300M (Google's 300M-image proprietary dataset) and fine-tuned on ImageNet-1k, ViT matched or beat BiT-L (the best-at-the-time CNN) and continued improving with even larger pretraining sets. The conclusion: transformers lack the inductive biases (locality, translation equivariance) that convnets have built in, so they need more data to learn those biases from examples — but given enough data, they can learn more than convnets, because their function class is less constrained. The practical takeaway is that ViT is the right choice when you have lots of data or can pretrain on lots of data (foundation-model-style); a CNN is often still the right choice when data is limited.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou's 2021 Data-efficient image transformers (DeiT) showed that careful training recipes (heavy augmentation, distillation from a CNN teacher, hard label smoothing) could make ViT competitive with CNNs on ImageNet-1k-only training, closing the data-scale gap substantially. DeiT made ViT practical for labs without access to 300M proprietary datasets, and it helped drive the architecture's broader adoption.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo's 2021 Swin Transformer: Hierarchical vision transformer using shifted windows split the difference between ViT and CNN. Swin uses transformer blocks, but within local windows (7 × 7 patches by default) rather than globally, and uses a hierarchical structure with progressive spatial downsampling — the same macro-structure as a CNN, with the per-stage operation being window-based attention instead of convolution. Shifted windows (the window partition alternates between stages) give cross-window information flow. Swin was the first transformer to match or beat CNNs across classification, detection, and segmentation simultaneously — the first "general-purpose" vision transformer. The CoAtNet (Dai et al. 2021) and MaxViT (Tu et al. 2022) architectures go further in combining convolutional and attention operations within a single backbone.
Five years after ViT, the vision community has settled into a rough equilibrium. For classification on standard datasets at standard scales, CNNs (ConvNeXt) and transformers (ViT, Swin) are within a point of each other at matched compute, with the choice driven by implementation and deployment considerations. For very large pretraining on hundreds of millions to billions of images (CLIP, DINOv2, JFT-scale pretraining), ViTs scale better and are the default. For real-time mobile inference, CNNs (MobileNet variants) dominate. For dense prediction (segmentation, detection), both architecture families are in use; Swin-based encoders and ConvNeXt-based encoders both perform well. For foundation vision models (SAM, CLIP, DINOv2), transformers are the default. The debate "which architecture is better?" has been replaced by "which architecture is better for this task, this data scale, this deployment constraint?" — a healthier, more nuanced equilibrium. The convolutional prior is not dead; it's one tool among several, with its strongest case at modest data scales and its weakest case when billions of images and ViT-scale transformers are available.
Convolutional networks are not only a chapter in computer vision; they are the archetype that demonstrated a broader idea about what deep learning is for. The CNN story — that encoding domain-appropriate structural priors in the architecture produces more efficient, more data-hungry, better-generalising models than starting with a maximally-general function class — has echoed across every subsequent corner of machine learning. This closing section places the CNN in the broader landscape and traces its downstream influence.
The clearest lesson of the CNN decade is that architecture is a form of regularisation more powerful than explicit penalties (Chapter 03) or stochastic perturbations. The MLP baseline — a network with no architectural prior — is uniformly worse than a CNN on image tasks, even with heavy regularisation, even with enormous data. The convolutional prior's translation equivariance and local feature extraction are mathematical facts about natural images, and baking them into the architecture gives the network a starting point that no training regime can reproduce from zero. This is the CNN's philosophical gift to deep learning: the answer to "how do we regularise deep networks?" includes "choose the right architecture for the data."
The pretrained ImageNet CNN has been the backbone of essentially every computer-vision application of the past decade. Object detection uses a CNN backbone and bolts on detection heads. Semantic segmentation uses a CNN backbone and bolts on a decoder. Image-to-image tasks (super-resolution, denoising, colourisation) use CNN encoder-decoders. Style transfer uses pretrained VGG features as perceptual losses. Image captioning uses a CNN backbone as a visual feature extractor feeding into a language model. Even tasks that eventually moved to transformers (image generation with diffusion models) retained CNN (specifically U-Net) backbones for the noise-prediction network, because the CNN's spatial inductive bias is useful for pixel-level output. The CNN backbone is the substrate on which most of modern computer vision runs.
The fine-tune a pretrained ImageNet CNN on your task workflow is one of the most productive ideas to come out of the CNN era. Take a ResNet-50 pretrained on ImageNet, replace the final classifier head with one for your task, fine-tune on your (often much smaller) task-specific dataset. This typically requires only a few thousand labelled examples — often one or two orders of magnitude fewer than training from scratch — and produces competitive models across a wide range of domains. The workflow works because the lower layers of the pretrained CNN have learned generic low-level features (edges, textures, colours) that transfer across tasks; the fine-tuning adapts only the higher layers to the task specifics. This transfer-learning methodology scaled up into the foundation-model era; CLIP-pretrained ViTs, DINOv2-pretrained features, and SAM-style foundation models are all direct descendants.
The convolutional template has migrated to adjacent domains. Audio processing uses CNNs on spectrograms (WaveNet uses causal dilated 1D convolutions on raw waveforms, producing state-of-the-art speech synthesis for several years before transformers took over). Text processing used CNNs briefly in the mid-2010s (fastText, character-level CNNs for text classification) before transformers displaced them, but 1D convolutions still show up inside language-model embedding stacks. Time-series forecasting uses temporal convolutional networks (TCN, Bai, Kolter, Koltun 2018) with causal dilated convolutions. Graph-structured data uses graph convolutions (Kipf, Welling 2017) that generalise the weight-sharing-over-a-grid idea to weight-sharing-over-a-graph. The convolutional template — local receptive fields, weight sharing, hierarchical feature extraction — is the mental model used to design new architectures across many domains.
In the 2026 view, CNNs are not the single dominant architecture they were in 2017. Transformers have taken over language, large-scale vision pretraining, and many multimodal tasks. But CNNs remain the default for modest-data vision, for real-time mobile inference, for dense-prediction decoders, and for any application where the spatial inductive bias is a clear asset. The ResNet-50 is still, 11 years after publication, one of the most-used architectures in all of machine learning — it is the reference baseline every new vision paper must report against, and it is the workhorse of countless production systems. The CNN's intellectual gift — the demonstration that architectural priors matter, and that designing them well produces outsized returns — will outlast any specific CNN variant.
Convolutional neural networks are the most thoroughly documented architecture family in all of deep learning. The list below starts with anchor textbooks that cover the whole modern picture, then turns to the foundational papers — LeNet, AlexNet, VGG, Inception, ResNet, DenseNet, R-CNN and its descendants, YOLO, U-Net, FCN — that each introduced a piece of the current vocabulary. Modern extensions cover the efficient-CNN lineage (MobileNet, EfficientNet), the ConvNeXt rehabilitation, the transformer crossover (ViT, Swin, DETR, SAM), and the infrastructural refinements (effective receptive fields, anti-aliased pooling, deformable convolutions). Software ships the decade's architectural work in reusable form. Start with one of the textbooks for the coherent picture, then descend to primary sources when a detail needs verifying.
This page is Chapter 04 of Part V: Deep Learning Foundations. The three preceding chapters built the general machinery of deep networks (forward and backward pass, optimisers, regularisation); this chapter has followed that machinery into its first great architectural success story, the convolutional neural network. We traced the CNN from its origins in LeCun's 1989 handwritten-digit work, through LeNet-5 and the MNIST era, to AlexNet's 2012 ImageNet breakthrough, through the rapid architectural development of VGG, GoogLeNet, ResNet, and DenseNet, into the efficient-CNN era of MobileNet and EfficientNet, and through the post-transformer rehabilitation of convolutions in ConvNeXt. We saw CNN backbones underpin the downstream computer-vision tasks — object detection (R-CNN, YOLO, DETR) and semantic and instance segmentation (FCN, U-Net, Mask R-CNN, SAM) — and we examined the crossover with Vision Transformers that defines the current equilibrium of the field. The chapter's broader argument — that architecture is a form of regularisation, that hardcoding the right inductive bias into the function class produces outsized returns — is a thread that will continue through the rest of Part V. Chapter 05 turns to sequence models, the equivalent architectural lineage for temporal data: from RNNs to LSTMs to gated recurrent units to the attention-based models that Chapter 06 will treat as the general solution. Each of these architecture families encodes a different inductive bias; each has a different place in the modern toolbox; and together they form the architectural vocabulary the rest of the compendium builds on.