Part IX · Reinforcement Learning · Chapter 01

RL Fundamentals, learning from reward, not labels.

Reinforcement learning is the third paradigm of machine learning: rather than fitting labels or finding structure, an agent interacts with an environment, receives scalar reward signals, and learns a behavior policy to maximize cumulative return. The Markov decision process is the mathematical backbone that makes this tractable — and the Bellman equations, value functions, and exploration–exploitation tradeoff are the concepts every subsequent algorithm builds on.

Chapter notes

This chapter establishes vocabulary and formal structure. The algorithms — dynamic programming, Q-learning, policy gradients — appear in chapters 02 through 04. Notation follows Sutton & Barto (2018) throughout: uppercase letters for random variables (S, A, R), lowercase for their realizations (s, a, r), subscript t for time step.

Prerequisites: probability theory (Part I Ch 04), basic optimization (Part I Ch 03). Knowledge of deep learning is not required here but becomes necessary starting in chapter 03.

Sections

The third paradigmRL vs supervised vs unsupervised, the trial-and-error loop
A history of reinforcement learningThorndike to Bellman, TD learning, deep RL, RLHF
The agent–environment interfacestates, actions, rewards, the interaction loop
Markov decision processesthe formal model: S, A, P, R, γ
The Markov propertymemorylessness, why it matters, partial observability
Policiesdeterministic and stochastic, the object being learned
Return and discountingG_t, the discount factor, episodic vs continuing tasks
Value functionsV^π, Q^π, the bridge between policy and return
The Bellman equationsthe recursive structure, bootstrapping, consistency
Optimal value functionsV*, Q*, Bellman optimality, the greedy policy
Exploration vs exploitationthe fundamental dilemma, ε-greedy, UCB, Thompson sampling
Model-based vs model-freethe central architectural choice
On-policy vs off-policybehavior vs target policy, experience replay

The third paradigm

Supervised learning requires a teacher with correct answers. Unsupervised learning requires a dataset to find structure in. Reinforcement learning requires neither — only a way to interact with a world and receive feedback on the results.

In supervised learning, the training signal is explicit: here is an input, here is the correct output, adjust your parameters to close the gap. The data is given, the target is given, and the loss function makes the problem well-posed. Generalization is the challenge; the learning signal itself is not in question.

Reinforcement learning inverts this. There is no labeled dataset. There is no teacher who can say what the right action was. Instead there is an agent that takes actions in an environment and receives a scalar reward — a signal that encodes, loosely and often sparsely, whether things went well or poorly. The agent must discover, through trial and error, which actions lead to high cumulative reward over time. The learning signal is indirect: reward says "this was good" or "this was bad," but does not say what should have been done instead.

This is closer to how biological learning often works. A child learning to walk does not receive a labeled dataset of correct joint angles; it tries things, falls, recovers, and slowly improves. A chess player does not receive move-by-move annotations; they play games, win or lose, and adjust. The temporal credit assignment problem — figuring out which of the many decisions in a long sequence was responsible for a good or bad outcome — is what makes RL both natural as a paradigm and technically difficult as a field.

The defining features of RL. Three properties jointly distinguish RL from other learning paradigms: the agent learns from interaction, the feedback is a scalar reward (not a gradient or a label), and decisions are sequential — today's action affects tomorrow's situation. Remove any one and you have a different problem.

A history of reinforcement learning

Reinforcement learning did not emerge fully formed from a single paper or laboratory. It was assembled slowly, from animal psychology, operations research, cybernetics, and computer science — with long stretches where progress seemed impossible and breakthroughs that surprised even practitioners.

Roots in Animal Learning Theory (1890s–1950s)

The intellectual ancestry of RL traces further back than computing itself. Edward Thorndike, working with cats and puzzle boxes in the 1890s, articulated what he called the Law of Effect: behaviors followed by satisfying consequences become more likely; behaviors followed by discomfort become less likely. This is, in modern language, a policy gradient update — the probability of an action increases when the outcome was good and decreases when it was bad. Thorndike had no mathematics for it, no notion of states or value functions, but the core idea was there.

Ivan Pavlov's contemporaneous work on classical conditioning — the conditioned reflex — mapped a complementary territory: how animals learn to predict outcomes from preceding stimuli. Where Thorndike's work described learning to act, Pavlov's described learning to anticipate. Both would eventually find mathematical expression in RL: Thorndike's Law of Effect underlies policy-gradient methods; Pavlovian anticipation underlies temporal-difference learning, which is explicitly about learning to predict future reward. The neuroscientist Wolfram Schultz later discovered (1990s) that dopamine neurons in the primate brain fire in a pattern that precisely matches TD prediction errors — a striking convergence of neuroscience and the algorithm Sutton had formalized in 1988.

Clark Hull in the 1930s–40s attempted to mathematize Thorndike's ideas into a formal theory of habit strength, introducing notions of stimulus generalization and delayed reinforcement that foreshadow function approximation and discounting. B.F. Skinner's operant conditioning framework provided the most systematic experimental methodology for measuring the effects of reward schedules on behavior — variable-ratio schedules produce the highest response rates and the most persistent behavior, a finding that maps directly to questions about reward shaping and curriculum design in RL.

Cybernetics and the Dream of Self-Regulating Machines (1940s–1950s)

Norbert Wiener's Cybernetics (1948) placed feedback control and purposive behavior at the center of a new unified science of communication and control in animals and machines. Wiener was not doing RL in any modern sense, but he was asking RL's central question: how can a system use feedback about the consequences of its actions to regulate its future behavior toward a goal? Claude Shannon's information theory, also from 1948, provided the mathematical foundation for thinking about information flow through feedback loops, and laid the groundwork for thinking about states and observations quantitatively.

Alan Turing, in his 1950 paper introducing the Turing test, sketched a vision of machine learning via reward and punishment — what he called the "pleasure-pain system." He imagined a child machine that could be trained through reinforcement to perform arbitrary tasks, much as human children are. Marvin Minsky's 1951 SNARC (Stochastic Neural Analog Reinforcement Calculator) — an analog computer built from vacuum tubes and clutches — was arguably the first physical machine designed explicitly around a reinforcement-learning principle, designed to simulate a rat navigating a maze and strengthening connections that led to reward. Minsky's 1954 PhD thesis elaborated the theoretical framework, introducing ideas about prediction and reinforcement that would not be fully formalized for another thirty years.

Dynamic Programming and the Bellman Equations (1950s)

The mathematical foundations of modern RL were laid independently, and largely in isolation from the animal-learning tradition, by Richard Bellman at RAND Corporation. Bellman was working on sequential decision problems in operations research — how to make a sequence of decisions to optimize a long-run objective — and he developed dynamic programming as the solution technique. The key insight, now called the Bellman principle of optimality, states that an optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

This recursive structure — the value of a state equals the immediate reward plus the discounted optimal value of the next state — is the Bellman equation, which sits at the center of virtually every RL algorithm in use today. Bellman worked out the full framework for finite-horizon and infinite-horizon problems, proved convergence of value iteration under appropriate conditions, and introduced the MDP formalism that would become RL's standard model. He also explicitly coined the term "the curse of dimensionality" to describe why dynamic programming was computationally impractical for large state spaces — computing optimal policies requires sweeping over every state, which is feasible when there are thousands of states but not millions or trillions. The history of RL from the 1950s to the 2010s is largely the history of finding ways around this curse through approximation, sampling, and eventually deep learning.

Samuel's Checkers Player (1959–1962)

Arthur Samuel at IBM built one of the first programs capable of learning to play a game better than its programmer through self-play — a checkers player that updated its evaluation function based on game outcomes. Samuel used a form of temporal-difference learning decades before it had that name: the program compared its evaluation of a position to the evaluation it reached a few moves later, using the discrepancy to update the earlier evaluation. It also used rote learning (memorizing specific positions) and feature-based generalization, anticipating the combination of lookup tables and function approximation that would define later RL systems. Samuel's program defeated a former checkers champion in 1962, generating enormous press attention. But its significance was easily misread: the program had learned to be good at checkers by playing millions of games against itself, not because it had achieved general intelligence — a pattern that would recur throughout RL's history.

Temporal-Difference Learning and the 1980s Synthesis

The modern formal framework for RL — unifying the animal-learning tradition with Bellman's dynamic programming — was assembled primarily by Richard Sutton and Andrew Barto at the University of Massachusetts Amherst in the 1980s. Sutton's 1988 paper "Learning to Predict by the Methods of Temporal Differences" gave a rigorous formulation of temporal-difference (TD) learning: adjusting a prediction by comparing it to a prediction made slightly later, using the discrepancy as the learning signal. This was simultaneously a formalization of what Samuel had done intuitively, a mathematical model of what Pavlovian conditioning accomplishes, and a practical algorithm that could be run on a computer. The unification was conceptually significant: it showed that what animal brains do, what Samuel's program did, and what Bellman's equations prescribe are all instances of the same underlying principle.

Sutton and Barto's textbook Reinforcement Learning: An Introduction, first circulated as a draft from the late 1980s and published in full in 1998 (with a second edition in 2018), became the field's standard reference. It presented the MDP formalism, Bellman equations, TD methods, and Q-learning in a unified treatment that established the vocabulary and notation practitioners still use today. Chris Watkins introduced Q-learning in his 1989 PhD thesis — the first off-policy TD algorithm, one that could learn the optimal policy while following an exploratory one. Watkins and Dayan's 1992 convergence proof gave Q-learning theoretical solidity and made it the dominant tabular RL algorithm for two decades.

TD-Gammon: The Promise Deferred (1992)

Gerald Tesauro at IBM Research trained TD-Gammon — a backgammon program using TD learning with a neural-network value function — entirely through self-play. Starting from random play, within a few hundred thousand games it reached a level competitive with world-class human players. TD-Gammon was a stunning demonstration that neural-network RL could achieve expert performance on a complex game with a vast state space. But the breakthrough failed to generalize. Attempts to apply similar methods to other games and control problems mostly failed, for reasons that were not well understood at the time: the instabilities of combining neural networks, temporal-difference bootstrapping, and off-policy learning — what Sutton would later call the "deadly triad" — were lurking beneath TD-Gammon's success in a way that backgammon's specific structure happened to avoid. The field retreated toward more theoretically tractable tabular methods through the 1990s and 2000s.

The Policy Gradient Renaissance and RL's Quiet Decade (2000s)

The 2000s saw substantial theoretical progress with limited empirical spectacle. Williams' REINFORCE algorithm (1992) had introduced policy gradient methods, and Sutton et al.'s 1999 policy gradient theorem provided a cleaner derivation valid for function approximators. Kakade (2001) introduced natural policy gradients. Approximate dynamic programming found applications in operations research — inventory management, routing, energy control, financial trading — where reward signals were well-defined and plentiful. In robotics, RL made modest inroads in locomotion and manipulation, but required carefully shaped rewards and hand-designed state representations that limited generality.

Deep RL and the Atari Breakthrough (2013–2015)

The decisive shift came in 2013 when Volodymyr Mnih and colleagues at DeepMind submitted "Playing Atari with Deep Reinforcement Learning." The Deep Q-Network (DQN) learned to play 49 Atari 2600 games directly from raw pixel input using a convolutional neural network to approximate Q-values. Two stabilizing techniques made this tractable for the first time: an experience replay buffer that stored past transitions and sampled random mini-batches to break temporal correlations, and a target network updated only periodically to prevent the moving-target instabilities in the TD update. With these two innovations, the deadly triad was tamed — at least for on-policy Q-learning on these benchmarks.

What made DQN historically significant was not any single game result but the proof of concept: a single agent, same algorithm, same hyperparameters, learning from raw pixels in 49 games, achieving human-level or better performance on more than half of them. The Nature paper (2015) launched deep RL as a recognizable subfield, attracting researchers from across machine learning and generating a wave of follow-on work — Double DQN, dueling networks, prioritized experience replay, distributional RL — each improving sample efficiency or stability.

AlphaGo and the Perfect-Information Frontier (2016–2018)

DeepMind's AlphaGo (2016) defeated world Go champion Lee Sedol 4–1 by combining supervised learning on human expert games, a learned value network, and Monte Carlo tree search — demonstrating that RL could tackle a game long considered beyond AI's reach. AlphaGo Zero (2017) eliminated the human data dependency entirely, learning superhuman Go purely through self-play in 40 days. AlphaZero (2018) generalized this to chess and shogi as well, using the same algorithm and hyperparameters for all three games. AlphaZero's chess play was qualitatively different from previous computer chess — it discovered novel openings and sacrificial strategies that had never appeared in centuries of human play, suggesting genuine discovery rather than memorization.

Real-Time Strategy and Multi-Agent RL (2018–2019)

OpenAI Five and DeepMind's AlphaStar tackled real-time strategy games — Dota 2 and StarCraft II respectively — with imperfect information, thousands of simultaneous decisions, and game lengths requiring credit assignment over tens of minutes. Both systems defeated world champions in 2019, requiring simulated millennia of self-play experience. These results demonstrated that RL could scale to real-time partially observable multi-agent environments, and also surfaced the staggering sample complexity of RL in complex settings — compute budgets that raised serious questions about the path to more general agents.

RL for Scientific Discovery (2020–)

RL has since been applied to scientific discovery problems where no labeled solution dataset exists but a reward function can be defined. DeepMind demonstrated RL control of plasma in a nuclear fusion tokamak (2022), achieving configurations that had never been manually engineered. Google's chip floor-planning work used RL to optimize TPU layouts, outperforming human engineers. Molecular design systems use RL to navigate combinatorial chemical space toward molecules with target properties. FunSearch (2023) used a combination of LLMs and RL to discover new mathematical algorithms for combinatorics problems. In each case, the same structure applies: vast action space, no labeled solutions, a simulatable or computable reward.

RLHF and the Language Model Era (2022–)

The most societally visible application of RL in the current era is Reinforcement Learning from Human Feedback. InstructGPT (OpenAI, 2022) demonstrated that a PPO-trained model optimizing a learned reward model could produce dramatically more helpful outputs than the raw pretrained model. ChatGPT, GPT-4, Claude, and Gemini all use variants of this pipeline. For the first time, RL was not producing impressive game-playing agents for researchers to admire — it was shaping the behavior of systems interacting with hundreds of millions of people, determining their helpfulness, honesty, and safety. More recently, systems like DeepSeek-R1 and OpenAI's o1 use RL with outcome-based rewards to train extended chain-of-thought reasoning, with models discovering problem-solving strategies — backtracking, checking work, trying alternative approaches — that emerge from the RL training signal rather than being explicitly programmed.

Two recurring themes. Reading RL's history, two patterns stand out. First: ideas cycle through the field. TD learning was implicit in Pavlovian conditioning, then in Samuel's checkers, then was formalized by Sutton, then overshadowed by tabular methods, then returned as the engine of DQN. Second: scale changes what is possible. TD-Gammon ran on 300,000 games; DQN needed millions of frames; AlphaZero needed tens of millions of self-play games; OpenAI Five needed simulated millennia. The underlying algorithms were often similar; the compute increased by orders of magnitude. Understanding RL's history means understanding when the field was waiting for better theory, and when it was waiting for cheaper hardware.

The agent–environment interface

Everything in RL is built on a single repeating loop: the agent observes a state, selects an action, and the environment returns a new state and a reward. Everything else is commentary on this cycle.

At each discrete time step t, the interaction proceeds as follows. The environment is in some state S_t. The agent observes S_t (or some function of it) and selects an action A_t from the available action space. The environment transitions to a new state S_{t+1} according to some transition dynamics, and simultaneously emits a scalar reward R_{t+1}. The agent then observes S_{t+1} and R_{t+1} and selects A_{t+1}. This loop continues until a terminal state is reached or indefinitely.

The agent–environment loop. At each time step, the agent observes the current state S_t and selects action A_t. The environment responds with a new state S_t+1 and scalar reward R_t+1. The agent's goal is to select actions that maximize total reward accumulated over time.

The boundary between agent and environment is a modeling choice, not a physical fact. In robotics, the robot's motors might be part of the agent; the external world is the environment. In a game, the game engine is the environment; the policy is the agent. In RLHF, the language model is the agent, and the human preference signal is part of the environment. The distinction is wherever it is useful to draw it, but once drawn it fixes the problem: the agent controls actions, and actions are the only lever it has.

Episodic vs continuing tasks. Some tasks naturally terminate — a game ends in win or loss, a robot episode ends when it falls. These are episodic tasks, and the interaction resets to a start state. Other tasks run indefinitely — a trading algorithm, a recommendation system, an HVAC controller. These are continuing tasks. The distinction matters for how return is defined and how algorithms handle the boundary.

The reward signal is all the agent gets. It is the only source of information about whether the chosen actions are good ones. This places enormous design pressure on reward specification: if the reward is misaligned with what we actually want, the agent will optimize for the proxy, often in ways that violate the intent. Reward design is less a technical problem than a statement of values, and getting it wrong is the root of most spectacular RL failures in deployment.

Markov decision processes

The agent–environment loop becomes mathematically tractable when formalized as a Markov decision process. The MDP is the central object of reinforcement learning theory.

A Markov decision process is a tuple $(S, \mathcal{A}, P, R, \gamma)$:

$S$ — the state space, the set of all possible situations the environment can be in. States can be discrete (positions on a grid) or continuous (joint angles and velocities).
$\mathcal{A}$ — the action space, the set of actions available to the agent. Can likewise be discrete (left/right/up/down) or continuous (torques, voltages).
$P(s' \mid s, a)$ — the transition function, the probability of reaching state $s'$ when taking action $a$ in state $s$. This encodes the dynamics of the environment, which may be stochastic.
$R(s, a)$ (or $R(s, a, s')$) — the reward function, the expected immediate reward for taking action $a$ in state $s$. Sometimes defined as a distribution rather than an expectation.
$\gamma \in [0, 1)$ — the discount factor, which controls how much future rewards are weighted relative to immediate ones (developed further in Section 06).

The transition function $P$ is the environment's physics. The agent does not choose $P$; it is fixed by the world. What the agent controls is the mapping from states to actions — the policy (Section 05). The goal is to find the policy that maximizes expected cumulative discounted reward.

Known vs unknown models. In some settings the agent is given $P$ and $R$ explicitly — this is the planning setting, addressed in chapter 02 via dynamic programming. In most practical RL, $P$ and $R$ are unknown and must be inferred from experience. The distinction between these two regimes — model-based vs model-free — is one of the central axes of the field (Section 11).

Real environments often violate the MDP assumptions. The state space may be incompletely observed — a robot cannot see behind obstacles; a poker player cannot see opponents' cards. These are partially observable MDPs (POMDPs), and they require the agent to maintain a belief state over possible true states. Much practical RL sidesteps this formally by training on observations directly and relying on recurrent networks or frame stacking to recover relevant history, but the underlying POMDP structure explains many failure modes.

The Markov property

The Markov property says the future is independent of the past, given the present. It is what makes the MDP framework tractable — and what breaks when environments keep secrets.

A state $S_t$ is said to be Markov if:

$$P(S_{t+1} \mid S_t, A_t) = P(S_{t+1} \mid S_0, A_0, S_1, A_1, \ldots, S_t, A_t)$$

That is, knowing the current state $S_t$ gives you all the information in the full history that is relevant to predicting the future. The past is irrelevant once you have the present. This is not a claim about the world; it is a claim about the state representation. A badly chosen state can violate Markov (if it omits information the future depends on), while a well-chosen one restores it.

The classic example: a ball in free flight. If the state is only the current position, it is not Markov — the future trajectory also depends on velocity. Augment the state to include both position and velocity, and Markov is restored. The physics of the ball did not change; the representation did. Designing states that are Markov — or approximately Markov — is one of the key engineering problems in applying RL to real systems.

Partial observability in practice. Most real systems are technically POMDPs: the agent sees observations, not the full state. Standard practice is to treat observations as if they were states, relying on temporal context (stacking frames in Atari, using LSTM hidden states) to recover some of the missing history. This works surprisingly often but fails when the hidden state matters significantly — a recurring source of fragility in deployed RL systems.

Policies

A policy is the agent's behavior function — the mapping from states to actions, or to distributions over actions. It is what RL optimizes.

Formally, a policy $\pi$ is a mapping from states to actions. Two flavors:

A deterministic policy $\pi(s) = a$ returns a single action for each state. Appropriate when the optimal response to a state is unambiguous.
A stochastic policy $\pi(a \mid s)$ returns a probability distribution over actions. Necessary in partially observable or adversarial settings where a deterministic policy can be exploited, and in policy gradient methods where differentiability through the action requires a distribution.

Policies can be represented in many ways. In tabular settings (small, discrete state-action spaces), a policy is literally a table. In deep RL, the policy is a neural network that takes the state as input and outputs action probabilities or, for continuous actions, the parameters of a distribution (typically a Gaussian mean and variance). This representation — the policy network — is the primary object in policy gradient and actor-critic methods.

A key distinction is between stationary and non-stationary policies. A stationary policy chooses actions based only on the current state, not on time. For MDPs with a fixed horizon, optimal policies may be time-dependent; for infinite-horizon discounted problems, the optimal policy is always stationary — an important simplification that the discount factor buys us.

What RL actually optimizes. The goal of RL is to find a policy $\pi^*$ that maximizes the expected return from the initial state distribution. Every algorithm in the field is, at its core, a different approach to this same optimization problem — differing in what it estimates (value function, policy gradient, model), how it uses data, and what approximations it makes.

Return and discounting

The agent's goal is not to maximize immediate reward but cumulative reward over time. How future rewards are weighted relative to immediate ones — the discount factor — is one of the most consequential hyperparameters in any RL system.

The return $G_t$ is the total reward accumulated from time $t$ onward:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

The discount factor $\gamma \in [0, 1)$ controls the time horizon of the agent's concern. At $\gamma = 0$, the agent is purely myopic, caring only about the next immediate reward. As $\gamma \to 1$, the agent gives nearly equal weight to rewards in the far future. The discount factor serves three purposes: it ensures the infinite sum converges (for bounded rewards), it introduces a preference for earlier rewards over later ones, and it captures the practical reality that distant consequences are less certain.

Note the recursive structure: $G_t = R_{t+1} + \gamma G_{t+1}$. Today's return equals the immediate reward plus the discounted return from the next state. This recursive identity is the seed from which the Bellman equations grow.

Choosing $\gamma$. In practice, $\gamma$ is treated as a hyperparameter and tuned empirically. Values of 0.99 or 0.999 are common in continuous control; 0.95–0.99 in Atari-style games. A low $\gamma$ makes optimization easier (shorter effective horizon) but produces myopic behavior. A high $\gamma$ produces farsighted agents but increases variance and slows learning. The "effective horizon" of an agent with discount $\gamma$ is approximately $1 / (1 - \gamma)$ steps.

For episodic tasks, the sum is finite — it runs to the terminal time step $T$. For continuing tasks, the discount factor is necessary to ensure convergence. Some formulations of continuing tasks use the average reward criterion rather than discounting, which avoids the $\gamma$ hyperparameter but complicates the math; average reward methods appear most often in queuing and scheduling applications.

Value functions

A value function assigns a number to each state — or state-action pair — measuring how much cumulative reward the agent can expect from that point onward under a given policy. Value functions are how RL converts the long-run objective into something computable at each step.

The state-value function under policy $\pi$ is:

$$V^\pi(s) = \mathbb{E}_\pi\left[G_t \mid S_t = s\right] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\Bigg|\; S_t = s\right]$$

This is the expected return when starting in state $s$ and following policy $\pi$ thereafter. A high $V^\pi(s)$ means starting in $s$ leads to a lot of cumulative reward under $\pi$; a low value means it does not.

The action-value function (or Q-function) under policy $\pi$ is:

$$Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t \mid S_t = s, A_t = a\right]$$

This is the expected return when starting in state $s$, taking action $a$ first, and thereafter following $\pi$. The Q-function is more informative than the V-function: knowing $Q^\pi(s, \cdot)$ for all actions tells you exactly which action is best in state $s$, without needing to know the transition dynamics. This is why Q-functions are the centerpiece of value-based deep RL (DQN and its descendants).

The two are related by:

$$V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)$$

— the value of a state is the expectation of the action-value over the actions selected by the policy. And conversely:

$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\pi(s')$$

These interrelations recur constantly throughout RL theory and algorithm design.

The Bellman equations

The Bellman equations express the value of a state in terms of the value of its successors. This recursive consistency condition is the mathematical engine behind every value-based RL algorithm.

Substituting the recursive definition of return ($G_t = R_{t+1} + \gamma G_{t+1}$) into the definition of $V^\pi$, and taking expectations:

$$V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s, a)\left[R(s, a) + \gamma\, V^\pi(s')\right]$$

This is the Bellman expectation equation for $V^\pi$. It says: the value of state $s$ under policy $\pi$ equals the expected immediate reward plus the discounted expected value of the next state — where the expectation is over both the policy's choice of action and the environment's stochastic transition. An analogous equation holds for $Q^\pi$:

$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \sum_{a'} \pi(a' \mid s')\, Q^\pi(s', a')$$

These equations are simultaneously a definition and a constraint: any valid value function must satisfy them. This means they can be used to check whether a given value function is correct (compute both sides; they should match) and to improve an estimate (treat the right-hand side as a target and update toward it). The second use is bootstrapping — updating value estimates using other value estimates — and it is the defining feature of temporal difference learning.

Bootstrapping as the core idea. Supervised learning updates a prediction using a fixed external target (the label). Temporal difference methods update a prediction using another prediction — the estimated value of the next state. This sounds circular but converges because the next-state estimate is updated too, and under standard conditions the system finds the fixed point. The Bellman equation is that fixed point.

The Bellman equations can be written compactly in matrix form for finite state spaces: $\mathbf{v}^\pi = \mathbf{r}^\pi + \gamma \mathbf{P}^\pi \mathbf{v}^\pi$, which has the closed-form solution $\mathbf{v}^\pi = (I - \gamma \mathbf{P}^\pi)^{-1} \mathbf{r}^\pi$. This is policy evaluation in exact form. For large or continuous state spaces, exact solution is intractable; approximation (via neural networks or other function approximators) is necessary — which is the entry point into deep RL.

Optimal value functions

The optimal value functions are the best any policy can achieve. Finding them — or approximating them — is the goal of every value-based RL algorithm.

The optimal state-value function $V^*$ is the value function of the best possible policy:

$$V^*(s) = \max_\pi V^\pi(s) \quad \text{for all } s \in S$$

Similarly, the optimal action-value function:

$$Q^*(s, a) = \max_\pi Q^\pi(s, a) \quad \text{for all } s, a$$

These satisfy the Bellman optimality equations:

$$V^*(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s, a)\, V^*(s')\right]$$ $$Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \max_{a'} Q^*(s', a')$$

The critical difference from the expectation equations is the replacement of $\sum_a \pi(a \mid s)$ with $\max_a$: rather than averaging over the policy's choices, we take the best possible action. These equations have a unique solution (under the contracting-mapping theorem for $\gamma < 1$), meaning the optimal value function is well-defined and findable — at least in principle.

Given $Q^*$, the optimal policy is simply:

$$\pi^*(s) = \arg\max_a\, Q^*(s, a)$$

This is the greedy policy with respect to $Q^*$. It is optimal without needing to know the transition dynamics — just the Q-values. This is why Q-learning is so powerful: if you can learn $Q^*$, the optimal behavior falls out for free. The entire enterprise of value-based RL is the pursuit of $Q^*$ in environments too large to solve exactly.

Exploration vs exploitation

An agent that only exploits its current knowledge never discovers whether something better exists. An agent that only explores never gets to use what it learned. Managing this tension is a fundamental challenge with no universally optimal solution.

The exploration–exploitation dilemma is cleanest in the multi-armed bandit problem — RL stripped to a single-step interaction. You have $k$ slot machines (bandits), each with an unknown reward distribution. At each round, you choose one to pull. Pull the one you think is best (exploit), and you might miss a better option you haven't tried enough. Try everything randomly (explore), and you waste pulls on machines you already know are bad. The question is how to balance these over time to maximize cumulative reward.

Three canonical strategies for exploration:

ε-greedy: with probability $\varepsilon$ take a uniformly random action; with probability $1 - \varepsilon$ take the greedy action (the one with highest estimated value). Simple, widely used, easy to implement. Decaying $\varepsilon$ over training is standard: explore early, exploit later.
Upper Confidence Bound (UCB): select the action that maximizes $\hat{Q}(a) + c\sqrt{\frac{\ln t}{N(a)}}$, where $N(a)$ is the number of times action $a$ has been tried and $t$ is the total number of steps. The bonus rewards actions that haven't been tried much — principled optimism under uncertainty.
Thompson sampling: maintain a posterior distribution over reward parameters for each action; at each step sample from the posteriors and select the action that looks best under the sample. Bayesian, principled, and often empirically excellent.

In full MDPs (not just bandits), exploration is harder: the effect of an action may not be felt until many steps later, and unvisited states might be reached only through specific action sequences. Methods like count-based exploration, intrinsic motivation (curiosity-driven reward bonuses), and randomized value functions extend the bandit intuitions to sequential settings, with varying success. Sparse reward environments — where the reward signal is near-zero for most of training — are the hardest case and the site of much ongoing research.

Exploration in the real world. In physical systems — robots, power grids, clinical treatment — exploration has costs that are absent in simulation: wear, safety violations, irreversible outcomes. Safe exploration, constrained RL, and sim-to-real transfer are responses to this problem. The assumption that exploration is free is one of the cleanest signs that an RL formulation was designed for simulation.

Model-based vs model-free

Does the agent learn a model of the world, or does it learn directly from experience? This architectural choice determines sample efficiency, generalization, and failure modes.

A model-based algorithm learns (or is given) a representation of the environment's dynamics — the transition function $P(s' \mid s, a)$ and reward function $R(s, a)$ — and uses it to plan. With a good model, the agent can simulate trajectories, reason about consequences, and make decisions far more sample-efficiently than if it had to try everything in the real environment. Dynamic programming (chapter 02), Dyna (chapter 05), and Dreamer (chapter 05) are all model-based.

A model-free algorithm learns directly from environment interactions, estimating value functions or policy gradients without building an explicit model of dynamics. Q-learning, SARSA, PPO, and SAC are all model-free. The lack of a model is a limitation — the agent cannot plan ahead — but it is also a robustness: a wrong model is often worse than no model, and model errors can compound catastrophically in long-horizon reasoning.

Dimension	Model-based	Model-free
Sample efficiency	Higher	Lower
Computational cost	Higher (planning)	Lower
Sensitivity to model error	High	N/A
Typical use case	Low-data regimes, sim planning	High-data regimes, game playing

In practice the distinction is a spectrum. Dyna-style algorithms maintain a model and use it for additional "imagined" transitions to supplement real experience, getting some of the sample efficiency benefits with less planning overhead. World models (Dreamer, MBPO, chapter 05) take this further: learning a rich latent-space dynamics model and doing most planning inside it. The trend in frontier RL is toward more model use, as collecting real experience in physical systems becomes the binding constraint.

On-policy vs off-policy

On-policy algorithms learn about the policy they are currently executing. Off-policy algorithms can learn about one policy while executing another — unlocking experience replay, data reuse, and learning from demonstrations.

In on-policy learning, the data used for updating the value function or policy must come from the current policy. SARSA and the REINFORCE algorithm are on-policy: they update using the actions actually taken. If the policy changes, old data becomes stale and must be discarded. This is wasteful but safe — the estimates are always consistent with what the agent is actually doing.

In off-policy learning, updates are performed using data from a different policy — the behavior policy — while estimating the value of the target policy. Q-learning is the canonical example: regardless of what action the agent actually took, the update uses $\max_{a'} Q(s', a')$ — the value of the best action, not the one chosen. This means data collected under any policy can be used to improve any other policy.

Off-policy learning enables two major capabilities. First, experience replay: store all past transitions in a replay buffer and sample from it during updates. The agent can learn from the same experience many times, dramatically improving data efficiency. DQN's replay buffer was central to its success on Atari. Second, learning from offline data: in offline RL (chapter 07), the agent learns entirely from a fixed dataset of previously collected transitions, without any environment interaction at all.

Importance sampling. When the behavior policy differs significantly from the target policy, raw off-policy estimates can be biased. Importance sampling corrects for this by reweighting transitions by the ratio $\pi_{\text{target}}(a \mid s) / \pi_{\text{behavior}}(a \mid s)$. The correction is exact but introduces high variance when the policies differ greatly. Managing this variance is a central challenge in off-policy actor-critic methods like SAC and TD3.

The on-policy / off-policy axis cuts across the model-based / model-free axis, giving a rough taxonomy: on-policy model-free (REINFORCE, PPO, A3C), off-policy model-free (Q-learning, DQN, SAC), on-policy model-based (Dyna with fresh rollouts), off-policy model-based (Dreamer, MBPO). Most frontier algorithms are off-policy — the data efficiency benefits are too large to ignore.

RL Fundamentals, learning from reward, not labels.

Chapter notes

The third paradigm

A history of reinforcement learning

Roots in Animal Learning Theory (1890s–1950s)

Cybernetics and the Dream of Self-Regulating Machines (1940s–1950s)

Dynamic Programming and the Bellman Equations (1950s)

Samuel's Checkers Player (1959–1962)

Temporal-Difference Learning and the 1980s Synthesis

TD-Gammon: The Promise Deferred (1992)

The Policy Gradient Renaissance and RL's Quiet Decade (2000s)

Deep RL and the Atari Breakthrough (2013–2015)

AlphaGo and the Perfect-Information Frontier (2016–2018)

Real-Time Strategy and Multi-Agent RL (2018–2019)

RL for Scientific Discovery (2020–)

RLHF and the Language Model Era (2022–)

The agent–environment interface

Markov decision processes

The Markov property

Policies

Return and discounting

Value functions

The Bellman equations

Optimal value functions

Exploration vs exploitation

Model-based vs model-free

On-policy vs off-policy

Further reading

Textbooks

Lecture series

Foundational papers