Introduction
Reinforcement learning is about learning from consequences. Unlike supervised learning, nobody tells the agent the correct action for every situation. The agent tries actions, receives rewards, and slowly discovers which behavior leads to better long-term outcomes.
Imagine playing a computer game. You observe the screen, make a decision, and the game responds. Sometimes you get points immediately, sometimes nothing happens, and sometimes the real consequence appears much later. The same idea can be described in a more abstract mathematical language. In such a formal language, you are the agent and the game is the environment.
- You observe a state of the environment at a timestamp .
- You execute an action by following your policy function , which chooses actions based on the current state of the environment .
- You get a reward .
- The environment transitions to a new state .
This loop is simple, but the hard part is credit assignment. An action can look useless now and still be important because it leads to a better state later. If you move toward a key in a game, the reward may appear only many steps later when the key opens a door.
So instead of asking which action gives the biggest reward immediately, reinforcement learning asks a broader question: which action puts the agent on a path with the best future? Here, the best future means the one with the largest accumulated reward, not necessarily the largest next reward. To make that question precise, we need one number that summarizes all future rewards from the current time step. This number is called the return .
However, if the game continues forever, this definition of may produce an infinite sum, which imposes certain mathematical complications. In practice, we usually replace with discounted cumulative rewards where represents a discount factor.
Discounting has two purposes.
First, the mathematical reason. When , rewards far in the future become smaller and the infinite sum can stay finite. To see this, assume a constant reward at each timestamp . Then the discounted return becomes a finite geometric series.
Second, the modeling reason. The discount factor limits the effective horizon of an action. A reward steps in the future is worth today, so a smaller makes the agent more short-sighted, while a larger makes it more patient.
At this point we have a way to score one realized future, but in reinforcement learning the future is often not fully predictable. The same state and action can lead to different next states or rewards. Instead of ignoring this randomness, we need a framework that can model it. This is why we use probabilistic theory and move from concrete values to random variables. A realized state becomes a random variable , a realized action becomes , a realized reward becomes , and a realized return becomes .
Randomness appears in two places.
- Environment randomness: if and , we don’t always have to get the same next state or the same reward . In a game, the same move can miss, slip, or trigger a random event.
- Policy randomness: the policy itself can also be stochastic. Instead of always choosing one action, can assign probabilities to actions. This is useful for exploration, for avoiding commitment to a bad action too early, and for describing strategies that intentionally mix actions.
With this probabilistic notation, the policy function will be represented as a probability distribution over actions. The discounted return is also a random variable now:
The learning goal is to find a policy that makes future return as large as possible. Since is random, this means maximizing its expectation: the expected discounted cumulative reward.
This expression conditions on the entire history of the episode. In principle, the value of the current situation could depend on every state and action that happened before. That is difficult to work with, so in reinforcement learning we often assume that the process is a Markov Decision Process (MDP).
The Markov assumption says that the current state already contains all information needed for the next transition. Once we know the current state and action , the next state does not depend on the older history.
This lets us replace the full-history question with a state-based question: if I am in state , what return should I expect from here? That is the object we will formalize as the value function in the next section.
Bellman Equations
In the introduction, we defined the discounted return as a sum of future rewards. To derive Bellman equations, we now use the same return in a recursive form: the return from today is the immediate reward plus the discounted return from the next time step.
This recursive form is the starting point for the Bellman equations. Once we define the value function as the expected return, we will substitute this recursive expression for into that expectation.
Value Function
The goal of reinforcement learning is to find an optimal policy that maximizes the expected discounted cumulative reward for every state . The star in means optimal.
Before we can find the best policy, we need a way to evaluate a policy. For a policy , we define as the expected return when the agent starts in state and follows . This quantity is called the value function.
You already use something like a value function intuitively. Imagine playing chess. There is no immediate reward after every move, but at some point you can look at the board and know that your position is probably lost. You might even resign before checkmate because you can predict the future from the current state. That prediction is the value of the state: how good this position is expected to be if you keep playing from here.
Bellman Expectation Equation
The definition of tells us what the value function means, but it is not yet very useful for computation. It asks for the expected return over the whole future. To make it practical, we substitute the recursive form of into this expectation, separating the immediate reward from the expected value of the continuation.
So the final form is:
This is the Bellman expectation equation. It says that the value of state under policy is the expected immediate reward plus the discounted value of the next state. The expectation averages over both sources of randomness: the action chosen by the policy and the next state and reward produced by the environment.
The subscript in matters. If we change the policy, the action probabilities change, and the value of the same state can change as well. This is why the Bellman expectation equation is used for policy evaluation: it tells us how good each state is for a fixed policy.
Action-Value Function
The value function answers a state question: if the agent is in state and follows policy , what return should we expect? For policy improvement, we often need a more specific question: if the agent is in state , takes action first, and then follows , what return should we expect?
This is the action-value function :
Because already conditions on the first action, we no longer average over actions inside its definition. If we want the value of the state again, we can average the action-values using the policy:
The action-value function is useful because it lets us compare actions in the same state. This comparison is exactly what we need when we move from evaluating a fixed policy to improving it.
Bellman Optimality Equation
The Bellman expectation equation evaluates a fixed policy. But the final goal is not just to evaluate one policy. We want the best policy.
If we know the action-value of each possible action, then the best policy should choose an action with the highest action-value. This replaces the policy average with a maximum over actions. The result is the Bellman optimality equation:
Here is the action-value when the agent takes action first and behaves optimally afterward.
Unlike , the optimal value function does not describe one fixed policy. It describes the best achievable expected return from each state. Once we know , we can recover an optimal policy by choosing an action with the highest in each state.
At this point, we have equations that characterize the value functions we want. The Bellman expectation equation characterizes for a fixed policy, and the Bellman optimality equation characterizes for the best policy. But an equation is not yet an algorithm. To find these value functions in practice, we need a procedure that starts with a rough guess, updates it repeatedly, and eventually converges to the right answer.
This is where contraction mapping enters the story.
Contraction Mapping
Contraction mapping is not specific to reinforcement learning. It is a general idea from metric spaces, where we study points, distances between points, and functions that move points around. The definition and theorem below show one important result: if a function always moves points closer together, then repeatedly applying that function converges to a unique fixed point.
This sounds abstract, but it will become useful because the Bellman equations are recursive. Later, we will treat the right-hand side of a Bellman equation as a function that updates a value function. Contraction mapping gives us the language to explain why applying that update repeatedly can converge, which is exactly what we need for Value Iteration and Policy Iteration.
Definition
A contraction mapping is a function on a metric space . It takes a point from and returns another point in the same space. It is called a contraction if applying always brings points closer together.
More formally, there must be a constant such that for any two points and :
In other words, after applying , the distance between two points is at most times the original distance.
Banach Fixed-Point Theorem
The Banach Fixed-Point Theorem gives us the guarantee that makes contraction mappings useful. If is a contraction mapping, then:
- has a unique fixed point such that .
- For any initial point in the space, the sequence converges to that fixed point .
Algorithmically, this means: if an update rule is a contraction, then repeatedly applying it is guaranteed to converge to one solution.
Applications in RL
Now we can connect this abstract theorem back to reinforcement learning.
The Bellman equations are recursive: the value of the current state is written using the values of possible next states. For example depends on . This means the same unknown function appears on both sides of the equation.
That recursive form is exactly what lets us turn a Bellman equation into an update rule. Instead of already knowing the true value function, we start with some current guess . Then we plug that guess into the right-hand side of the Bellman equation and get an updated guess. This update rule is the operator from the fixed-point theorem.
For a fixed policy , the Bellman expectation operator is:
For the optimal value function, the Bellman optimality operator is:
A key result, which we will use without proving here, is that both Bellman operators are contraction mappings when . So Banach’s theorem tells us that repeated Bellman updates converge to a unique fixed point.
For the Bellman expectation operator , the fixed point is , the value function for policy . For the Bellman optimality operator , the fixed point is , the optimal value function.
This is the bridge from equations to algorithms. The Bellman equations define the fixed points, and contraction mapping explains why repeated updates can find them. In the next section, we will turn these two update rules into Value Iteration and Policy Iteration.
Algorithms
We now have the ingredients for two common algorithms. The Bellman optimality operator gives us a way to update values toward directly. The Bellman expectation operator gives us a way to evaluate a fixed policy . Both ideas can be used to find an optimal policy, but they organize the work differently.
Value Iteration
Value Iteration uses the Bellman optimality operator directly. The algorithm is:
Start with some initial value function , which can be a rough guess.
Repeatedly apply the Bellman optimality update:
Because this update is a contraction when , repeated updates converge to the optimal value function .
Stop when the value function changes only slightly between two iterations, for example when:
Once the values have converged, extract a policy greedily. First compute the optimal action-value:
Then choose an action with the highest action-value in each state:
The important detail is that Value Iteration does not maintain or evaluate a separate policy during the updates. The intermediate function is only a working value estimate. A greedy policy can be extracted from it at any time, but the clean guarantee comes after the values converge to .
Policy Iteration
Policy Iteration takes a different route. Instead of updating values directly toward , it keeps an explicit policy and improves it step by step. Each iteration has two parts:
- Policy evaluation: compute for the current policy .
- Policy improvement: update the policy so it chooses better actions according to the current value estimates.
Policy Evaluation
The evaluation step uses the Bellman expectation update:
After evaluation, we know how good the current policy is. The next question is how to improve it.
Policy Improvement Theorem
First, we need to define what it means for one policy to be better than another. We say that is better than or equal to if it has value at least as high in every state:
This means that from any starting state, gives us an expected discounted cumulative reward that is no worse than .
The Policy Improvement Theorem says that if we act greedily with respect to , the new policy satisfies this condition. In other words, after evaluating , we can improve it by choosing the action that looks best under :
Proof Sketch
Why is this greedy update guaranteed to improve the policy?
Under the old policy , the value of state is an average of the action-values . The greedy policy chooses the action with the largest action-value, so:
Because is greedy, this means:
For readability, shift the time index to start at zero. So the condition from the value-function definition becomes , and we write the realized starting state as inside the rollout. The inequality we will unroll is:
Now expand the right-hand side using the definition of :
So:
The continuation term is still , not , because means: take action first, then follow the old policy . To unroll this expression once more, first write the same greedy bound for the possible next state :
Then expand that term as well:
Now we can replace by this upper bound inside the previous sum. This is not substitution by equality: it is valid here because , so replacing with a larger quantity can only make the right-hand side larger.
In the second line, we use marginalization to combine terms into one sum: the term can also be summed over because .
In the last line, is the joint distribution of the two-step rollout when the first two actions are chosen by . This does not assume ordinary independence between the two transitions. The probability chain rule gives a product of conditional probabilities:
Then the Markov property lets us drop the earlier history from the second factor:
The sum with the joint rollout probability is therefore expectation notation written out. Once we know and the policy being followed, the Markov property gives the rollout distribution, so the two-step bound becomes:
If we repeat the same substitution times, we get:
The old policy appears only in the tail term , which says what happens after the first greedy steps. As , the discounted tail goes to zero when rewards are bounded and , so what remains is the expected return from following the new policy from state :
Therefore:
Algorithm
After the proof, the actual algorithm is much simpler: it is just the thing software engineers like most, a loop. Putting the pieces together, Policy Iteration works as follows:
Start with any policy .
Evaluate the policy by computing .
Improve the policy by acting greedily with respect to :
If the policy no longer changes, stop. Otherwise, set and repeat.
Policy Iteration is therefore an explicit loop between evaluation and improvement. Evaluation asks how good is the current policy. Improvement asks if we can choose better actions using what we just learned.
Generalized Policy Iteration
Value Iteration and Policy Iteration look different, but they share the same structure. Both combine two ideas:
- Policy evaluation: estimate how good the current policy is, usually by estimating or .
- Policy improvement: change the policy so it acts greedily, or more greedily, with respect to the current value estimates.
This shared structure is called Generalized Policy Iteration (GPI).
The important word is generalized. GPI describes the interaction between evaluation and improvement, but it does not prescribe the schedule. It does not say how long we must evaluate a policy, whether we should update all states or only some states, or in what order states should be updated.
Policy Iteration keeps an explicit policy. We evaluate that policy for some amount of time, improve it greedily, and repeat. Classical Policy Iteration evaluates the policy all the way to convergence before improving it. Modified Policy Iteration evaluates it only for a few sweeps.
Value Iteration does not keep an explicit policy during the value updates. It repeatedly applies the recursive optimality update for a long time, and once the values are good enough, it extracts a greedy policy. In that sense, the greedy improvement step is built into the value update, and the explicit policy extraction can happen at the end.
This is why GPI is a useful umbrella concept. The important part is not one specific schedule, but the interaction between estimating values and improving the policy.
Frozen Lake Example
Let’s ground the algorithms in a small grid-world game. Each cell is a state , and from each non-terminal state the agent can choose one of four actions: up, down, left, or right.
The goal is to reach the gift while avoiding holes. Every step costs , and falling into a hole ends the episode. Since each move is penalized equally, the agent maximizes its total reward by reaching the gift as fast as possible - the shortest path is the optimal policy. This is also a clean example of how a classic shortest-path problem can be expressed in RL terms.
Because this is a model-based setting, we know the environment dynamics: for each state-action pair, we know the probability distribution over next states and rewards - . We don’t know exactly where the agent will land, only how likely each outcome is. That is exactly the information needed to compute the Bellman equations we derived earlier, estimate and , and improve the policy.

Optimal Policy
You can see an optimal policy on the image below:

Exercise
Want to test your understanding? I prepared an exercise for this lesson in my Reinforcement Learning Course - there’s a task to implement yourself along with an example solution.
Final thoughts
In this article we covered how a model-based agent can learn an optimal policy using Policy Iteration and Value Iteration - both grounded in the Bellman equations. The key assumption was that we have full access to the environment dynamics , which made it possible to solve for and directly.
In the real world, that assumption rarely holds. In the next article we will move to model-free environments, where the agent has no access to transition probabilities and must instead learn purely from experience. We will look at Q-learning and SARSA - two foundational algorithms that make this possible.