[{"content":"Introduction In the previous article, we replaced the tabular action-value function q(s,a)q(s, a)q(s,a) with a parameterized approximation: qθ(s,a)q_\\theta(s, a)qθ​(s,a). We will use neural networks to approximate qθ(s,a)q_\\theta(s, a)qθ​(s,a).\nDeep Q-Network (DQN) is the version of approximate Q-learning that made it practical at a scale. When we combine Q-learning with neural networks, then we must deal with the instability created by correlated data, moving targets, changing behavior distributions, and large TD errors. This article presents solutions to deal with those problems.\nBefore we continue, we need to understand why we moved forward with approximate Q-learning, not approximate SARSA or Expected SARSA.\nWhy not SARSA? SARSA is an on-policy algorithm: it learns the value of the policy that is currently acting. Its target uses the actual next action a′a^\\primea′:\nr+γqθ(s′,a′) r + \\gamma q_\\theta(s^\\prime, a^\\prime) r+γqθ​(s′,a′) The important detail is a′a^\\primea′. For SARSA to be truly on-policy, this action should come from the current policy. If we keep a history of old transitions and train on one of them later, the policy may have already changed, so the stored a′a^\\primea′ is stale.\nQ-learning avoids this issue because it is off-policy. The behavior policy can explore and collect transitions, while the update still trains toward the greedy policy:\nr+γmax⁡a′qθ(s′,a′) r + \\gamma \\max_{a^\\prime} q_\\theta(s^\\prime, a^\\prime) r+γa′max​qθ​(s′,a′) This fits neural-network training better because the same old transition can be reused many times, and each reuse can produce another gradient update to the weights.\nExpected SARSA can also be off-policy. Its target is:\nr+γ∑a′π(a′∣s′)qθ(s′,a′) r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime) q_\\theta(s^\\prime, a^\\prime) r+γa′∑​π(a′∣s′)qθ​(s′,a′) Expected SARSA with an ϵ\\epsilonϵ-greedy policy learns values under the assumption that the agent will keep exploring in the future. In other words, the learned qθ(s,a)q_\\theta(s,a)qθ​(s,a) estimates the return of a policy that sometimes deliberately takes non-greedy actions.\nThat is not what we want because usually we want to play the best policy we found, not continue making random exploratory moves. If we are playing for the championship, we care about winning the game, not about gathering more data.\nOne caveat is that this does not mean on-policy methods are worse by default. Later, when we discuss A3C, we will see another way to make fresh on-policy data more efficient.\nNow, let\u0026rsquo;s go back to DQN and problems we outlined in the previous article.\nProblem 1: Correlated Data Stochastic gradient descent works best when batches give a useful estimate of the training objective. In supervised learning, we sample examples independently from the same training distribution and use the minibatch gradient as an unbiased estimate of the expected gradient:\n∇θJ(θ)=EX[∇θL(X,θ)]≈1B∑i=1B∇θL(Xi,θ) \\nabla_\\theta J(\\theta) = \\mathbb{E}_X[\\nabla_\\theta L(X, \\theta)] \\approx \\frac{1}{B}\\sum_{i=1}^{B}\\nabla_\\theta L(X_i, \\theta) ∇θ​J(θ)=EX​[∇θ​L(X,θ)]≈B1​i=1∑B​∇θ​L(Xi​,θ) Consecutive RL samples are not like independently sampled training examples. During one episode, adjacent states are strongly correlated.\nImagine an agent watching the ball move across the screen. Frame ttt and frame t+1t+1t+1 are almost the same. If we train directly on the latest transition at every step, the network sees a long stream from one small region of experience. The gradient points too strongly toward what just happened and too weakly toward the broader game.\nThere is also a distribution problem. In supervised learning, the dataset is usually fixed. In RL, the behavior policy creates the data. If qθq_\\thetaqθ​ changes, then an ϵ\\epsilonϵ-greedy policy may choose different actions and reach different states. If we train only on the newest transitions, the training distribution can immediately become dominated by whatever the current policy just visited, instead of staying stable and well mixed.\nExperience Replay DQN appends each transition to a replay buffer D\\mathcal{D}D and trains on random batches BBB sampled from it. This helps in several ways.\n{(si,ai,ri,si′)}i=1B∼D \\{(s_i, a_i, r_i, s_i^\\prime)\\}_{i=1}^{B} \\sim \\mathcal{D} {(si​,ai​,ri​,si′​)}i=1B​∼D It breaks short-term correlations. A random batch may contain transitions from different episodes, different game situations, and different moments in training. The samples are not truly independent, but they are much less correlated than consecutive frames.\nIt improves data efficiency. One transition can be used in many gradient updates of our neural network. This matters because environment interaction is often the expensive part.\nIt slows sudden shifts in what the network trains on. This makes learning more stable: a small change in the current policy does not immediately replace the whole training batch with new, similar transitions.\nOther way to think about replay batch is that it can be viewed as a sample-based model of the world. A real model would estimate:\np(r,s′∣s,a) p(r, s^\\prime \\mid s, a) p(r,s′∣s,a) and let us ask what would happen for arbitrary state-action pairs. A replay buffer does not do that. It cannot generate new transitions for states and actions we never tried, but it is a non-parametric collection of one-step samples from the environment.\nWhat are disadvantages of replay?\nIf a replay buffer is large, then it is memory intensive. For Atari, one uint8 grayscale frame is 84×84=7,05684 \\times 84 = 7{,}05684×84=7,056 bytes. A transition contains sts_tst​ and st+1s_{t+1}st+1​, each a stack of 4 frames, so naive storage needs 8×7,056=56,4488 \\times 7{,}056 = 56{,}4488×7,056=56,448 bytes per transition i.e. ~56GB for 1M1M1M transitions. Practical buffers avoid most of this duplication by storing frames once and reconstructing overlapping states, which is why Atari replay buffers are often closer to ~7GB.\nUniform random sampling to form a batch is not optimal as it treats all transitions as equally useful. In reality, some transitions have larger TD errors or contain rare rewards. Prioritized replay improves this idea by sampling more informative transitions more often, but vanilla DQN uses uniform sampling.\nReplay trains on transitions collected by older versions of the policy. Q-learning can use them because it is off-policy, but too much old experience can slow adaptation when the current policy starts reaching different states. Having said that, bigger replay buffer is not always better.\nProblem 2: Moving Targets Experience replay helps with correlated data, but it does not solve the moving target problem. In approximate Q-learning, the target is:\ny(θ)=r+γmax⁡a′qθ(s′,a′) y(\\theta) = r + \\gamma \\max_{a^\\prime}q_\\theta(s^\\prime, a^\\prime) y(θ)=r+γa′max​qθ​(s′,a′) In the previous article, we established that we use semi-gradients for targets y(θ)y(\\theta)y(θ), so during one gradient update the target is treated as fixed. However, this is not enough. Once we update the network parameters θ\\thetaθ, the target y(θ)y(\\theta)y(θ) also changes, because the same network is used both to predict qθ(s,a)q_\\theta(s,a)qθ​(s,a) and to build the bootstrap target y(θ)y(\\theta)y(θ).\nWhy is this a problem? Is it because the target is wrong? No. Bootstrap targets are usually wrong. That is expected. For example, consider the scalar recursive equation:\nq=1+0.9q q = 1 + 0.9q q=1+0.9q The true solution is:\nq∗=10 q^* = 10 q∗=10 Suppose we start from q0=0q_0 = 0q0​=0. It does not matter that q0q_0q0​ is wrong. If we keep applying the update:\nqk+1=1+0.9qk q_{k+1} = 1 + 0.9q_k qk+1​=1+0.9qk​ we get:\nq1=1,q2=1.9,q3=2.71,… q_1 = 1,\\quad q_2 = 1.9,\\quad q_3 = 2.71,\\dots q1​=1,q2​=1.9,q3​=2.71,… Every intermediate value is wrong, but the sequence still moves toward the fixed point. That is the core idea behind the Bellman optimality equation, contraction mappings, and the Banach fixed-point theorem. The target does not need to be correct at every step, but it needs to be a useful next approximation.\nThe problem in DQN is different. We are not applying the exact recursive equation to the whole action-value function, but we are learning from finite, noisy samples using gradient descent. If we just use a noisy sample to update qkq_kqk​, then training might behave in a very unpredictable way. Ideally, we would like to do something closer to this:\nKeep an old action-value function q0q_0q0​ fixed. Use many noisy samples and many gradient updates to learn a better approximation q1q_1q1​. Once q1q_1q1​ has been fitted reasonably well, replace q0q_0q0​ with q1q_1q1​. Repeat the process. In other words, we do not want the target to change every time the online network takes a single gradient step. We want to hold the previous approximation fixed for a while, train the new approximation against it, and only then move to the next step of the recursive process. It is then more similar to how recursive Bellman optimality equations work.\nTarget Network To apply this idea in practice, DQN introduces a second neural network called the target network, with parameters θ−\\theta^-θ−. DQN separates the network being trained from the network used to build the bootstrap target.\nThere are two networks:\nThe online network qθq_\\thetaqθ​, updated by gradient descent. The target network qθ−q_{\\theta^-}qθ−​, used to compute bootstrap targets. The target becomes:\ny={r,if s′ is terminalr+γmax⁡a′qθ−(s′,a′),otherwise y = \\begin{cases} r, \u0026amp; \\text{if } s^\\prime \\text{ is terminal} \\\\ r + \\gamma \\max_{a^\\prime}q_{\\theta^-}(s^\\prime, a^\\prime), \u0026amp; \\text{otherwise} \\end{cases} y={r,r+γmaxa′​qθ−​(s′,a′),​if s′ is terminalotherwise​ The online network is still trained with batches BBB sampled from a replay buffer.\nL(θ)=1B∑i=1B12(yi−qθ(si,ai))2 L(\\theta) = \\frac{1}{B}\\sum_{i=1}^{B} \\frac{1}{2}\\left(y_i - q_\\theta(s_i, a_i)\\right)^2 L(θ)=B1​i=1∑B​21​(yi​−qθ​(si​,ai​))2 During these gradient updates, the target network parameters θ−\\theta^-θ− are held fixed. This means the online network is repeatedly trained against targets generated by the same fixed action-value approximation. Then periodically, the online network copies its parameters to the target network:\nθ−←θ \\theta^- \\leftarrow \\theta θ−←θ Conceptually, the target network plays the role of the old approximation q0q_0q0​. The online network tries to learn a better approximation q1q_1q1​ using many noisy samples and gradient updates. After some time, q1q_1q1​ becomes the new fixed reference point, and the process continues. This makes training less reactive to noisy samples and closer in spirit to approximate value iteration, although there are still no convergence guarantees like in the tabular setting.\nProblem 3: Unstable Gradients The TD error for one transition is:\nδ=y−qθ(s,a) \\delta = y - q_\\theta(s, a) δ=y−qθ​(s,a) For squared loss, the gradient has the form:\n∇θL(θ)=−δ∇θqθ(s,a) \\nabla_\\theta L(\\theta) = -\\delta \\nabla_\\theta q_\\theta(s, a) ∇θ​L(θ)=−δ∇θ​qθ​(s,a) So the size of the TD error directly scales the update. Since action-values estimate discounted sums of rewards, larger reward scales lead to larger target values and larger TD errors. If rewards have very different magnitudes across environments, then the same learning rate can be too small for one environment and too large for another.\nReward Clipping DQN clips rewards to:\nr∈[−1,1] r \\in [-1, 1] r∈[−1,1] It makes true optimal Q-values bounded by roughly:\n11−γ \\frac{1}{1 - \\gamma} 1−γ1​ For γ=0.99\\gamma = 0.99γ=0.99, that is 100100100. This gives good gradients in a pragmatic sense: the TD errors are kept in a range where one learning rate can work across many games. Without clipping, the scale can be much larger and very different across games.\nReward clipping has also a cost. If one action gives reward +1+1+1 and another gives reward +100+100+100, clipping makes both immediate rewards equal to +1+1+1. The agent can no longer distinguish good rewards from great rewards at that time step. It learns from signs and frequencies more than from exact reward magnitudes. So reward clipping is more like an engineering trick to make learning process stable.\nVanilla DQN Now we can combine all these ideas and implement a basic DQN training algorithm.\nInitialize the online network qθq_\\thetaqθ​. Initialize the target network with the same parameters: θ−←θ\\theta^- \\leftarrow \\thetaθ−←θ. Initialize a replay buffer D\\mathcal{D}D. Act with an ϵ\\epsilonϵ-greedy behavior policy based on qθq_\\thetaqθ​. Store each transition (St,At,Rt,St+1)(S_t, A_t, R_t, S_{t+1})(St​,At​,Rt​,St+1​) in D\\mathcal{D}D. Sample a batch from D\\mathcal{D}D. For each sampled transition, compute: yi={ri,if si′ is terminalri+γmax⁡a′qθ−(si′,a′),otherwise y_i = \\begin{cases} r_i, \u0026amp; \\text{if } s_i^\\prime \\text{ is terminal} \\\\ r_i + \\gamma \\max_{a^\\prime} q_{\\theta^-}(s_i^\\prime, a^\\prime), \u0026amp; \\text{otherwise} \\end{cases} yi​={ri​,ri​+γmaxa′​qθ−​(si′​,a′),​if si′​ is terminalotherwise​ Update θ\\thetaθ by minimizing loss function: L(θ)=1B∑i=1B12(yi−qθ(si,ai))2 L(\\theta) = \\frac{1}{B}\\sum_{i=1}^{B} \\frac{1}{2}\\left(y_i - q_\\theta(s_i, a_i)\\right)^2 L(θ)=B1​i=1∑B​21​(yi​−qθ​(si​,ai​))2 Periodically update the target network with a hard copy of the online network. θ−←θ \\theta^- \\leftarrow \\theta θ−←θ This is vanilla DQN: approximate Q-learning with replay buffer, a target network, and reward clipping.\nAtari Pong Demo Pong is a two-player Atari game. Here, vanilla DQN uses a Convolutional Neural Network that sees the state as the last four frames stacked together. After about 10 hours of training on a MacBook, the learned policy controls the paddle on the right and starts exploiting certain strategies that might feel like reward hacking.\nImplementation Feel free to take a look at my implementation on GitHub. I recommend implementing it yourself rather than having an agent do it. It is the best way to learn.\nTips \u0026amp; Tricks There are a few practical issues that can show up when training vanilla DQN on Atari games.\nStore frames as uint8 in the replay buffer to save memory, then convert and normalize them on the GPU: states = states.to(self.device).float().div_(255.0) If learning stalls early, check the gradient norms. In my runs, very small gradients with Adam made the model fail to learn. Using a larger Adam epsilon helped: self.optimizer = Adam(self.policy_network.parameters(), lr=config.learning_rate, eps=1e-4) I also used LeakyReLU instead of ReLU in the convolutional neural network to avoid dying ReLU neurons. nn.LeakyReLU(negative_slope=0.01) Atari environments return both terminated and truncated. Only true terminations should stop bootstrapping; time-limit truncations can still use the next-state Q-value. transition = Transition(state=state, action=action, reward=clipped_reward, next_state=next_state, done=terminated) Prioritized Replay Buffer Uniform replay treats every stored transition as equally useful. That is simple and it already helps a lot, but it is not how learning usually feels. Some transitions are boring because the network already predicts them well. Others are surprising: the reward was unexpected, the state was rare, or the bootstrap target disagrees strongly with the current estimate. The TD error gives us a simple way to measure that surprise:\nδi=yi−qθ(si,ai) \\delta_i = y_i - q_\\theta(s_i, a_i) δi​=yi​−qθ​(si​,ai​) If ∣δi∣|\\delta_i|∣δi​∣ is large, then transition iii currently creates a large learning signal. Prioritized replay uses this idea by sampling those transitions more often than transitions with small TD errors. A common priority is:\npi=∣δi∣+ϵ p_i = |\\delta_i| + \\epsilon pi​=∣δi​∣+ϵ where ϵ\u0026gt;0\\epsilon \u0026gt; 0ϵ\u0026gt;0 keeps every transition sampleable. Without this small constant, a transition with zero priority might never be seen again, even though it could become useful later after the network changes. Then the sampling probability is:\nP(i)=piα∑jpjα P(i) = \\frac{p_i^\\alpha}{\\sum_j p_j^\\alpha} P(i)=∑j​pjα​piα​​ The parameter α\\alphaα controls how much prioritization we use. If α=0\\alpha = 0α=0, then piα=1p_i^\\alpha = 1piα​=1 for every transition, so we end up with uniform replay. If α\\alphaα is larger, high-error transitions are sampled more often. In practice, a common choice is α=0.6\\alpha = 0.6α=0.6. It is enough to prefer high-error transitions, but not so aggressive that the replay buffer is dominated only by the largest errors.\nSo we changed our sampling distribution. How does that affect the gradient? Before, in uniform replay, we sampled uniformly from the NNN stored transitions:\nPUniform(i)=1N P_{\\text{Uniform}}(i) = \\frac{1}{N} PUniform​(i)=N1​ With uniform replay, the expected minibatch gradient estimates the average gradient over the whole buffer:\n∇θJ(θ)≈1N∑i=1N∇θLi(θ) \\nabla_\\theta J(\\theta) \\approx \\frac{1}{N}\\sum_{i=1}^{N}\\nabla_\\theta L_i(\\theta) ∇θ​J(θ)≈N1​i=1∑N​∇θ​Li​(θ) Prioritized replay samples transition iii with probability P(i)P(i)P(i) instead, so before correction the expectation becomes:\n∇θJ(θ)≈∑i=1NP(i)∇θLi(θ) \\nabla_\\theta J(\\theta) \\approx \\sum_{i=1}^{N}P(i)\\nabla_\\theta L_i(\\theta) ∇θ​J(θ)≈i=1∑N​P(i)∇θ​Li​(θ) That is the bias introduced by prioritized sampling. High-priority transitions are no longer just seen more often, but they also count more in expectation. To correct that bias, multiply each sampled gradient by the ratio between its uniform probability and its prioritized probability:\n1N⋅P(i) \\frac{1}{N \\cdot P(i)} N⋅P(i)1​ Then the weighted prioritized gradient becomes the uniform replay gradient again:\n∇θJ(θ)≈∑i=1NP(i)1N⋅P(i)∇θLi(θ)=1N∑i=1N∇θLi(θ) \\nabla_\\theta J(\\theta) \\approx \\sum_{i=1}^{N} P(i)\\frac{1}{N \\cdot P(i)}\\nabla_\\theta L_i(\\theta) = \\frac{1}{N}\\sum_{i=1}^{N}\\nabla_\\theta L_i(\\theta) ∇θ​J(θ)≈i=1∑N​P(i)N⋅P(i)1​∇θ​Li​(θ)=N1​i=1∑N​∇θ​Li​(θ) This may look like it removes the benefit of prioritized replay, but it does not. The bias correction changes the weight of a transition after it has already been sampled. It does not make the sampling process uniform again. High-priority transitions are still selected more often, so the agent spends more updates looking at them.\nImplementation-wise, we still need a way to sample high-priority transitions efficiently. More on that later, but once we have these high-priority samples in the batch, we correct the bias by weighting their losses. For the full correction, the weight is:\nwi=1N⋅P(i) w_i = \\frac{1}{N \\cdot P(i)} wi​=N⋅P(i)1​ In PyTorch, this just means computing a per-sample loss for each sampled transition iki_kik​, multiplying it by its weight, and then averaging over the batch:\nL(θ)=1B∑k=1Bwik12(yik−qθ(sik,aik))2 L(\\theta) = \\frac{1}{B}\\sum_{k=1}^{B} w_{i_k} \\frac{1}{2}\\left(y_{i_k} - q_\\theta(s_{i_k}, a_{i_k})\\right)^2 L(θ)=B1​k=1∑B​wik​​21​(yik​​−qθ​(sik​​,aik​​))2 In practice, Prioritized Replay Buffer often does not use the full correction immediately. It introduces a parameter β\\betaβ:\nwi=(1N⋅P(i))β w_i = \\left(\\frac{1}{N \\cdot P(i)}\\right)^\\beta wi​=(N⋅P(i)1​)β Here NNN is the number of transitions in the replay buffer. The parameter β\\betaβ controls how strongly we correct the sampling bias. If β=0\\beta = 0β=0, there is no bias correction. If β=1\\beta = 1β=1, the bias correction is full.\nSometimes we intentionally want stronger gradients from high-priority samples. Early in training, the policy, targets, and priorities are changing quickly, and high-error transitions often contain the useful signal the network is currently missing. If we fully correct the bias immediately, many of those transitions are downweighted after we sampled them. That can make prioritization less aggressive exactly when we want it to move learning quickly.\nSo Prioritized Replay Buffer usually starts with weaker correction, for example β=0.4\\beta = 0.4β=0.4, and anneals β\\betaβ upward toward 1.01.01.0 during training. Later, as the value estimates become more stable, stronger correction reduces the extra influence of high-priority samples and makes the update less biased relative to uniform replay.\nTraining Curves Here is one Pong run comparing uniform replay with prioritized replay. The prioritized replay agent reaches positive returns earlier, although both runs end up close after enough environment steps.\nSince high-priority transitions appear more often in batches, the sampled batch TD error can stay higher while the policy improves.\nImplementation The replay buffer itself can still be a normal circular buffer: arrays for states, actions, rewards, next states, and done flags. The extra part is one priority per stored transition. The algorithm is straightforward:\nSample a batch from the replay buffer using priorities to form a probability distribution. Train on that batch and compute the TD error for each sampled transition. Convert each TD error into a new priority. Store that priority next to the transition in the regular replay buffer. New transitions do not have a TD error yet, so they usually get the current maximum priority. That makes sure they can be sampled at least once.\nThe naive sampling approach is to build a probability distribution over the current buffer from these priorities, then sample from that distribution. This works, but it requires to scan a whole replay buffer with time complexity O(N)O(N)O(N).\nA sum tree is a data structure that makes this weighted sampling efficient. It stores priority sums in a binary tree, so we can sample a transition in O(log⁡N)O(\\log N)O(logN) instead of scanning the whole buffer. Adding a new transition or updating an existing priority is also O(log⁡N)O(\\log N)O(logN). The actual transitions do not live in the tree, they still live in a replay buffer. The tree plays a role of an indexing data structure and maps priority mass to replay-buffer indices.\nFor the implementation details, see my prioritized replay buffer code.\nOverestimation Bias Target networks made the bootstrap target change more slowly, so we could apply recursive Bellman optimality equations with more confidence and less noise. I also mentioned that target estimates were always wrong, but it was fine, because that was the nature of Bellman updates.\nThough, you can also imagine that if targets were perfectly correct, then the algorithm would converge much faster. Let\u0026rsquo;s first understand what is wrong with our target estimate. For one sampled non-terminal transition, the ideal target for that sample would be:\ny∗=r+γmax⁡a′q∗(s′,a′) y^* = r + \\gamma \\max_{a^\\prime}q^*(s^\\prime, a^\\prime) y∗=r+γa′max​q∗(s′,a′) DQN does not know q∗q^*q∗ so it trains on the estimate:\ny^=r+γmax⁡a′q^θ−(s′,a′) \\hat y = r + \\gamma \\max_{a^\\prime}\\hat q_{\\theta^-}(s^\\prime, a^\\prime) y^​=r+γa′max​q^​θ−​(s′,a′) During that gradient update, y^\\hat yy^​ is just a fixed number. Below is a mathematical analysis of fixed target-network action-value estimates being noisy estimates of the true action-values.\nFor each next action a′a^\\primea′, write the target-network estimate as the true action-value plus an estimation error ϵa′\\epsilon_{a^\\prime}ϵa′​.\nq^θ−(s′,a′)=q∗(s′,a′)+ϵa′,Eϵ[ϵa′]=0 \\hat q_{\\theta^-}(s^\\prime, a^\\prime) = q^*(s^\\prime, a^\\prime) + \\epsilon_{a^\\prime}, \\qquad \\mathbb{E}_\\epsilon[\\epsilon_{a^\\prime}] = 0 q^​θ−​(s′,a′)=q∗(s′,a′)+ϵa′​,Eϵ​[ϵa′​]=0 Then each next action-value estimate is correct on average:\nEϵ[q^θ−(s′,a′)]=Eϵ[q∗(s′,a′)+ϵa′]=q∗(s′,a′)+Eϵ[ϵa′]=q∗(s′,a′) \\mathbb{E}_\\epsilon[\\hat q_{\\theta^-}(s^\\prime, a^\\prime)] = \\mathbb{E}_\\epsilon[q^*(s^\\prime, a^\\prime) + \\epsilon_{a^\\prime}] = q^*(s^\\prime, a^\\prime) + \\mathbb{E}_\\epsilon[\\epsilon_{a^\\prime}] = q^*(s^\\prime, a^\\prime) Eϵ​[q^​θ−​(s′,a′)]=Eϵ​[q∗(s′,a′)+ϵa′​]=q∗(s′,a′)+Eϵ​[ϵa′​]=q∗(s′,a′) Based on that, if we averaged out the noise before taking the max, the max term would match the true optimal value of the sampled next state:\nmax⁡a′Eϵ[q^θ−(s′,a′)]=max⁡a′q∗(s′,a′) \\max_{a^\\prime} \\mathbb{E}_\\epsilon[\\hat q_{\\theta^-}(s^\\prime, a^\\prime)] = \\max_{a^\\prime}q^*(s^\\prime, a^\\prime) a′max​Eϵ​[q^​θ−​(s′,a′)]=a′max​q∗(s′,a′) That is the reference we would want the maxmaxmax term to match. If the max term were equal to this quantity, then the target estimate y^\\hat yy^​ would be an unbiased estimator of y∗y^*y∗:\nEϵ[y^]=y∗ \\mathbb{E}_\\epsilon[\\hat y] = y^* Eϵ​[y^​]=y∗ Unfortunately, the actual DQN target maximizes the noisy estimates before averaging, so Eϵ[y^]≠y∗\\mathbb{E}_\\epsilon[\\hat y] \\neq y^*Eϵ​[y^​]=y∗. It means that the target estimate y^\\hat yy^​ is a biased estimator of y∗y^*y∗.\nEϵ[y^]=Eϵ[r+γmax⁡a′q^θ−(s′,a′)]=r+γEϵ[max⁡a′q^θ−(s′,a′)] \\mathbb{E}_\\epsilon[\\hat y] = \\mathbb{E}_\\epsilon\\left[ r + \\gamma \\max_{a^\\prime}\\hat q_{\\theta^-}(s^\\prime, a^\\prime) \\right] = r + \\gamma \\mathbb{E}_\\epsilon\\left[ \\max_{a^\\prime}\\hat q_{\\theta^-}(s^\\prime, a^\\prime) \\right] Eϵ​[y^​]=Eϵ​[r+γa′max​q^​θ−​(s′,a′)]=r+γEϵ​[a′max​q^​θ−​(s′,a′)] y∗=r+γmax⁡a′q∗(s′,a′)=r+γmax⁡a′Eϵ[q^θ−(s′,a′)] y^* = r + \\gamma \\max_{a^\\prime} q^*(s^\\prime, a^\\prime) = r + \\gamma \\max_{a^\\prime} \\mathbb{E}_\\epsilon[\\hat q_{\\theta^-}(s^\\prime, a^\\prime)] y∗=r+γa′max​q∗(s′,a′)=r+γa′max​Eϵ​[q^​θ−​(s′,a′)] Eϵ[y^]≥y∗ \\mathbb{E}_\\epsilon[\\hat y] \\ge y^* Eϵ​[y^​]≥y∗ We don\u0026rsquo;t achieve the desirable unbiased estimator, because the estimate from DQN is larger. That\u0026rsquo;s why we call it overestimation bias.\nEϵ[max⁡a′q^θ−(s′,a′)]≥max⁡a′Eϵ[q^θ−(s′,a′)] \\mathbb{E}_\\epsilon\\left[ \\max_{a^\\prime}\\hat q_{\\theta^-}(s^\\prime, a^\\prime) \\right] \\geq \\max_{a^\\prime} \\mathbb{E}_\\epsilon[\\hat q_{\\theta^-}(s^\\prime, a^\\prime)] Eϵ​[a′max​q^​θ−​(s′,a′)]≥a′max​Eϵ​[q^​θ−​(s′,a′)] This follows from Jensen\u0026rsquo;s inequality, because the max over actions is a convex function of the action-value vector.\nFor a tiny example, assume two actions have true value 000, and each estimate is independently either −1-1−1 or +1+1+1 with equal probability. Each row below has probability 14\\frac{1}{4}41​.\nq^1\\hat q_1q^​1​ q^2\\hat q_2q^​2​ max⁡(q^1,q^2)\\max(\\hat q_1, \\hat q_2)max(q^​1​,q^​2​) −1-1−1 −1-1−1 −1-1−1 −1-1−1 +1+1+1 +1+1+1 +1+1+1 −1-1−1 +1+1+1 +1+1+1 +1+1+1 +1+1+1 If we average before the max, there is no overestimation:\nE[q^1]=0,E[q^2]=0,max⁡iE[q^i]=0 \\mathbb{E}[\\hat q_1] = 0,\\quad \\mathbb{E}[\\hat q_2] = 0,\\quad \\max_i \\mathbb{E}[\\hat q_i] = 0 E[q^​1​]=0,E[q^​2​]=0,imax​E[q^​i​]=0 If we take the max before averaging, there is:\nE[max⁡(q^1,q^2)]=−1+1+1+14=0.5 \\mathbb{E}[\\max(\\hat q_1, \\hat q_2)] = \\frac{-1 + 1 + 1 + 1}{4} = 0.5 E[max(q^​1​,q^​2​)]=4−1+1+1+1​=0.5 A takeaway is that DQN can learn values that are too high, not because the rewards or true next-state values are high, but because the bootstrap target repeatedly selects action estimates with positive error terms ϵa′\\epsilon_{a^\\prime}ϵa′​.\nDouble DQN In the overestimation section, the bad term was:\nE[max⁡a′q^θ−(s′,a′)] \\mathbb{E}\\left[ \\max_{a^\\prime}\\hat q_{\\theta^-}(s^\\prime, a^\\prime) \\right] E[a′max​q^​θ−​(s′,a′)] Double DQN changes this by not using target network to choose the action a′a^\\primea′ whose value will appear in the target and start using online network instead.\nThe online network qθq_\\thetaqθ​ chooses the next action:\naθ=arg max⁡a′qθ(s′,a′) a_\\theta = \\argmax_{a^\\prime} q_\\theta(s^\\prime, a^\\prime) aθ​=a′argmax​qθ​(s′,a′) The target network qθ−q_{\\theta^-}qθ−​ evaluates that selected action:\nyDouble DQN={r,if s′ is terminalr+γqθ−(s′,aθ),otherwise y^{\\text{Double DQN}} = \\begin{cases} r, \u0026amp; \\text{if } s^\\prime \\text{ is terminal} \\\\ r + \\gamma q_{\\theta^-}(s^\\prime, a_\\theta), \u0026amp; \\text{otherwise} \\end{cases} yDouble DQN={r,r+γqθ−​(s′,aθ​),​if s′ is terminalotherwise​ So the max is still there, but it is only used to choose an action index:\nqθ−(s′,arg max⁡a′qθ(s′,a′)) q_{\\theta^-}\\left( s^\\prime, \\argmax_{a^\\prime} q_\\theta(s^\\prime, a^\\prime) \\right) qθ−​(s′,a′argmax​qθ​(s′,a′)) As you can see, the target-network value used to evaluate the selected action is not the maximum anymore, but the value at the index selected by the online network. Having said that, if we condition on the action selected by the online network, then maximization bias disappears:\nEϵ−[y^∣aθ]=r+γEϵ−[qθ−(s′,aθ)]=r+γq∗(s′,aθ)=y∗ \\mathbb{E}_{\\epsilon^-}\\left[ \\hat y \\mid a_\\theta \\right] = r + \\gamma \\mathbb{E}_{\\epsilon^-}\\left[ q_{\\theta^-}(s^\\prime, a_\\theta) \\right] = r + \\gamma q^*(s^\\prime, a_\\theta) = y^* Eϵ−​[y^​∣aθ​]=r+γEϵ−​[qθ−​(s′,aθ​)]=r+γq∗(s′,aθ​)=y∗ You might say that the online network can be noisy too. You are right. In the expectation above we assumed that a decision aθa_\\thetaaθ​ was already made and we computed expectation with respect to ϵ−\\epsilon^-ϵ−.\nLet\u0026rsquo;s go back to the tiny two-action example from the previous section where q∗(a1)=q∗(a2)=0q^*(a_1)=q^*(a_2)=0q∗(a1​)=q∗(a2​)=0. Now let ϵiθ\\epsilon^\\theta_iϵiθ​ be the online-network error for action aia_iai​, and let ϵi−\\epsilon^-_iϵi−​ be the target-network error.\nThen the online network selects one of the two actions:\naθ=arg max⁡aqθ(s′,a)=arg max⁡(ϵ1θ,ϵ2θ) a_\\theta = \\argmax_a q_\\theta(s^\\prime, a) = \\argmax(\\epsilon^\\theta_1,\\epsilon^\\theta_2) aθ​=aargmax​qθ​(s′,a)=argmax(ϵ1θ​,ϵ2θ​) The target network evaluates that selected action:\nqθ−(s′,aθ)=ϵaθ− q_{\\theta^-}(s^\\prime, a_\\theta) = \\epsilon^-_{a_\\theta} qθ−​(s′,aθ​)=ϵaθ​−​ Now let\u0026rsquo;s average the Double DQN value. The selected action aθa_\\thetaaθ​ is determined by the online-network errors ϵ1θ\\epsilon^\\theta_1ϵ1θ​ and ϵ2θ\\epsilon^\\theta_2ϵ2θ​.\nE[qθ−(s′,aθ)]=E[ϵaθ−](because q∗(a1)=q∗(a2)=0)=P(aθ=a1)E[ϵ1−∣aθ=a1]+P(aθ=a2)E[ϵ2−∣aθ=a2](law of total expectation)=P(aθ=a1)E[ϵ1−]+P(aθ=a2)E[ϵ2−](selection is independent of target noise)=P(aθ=a1)⋅0+P(aθ=a2)⋅0(zero-mean target-network errors)=0 \\begin{aligned} \\mathbb{E}[q_{\\theta^-}(s^\\prime, a_\\theta)] \u0026amp;= \\mathbb{E}[\\epsilon^-_{a_\\theta}] \u0026amp;\u0026amp; {\\scriptsize\\text{(because } q^*(a_1)=q^*(a_2)=0 \\text{)}} \\\\[0.5em] \u0026amp;= P(a_\\theta=a_1)\\mathbb{E}[\\epsilon^-_1 \\mid a_\\theta=a_1] + P(a_\\theta=a_2)\\mathbb{E}[\\epsilon^-_2 \\mid a_\\theta=a_2] \u0026amp;\u0026amp; {\\scriptsize\\text{(law of total expectation)}} \\\\[0.5em] \u0026amp;= P(a_\\theta=a_1)\\mathbb{E}[\\epsilon^-_1] + P(a_\\theta=a_2)\\mathbb{E}[\\epsilon^-_2] \u0026amp;\u0026amp; {\\scriptsize\\text{(selection is independent of target noise)}} \\\\[0.5em] \u0026amp;= P(a_\\theta=a_1)\\cdot 0 + P(a_\\theta=a_2)\\cdot 0 \u0026amp;\u0026amp; {\\scriptsize\\text{(zero-mean target-network errors)}} \\\\[0.5em] \u0026amp;= 0 \\end{aligned} E[qθ−​(s′,aθ​)]​=E[ϵaθ​−​]=P(aθ​=a1​)E[ϵ1−​∣aθ​=a1​]+P(aθ​=a2​)E[ϵ2−​∣aθ​=a2​]=P(aθ​=a1​)E[ϵ1−​]+P(aθ​=a2​)E[ϵ2−​]=P(aθ​=a1​)⋅0+P(aθ​=a2​)⋅0=0​​(because q∗(a1​)=q∗(a2​)=0)(law of total expectation)(selection is independent of target noise)(zero-mean target-network errors)​ This equation tells us that if the noise ϵθ\\epsilon^\\thetaϵθ used to select the action is independent of the noise ϵ−\\epsilon^-ϵ− used to evaluate that action, then the target network does not add a positive error on average.\nIn practice the two networks are not fully independent, because θ−\\theta^-θ− is periodically copied from θ\\thetaθ. Having said that, Double DQN reduces this overestimation bias rather than removing it completely.\nDueling DQN A normal DQN directly predicts one value per action:\nqθ(s,a) q_\\theta(s, a) qθ​(s,a) However in many situations, knowing that the state is good or bad is easier than knowing the exact best action. Imagine playing Pong when the ball is on the opposite side of the screen. Moving the paddle up, down, or doing nothing for one frame may lead to almost the same long-term outcome, because one slightly bad move does not decide the point. Having said that, you could say that all qθ(s,a)q_\\theta(s, a)qθ​(s,a) share the same vθ(s)v_\\theta(s)vθ​(s) value.\nIndeed, a neural network shares hidden layers across actions, so the action outputs are not completely independent. However, to estimate qθ(s,a)q_\\theta(s, a)qθ​(s,a) well for each action, the network has to learn what happens when those actions are taken in states where it doesn\u0026rsquo;t matter. This is a bit wasteful, and it is often better to first estimate vθ(s)v_\\theta(s)vθ​(s) and then only use an action-specific term for the difference between actions. Dueling DQN explicitly decomposes action-values by representing them with two terms:\nA state-value term vθ(s)v_\\theta(s)vθ​(s), which outputs one number. An advantage term zθ(s,a)z_\\theta(s, a)zθ​(s,a), which outputs one number per action. The naive decomposition is:\nqθ(s,a)=vθ(s)+zθ(s,a) q_\\theta(s, a) = v_\\theta(s) + z_\\theta(s, a) qθ​(s,a)=vθ​(s)+zθ​(s,a) We train together vθ(s)v_\\theta(s)vθ​(s) and zθ(s,a)z_\\theta(s, a)zθ​(s,a) in a single network, but we use separate neural network outputs. After obtaining separate outputs, we compose them together to get qθ(s,a)q_\\theta(s, a)qθ​(s,a).\nThe difference is:\nIn the Dueling DQN, every sampled transition updates the value term vθ(s)v_\\theta(s)vθ​(s), because vθ(s)v_\\theta(s)vθ​(s) contributes to every action-value built from that state.\nIn a normal DQN update, the TD loss is applied to the output for the sampled action aaa. The shared hidden layers can still change, but the other action outputs do not receive their own direct TD error in that update.\nIn other words, the network does not need to relearn the same state-quality signal separately for each action output. It can spend more capacity on estimating the state value well, and then let the advantage term learn the smaller differences between actions. This matters for temporal-difference learning, because the bootstrap target depends on having accurate next-state values. It also explains why the dueling architecture becomes more useful when the action space is large.\nNow let\u0026rsquo;s look closer at the naive decomposition of qθ(s,a)q_\\theta(s, a)qθ​(s,a):\nqθ(s,a)=vθ(s)+zθ(s,a) q_\\theta(s, a) = v_\\theta(s) + z_\\theta(s, a) qθ​(s,a)=vθ​(s)+zθ​(s,a) For any constant c(s)c(s)c(s):\nvθ(s)+zθ(s,a)=(vθ(s)+c(s))+(zθ(s,a)−c(s)) v_\\theta(s) + z_\\theta(s, a) = \\left(v_\\theta(s) + c(s)\\right) + \\left(z_\\theta(s, a) - c(s)\\right) vθ​(s)+zθ​(s,a)=(vθ​(s)+c(s))+(zθ​(s,a)−c(s)) This means that the same qθ(s,a)q_\\theta(s, a)qθ​(s,a) can be represented in infinitely many ways. For example, the network could add 101010 to the value term and subtract 101010 from every raw score output, and the final qθ(s,a)q_\\theta(s, a)qθ​(s,a) would not change.\nAt first this may look harmless, because the TD loss only cares about the final qθ(s,a)q_\\theta(s, a)qθ​(s,a). The problem is that the two heads are supposed to give the network a useful division of work: vθ(s)v_\\theta(s)vθ​(s) should carry the common state-quality signal, while zθ(s,a)z_\\theta(s, a)zθ​(s,a) should carry the action-specific differences. Without a constraint, the training loss gives no reason to prefer one offset over another, so the value term and advantage term can drift together by arbitrary state-dependent constants. In that case neither has a stable meaning, and the network can waste capacity coordinating cancelling offsets instead of learning. So the decomposition needs an anchor.\nOne possible anchor is to subtract the largest raw advantage score:\nqθ(s,ai)=vθ(s)+(zθ(s,ai)−max⁡jzθ(s,aj)) q_\\theta(s, a_i) = v_\\theta(s) + \\left( z_\\theta(s, a_i) - \\max_j z_\\theta(s, a_j) \\right) qθ​(s,ai​)=vθ​(s)+(zθ​(s,ai​)−jmax​zθ​(s,aj​)) This makes the best raw score equal to zero after normalization. It matches the intuition that if vθ(s)v_\\theta(s)vθ​(s) means the value of the best action, then the normalized score can be interpreted as a lost opportunity: every worse action has non-positive advantage.\nFor example, suppose a state has two true optimal action-values:\nq∗(s,a1)=100,q∗(s,a2)=1 q^*(s, a_1) = 100,\\qquad q^*(s, a_2) = 1 q∗(s,a1​)=100,q∗(s,a2​)=1 If v∗(s)v^*(s)v∗(s) is interpreted as the best action-value, then v∗(s)=max⁡aq∗(s,a)=100v^*(s) = \\max_a q^*(s, a) = 100v∗(s)=maxa​q∗(s,a)=100. The optimal advantage relative to that best action is:\nA∗(s,a1)=0,A∗(s,a2)=−99 A^*(s, a_1) = 0,\\qquad A^*(s, a_2) = -99 A∗(s,a1​)=0,A∗(s,a2​)=−99 Taking a2a_2a2​ is therefore a 999999 point lost opportunity compared with the best available action.\nIn practice, Dueling DQN usually uses a mean-subtraction anchor instead:\nqθ(s,a)=vθ(s)+(zθ(s,a)−1∣A∣∑bzθ(s,b)) q_\\theta(s, a) = v_\\theta(s) + \\left( z_\\theta(s, a) - \\frac{1}{|\\mathcal{A}|}\\sum_b z_\\theta(s, b) \\right) qθ​(s,a)=vθ​(s)+(zθ​(s,a)−∣A∣1​b∑​zθ​(s,b)) Now the normalized scores have average zero:\n1∣A∣∑a(zθ(s,a)−1∣A∣∑bzθ(s,b))=0 \\frac{1}{|\\mathcal{A}|}\\sum_a \\left( z_\\theta(s, a) - \\frac{1}{|\\mathcal{A}|}\\sum_b z_\\theta(s, b) \\right) = 0 ∣A∣1​a∑​(zθ​(s,a)−∣A∣1​b∑​zθ​(s,b))=0 So the average action-value in a state is exactly the value term:\n1∣A∣∑aqθ(s,a)=vθ(s) \\frac{1}{|\\mathcal{A}|}\\sum_a q_\\theta(s, a) = v_\\theta(s) ∣A∣1​a∑​qθ​(s,a)=vθ​(s) The mean version no longer makes vθ(s)v_\\theta(s)vθ​(s) equal to the best action-value, but it makes vθ(s)v_\\theta(s)vθ​(s) equal to the average action-value under the network\u0026rsquo;s current outputs. That is also okay because the point of the dueling architecture is not to recover a separately supervised \u0026ldquo;true\u0026rdquo; value term and \u0026ldquo;true\u0026rdquo; advantage term. The point is to remove the arbitrary offset and give the network a stable way to build qθ(s,a)q_\\theta(s, a)qθ​(s,a).\nIn practice we often use mean subtraction because it gives a smoother, more stable parameterization: the anchor depends on all action scores instead of only the maximal one. In code it is a simple change in the forward pass and neural network architecture.\ndef forward(self, states: Tensor) -\u0026gt; Tensor: features = self._model(states) value = self._value_head(features) advantage = self._advantage_head(features) return value + advantage - advantage.mean(dim=1, keepdim=True) Final Thoughts Vanilla DQN showed how to learn efficiently from a buffer of stored transitions instead of training only on the latest environment step. Experience replay makes the data more reusable and less correlated, while the target network makes the bootstrap target less reactive.\nPrioritized replay then improves the buffer by sampling more informative transitions more often. Double DQN improves the target by reducing overestimation bias. Dueling DQN improves the network architecture by separating the shared state-quality signal from action-specific differences.\nThere are many more DQN improvements in the literature beyond the ones covered in this article. Rainbow and Beyond The Rainbow are useful next papers because they show how several compatible improvements can be combined into one stronger agent.\nReferences Q-learning Playing Atari with Deep Reinforcement Learning Human-level control through deep reinforcement learning Prioritized Experience Replay Deep Reinforcement Learning with Double Q-learning Dueling Network Architectures for Deep Reinforcement Learning Rainbow: Combining Improvements in Deep Reinforcement Learning Beyond The Rainbow: High Performance Deep Reinforcement Learning On A Desktop PC ","permalink":"https://mateuszpieniak.com/courses/reinforcement-learning/104-deep-q-networks/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eIn the previous article, we replaced the tabular action-value function \u003cspan class=\"katex\"\u003e\u003cspan class=\"katex-mathml\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmi\u003eq\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmi\u003ea\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003eq(s, a)\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e\u003cspan class=\"katex-html\" aria-hidden=\"true\"\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.03588em;\"\u003eq\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"mpunct\"\u003e,\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003ea\u003c/span\u003e\u003cspan class=\"mclose\"\u003e)\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e with a parameterized approximation: \u003cspan class=\"katex\"\u003e\u003cspan class=\"katex-mathml\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmsub\u003e\u003cmi\u003eq\u003c/mi\u003e\u003cmi\u003eθ\u003c/mi\u003e\u003c/msub\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmi\u003ea\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003eq_\\theta(s, a)\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e\u003cspan class=\"katex-html\" aria-hidden=\"true\"\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord\"\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.03588em;\"\u003eq\u003c/span\u003e\u003cspan class=\"msupsub\"\u003e\u003cspan class=\"vlist-t vlist-t2\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.3361em;\"\u003e\u003cspan style=\"top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:2.7em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mathnormal mtight\" style=\"margin-right:0.02778em;\"\u003eθ\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-s\"\u003e​\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.15em;\"\u003e\u003cspan\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"mpunct\"\u003e,\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003ea\u003c/span\u003e\u003cspan class=\"mclose\"\u003e)\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e. We will use neural networks to approximate \u003cspan class=\"katex\"\u003e\u003cspan class=\"katex-mathml\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmsub\u003e\u003cmi\u003eq\u003c/mi\u003e\u003cmi\u003eθ\u003c/mi\u003e\u003c/msub\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmi\u003ea\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003eq_\\theta(s, a)\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e\u003cspan class=\"katex-html\" aria-hidden=\"true\"\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord\"\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.03588em;\"\u003eq\u003c/span\u003e\u003cspan class=\"msupsub\"\u003e\u003cspan class=\"vlist-t vlist-t2\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.3361em;\"\u003e\u003cspan style=\"top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:2.7em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mathnormal mtight\" style=\"margin-right:0.02778em;\"\u003eθ\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-s\"\u003e​\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.15em;\"\u003e\u003cspan\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"mpunct\"\u003e,\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003ea\u003c/span\u003e\u003cspan class=\"mclose\"\u003e)\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eDeep Q-Network (DQN) is the version of approximate Q-learning that made it practical at a scale. When we combine Q-learning with neural networks, then we must deal with the instability created by correlated data, moving targets, changing behavior distributions, and large TD errors. This article presents solutions to deal with those problems.\u003c/p\u003e","title":"Reinforcement Learning 104: Deep Q-Networks"},{"content":"Introduction In the previous article, we learned model-free control with tabular action-values:\nq(s,a) q(s, a) q(s,a) Strictly speaking, qqq is already a function from state-action pairs to numbers. In the tabular setting, we represent that function with an explicit table. Every state-action pair has its own cell. If the agent observes a transition from (s,a)(s, a)(s,a), it updates that one cell and leaves the others unchanged. That is conceptually clean and enough for small grid worlds, but it does not scale.\nIf the state is an image, then each possible image would need a separate table row. If the state is a continuous vector, like the position and angle of a robot arm, then the table has no natural finite set of keys. Even if we discretize the state space, nearby states are often related and should share information. If a spaceship is moving directly toward us, then slightly different pixel positions should not require learning the same lesson from scratch.\nSo instead of storing one value for every possible state-action pair, we approximate the action-value function with a parameterized function:\nqθ(s,a) q_\\theta(s, a) qθ​(s,a) Here θ\\thetaθ is the parameter vector. It may be the weights of a linear model, a neural network, or any other differentiable function approximator. The state sss and action aaa are still inputs to the function; in an implementation they have to be represented somehow, for example as feature vectors, tensors, or embeddings. We may write the model input as x=ϕ(s,a)x = \\phi(s, a)x=ϕ(s,a), where ϕ\\phiϕ is the chosen representation. The goal is to choose θ\\thetaθ so that qθ(s,a)q_\\theta(s, a)qθ​(s,a) is close to the action-value function we care about.\nThis changes the problem in an important way. In tabular learning, the estimates are independent. Updating q(s,a)q(s, a)q(s,a) does not directly modify q(s~,a~)q(\\tilde{s}, \\tilde{a})q(s~,a~) for another pair (s~,a~)(\\tilde{s}, \\tilde{a})(s~,a~).\nWith function approximation, the same parameters are shared across many states and actions. One update to θ\\thetaθ can improve many predictions at once, which is exactly why approximation is useful, but the same update can also damage many predictions at once. Generalization and interference come from the same mechanism.\nThis article is about that replacement:\nq(s,a)→qθ(s,a) q(s, a) \\quad \\rightarrow \\quad q_\\theta(s, a) q(s,a)→qθ​(s,a) We will first turn action-value prediction into a regression problem. Then we will see why reinforcement learning is not ordinary supervised learning, why TD methods use semi-gradients, and how approximate SARSA, Expected SARSA, and Q-learning are written with function approximation.\nSupervised Learning The tabular update had the form:\nq(s,a)←q(s,a)+α(y−q(s,a)) q(s, a) \\leftarrow q(s, a) + \\alpha(y - q(s, a)) q(s,a)←q(s,a)+α(y−q(s,a)) The target yyy depended on the algorithm. For example, for SARSA:\ny=r+γq(s′,a′) y = r + \\gamma q(s^\\prime, a^\\prime) y=r+γq(s′,a′) The update moves the current estimate toward the target. When qqq is a table, this is a direct assignment to one table entry. With function approximation, the estimate is produced by a function:\nqθ(s,a) q_\\theta(s, a) qθ​(s,a) So the update can no longer say \u0026ldquo;change this table cell\u0026rdquo;. Instead, it must say \u0026ldquo;change the parameters so that the function output at (s,a)(s, a)(s,a) moves toward the target\u0026rdquo;. This makes the problem look like supervised learning. Abstractly, the input is the state-action pair:\nx=(s,a) x = (s, a) x=(s,a) For an actual function approximator, this usually means feeding in the chosen representation ϕ(s,a)\\phi(s, a)ϕ(s,a). The model prediction is:\ny^=qθ(s,a) \\hat{y} = q_\\theta(s, a) y^​=qθ​(s,a) And the target is some number yyy that we want the prediction to match. The analogy is useful, but incomplete. In ordinary supervised learning, we usually start with a fixed dataset:\n{(xi,yi)}i=1N \\{(x_i, y_i)\\}_{i=1}^{N} {(xi​,yi​)}i=1N​ The inputs are known, the labels are known, and the optimization problem is fixed before training begins. In reinforcement learning, none of this is so clean.\nThe input (s,a)(s, a)(s,a) appears only when the agent or another behavior policy visits it. The data distribution depends on the policy used to collect experience. The target yyy is not given by a teacher. It is estimated from returns in Monte Carlo learning, or bootstrapped from the current action-value estimate in TD learning. So approximate RL borrows tools from supervised learning, but the learning problem keeps moving while we train.\nLoss Function For a sampled pair (St,At)(S_t, A_t)(St​,At​) with target yty_tyt​, the simplest objective is squared error:\nLt(θ)=12(yt−qθ(St,At))2 L_t(\\theta) = \\frac{1}{2}\\left(y_t - q_\\theta(S_t, A_t)\\right)^2 Lt​(θ)=21​(yt​−qθ​(St​,At​))2 Huber loss is also common in deep RL because occasional large TD errors can produce large gradients. The important question is where yty_tyt​ comes from. Monte Carlo targets use sampled returns. TD targets bootstrap from another action-value estimate, which makes the target available earlier, but also makes it depend on the current approximation.\nAfter that deciding on the target, we still need optimize this loss when both the target and the data distribution come from interactions.\nMonte Carlo The Monte Carlo target is the sampled return:\nytMC=Gt y_t^{\\text{MC}} = G_t ytMC​=Gt​ The Monte Carlo target has a nice property: once the episode is finished, GtG_tGt​ is just a number. It does not depend on θ\\thetaθ. From the point of view of gradient descent, it behaves like an ordinary supervised label. This means Monte Carlo does not have bootstrap bias: errors in the current estimate qθq_\\thetaqθ​ do not enter the target.\nThe downside is that we must wait until the return is known. For long episodes this delays learning. For continuing tasks, there may be no natural episode end. Monte Carlo targets can also have high variance because GtG_tGt​ is a sum over the rest of the trajectory. Each RtR_tRt​, Rt+1R_{t+1}Rt+1​, Rt+2R_{t+2}Rt+2​ etc. is a random variable and the more random variables we sum together, the higher the variance.\nTemporal Difference The one-step SARSA target is:\nytSARSA(θ)=Rt+γqθ(St+1,At+1) y_t^{\\text{SARSA}}(\\theta) = R_t + \\gamma q_\\theta(S_{t+1}, A_{t+1}) ytSARSA​(θ)=Rt​+γqθ​(St+1​,At+1​) Compared with Monte Carlo, this target usually has lower variance because it does not sum over the whole remaining trajectory. The cost is that it uses our current estimate qθq_\\thetaqθ​. If qθ(St+1,At+1)q_\\theta(S_{t+1}, A_{t+1})qθ​(St+1​,At+1​) is wrong, then the target is wrong too.\nYou might wonder now: why would a model learn from a target that can be wrong? Recall contraction mappings and the Banach fixed-point theorem. The whole idea is that recursive Bellman-style equations can start from a wrong estimate, and repeatedly applying them can improve that estimate.\nHaving said that, the wrong target itself is not automatically the problem. The problem to keep in mind is the target can be systematically shifted every time we update θ\\thetaθ. It might lead to unstable learning. We will solve that problem in the next article by introducing a target network.\nAnother thing to keep in mind is that the TD target has bootstrap bias. In approximate TD, we do not know the true value qπq^\\piqπ, so we rely on the estimate produced by our approximator, for example a neural network qθq_\\thetaqθ​. In statistics, bias is the difference between the expected value of an estimator and the true value.\nIn SARSA, once the next state St+1S_{t+1}St+1​ has been observed, the target is a sum with two random variables: the reward RtR_tRt​ and the sampled next action At+1A_{t+1}At+1​. Expected SARSA has the same bootstrap bias as SARSA, but lower variance in the target because it removes the next-action random variable At+1A_{t+1}At+1​ and replaces it with a policy expectation:\nytExpected SARSA(θ)=Rt+γ∑a′π(a′∣St+1)qθ(St+1,a′) y_t^{\\text{Expected SARSA}}(\\theta) = R_t + \\gamma \\sum_{a^\\prime}\\pi(a^\\prime \\mid S_{t+1})q_\\theta(S_{t+1}, a^\\prime) ytExpected SARSA​(θ)=Rt​+γa′∑​π(a′∣St+1​)qθ​(St+1​,a′) Q-learning uses the Bellman optimality target:\nytQ-learning(θ)=Rt+γmax⁡a′qθ(St+1,a′) y_t^{\\text{Q-learning}}(\\theta) = R_t + \\gamma \\max_{a^\\prime}q_\\theta(S_{t+1}, a^\\prime) ytQ-learning​(θ)=Rt​+γa′max​qθ​(St+1​,a′) Similarly to Expected SARSA, Q-learning does not sample the next action for the target. However, Q-learning can introduce overestimation bias: when action-value estimates are noisy, the max tends to select actions whose estimates are too high. We will describe that in more detail in the next article.\nTo sum up, Monte Carlo uses completed returns, but it has to wait and can have high variance. TD uses imperfect bootstrap targets, but it can update after one transition and often has lower variance. A useful empirical reminder is the paper TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning, which found that finite-horizon Monte Carlo targets can be competitive in deep RL settings. The practical choice is a bias-variance and engineering tradeoff, not a universal rule.\nTraining Distribution Now assume we can produce a target yty_tyt​ from Monte Carlo or TD. The next question is what objective this sampled loss represents.\nIn supervised learning, the loss is usually an average over the training dataset. In approximate RL, it is better to think of the loss as an expectation over the state-action pairs we train on. For a discrete state-action space, an ideal weighted squared error would look like:\nJ(θ)=12∑s,aμ(s,a)(y(s,a)−qθ(s,a))2 J(\\theta) = \\frac{1}{2}\\sum_{s,a}\\mu(s, a) \\left(y(s, a) - q_\\theta(s, a)\\right)^2 J(θ)=21​s,a∑​μ(s,a)(y(s,a)−qθ​(s,a))2 Here μ(s,a)\\mu(s, a)μ(s,a) is the distribution that says how much we care about each state-action pair. For continuous variables, the corresponding sums become integrals, but the idea is the same.\nThis weighting matters because with function approximation we usually cannot be perfectly accurate everywhere. The model has limited capacity, the state-action space is large, and some regions are visited much more often than others. The distribution μ\\muμ tells the optimizer where errors are expensive.\nIn on-policy learning, samples come from the same policy we are evaluating or improving. A rough way to think about the update distribution is:\nμπ(s,a)=dπ(s)π(a∣s) \\mu_\\pi(s, a) = d_\\pi(s)\\pi(a \\mid s) μπ​(s,a)=dπ​(s)π(a∣s) where dπ(s)d_\\pi(s)dπ​(s) is the state visitation distribution under policy π\\piπ. We multiply by π(a∣s)\\pi(a \\mid s)π(a∣s) because μπ\\mu_\\piμπ​ is a distribution over state-action pairs. The first term says how often the policy reaches state sss; the second says how often it chooses action aaa once it is there. This means the loss emphasizes states the policy actually visits and actions it actually chooses.\nIn off-policy learning, the samples come from a behavior policy bbb:\nμb(s,a)=db(s)b(a∣s) \\mu_b(s, a) = d_b(s)b(a \\mid s) μb​(s,a)=db​(s)b(a∣s) The target may still describe a different policy, but the updates happen where the behavior policy provides data.\nA common misconception is to read μ(s,a)\\mu(s, a)μ(s,a) as how important the state is in some human sense, but it is the training distribution. If the agent rarely visits a dangerous state, then ordinary SGD rarely updates that state. If we want to learn more accurately there, we need more samples, different exploration, replay, prioritization, or explicit weighting.\nIn practice, we cannot compute loss J(θ)J(\\theta)J(θ) directly:\nThe state-action space may be huge or continuous. The target value for every pair (s,a)(s, a)(s,a) is not known in advance. The visitation distribution μ\\muμ is observed through rollouts rather than available as a full table. The useful part is that we do not need to compute J(θ)J(\\theta)J(θ) exactly to improve θ\\thetaθ. Rollouts give us sampled state-action pairs from this distribution, so we can use stochastic gradient descent: estimate an update direction from sampled experience instead of evaluating the full expectation.\nStochastic Gradient Descent Monte Carlo First consider the easy case where the target yty_tyt​ is fixed with respect to θ\\thetaθ. This is true for a completed Monte Carlo return GtG_tGt​.\nThe sample loss is:\nLt(θ)=12(yt−qθ(St,At))2 L_t(\\theta) = \\frac{1}{2}\\left(y_t - q_\\theta(S_t, A_t)\\right)^2 Lt​(θ)=21​(yt​−qθ​(St​,At​))2 The gradient is:\n∇θLt(θ)=−(yt−qθ(St,At))∇θqθ(St,At) \\nabla_\\theta L_t(\\theta) = -\\left(y_t - q_\\theta(S_t, A_t)\\right) \\nabla_\\theta q_\\theta(S_t, A_t) ∇θ​Lt​(θ)=−(yt​−qθ​(St​,At​))∇θ​qθ​(St​,At​) Gradient descent subtracts this gradient:\nθ←θ+α(yt−qθ(St,At))∇θqθ(St,At) \\theta \\leftarrow \\theta + \\alpha \\left(y_t - q_\\theta(S_t, A_t)\\right) \\nabla_\\theta q_\\theta(S_t, A_t) θ←θ+α(yt​−qθ​(St​,At​))∇θ​qθ​(St​,At​) This is the basic approximate action-value update. It says: change the parameters in the direction that would increase the prediction if the target is larger than the prediction, and decrease the prediction if the target is smaller. For a linear function approximator, this becomes especially simple. Suppose:\nqθ(s,a)=θ⊤ϕ(s,a) q_\\theta(s, a) = \\theta^\\top \\phi(s, a) qθ​(s,a)=θ⊤ϕ(s,a) where ϕ(s,a)\\phi(s, a)ϕ(s,a) is a feature vector. Then:\n∇θqθ(s,a)=ϕ(s,a) \\nabla_\\theta q_\\theta(s, a) = \\phi(s, a) ∇θ​qθ​(s,a)=ϕ(s,a) and the update is:\nθ←θ+α(yt−qθ(St,At))ϕ(St,At) \\theta \\leftarrow \\theta + \\alpha \\left(y_t - q_\\theta(S_t, A_t)\\right) \\phi(S_t, A_t) θ←θ+α(yt​−qθ​(St​,At​))ϕ(St​,At​) This looks very close to a tabular update. In fact, a table can be viewed as a special case of linear approximation where every state-action pair has its own one-hot feature. The difference is that in a general feature representation, multiple state-action pairs share features. Updating θ\\thetaθ for one pair also changes predictions for other pairs with overlapping features.\nSemi-gradient Methods TD targets are different from Monte Carlo targets because they can depend on θ\\thetaθ.\nFor SARSA, the target is:\nyt(θ)=Rt+γqθ(St+1,At+1) y_t(\\theta) = R_t + \\gamma q_\\theta(S_{t+1}, A_{t+1}) yt​(θ)=Rt​+γqθ​(St+1​,At+1​) The loss is:\nLt(θ)=12(yt(θ)−qθ(St,At))2 L_t(\\theta) = \\frac{1}{2}\\left(y_t(\\theta) - q_\\theta(S_t, A_t)\\right)^2 Lt​(θ)=21​(yt​(θ)−qθ​(St​,At​))2 If we take the full gradient, the target also has a derivative:\n∇θLt(θ)=(yt(θ)−qθ(St,At))(∇θyt(θ)−∇θqθ(St,At)) \\nabla_\\theta L_t(\\theta) = \\left(y_t(\\theta) - q_\\theta(S_t, A_t)\\right) \\left( \\nabla_\\theta y_t(\\theta) - \\nabla_\\theta q_\\theta(S_t, A_t) \\right) ∇θ​Lt​(θ)=(yt​(θ)−qθ​(St​,At​))(∇θ​yt​(θ)−∇θ​qθ​(St​,At​)) For the SARSA target:\n∇θyt(θ)=γ∇θqθ(St+1,At+1) \\nabla_\\theta y_t(\\theta) = \\gamma \\nabla_\\theta q_\\theta(S_{t+1}, A_{t+1}) ∇θ​yt​(θ)=γ∇θ​qθ​(St+1​,At+1​) So the full gradient would update parameters through both the current prediction and the next-state prediction inside the target.\nTD methods usually do not do that. They use a semi-gradient. A semi-gradient treats the target as fixed for the purpose of the current update, even when the target was computed from the current action-value function. The semi-gradient update ignores ∇θyt(θ)\\nabla_\\theta y_t(\\theta)∇θ​yt​(θ):\nθ←θ+α(yt(θ)−qθ(St,At))∇θqθ(St,At) \\theta \\leftarrow \\theta + \\alpha \\left(y_t(\\theta) - q_\\theta(S_t, A_t)\\right) \\nabla_\\theta q_\\theta(S_t, A_t) θ←θ+α(yt​(θ)−qθ​(St​,At​))∇θ​qθ​(St​,At​) In PyTorch, this means stopping gradients through the target while keeping gradients through the current prediction. A common pattern is:\nwith torch.no_grad(): target = reward + gamma * neural_network(next_state, next_action) prediction = neural_network(state, action) loss = 0.5 * (target - prediction).pow(2).mean() The intuition is practical. The bootstrap target is acting as a temporary label. We want to move the current action-value estimate toward the estimate of the next step. If we allowed the update to also change the next-step estimate, the model could reduce the loss by moving the target instead of improving the current prediction.\nFor Q-learning with function approximation, the same issue appears. The target:\nyt(θ)=Rt+γmax⁡a′qθ(St+1,a′) y_t(\\theta) = R_t + \\gamma \\max_{a^\\prime}q_\\theta(S_{t+1}, a^\\prime) yt​(θ)=Rt​+γa′max​qθ​(St+1​,a′) also depends on θ\\thetaθ. Approximate Q-learning normally uses the semi-gradient update:\nθ←θ+α(Rt+γmax⁡a′qθ(St+1,a′)−qθ(St,At))∇θqθ(St,At) \\theta \\leftarrow \\theta + \\alpha \\left(R_t + \\gamma \\max_{a^\\prime}q_\\theta(S_{t+1}, a^\\prime) - q_\\theta(S_t, A_t) \\right) \\nabla_\\theta q_\\theta(S_t, A_t) θ←θ+α(Rt​+γa′max​qθ​(St+1​,a′)−qθ​(St​,At​))∇θ​qθ​(St​,At​) The fact that a behavior policy collected the transition does not make the target fixed. The sampled transition (St,At,Rt,St+1)(S_t, A_t, R_t, S_{t+1})(St​,At​,Rt​,St+1​) is fixed after it is observed, but the bootstrap value max⁡a′qθ(St+1,a′)\\max_{a^\\prime}q_\\theta(S_{t+1}, a^\\prime)maxa′​qθ​(St+1​,a′) still depends on the parameters. One way to make the target fixed with respect to the current update is to use a separate target network:\nyt=Rt+γmax⁡a′qθ−(St+1,a′) y_t = R_t + \\gamma \\max_{a^\\prime}q_{\\theta^-}(S_{t+1}, a^\\prime) yt​=Rt​+γa′max​qθ−​(St+1​,a′) where θ−\\theta^-θ− is held constant while updating θ\\thetaθ. This idea becomes central in Deep Q-Networks, and we will return to it in the next article.\nAlgorithms The approximate algorithms differ mainly in how they build the target. The training loop has the same shape:\nCollect transitions (S,A,R,S′)(S, A, R, S^\\prime)(S,A,R,S′) and optionally A′A\u0026#x27;A′ by acting in the environment. Form a minibatch B\\mathcal{B}B from the collected transitions. This may be a short segment of recent experience or the most recent rollout. Build one scalar target yiy_iyi​ for each transition in the minibatch. Take one gradient or semi-gradient step using the average minibatch loss. Here iii indexes transitions in the minibatch. The sampled loss is:\nLB(θ)=1∣B∣∑i∈B12(yi−qθ(Si,Ai))2 L_{\\mathcal{B}}(\\theta) = \\frac{1}{|\\mathcal{B}|} \\sum_{i \\in \\mathcal{B}} \\frac{1}{2}\\left(y_i - q_\\theta(S_i, A_i)\\right)^2 LB​(θ)=∣B∣1​i∈B∑​21​(yi​−qθ​(Si​,Ai​))2 For Monte Carlo, wait until the return is known and set:\nyi=Gi y_i = G_i yi​=Gi​ For one-step TD methods, use yi=Riy_i = R_iyi​=Ri​ if Si′S_i^\\primeSi′​ is terminal. Otherwise:\nyi(θ)={Ri+γqθ(Si′,Ai′),SARSARi+γ∑a′π(a′∣Si′)qθ(Si′,a′),Expected SARSARi+γmax⁡a′qθ(Si′,a′),Q-learning y_i(\\theta) = \\begin{cases} R_i + \\gamma q_\\theta(S_i^\\prime, A_i^\\prime), \u0026amp; \\text{SARSA} \\\\ R_i + \\gamma \\sum_{a^\\prime}\\pi(a^\\prime \\mid S_i^\\prime)q_\\theta(S_i^\\prime, a^\\prime), \u0026amp; \\text{Expected SARSA} \\\\ R_i + \\gamma \\max_{a^\\prime}q_\\theta(S_i^\\prime, a^\\prime), \u0026amp; \\text{Q-learning} \\end{cases} yi​(θ)=⎩⎨⎧​Ri​+γqθ​(Si′​,Ai′​),Ri​+γ∑a′​π(a′∣Si′​)qθ​(Si′​,a′),Ri​+γmaxa′​qθ​(Si′​,a′),​SARSAExpected SARSAQ-learning​ For TD methods, treat yi(θ)y_i(\\theta)yi​(θ) as fixed during the current update. In code, this means detaching the target before computing the minibatch loss.\nWhy Approximate RL Is Hard At this point, approximate RL may look like a small modification of supervised learning:\nCollect experience. Build targets. Take gradient steps. The difficulty is that each of those steps hides a problem.\nGeneralization and Interference Function approximation generalizes; that is the point. If two state-action pairs share features, then learning from one pair can improve predictions for the other. This is what lets us handle large state-action spaces without experiencing every possible state-action combination separately.\nThe same shared parameters also create interference. An update from one transition can change predictions for many other state-action pairs, including ones we did not intend to change. In tabular learning, a bad update is local to one cell; in approximate learning, it can spread through the function approximator. With neural networks this is especially visible, because a single gradient step can change the representation used by many inputs. Sometimes that improves generalization, and sometimes it damages behavior that was previously working.\nCorrelated Data The clean SGD picture assumes that mini-batches give a reasonably representative estimate of the gradient of the training objective. This is easiest when samples are approximately independent and identically distributed (IID), or at least well mixed. The samples do not have to be perfectly independent, but strong correlation can make optimization worse.\nRL data is naturally correlated, so it violates this IID picture. During a rollout, consecutive states are close in time and often close in content. If the agent is driving down a road, many adjacent frames look almost the same. If we train directly on this stream, the model may overfit to the most recent part of experience and move away from older knowledge.\nCorrelated updates also make the gradient estimates less representative of the broader training distribution. Instead of getting a useful average direction, the optimizer may chase whatever small region the agent recently visited.\nThis is one reason experience replay is useful. A replay buffer lets the agent train on a more mixed batch of past transitions instead of only the latest transition, but then it works only for off-policy algorithms.\nMoving Targets We already saw this in the semi-gradient section. During one gradient step, we detach the TD target and treat it like a temporary label.\nBut the label is fixed only for that step. After we update θ\\thetaθ, the action-value estimates change. The next TD target is then built from the new qθq_\\thetaqθ​, so the target changes too.\nThis is the moving target problem: the model is training against targets produced by its own changing predictions. The problem is not simply that the targets are wrong. Bootstrap targets are usually approximate. The problem is that a noisy or overestimated value can be used as a target, learned by earlier state-action pairs, and then reused in later targets. A deeper explanation will be in the next article.\nMoving Data Distribution In ordinary supervised learning, the dataset is often fixed. In RL, the data distribution is produced by a behavior policy. If that behavior policy is tied to the current qθq_\\thetaqθ​, then changing θ\\thetaθ also changes what data we collect.\nFor example, if the behavior policy chooses actions ϵ\\epsilonϵ-greedily according to qθq_\\thetaqθ​, then changing θ\\thetaθ changes both how the agent exploits and how it explores. That changes which states it visits, which actions it tries, and which rewards it observes. This is clearly true for on-policy methods such as SARSA, where the policy being evaluated is also the policy collecting data.\nFor Q-learning, the target is greedy rather than defined by the behavior policy, so the update can use off-policy data, but it is the behavior policy that determines which transitions are observed. The moving distribution comes from data collection, not from the target itself. With a fixed offline dataset this movement disappears, although the target may still ask for values of actions that are rare or missing in the data.\nSharp Value Boundaries Function approximators are often smooth functions. Similar inputs tend to produce similar outputs. That is helpful when similarity in input space matches similarity in action-value, but in RL, tiny changes in state can sometimes cause huge changes in return.\nImagine a helicopter flying close to a tree. A small change in position may be the difference between passing safely and crashing. The reward changes abruptly because one trajectory continues and the other terminates with a large penalty. In other words, two inputs can look close in raw pixels or coordinates while requiring very different value predictions.\nThis does not always mean the true action-value function is mathematically non-differentiable, although it can be. The practical issue is the action-value function may have sharp boundaries or high curvature. A smooth approximator can smear the boundary and assign unsafe values to states near failure.\nThe Deadly Triad The feedback loop above is especially dangerous in what is usually called the deadly triad:\nFunction approximation Bootstrapping Off-policy learning None of these is suspicious on its own. Function approximation is how we handle large state-action spaces. Bootstrapping is how we update before the full return is known. Off-policy learning is how we learn from data generated by a different policy.\nThe trouble starts when they are combined. With function approximation, an update for one state-action pair also changes predictions for other pairs. With bootstrapping, those changed predictions are used as targets for later updates. With off-policy learning, the data may be weak exactly where the target asks for values, so the model has to extrapolate.\nFor example, suppose an action-value is too high in a poorly covered next state. A bootstrapped target uses that value. The update then increases the estimate for a previous state-action pair. Because parameters are shared, that same update may also change other predictions. Later targets can use those changed predictions again. The error is no longer isolated in one table entry; it becomes part of the training signal.\nThis is where the tabular fixed-point intuition stops being enough. The update may still settle down, but it is no longer guaranteed to behave like a simple contraction. It can oscillate or diverge.\nThis is not just a neural-network problem. Divergence examples exist even with linear function approximation. Papers such as Breaking the Deadly Triad with a Target Network study how target networks can stabilize some of these dynamics.\nThe deadly triad doesn\u0026rsquo;t mean that every approximate off-policy TD method fails. DQN uses all three ingredients and can work very well. The whole point is that stability is no longer automatically guaranteed.\nFinal Thoughts Tabular model-free control represents the action-value function with one number per state-action pair. Approximate model-free control replaces that explicit table representation with a parameterized function:\nqθ(s,a) q_\\theta(s, a) qθ​(s,a) This lets the agent generalize to large, continuous, or high-dimensional state-action spaces. It also makes each update less local. One gradient step can change many predictions at once.\nMonte Carlo targets turn RL into the most supervised-looking regression problem, because the completed return is a fixed number. TD targets update sooner, but they bootstrap from current estimates, so the target itself can move. Semi-gradient methods handle this by treating the target as fixed during the current update.\nApproximate SARSA, Expected SARSA, and Q-learning all share the same update shape. They differ in the target used to compute δt\\delta_tδt​: sampled next action, expected next action, or greedy next action.\nθ←θ+αδt∇θqθ(St,At) \\theta \\leftarrow \\theta + \\alpha \\delta_t \\nabla_\\theta q_\\theta(S_t, A_t) θ←θ+αδt​∇θ​qθ​(St​,At​) The main lesson is that function approximation changes the learning dynamics. Correlated samples, moving data distributions, sharp value boundaries, moving targets, and the deadly triad all appear because we no longer have independent table entries.\nThe next step is to make approximate Q-learning work with neural networks. That leads to Deep Q-Networks, where replay buffers and target networks are introduced as practical answers to the instability described here.\n","permalink":"https://mateuszpieniak.com/courses/reinforcement-learning/103-approximate-methods/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eIn the previous article, we learned model-free control with tabular action-values:\u003c/p\u003e\n\u003cspan class=\"katex-display\"\u003e\u003cspan class=\"katex\"\u003e\u003cspan class=\"katex-mathml\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmi\u003eq\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmi\u003ea\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003e\nq(s, a)\n\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e\u003cspan class=\"katex-html\" aria-hidden=\"true\"\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.03588em;\"\u003eq\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"mpunct\"\u003e,\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003ea\u003c/span\u003e\u003cspan class=\"mclose\"\u003e)\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\n\u003cp\u003eStrictly speaking, \u003cspan class=\"katex\"\u003e\u003cspan class=\"katex-mathml\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmi\u003eq\u003c/mi\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003eq\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e\u003cspan class=\"katex-html\" aria-hidden=\"true\"\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:0.625em;vertical-align:-0.1944em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.03588em;\"\u003eq\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e is already a function from state-action pairs to numbers. In the tabular setting, we represent that function with an explicit table. Every state-action pair has its own cell. If the agent observes a transition from \u003cspan class=\"katex\"\u003e\u003cspan class=\"katex-mathml\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmi\u003ea\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003e(s, a)\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e\u003cspan class=\"katex-html\" aria-hidden=\"true\"\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"mpunct\"\u003e,\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003ea\u003c/span\u003e\u003cspan class=\"mclose\"\u003e)\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e, it updates that one cell and leaves the others unchanged. That is conceptually clean and enough for small grid worlds, but it does not scale.\u003c/p\u003e","title":"Reinforcement Learning 103: Approximate Methods"},{"content":"Introduction In the previous article, we derived Value Iteration. The update was:\nvk+1(s)=max⁡a∑r,s′p(r,s′∣s,a)(r+γvk(s′)) v_{k+1}(s) = \\max_a \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a)(r + \\gamma v_k(s^\\prime)) vk+1​(s)=amax​r,s′∑​p(r,s′∣s,a)(r+γvk​(s′)) This is a model-based update. It assumes that we know the environment dynamics:\np(r,s′∣s,a) p(r, s^\\prime \\mid s, a) p(r,s′∣s,a) If this distribution is available, the update can evaluate actions before choosing one. For each possible action aaa in state sss, it averages over all possible rewards rrr and next states s′s^\\primes′ to compute that action\u0026rsquo;s candidate value. Once every action has been evaluated this way, the outer max⁡a\\max_amaxa​ selects the best one and uses it to update v(s)v(s)v(s).\nIn most real problems, we do not know p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a) in advance. One way forward is to learn this model from data and then plan with it. This is also close to how people often reason: before acting, we imagine what might happen next and whether it will be good or bad. For example, stealing something may lead to getting arrested.\nBut learning an explicit dynamics model is not trivial. Yann LeCun makes a related point in his work on predictive world models: useful prediction should often happen at a more abstract level than raw pixels. If the state is everything visible on the screen, then predicting the exact next state means predicting the next image pixel by pixel, including details that may not matter for the decision.\nTo learn in this model-free setup, we need samples from interaction with the environment. They may come from the current agent, another policy, or an existing dataset, but each sample still records what happened along one realized path. We cannot pause in state sss, try every possible action, observe all possible outcomes, average them, and then go back to choose the best action. Some policy chooses one action, gets one reward, moves to one next state, and the episode continues.\nSo instead of the exact model-based update, interaction gives us samples from the unknown dynamics. A single transition looks like:\n(s,a,r,s′) (s, a, r, s^\\prime) (s,a,r,s′) Over time, these sampled transitions form trajectories, episodes, or rollouts. This is the data source for the rest of the article. The next question is what we should estimate from those samples so that we can improve a policy.\nUse Q, not V One tempting idea is to use rollouts to estimate v(s)v(s)v(s) like we did in previous article. That works for prediction, but it is not enough for control. The reason is simple: v(s)v(s)v(s) tells us how good state sss is under some policy, but it does not tell us which action to choose in that state. In the model-based setting, this was not a problem. If we knew v(s)v(s)v(s) and the transition model, we could compare actions by computing:\nq(s,a)=∑r,s′p(r,s′∣s,a)(r+γv(s′)) q(s, a) = \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a)(r + \\gamma v(s^\\prime)) q(s,a)=r,s′∑​p(r,s′∣s,a)(r+γv(s′)) Then we could improve the policy greedily:\nπ′(s)=arg max⁡aq(s,a) \\pi^\\prime(s) = \\argmax_a q(s, a) π′(s)=aargmax​q(s,a) In the model-free setting, this path is blocked. We do not know p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a), so we cannot turn v(s)v(s)v(s) into action comparisons. Estimating v(s)v(s)v(s) alone would tell us that a state is good or bad, but not which action made it good or bad. So for model-free control we should learn action-values directly:\nqπ(s,a)=Eπ[Gt∣St=s,At=a] q_\\pi(s, a) = \\mathbb{E}_\\pi[G_t \\mid S_t = s, A_t = a] qπ​(s,a)=Eπ​[Gt​∣St​=s,At​=a] The action-value function answers the question we need for control: if I am in state sss and take action aaa, what return should I expect? Once we have an estimate q(s,a)q(s, a)q(s,a), we can extract a greedy policy without a transition model:\nπ′(s)=arg max⁡aq(s,a) \\pi^\\prime(s) = \\argmax_a q(s, a) π′(s)=aargmax​q(s,a) That is why the rest of the article focuses on learning q(s,a)q(s, a)q(s,a) from sampled rollouts.\nEstimation methods Before looking at control algorithms, we first need two ways to estimate q(s,a)q(s, a)q(s,a) from sampled experience.\nMonte Carlo Monte Carlo learning uses complete episodes. Suppose an episode produces the following trajectory:\nS0,A0,R0,S1,A1,R1,…,ST S_0, A_0, R_0, S_1, A_1, R_1, \\ldots, S_T S0​,A0​,R0​,S1​,A1​,R1​,…,ST​ The episode ends in terminal state STS_TST​. After it finishes, we can compute the realized return from each time step:\nGt=Rt+γRt+1+γ2Rt+2+…+γT−t−1RT−1=∑k=tT−1γk−tRk G_t = R_t + \\gamma R_{t+1} + \\gamma^2 R_{t+2} + \\ldots + \\gamma^{T-t-1}R_{T-1} = \\sum_{k=t}^{T-1} \\gamma^{k-t}R_k Gt​=Rt​+γRt+1​+γ2Rt+2​+…+γT−t−1RT−1​=k=t∑T−1​γk−tRk​ For every visited state-action pair (St,At)(S_t, A_t)(St​,At​), this GtG_tGt​ is one sample of the return after taking action AtA_tAt​ in state StS_tSt​. The estimate q(s,a)q(s, a)q(s,a) does not depend on time; it is tied to the state-action pair. The time index is only needed to identify which sampled return GtG_tGt​ came from the rollout.\nTo write the estimate without carrying the time index everywhere, let s=Sts = S_ts=St​ and a=Ata = A_ta=At​ for a sampled visit. Across episodes, or even within one episode, the same state-action pair may appear many times, and each visit gives another sampled return. Monte Carlo estimates the action-value by averaging those returns:\nq(s,a)=1N(s,a)∑i=1N(s,a)Gi(s,a) q(s, a) = \\frac{1}{N(s, a)} \\sum_{i=1}^{N(s, a)} G_i(s, a) q(s,a)=N(s,a)1​i=1∑N(s,a)​Gi​(s,a) Here N(s,a)N(s, a)N(s,a) is the number of observed returns for that state-action pair. We do not need to store all previous episodes to compute this average. When we observe a new return for (s,a)(s, a)(s,a), we first increment the count:\nN(s,a)←N(s,a)+1 N(s, a) \\leftarrow N(s, a) + 1 N(s,a)←N(s,a)+1 Then we treat the new return as Gi(s,a)G_i(s, a)Gi​(s,a) and update the running average:\nq(s,a)←(N(s,a)−1)q(s,a)+Gi(s,a)N(s,a) q(s, a) \\leftarrow \\frac{(N(s, a) - 1)q(s, a) + G_i(s, a)}{N(s, a)} q(s,a)←N(s,a)(N(s,a)−1)q(s,a)+Gi​(s,a)​ Equivalently:\nq(s,a)←q(s,a)+1N(s,a)(Gi(s,a)−q(s,a)) q(s, a) \\leftarrow q(s, a) + \\frac{1}{N(s, a)}(G_i(s, a) - q(s, a)) q(s,a)←q(s,a)+N(s,a)1​(Gi​(s,a)−q(s,a)) The cost is that we must wait until the episode finishes. If the episode is very long, or if the task is continuing and does not naturally end, this can be inconvenient. Monte Carlo returns can also have high variance, because a full return may depend on many random events that happen after the current state-action pair. Averaging over more episodes reduces this variance, but it may take many samples.\nSparse rewards make this worse. A reward signal is sparse when most transitions produce zero or uninformative reward, and useful feedback appears only after a rare event. Imagine a game where the agent receives a meaningful reward only after winning. If the initial policy is random, the agent may play many episodes without ever winning, so most sampled returns contain no useful learning signal. In that case Monte Carlo learning may need a large number of samples before it observes even one successful trajectory.\nTemporal Difference Monte Carlo gives us a clean sampled target, but it has one practical problem: we need to wait for the final return.\nTemporal Difference (TD) learning changes the target. Instead of waiting for the complete return GtG_tGt​, it uses the recursive Bellman equation for action-values and bootstraps from the current estimate. This is the same Bellman expectation idea from the previous article, where we used it for vπv_\\pivπ​, but unrolled one step further so that the value is attached to a state-action pair. For a fixed policy π\\piπ, the equation for qπq_\\piqπ​ is:\nqπ(s,a)=∑r,s′p(r,s′∣s,a)(r+γ∑a′π(a′∣s′)qπ(s′,a′)) q_\\pi(s, a) = \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) \\left(r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime)q_\\pi(s^\\prime, a^\\prime)\\right) qπ​(s,a)=r,s′∑​p(r,s′∣s,a)(r+γa′∑​π(a′∣s′)qπ​(s′,a′)) This says: take action aaa in state sss, receive reward rrr, move to s′s^\\primes′, and then follow policy π\\piπ from the next state. If we knew the transition distribution, we could compute that expectation exactly.\nThere are two expectations inside this equation. The outer one is over the environment dynamics p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a). In a model-free setting, we do not know those probabilities, so this is the part we must replace with a sampled transition:\n(s,a,r,s′) (s, a, r, s^\\prime) (s,a,r,s′) That sampled transition is enough to handle the unknown transition probabilities. It does not, by itself, require the transition to be generated by the same policy π\\piπ that appears in the Bellman equation. Once we condition on (s,a)(s, a)(s,a), the reward and next state come from the environment dynamics p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a). Which policy collected the sample matters for coverage and for the action distribution used in the target, and we will return to that in the on-policy vs. off-policy section.\nThe inner expectation is over the next action under the policy π\\piπ. That part is different: if we know the policy and the current action-value estimates, we can either sample one next action from π\\piπ or compute the expectation over actions directly.\nIf we sample the next action a′∼π(⋅∣s′)a^\\prime \\sim \\pi(\\cdot \\mid s^\\prime)a′∼π(⋅∣s′), the one-step sample is:\n(s,a,r,s′,a′) (s, a, r, s^\\prime, a^\\prime) (s,a,r,s′,a′) and its target is:\nr+γq(s′,a′) r + \\gamma q(s^\\prime, a^\\prime) r+γq(s′,a′) This version samples both the environment outcome and the next action under π\\piπ. There is another possibility: keep the action expectation instead. Then the transition (s,a,r,s′)(s, a, r, s^\\prime)(s,a,r,s′) is enough, and the target becomes:\nr+γ∑a′π(a′∣s′)q(s′,a′) r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime)q(s^\\prime, a^\\prime) r+γa′∑​π(a′∣s′)q(s′,a′) We will come back to that difference when we compare SARSA with Expected SARSA. For now, assume the sampled-action target (s,a,r,s′,a′)(s, a, r, s^\\prime, a^\\prime)(s,a,r,s′,a′) and:\nr+γq(s′,a′) r + \\gamma q(s^\\prime, a^\\prime) r+γq(s′,a′) This target can be used immediately after observing s′s^\\primes′ and choosing a′a^\\primea′. It gives us more frequent updates than Monte Carlo, but it bootstraps from the current action-value estimate q(s′,a′)q(s^\\prime, a^\\prime)q(s′,a′), which may still be wrong. It also has extra noise from sampling a′a^\\primea′.\nThe TD update has the same shape as the Monte Carlo running-average update:\nq(s,a)←q(s,a)+step size⋅(target−q(s,a)) q(s, a) \\leftarrow q(s, a) + \\text{step size} \\cdot (\\text{target} - q(s, a)) q(s,a)←q(s,a)+step size⋅(target−q(s,a)) For Monte Carlo, the target was the sampled return GiG_iGi​, and the step size was often 1N(s,a)\\frac{1}{N(s, a)}N(s,a)1​. For a sampled-action TD target, the target is the bootstrapped estimate r+γq(s′,a′)r + \\gamma q(s^\\prime, a^\\prime)r+γq(s′,a′). Because this target is noisy and also depends on the current value estimate, we usually use a learning rate α\\alphaα instead of 1N(s,a)\\frac{1}{N(s, a)}N(s,a)1​:\nq(s,a)←q(s,a)+α(r+γq(s′,a′)−q(s,a)) q(s, a) \\leftarrow q(s, a) + \\alpha \\left(r + \\gamma q(s^\\prime, a^\\prime) - q(s, a)\\right) q(s,a)←q(s,a)+α(r+γq(s′,a′)−q(s,a)) The term in parentheses is the error between the bootstrapped target and the current estimate. Equivalently, the same update can be written as:\nq(s,a)←(1−α)q(s,a)+α(r+γq(s′,a′)) q(s, a) \\leftarrow (1 - \\alpha)q(s, a) + \\alpha \\left(r + \\gamma q(s^\\prime, a^\\prime)\\right) q(s,a)←(1−α)q(s,a)+α(r+γq(s′,a′)) With a constant α\\alphaα, this behaves like an exponential moving average: recent targets receive more weight, but older targets still influence the estimate indirectly through the previous value of q(s,a)q(s, a)q(s,a).\nWe do not have to update after exactly one step. We can keep the sampled trajectory for longer, collect more real rewards, and only then bootstrap from the current estimate.\nThe one-step target is:\nGt(1)=Rt+γq(St+1,At+1) G_t^{(1)} = R_t + \\gamma q(S_{t+1}, A_{t+1}) Gt(1)​=Rt​+γq(St+1​,At+1​) The two-step target is:\nGt(2)=Rt+γRt+1+γ2q(St+2,At+2) G_t^{(2)} = R_t + \\gamma R_{t+1} + \\gamma^2 q(S_{t+2}, A_{t+2}) Gt(2)​=Rt​+γRt+1​+γ2q(St+2​,At+2​) More generally, the nnn-step target is:\nGt(n)=Rt+γRt+1+…+γn−1Rt+n−1+γnq(St+n,At+n) G_t^{(n)} = R_t + \\gamma R_{t+1} + \\ldots + \\gamma^{n-1}R_{t+n-1} + \\gamma^n q(S_{t+n}, A_{t+n}) Gt(n)​=Rt​+γRt+1​+…+γn−1Rt+n−1​+γnq(St+n​,At+n​) The larger nnn is, the closer the target gets to Monte Carlo. If nnn reaches the end of the episode, the bootstrap term disappears and the target becomes the full return GtG_tGt​. The smaller nnn is, the sooner we can update.\nTD(λ\\lambdaλ) combines these nnn-step targets. Do not confuse nnn and λ\\lambdaλ: nnn can be 111, 222, 333, and so on, while λ\\lambdaλ is a mixing parameter between 000 and 111.\nThink of λ\\lambdaλ as controlling how quickly the influence of larger nnn fades. A small λ\\lambdaλ puts almost all of the mixture on the one-step return. A large λ\\lambdaλ lets returns that look farther ahead keep meaningful weight, so the result behaves more like Monte Carlo. A simple way to express this is geometric decay: each successive nnn-step return receives λ\\lambdaλ times the raw weight of the previous one:\n1,λ,λ2,λ3,… 1, \\lambda, \\lambda^2, \\lambda^3, \\ldots 1,λ,λ2,λ3,… These raw weights form a geometric series. They decay geometrically, but they do not sum to 111. For λ\u0026lt;1\\lambda \u0026lt; 1λ\u0026lt;1, their sum is:\n1+λ+λ2+…=11−λ 1 + \\lambda + \\lambda^2 + \\ldots = \\frac{1}{1 - \\lambda} 1+λ+λ2+…=1−λ1​ So we multiply by (1−λ)(1 - \\lambda)(1−λ) to normalize the weights. Mathematically, for λ\u0026lt;1\\lambda \u0026lt; 1λ\u0026lt;1, TD(λ\\lambdaλ) uses the λ\\lambdaλ-return, which is a weighted average of nnn-step returns:\nGtλ=(1−λ)∑n=1∞λn−1Gt(n) G_t^\\lambda = (1 - \\lambda)\\sum_{n=1}^{\\infty}\\lambda^{n-1}G_t^{(n)} Gtλ​=(1−λ)n=1∑∞​λn−1Gt(n)​ So the mixture weight assigned to the nnn-step return is:\n(1−λ)λn−1 (1 - \\lambda)\\lambda^{n-1} (1−λ)λn−1 Here, weight means the fraction of the final λ\\lambdaλ-return assigned to that particular nnn-step return. It is not the same thing as the reward discount γ\\gammaγ.\nThe plot shows this weight as a function of nnn. In the algorithm, nnn is an integer, but the smooth curves make the geometric decay easier to see.\nIn finite episodic tasks, λ=1\\lambda = 1λ=1 is handled as the limiting case where the return becomes the full Monte Carlo return. We will not need the full TD(λ\\lambdaλ) machinery for basic algorithms, but it is useful to understand the spectrum:\nMonte Carlo waits longer and uses less bootstrapping. One-step TD updates sooner and uses more bootstrapping. TD(λ\\lambdaλ) sits between those extremes. Exploration We now have two ways to estimate qπ(s,a)q_\\pi(s, a)qπ​(s,a) from samples: Monte Carlo returns and TD targets, but both methods depend on the samples we actually collect. If the policy never tries action aaa in state sss, then we do not get returns or TD targets for (s,a)(s, a)(s,a), and we cannot estimate q(s,a)q(s, a)q(s,a) well. As a consequence, the improved policy may never discover actions that would maximize the total discounted reward.\nThis is another consequence of being model-free. Because we do not know p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a), we cannot reliably predict what an untried action would do. The data-collecting policy therefore has to try alternatives sometimes instead of always choosing the current greedy action. That is exploration.\nBefore writing control algorithms, we should make action selection explicit. A policy is a rule for choosing actions. It may be deterministic, like:\nπ(s)=arg max⁡aq(s,a) \\pi(s) = \\argmax_a q(s, a) π(s)=aargmax​q(s,a) or stochastic, in which case we sample from an action distribution:\na∼π(⋅∣s) a \\sim \\pi(\\cdot \\mid s) a∼π(⋅∣s) The stochastic policy we will use is ϵ\\epsilonϵ-greedy. In each state, look at the current q(s,a)q(s, a)q(s,a) values and find the action that looks best. Most of the time, choose that action. With probability ϵ\\epsilonϵ, choose randomly so that other actions still get tried. The greedy choice exploits the current estimates. The random choice explores actions that may look worse now but could teach us something important. Balancing these two forces is the exploration-exploitation problem.\nOne downside of constant ϵ\\epsilonϵ-greedy exploration is that we keep paying an exploration cost even after we have learned a good policy. Over time, we want to maximize returns, not keep probing random actions. A common fix is to decay ϵ\\epsilonϵ over training: start high to explore broadly, then lower it so the agent increasingly exploits what it has learned. That said, in non-stationary environments where the rules or rewards can change over time, keeping some exploration permanently makes sense.\nGeneralized Policy Iteration We now have the ingredients for model-free control:\nAn ϵ\\epsilonϵ-greedy policy usually chooses the action with the largest current q(s,a)q(s, a)q(s,a) but still tries other actions. Monte Carlo and TD give us ways to estimate qπ(s,a)q_\\pi(s, a)qπ​(s,a) from sampled rollouts. So far, Monte Carlo and TD were only prediction tools: keep a policy fixed and estimate how good its actions are. For control, we do not need to invent a new framework. We can reuse Policy Iteration from the previous article and run the same two-step loop, now with sampled estimates:\nPolicy evaluation: estimate action-values for a target policy. Policy improvement: make the policy prefer actions with larger estimated values. The model-based version used the transition dynamics to evaluate a policy. Here we do not have those dynamics, so evaluation has to come from samples. The improvement step also changes slightly: instead of always choosing the action with the largest q(s,a)q(s, a)q(s,a), the policy keeps some random exploration so that training continues to produce useful data.\nThis is Generalized Policy Iteration (GPI). The two steps are still distinct, but they can be interleaved at different rates. We can evaluate for many episodes and then improve, or we can improve after every small update to qqq. The algorithms below differ mostly in what target they use for evaluation: complete returns, sampled next actions, an average over next actions, or a greedy optimality target.\nMonte Carlo Control Monte Carlo Control uses complete returns for the evaluation part. Generate an episode with the current policy, use the observed returns to update q(s,a)q(s, a)q(s,a), then refresh the policy from the new estimates: in each state, the action with the largest q(s,a)q(s, a)q(s,a) becomes the greedy choice, while random exploration with probability ϵ\\epsilonϵ remains.\nA basic version is:\nGenerate an episode using the current policy π\\piπ. For each visited pair (St,At)(S_t, A_t)(St​,At​), compute the return GtG_tGt​ from that point in the episode. Update the running average for q(St,At)q(S_t, A_t)q(St​,At​). After the qqq update, recompute the greedy action in each state, while keeping probability ϵ\\epsilonϵ for random exploration. Repeat. SARSA SARSA keeps the same policy-evaluation, policy-improvement loop, but replaces the complete Monte Carlo return with a one-step TD target.\nFor the one-step target, we need to know which action the policy will take in the next state. So the sampled piece is:\nSt,At,Rt,St+1,At+1 S_t, A_t, R_t, S_{t+1}, A_{t+1} St​,At​,Rt​,St+1​,At+1​ This is where the name SARSA comes from: state, action, reward, state, action. If the sampled step is (s,a,r,s′,a′)(s, a, r, s^\\prime, a^\\prime)(s,a,r,s′,a′), the target is:\nr+γq(s′,a′) r + \\gamma q(s^\\prime, a^\\prime) r+γq(s′,a′) and the update becomes:\nq(s,a)←(1−α)q(s,a)+α(r+γq(s′,a′)) q(s, a) \\leftarrow (1 - \\alpha)q(s, a) + \\alpha \\left(r + \\gamma q(s^\\prime, a^\\prime)\\right) q(s,a)←(1−α)q(s,a)+α(r+γq(s′,a′)) If s′s^\\primes′ is terminal, there is no next action and no future value, so the target is just:\nr r r A simple SARSA loop is:\nChoose a∼π(⋅∣s)a \\sim \\pi(\\cdot \\mid s)a∼π(⋅∣s), take it, and observe rrr and s′s^\\primes′. If s′s^\\primes′ is not terminal, choose a′∼π(⋅∣s′)a^\\prime \\sim \\pi(\\cdot \\mid s^\\prime)a′∼π(⋅∣s′) using the same policy. Update q(s,a)q(s, a)q(s,a) toward r+γq(s′,a′)r + \\gamma q(s^\\prime, a^\\prime)r+γq(s′,a′), or toward rrr if s′s^\\primes′ is terminal. After the qqq update, recompute the greedy action in each state, while keeping probability ϵ\\epsilonϵ for random exploration. If the episode is not done, set s←s′s \\leftarrow s^\\primes←s′, a←a′a \\leftarrow a^\\primea←a′, and continue. The important detail is that a′a^\\primea′ is not arbitrary. The Bellman expectation equation for qπq_\\piqπ​ says that after reaching s′s^\\primes′, we continue with policy π\\piπ. If we approximate that next-action expectation with one sampled action, then the sampled action has to come from π(⋅∣s′)\\pi(\\cdot \\mid s^\\prime)π(⋅∣s′). This is the core reason SARSA is on-policy: the policy being evaluated must also generate the next sampled action. If π\\piπ is ϵ\\epsilonϵ-greedy, then the target includes that exploration, so SARSA learns the value of the policy it actually follows. We will talk more about off-policy vs. on-policy in the next section.\nExpected SARSA Expected SARSA uses the alternative we mentioned in the TD section. SARSA samples one next action and uses:\nr+γq(s′,a′) r + \\gamma q(s^\\prime, a^\\prime) r+γq(s′,a′) Expected SARSA keeps the action expectation instead. After observing (s,a,r,s′)(s, a, r, s^\\prime)(s,a,r,s′), it averages over the actions that policy π\\piπ could take in s′s^\\primes′:\nr+γ∑a′π(a′∣s′)q(s′,a′) r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime)q(s^\\prime, a^\\prime) r+γa′∑​π(a′∣s′)q(s′,a′) The update is:\nq(s,a)←(1−α)q(s,a)+α(r+γ∑a′π(a′∣s′)q(s′,a′)) q(s, a) \\leftarrow (1 - \\alpha)q(s, a) + \\alpha \\left(r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime)q(s^\\prime, a^\\prime)\\right) q(s,a)←(1−α)q(s,a)+α(r+γa′∑​π(a′∣s′)q(s′,a′)) For a terminal next state, the target is again:\nr r r The difference is small but useful. SARSA asks which next action the policy happened to sample. Expected SARSA asks for the average value under all actions the policy might sample. The cost is that we have to know the action probabilities π(a′∣s′)\\pi(a^\\prime \\mid s^\\prime)π(a′∣s′) and sum over the available actions, which is cheap when the action set is small and discrete.\nA simple control loop is:\nChoose a∼π(⋅∣s)a \\sim \\pi(\\cdot \\mid s)a∼π(⋅∣s), take it, and observe rrr and s′s^\\primes′. Update q(s,a)q(s, a)q(s,a) toward r+γ∑a′π(a′∣s′)q(s′,a′)r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime) q(s^\\prime, a^\\prime)r+γ∑a′​π(a′∣s′)q(s′,a′), or toward rrr if s′s^\\primes′ is terminal. After the qqq update, recompute the greedy action in each state, while keeping probability ϵ\\epsilonϵ for random exploration. Continue from s′s^\\primes′. Unlike SARSA, this update does not need to carry a sampled a′a^\\primea′ forward. The sampled tuple is really (s,a,r,s′)(s, a, r, s^\\prime)(s,a,r,s′). The final \u0026ldquo;A\u0026rdquo; in Expected SARSA is the action distribution inside the expectation, not a sampled action in the update.\nOn-policy vs. Off-policy Once the behavior policy includes exploration, we need to separate two ideas:\nThe behavior policy is the policy that collects data. The target policy is the policy being learned or evaluated. In on-policy learning, the behavior policy and target policy are the same. The update evaluates the policy that actually acts.\nIn off-policy learning, they can be different. It is like watching over someone else\u0026rsquo;s shoulder while they play a game: their actions generate experience, but we can use that experience to learn about a different way of playing. Watching strong players is especially useful because their trajectories spend more time near good decisions, so the behavior policy does not have to explore as much.\nThe Bellman expectation equation makes the target policy explicit. It evaluates qπq_\\piqπ​, the value of taking one action and then following policy π\\piπ:\nqπ(s,a)=∑r,s′p(r,s′∣s,a)(r+γ∑a′π(a′∣s′)qπ(s′,a′)) q_\\pi(s, a) = \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) \\left(r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime)q_\\pi(s^\\prime, a^\\prime)\\right) qπ​(s,a)=r,s′∑​p(r,s′∣s,a)(r+γa′∑​π(a′∣s′)qπ​(s′,a′)) After we replace the outer sum with a sampled transition, the model-free update no longer needs explicit transition probabilities. It only needs an observed (s,a,r,s′)(s, a, r, s^\\prime)(s,a,r,s′). That transition may have been collected by the current policy, by another behavior policy, or from an existing dataset. Once action aaa was taken in state sss, the observed rrr and s′s^\\primes′ are samples from the environment dynamics p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a).\nThe target policy π\\piπ appears in a different place: the next-state action distribution. That is the inner sum:\n∑a′π(a′∣s′)q(s′,a′) \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime)q(s^\\prime, a^\\prime) a′∑​π(a′∣s′)q(s′,a′) So the key question is not only \u0026ldquo;who collected this transition?\u0026rdquo; It is also \u0026ldquo;which policy is used inside the target?\u0026rdquo;\nSARSA samples the inner expectation with one actual next action a′a^\\primea′. If we are evaluating π\\piπ, that sampled a′a^\\primea′ must come from π\\piπ. In the basic control algorithm, the same ϵ\\epsilonϵ-greedy policy both acts and appears in the target, so SARSA is on-policy.\nExpected SARSA does not sample a′a^\\primea′. It computes the inner expectation directly from the target policy\u0026rsquo;s action probabilities. That gives two valid cases:\nOn-policy Expected SARSA: the behavior policy and target policy are the same, so transitions come from π\\piπ and the expectation is also under π\\piπ. Off-policy Expected SARSA: a behavior policy collects (s,a,r,s′)(s, a, r, s^\\prime)(s,a,r,s′), but the expectation is computed under a different target policy π\\piπ. Plain Monte Carlo Control is also on-policy, but over a longer horizon. Its target is the full return:\nGt=Rt+γRt+1+γ2Rt+2+… G_t = R_t + \\gamma R_{t+1} + \\gamma^2 R_{t+2} + \\ldots Gt​=Rt​+γRt+1​+γ2Rt+2​+… This return is not just about the first action. It also depends on the later actions in the episode. If those later actions were chosen by the current policy π\\piπ, then GtG_tGt​ is a sample of qπ(St,At)q_\\pi(S_t, A_t)qπ​(St​,At​). So plain Monte Carlo Control is on-policy because the episode is generated by the same policy whose action-values the return is estimating. The policy may still improve between episodes; on-policy does not mean the policy is frozen forever.\nQ-learning is different again: it does not use the Bellman expectation equation for a fixed policy. It uses the Bellman optimality equation, whose next-state target is greedy. The behavior policy can explore, but the update target is the value of acting greedily after the sampled transition.\nQ-learning Q-learning takes the Value Iteration idea and applies it to action-values. Instead of evaluating a fixed policy with the Bellman expectation equation, it uses the Bellman optimality equation for q∗q^*q∗. The question is: if we take action aaa in state sss, and then act optimally, what return do we expect?\nq∗(s,a)=∑r,s′p(r,s′∣s,a)(r+γmax⁡a′q∗(s′,a′)) q^*(s, a) = \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) \\left(r + \\gamma \\max_{a^\\prime} q^*(s^\\prime, a^\\prime)\\right) q∗(s,a)=r,s′∑​p(r,s′∣s,a)(r+γa′max​q∗(s′,a′)) There is no policy distribution in this equation. The maximum already encodes the idea that, from the next state onward, the agent chooses the best available action. Once we know q∗q^*q∗, an optimal policy falls out by acting greedily:\nπ∗(s)=arg max⁡aq∗(s,a) \\pi^*(s) = \\argmax_a q^*(s, a) π∗(s)=aargmax​q∗(s,a) The Bellman optimality equation still contains transition probabilities, so computing the right-hand side exactly requires a model of the environment. Q-learning is the model-free, sampled version of that idea. One observed transition:\n(s,a,r,s′) (s, a, r, s^\\prime) (s,a,r,s′) replaces the expectation over all possible outcomes. The sampled optimality target is:\nr+γmax⁡a′q(s′,a′) r + \\gamma \\max_{a^\\prime} q(s^\\prime, a^\\prime) r+γa′max​q(s′,a′) The update is:\nq(s,a)←(1−α)q(s,a)+α(r+γmax⁡a′q(s′,a′)) q(s, a) \\leftarrow (1 - \\alpha)q(s, a) + \\alpha \\left(r + \\gamma \\max_{a^\\prime} q(s^\\prime, a^\\prime)\\right) q(s,a)←(1−α)q(s,a)+α(r+γa′max​q(s′,a′)) The maximum is computed from the current table of action-values. In the next state s′s^\\primes′, look at every available action a′a^\\primea′ and take the largest current estimate:\nmax⁡a′q(s′,a′) \\max_{a^\\prime} q(s^\\prime, a^\\prime) a′max​q(s′,a′) This does not mean the agent must actually take that action next. The maximum is only used to compute the update target. The next real action is still chosen by the behavior policy. In fact, the agent does not even need to be acting now: the transitions could come from a logged dataset of past behavior.\nQ-learning is off-policy for two connected reasons:\nThe Bellman optimality equation does not require an explicit policy. The target is greedy because of the maximum. The sampled transition (s,a,r,s′)(s, a, r, s^\\prime)(s,a,r,s′) is only used to estimate the environment dynamics. It can come from any behavior policy, including an existing record of actions, as long as the data covers enough state-action pairs. A simple Q-learning loop is:\nChoose aaa in state sss using the behavior policy. For example, use ϵ\\epsilonϵ-greedy with respect to the current qqq values. Take aaa, observe rrr and s′s^\\primes′. Update q(s,a)q(s, a)q(s,a) toward r+γmax⁡a′q(s′,a′)r + \\gamma \\max_{a^\\prime} q(s^\\prime, a^\\prime)r+γmaxa′​q(s′,a′), or toward rrr if s′s^\\primes′ is terminal. After the qqq update, recompute which actions the behavior policy treats as greedy. If the episode is not done, set s←s′s \\leftarrow s^\\primes←s′ and continue. Cliff Walking Example Let\u0026rsquo;s ground the algorithms in another small grid-world game. Each cell is a state sss, and from each non-terminal state the agent can move up, down, left, or right.\nThe agent starts in the bottom-left cell and has to reach the goal in the bottom-right cell. The dark cells between them are the cliff. Every normal step gives reward −1-1−1. Stepping into the cliff gives reward −100-100−100 and sends the agent back to the start. Reaching the goal ends the episode.\nThis environment is useful because two routes are both reasonable, depending on what policy we are evaluating. The shortest route goes directly above the cliff. It reaches the goal quickly, but one exploratory move downward can be very expensive. A safer route goes one row higher. It takes a few extra steps, but it gives the behavior policy more room for mistakes: even if exploration makes the agent choose a random action, it is less likely to step straight into the cliff and receive the −100-100−100 penalty.\nThe images below show the deterministic greedy policy extracted after training. They do not show the random exploratory moves taken during training. They show what each learned qqq table would do if we acted greedily afterward.\nThat detail matters. For Monte Carlo Control, SARSA, and Expected SARSA, the qqq values were learned for an ϵ\\epsilonϵ-greedy policy. The final plot removes exploration by taking the best action in each state, but the values behind those arrows still include the cost of possible future exploratory moves. So the greedy path in the image does not have to be the shortest deterministic path.\nMonte Carlo Control Monte Carlo Control does find a route to the goal, but the greedy route in this plot is less direct.\nThe main reason is that Monte Carlo Control updates from the full return of the episode. In Cliff Walking, falling into the cliff gives −100-100−100 and sends the agent back to the start, but it does not end the episode. So if an exploratory move hits the cliff later in the episode, that penalty and the recovery steps become part of the return for earlier state-action pairs too. With ϵ\\epsilonϵ-greedy episodes, those full returns can be noisy, and the learned action-values may make a safer-looking route appear best.\nThis does not mean Monte Carlo is implemented incorrectly. It means that, in this environment and with this exploration setup, full-episode returns have high variance. The bootstrapping methods below use shorter targets, so they usually stabilize the path faster.\nSARSA SARSA learns the most cautious route in this run: it goes to the top row first, then moves right. This matches the usual cliff-walking effect described by Sutton and Barto: SARSA is on-policy, so it learns action-values for the ϵ\\epsilonϵ-greedy policy that is actually used during training. Most of the time the agent follows the current best action, but with probability ϵ\\epsilonϵ it takes a random action. Near the cliff, a random move down can cost −100-100−100.\nAfter reaching s′s^\\primes′, SARSA chooses the next action a′a^\\primea′ with the same ϵ\\epsilonϵ-greedy policy and uses that action in the update:\nr+γq(s′,a′) r + \\gamma q(s^\\prime, a^\\prime) r+γq(s′,a′) Imagine the agent is on the middle path and moves to a state s′s^\\primes′ closer to the cliff. From there, ϵ\\epsilonϵ-greedy can still sample the dangerous action down. After enough falls, q(s′,down)q(s^\\prime,\\text{down})q(s′,down) becomes very low because that action leads to the −100-100−100 cliff penalty. If down is the sampled a′a^\\primea′, SARSA uses this low action-value in the target for the previous move. In other words, the low q(s′,a′)q(s^\\prime,a^\\prime)q(s′,a′) helps update q(s,a)q(s,a)q(s,a) for the middle-path move. So the middle-path move can look bad too.\nExpected SARSA Expected SARSA learns the middle route. It is still on-policy in this setup, so it is also evaluating the ϵ\\epsilonϵ-greedy policy that may explore. The difference is that it does not wait to see which single a′a^\\primea′ was sampled. After reaching s′s^\\primes′, it asks what would happen on average if the agent followed the ϵ\\epsilonϵ-greedy policy from there:\nr+γ∑a′π(a′∣s′)q(s′,a′) r + \\gamma \\sum_{a^\\prime} \\pi(a^\\prime \\mid s^\\prime)q(s^\\prime, a^\\prime) r+γa′∑​π(a′∣s′)q(s′,a′) If s′s^\\primes′ is close to the cliff, that average includes the small chance that exploration picks down and falls. So Expected SARSA already knows that states near the cliff are risky under the ϵ\\epsilonϵ-greedy policy. But it also sees that going all the way to the top row costs extra steps. In this run, the middle route has the best balance: safer than the cliff edge, but shorter than the top path.\nQ-learning Q-learning learns the shortest route just above the cliff. It is off-policy: the behavior policy may still be ϵ\\epsilonϵ-greedy during training, but the update learns action-values for a greedy target policy. Its target uses the best next action:\nr+γmax⁡a′q(s′,a′) r + \\gamma \\max_{a^\\prime} q(s^\\prime, a^\\prime) r+γa′max​q(s′,a′) So after a move reaches s′s^\\primes′, Q-learning does not ask what random exploratory action might be sampled there. It assumes the agent will take the best-known action. Under that assumption, the path directly above the cliff is attractive: it reaches the goal in the fewest steps, and the greedy policy will not intentionally move down into the cliff. The agent may still fall during training because the behavior policy explores, but that exploration risk is not part of the policy Q-learning is learning.\nExercise Want to test your understanding? I prepared an exercise for this lesson in my Reinforcement Learning Course.\nSummary The main shift in this lesson was from planning with a known model to learning from sampled interaction. When we do not know the transition probabilities, learning only v(s)v(s)v(s) is not enough for control, because we cannot look ahead through the model to compare actions. Instead we learn q(s,a)q(s, a)q(s,a), which lets us choose actions directly:\nπ′(s)=arg max⁡aq(s,a) \\pi^\\prime(s) = \\argmax_a q(s, a) π′(s)=aargmax​q(s,a) We also saw two broad ways to estimate action-values from data. Monte Carlo methods wait until an episode ends and use the complete return. Temporal Difference methods update after each step by bootstrapping from the estimates they already have. SARSA, Expected SARSA, and Q-learning are all TD control methods, but their targets differ: a sampled next action, an expected next action, or a greedy maximum. The cliff-walking example made that difference visible, since these targets lead to different preferred routes.\nBecause the agent only learns from actions it actually tries, exploration is not optional. An ϵ\\epsilonϵ-greedy behavior policy keeps collecting useful samples while still exploiting what has already been learned.\nFinally, we separated on-policy from off-policy learning. SARSA learns about the same policy it uses to act. Expected SARSA can be either on-policy or off-policy, depending on whether its expectation uses the behavior policy or a separate target policy. Q-learning is off-policy because it can explore during training while learning the greedy policy.\nTowards Deep Q-learning So far q(s,a)q(s, a)q(s,a) was treated as a table. That is enough for small environments, but it will not scale to images, continuous state vectors, or very large state spaces. Neural networks can address this scaling problem by approximating action-values with parameters θ\\thetaθ:\nqθ(s,a) q_\\theta(s, a) qθ​(s,a) Once we do that, the algorithm needs extra machinery. Neural networks are sensitive to correlated data, and Q-learning targets move as the network changes. This is where experience replay and target networks become important. We will cover those ideas in the next article when we move from tabular Q-learning to Approximate methods.\n","permalink":"https://mateuszpieniak.com/courses/reinforcement-learning/102-q-learning-sarsa/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eIn the previous article, we derived Value Iteration. The update was:\u003c/p\u003e\n\u003cspan class=\"katex-display\"\u003e\u003cspan class=\"katex\"\u003e\u003cspan class=\"katex-mathml\"\u003e\u003cmath xmlns=\"http://www.w3.org/1998/Math/MathML\" display=\"block\"\u003e\u003csemantics\u003e\u003cmrow\u003e\u003cmsub\u003e\u003cmi\u003ev\u003c/mi\u003e\u003cmrow\u003e\u003cmi\u003ek\u003c/mi\u003e\u003cmo\u003e+\u003c/mo\u003e\u003cmn\u003e1\u003c/mn\u003e\u003c/mrow\u003e\u003c/msub\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003cmo\u003e=\u003c/mo\u003e\u003cmunder\u003e\u003cmrow\u003e\u003cmi\u003emax\u003c/mi\u003e\u003cmo\u003e⁡\u003c/mo\u003e\u003c/mrow\u003e\u003cmi\u003ea\u003c/mi\u003e\u003c/munder\u003e\u003cmunder\u003e\u003cmo\u003e∑\u003c/mo\u003e\u003cmrow\u003e\u003cmi\u003er\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmsup\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo mathvariant=\"normal\"\u003e′\u003c/mo\u003e\u003c/msup\u003e\u003c/mrow\u003e\u003c/munder\u003e\u003cmi\u003ep\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003er\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmsup\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo mathvariant=\"normal\"\u003e′\u003c/mo\u003e\u003c/msup\u003e\u003cmo\u003e∣\u003c/mo\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo separator=\"true\"\u003e,\u003c/mo\u003e\u003cmi\u003ea\u003c/mi\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmi\u003er\u003c/mi\u003e\u003cmo\u003e+\u003c/mo\u003e\u003cmi\u003eγ\u003c/mi\u003e\u003cmsub\u003e\u003cmi\u003ev\u003c/mi\u003e\u003cmi\u003ek\u003c/mi\u003e\u003c/msub\u003e\u003cmo stretchy=\"false\"\u003e(\u003c/mo\u003e\u003cmsup\u003e\u003cmi\u003es\u003c/mi\u003e\u003cmo mathvariant=\"normal\"\u003e′\u003c/mo\u003e\u003c/msup\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003cmo stretchy=\"false\"\u003e)\u003c/mo\u003e\u003c/mrow\u003e\u003cannotation encoding=\"application/x-tex\"\u003e\nv_{k+1}(s) = \\max_a \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a)(r + \\gamma v_k(s^\\prime))\n\u003c/annotation\u003e\u003c/semantics\u003e\u003c/math\u003e\u003c/span\u003e\u003cspan class=\"katex-html\" aria-hidden=\"true\"\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord\"\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.03588em;\"\u003ev\u003c/span\u003e\u003cspan class=\"msupsub\"\u003e\u003cspan class=\"vlist-t vlist-t2\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.3361em;\"\u003e\u003cspan style=\"top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:2.7em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mtight\"\u003e\u003cspan class=\"mord mathnormal mtight\" style=\"margin-right:0.03148em;\"\u003ek\u003c/span\u003e\u003cspan class=\"mbin mtight\"\u003e+\u003c/span\u003e\u003cspan class=\"mord mtight\"\u003e1\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-s\"\u003e​\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.2083em;\"\u003e\u003cspan\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"mclose\"\u003e)\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.2778em;\"\u003e\u003c/span\u003e\u003cspan class=\"mrel\"\u003e=\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.2778em;\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:2.4801em;vertical-align:-1.4301em;\"\u003e\u003c/span\u003e\u003cspan class=\"mop op-limits\"\u003e\u003cspan class=\"vlist-t vlist-t2\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.4306em;\"\u003e\u003cspan style=\"top:-2.4em;margin-left:0em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:3em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mathnormal mtight\"\u003ea\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"top:-3em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:3em;\"\u003e\u003c/span\u003e\u003cspan\u003e\u003cspan class=\"mop\"\u003emax\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-s\"\u003e​\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.7em;\"\u003e\u003cspan\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mop op-limits\"\u003e\u003cspan class=\"vlist-t vlist-t2\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:1.05em;\"\u003e\u003cspan style=\"top:-1.856em;margin-left:0em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:3.05em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mtight\"\u003e\u003cspan class=\"mord mathnormal mtight\" style=\"margin-right:0.02778em;\"\u003er\u003c/span\u003e\u003cspan class=\"mpunct mtight\"\u003e,\u003c/span\u003e\u003cspan class=\"mord mtight\"\u003e\u003cspan class=\"mord mathnormal mtight\"\u003es\u003c/span\u003e\u003cspan class=\"msupsub\"\u003e\u003cspan class=\"vlist-t\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.6828em;\"\u003e\u003cspan style=\"top:-2.786em;margin-right:0.0714em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:2.5em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size3 size1 mtight\"\u003e\u003cspan class=\"mord mtight\"\u003e′\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"top:-3.05em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:3.05em;\"\u003e\u003c/span\u003e\u003cspan\u003e\u003cspan class=\"mop op-symbol large-op\"\u003e∑\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-s\"\u003e​\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:1.4301em;\"\u003e\u003cspan\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003ep\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.02778em;\"\u003er\u003c/span\u003e\u003cspan class=\"mpunct\"\u003e,\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord\"\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"msupsub\"\u003e\u003cspan class=\"vlist-t\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.8019em;\"\u003e\u003cspan style=\"top:-3.113em;margin-right:0.05em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:2.7em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mtight\"\u003e′\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.2778em;\"\u003e\u003c/span\u003e\u003cspan class=\"mrel\"\u003e∣\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.2778em;\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"mpunct\"\u003e,\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.1667em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\"\u003ea\u003c/span\u003e\u003cspan class=\"mclose\"\u003e)\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.02778em;\"\u003er\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.2222em;\"\u003e\u003c/span\u003e\u003cspan class=\"mbin\"\u003e+\u003c/span\u003e\u003cspan class=\"mspace\" style=\"margin-right:0.2222em;\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"base\"\u003e\u003cspan class=\"strut\" style=\"height:1.0519em;vertical-align:-0.25em;\"\u003e\u003c/span\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.05556em;\"\u003eγ\u003c/span\u003e\u003cspan class=\"mord\"\u003e\u003cspan class=\"mord mathnormal\" style=\"margin-right:0.03588em;\"\u003ev\u003c/span\u003e\u003cspan class=\"msupsub\"\u003e\u003cspan class=\"vlist-t vlist-t2\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.3361em;\"\u003e\u003cspan style=\"top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:2.7em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mathnormal mtight\" style=\"margin-right:0.03148em;\"\u003ek\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-s\"\u003e​\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.15em;\"\u003e\u003cspan\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mopen\"\u003e(\u003c/span\u003e\u003cspan class=\"mord\"\u003e\u003cspan class=\"mord mathnormal\"\u003es\u003c/span\u003e\u003cspan class=\"msupsub\"\u003e\u003cspan class=\"vlist-t\"\u003e\u003cspan class=\"vlist-r\"\u003e\u003cspan class=\"vlist\" style=\"height:0.8019em;\"\u003e\u003cspan style=\"top:-3.113em;margin-right:0.05em;\"\u003e\u003cspan class=\"pstrut\" style=\"height:2.7em;\"\u003e\u003c/span\u003e\u003cspan class=\"sizing reset-size6 size3 mtight\"\u003e\u003cspan class=\"mord mtight\"\u003e′\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"mclose\"\u003e))\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\n\u003cp\u003eThis is a model-based update. It assumes that we know the environment dynamics:\u003c/p\u003e","title":"Reinforcement Learning 102: Q-learning \u0026 SARSA"},{"content":"Introduction Reinforcement learning is about learning from consequences. Unlike supervised learning, nobody tells the agent the correct action for every situation. The agent tries actions, receives rewards, and slowly discovers which behavior leads to better long-term outcomes.\nImagine playing a computer game. You observe the screen, make a decision, and the game responds. Sometimes you get points immediately, sometimes nothing happens, and sometimes the real consequence appears much later. The same idea can be described in a more abstract mathematical language. In such a formal language, you are the agent and the game is the environment.\nYou observe a state of the environment sts_tst​ at a timestamp ttt. You execute an action ata_tat​ by following your policy function π\\piπ, which chooses actions based on the current state of the environment sts_tst​. You get a reward rtr_trt​. The environment transitions to a new state st+1s_{t+1}st+1​. This loop is simple, but the hard part is credit assignment. An action can look useless now and still be important because it leads to a better state later. If you move toward a key in a game, the reward may appear only many steps later when the key opens a door.\nSo instead of asking which action gives the biggest reward immediately, reinforcement learning asks a broader question: which action puts the agent on a path with the best future? Here, the best future means the one with the largest accumulated reward, not necessarily the largest next reward. To make that question precise, we need one number that summarizes all future rewards from the current time step. This number is called the return gtg_tgt​.\ngt=rt+rt+1+rt+2+…=∑k=0∞rt+k g_t = r_{t} + r_{t+1} + r_{t+2} + \\ldots = \\sum_{k=0}^{\\infty} r_{t+k} gt​=rt​+rt+1​+rt+2​+…=k=0∑∞​rt+k​ However, if the game continues forever, this definition of gtg_tgt​ may produce an infinite sum, which imposes certain mathematical complications. In practice, we usually replace gtg_tgt​ with discounted cumulative rewards where 0≤γ≤10 \\leq \\gamma \\leq 10≤γ≤1 represents a discount factor.\ngt=rt+γrt+1+γ2rt+2+…=∑k=0∞γkrt+k g_t = r_{t} + \\gamma r_{t+1} + \\gamma^2 r_{t+2} + \\ldots = \\sum_{k=0}^{\\infty} \\gamma^k r_{t+k} gt​=rt​+γrt+1​+γ2rt+2​+…=k=0∑∞​γkrt+k​ Discounting has two purposes.\nFirst, the mathematical reason. When γ\u0026lt;1\\gamma \u0026lt; 1γ\u0026lt;1, rewards far in the future become smaller and the infinite sum can stay finite. To see this, assume a constant reward at each timestamp rt=rr_t = rrt​=r. Then the discounted return becomes a finite geometric series.\ngt=r+γr+γ2r+…=r∑k=0∞γk=r1−γ g_t = r + \\gamma r + \\gamma^2 r + \\ldots = r \\sum_{k=0}^{\\infty} \\gamma^k = \\frac{r}{1-\\gamma} gt​=r+γr+γ2r+…=rk=0∑∞​γk=1−γr​ Second, the modeling reason. The discount factor limits the effective horizon of an action. A reward kkk steps in the future is worth γk\\gamma^kγk today, so a smaller γ\\gammaγ makes the agent more short-sighted, while a larger γ\\gammaγ makes it more patient.\nAt this point we have a way to score one realized future, but in reinforcement learning the future is often not fully predictable. The same state and action can lead to different next states or rewards. Instead of ignoring this randomness, we need a framework that can model it. This is why we use probabilistic theory and move from concrete values to random variables. A realized state sts_tst​ becomes a random variable StS_tSt​, a realized action ata_tat​ becomes AtA_tAt​, a realized reward rtr_trt​ becomes RtR_tRt​, and a realized return gtg_tgt​ becomes GtG_tGt​.\nRandomness appears in two places.\nEnvironment randomness: if St=stS_t = s_tSt​=st​ and At=atA_t = a_tAt​=at​, we don\u0026rsquo;t always have to get the same next state St+1S_{t+1}St+1​ or the same reward RtR_tRt​. In a game, the same move can miss, slip, or trigger a random event. Policy randomness: the policy itself can also be stochastic. Instead of always choosing one action, π(a∣s)\\pi(a \\mid s)π(a∣s) can assign probabilities to actions. This is useful for exploration, for avoiding commitment to a bad action too early, and for describing strategies that intentionally mix actions. With this probabilistic notation, the policy function π\\piπ will be represented as a probability distribution over actions. The discounted return is also a random variable now:\nGt=Rt+γRt+1+γ2Rt+2+…=∑k=0∞γkRt+k G_t = R_{t} + \\gamma R_{t+1} + \\gamma^2 R_{t+2} + \\ldots = \\sum_{k=0}^{\\infty} \\gamma^k R_{t+k} Gt​=Rt​+γRt+1​+γ2Rt+2​+…=k=0∑∞​γkRt+k​ The learning goal is to find a policy π\\piπ that makes future return as large as possible. Since GtG_tGt​ is random, this means maximizing its expectation: the expected discounted cumulative reward.\nE[Gt∣St=st,At−1=at−1,St−1=st−1,…,S0=s0,A0=a0] \\mathbb{E}[G_t \\mid S_t = s_t, A_{t-1} = a_{t-1}, S_{t-1} = s_{t-1}, \\ldots, S_0 = s_0, A_0 = a_0] E[Gt​∣St​=st​,At−1​=at−1​,St−1​=st−1​,…,S0​=s0​,A0​=a0​] This expression conditions on the entire history of the episode. In principle, the value of the current situation could depend on every state and action that happened before. That is difficult to work with, so in reinforcement learning we often assume that the process is a Markov Decision Process (MDP).\nThe Markov assumption says that the current state already contains all information needed for the next transition. Once we know the current state sts_tst​ and action ata_tat​, the next state does not depend on the older history.\np(st+1∣st,at,st−1,at−1,…,s0,a0)=p(st+1∣st,at) p(s_{t+1} \\mid s_t, a_t, s_{t-1}, a_{t-1}, \\ldots, s_0, a_0) = p(s_{t+1} \\mid s_t, a_t) p(st+1​∣st​,at​,st−1​,at−1​,…,s0​,a0​)=p(st+1​∣st​,at​) This lets us replace the full-history question with a state-based question: if I am in state sss, what return should I expect from here? That is the object we will formalize as the value function in the next section.\nE[Gt∣St=s] \\mathbb{E}[G_t \\mid S_t = s] E[Gt​∣St​=s] Bellman Equations In the introduction, we defined the discounted return as a sum of future rewards. To derive Bellman equations, we now use the same return in a recursive form: the return from today is the immediate reward plus the discounted return from the next time step.\nGt=Rt+γGt+1 G_t = R_t + \\gamma G_{t+1} Gt​=Rt​+γGt+1​ This recursive form is the starting point for the Bellman equations. Once we define the value function as the expected return, we will substitute this recursive expression for GtG_tGt​ into that expectation.\nValue Function The goal of reinforcement learning is to find an optimal policy π∗(s)\\pi^*(s)π∗(s) that maximizes the expected discounted cumulative reward for every state sss. The star in π∗\\pi^*π∗ means optimal.\nBefore we can find the best policy, we need a way to evaluate a policy. For a policy π\\piπ, we define vπ(s)v_\\pi(s)vπ​(s) as the expected return when the agent starts in state sss and follows π\\piπ. This quantity is called the value function.\nvπ(s)=Eπ[Gt∣St=s] v_\\pi(s) = \\mathbb{E}_\\pi[G_t \\mid S_t = s] vπ​(s)=Eπ​[Gt​∣St​=s] You already use something like a value function intuitively. Imagine playing chess. There is no immediate reward after every move, but at some point you can look at the board and know that your position is probably lost. You might even resign before checkmate because you can predict the future from the current state. That prediction is the value of the state: how good this position is expected to be if you keep playing from here.\nBellman Expectation Equation The definition of vπ(s)v_\\pi(s)vπ​(s) tells us what the value function means, but it is not yet very useful for computation. It asks for the expected return over the whole future. To make it practical, we substitute the recursive form of GtG_tGt​ into this expectation, separating the immediate reward from the expected value of the continuation.\nvπ(s)=Eπ[Gt∣St=s](definition of vπ)=Eπ[Rt+γGt+1∣St=s](recursive definition of Gt)=Eπ[Rt∣St=s]+γEπ[Gt+1∣St=s](linearity of expectation)=Eπ[Rt∣St=s]+γ∑s′p(s′∣s)Eπ[Gt+1∣St=s,St+1=s′](law of total expectation)=Eπ[Rt∣St=s]+γ∑s′p(s′∣s)Eπ[Gt+1∣St+1=s′](Markov assumption)=Eπ[Rt∣St=s]+γ∑s′p(s′∣s)vπ(s′)(definition of vπ)=Eπ[Rt∣St=s]+γ∑r,a,s′p(r,a,s′∣s)vπ(s′)(marginalization)=∑r,a,s′p(r,a,s′∣s)r+γ∑r,a,s′p(r,a,s′∣s)vπ(s′)(definition of Eπ)=∑r,a,s′p(r,a,s′∣s)(r+γvπ(s′))(probability chain rule)=∑ap(a∣s)∑r,s′p(r,s′∣s,a)(r+γvπ(s′))=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)(r+γvπ(s′)) \\begin{aligned} v_\\pi(s) \u0026amp;= \\mathbb{E}_\\pi[G_t \\mid S_t = s] \u0026amp;\u0026amp; {\\scriptsize\\text{(definition of } v_\\pi \\text{)}} \\\\[0.5em] \u0026amp;= \\mathbb{E}_\\pi[R_t + \\gamma G_{t+1} \\mid S_t = s] \u0026amp;\u0026amp; {\\scriptsize\\text{(recursive definition of } G_t \\text{)}} \\\\[0.5em] \u0026amp;= \\mathbb{E}_\\pi[R_t \\mid S_t = s] + \\gamma \\mathbb{E}_\\pi[G_{t+1} \\mid S_t = s] \u0026amp;\u0026amp; {\\scriptsize\\text{(linearity of expectation)}} \\\\[0.5em] \u0026amp;= \\mathbb{E}_\\pi[R_t \\mid S_t = s] + \\gamma \\sum_{s^\\prime} p(s^\\prime \\mid s)\\mathbb{E}_\\pi[G_{t+1} \\mid S_t = s,S_{t+1} = s^\\prime] \u0026amp;\u0026amp; {\\scriptsize\\text{(law of total expectation)}} \\\\[0.5em] \u0026amp;= \\mathbb{E}_\\pi[R_t \\mid S_t = s] + \\gamma \\sum_{s^\\prime} p(s^\\prime \\mid s)\\mathbb{E}_\\pi[G_{t+1} \\mid S_{t+1} = s^\\prime] \u0026amp;\u0026amp; {\\scriptsize\\text{(Markov assumption)}} \\\\[0.5em] \u0026amp;= \\mathbb{E}_\\pi[R_t \\mid S_t = s] + \\gamma \\sum_{s^\\prime} p(s^\\prime \\mid s) v_\\pi(s^\\prime) \u0026amp;\u0026amp; {\\scriptsize\\text{(definition of } v_\\pi \\text{)}} \\\\[0.5em] \u0026amp;= \\mathbb{E}_\\pi[R_t \\mid S_t = s] + \\gamma \\sum_{r, a, s^\\prime} p(r, a, s^\\prime \\mid s) v_\\pi(s^\\prime) \u0026amp;\u0026amp; {\\scriptsize\\text{(marginalization)}} \\\\[0.5em] \u0026amp;= \\sum_{r, a, s^\\prime} p(r, a, s^\\prime \\mid s) r + \\gamma \\sum_{r, a, s^\\prime} p(r, a, s^\\prime \\mid s) v_\\pi(s^\\prime) \u0026amp;\u0026amp; {\\scriptsize\\text{(definition of } \\mathbb{E}_\\pi \\text{)}} \\\\[0.5em] \u0026amp;= \\sum_{r, a, s^\\prime} p(r, a, s^\\prime \\mid s) (r + \\gamma v_\\pi(s^\\prime)) \u0026amp;\u0026amp; {\\scriptsize\\text{(probability chain rule)}} \\\\[0.5em] \u0026amp;= \\sum_{a}p (a \\mid s) \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v_\\pi(s^\\prime)) \\\\[0.5em] \u0026amp;= \\sum_{a}\\pi (a \\mid s) \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v_\\pi(s^\\prime)) \\\\[0.5em] \\end{aligned} vπ​(s)​=Eπ​[Gt​∣St​=s]=Eπ​[Rt​+γGt+1​∣St​=s]=Eπ​[Rt​∣St​=s]+γEπ​[Gt+1​∣St​=s]=Eπ​[Rt​∣St​=s]+γs′∑​p(s′∣s)Eπ​[Gt+1​∣St​=s,St+1​=s′]=Eπ​[Rt​∣St​=s]+γs′∑​p(s′∣s)Eπ​[Gt+1​∣St+1​=s′]=Eπ​[Rt​∣St​=s]+γs′∑​p(s′∣s)vπ​(s′)=Eπ​[Rt​∣St​=s]+γr,a,s′∑​p(r,a,s′∣s)vπ​(s′)=r,a,s′∑​p(r,a,s′∣s)r+γr,a,s′∑​p(r,a,s′∣s)vπ​(s′)=r,a,s′∑​p(r,a,s′∣s)(r+γvπ​(s′))=a∑​p(a∣s)r,s′∑​p(r,s′∣s,a)(r+γvπ​(s′))=a∑​π(a∣s)r,s′∑​p(r,s′∣s,a)(r+γvπ​(s′))​​(definition of vπ​)(recursive definition of Gt​)(linearity of expectation)(law of total expectation)(Markov assumption)(definition of vπ​)(marginalization)(definition of Eπ​)(probability chain rule)​ So the final form is:\nvπ(s)=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)(r+γvπ(s′)) v_\\pi(s) = \\sum_{a}\\pi(a \\mid s) \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v_\\pi(s^\\prime)) vπ​(s)=a∑​π(a∣s)r,s′∑​p(r,s′∣s,a)(r+γvπ​(s′)) This is the Bellman expectation equation. It says that the value of state sss under policy π\\piπ is the expected immediate reward plus the discounted value of the next state. The expectation averages over both sources of randomness: the action chosen by the policy and the next state and reward produced by the environment.\nThe subscript in vπ(s)v_\\pi(s)vπ​(s) matters. If we change the policy, the action probabilities π(a∣s)\\pi(a \\mid s)π(a∣s) change, and the value of the same state can change as well. This is why the Bellman expectation equation is used for policy evaluation: it tells us how good each state is for a fixed policy.\nAction-Value Function The value function answers a state question: if the agent is in state sss and follows policy π\\piπ, what return should we expect? For policy improvement, we often need a more specific question: if the agent is in state sss, takes action aaa first, and then follows π\\piπ, what return should we expect?\nThis is the action-value function qπ(s,a)q_\\pi(s, a)qπ​(s,a):\nqπ(s,a)=Eπ[Gt∣St=s,At=a]=∑r,s′p(r,s′∣s,a)(r+γvπ(s′)) q_\\pi(s, a) = \\mathbb{E}_\\pi[G_t \\mid S_t = s, A_t = a] = \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a)(r + \\gamma v_\\pi(s^\\prime)) qπ​(s,a)=Eπ​[Gt​∣St​=s,At​=a]=r,s′∑​p(r,s′∣s,a)(r+γvπ​(s′)) Because qπ(s,a)q_\\pi(s, a)qπ​(s,a) already conditions on the first action, we no longer average over actions inside its definition. If we want the value of the state again, we can average the action-values using the policy:\nvπ(s)=∑aπ(a∣s)qπ(s,a) v_\\pi(s) = \\sum_{a}\\pi(a \\mid s) q_\\pi(s, a) vπ​(s)=a∑​π(a∣s)qπ​(s,a) The action-value function is useful because it lets us compare actions in the same state. This comparison is exactly what we need when we move from evaluating a fixed policy to improving it.\nBellman Optimality Equation The Bellman expectation equation evaluates a fixed policy. But the final goal is not just to evaluate one policy. We want the best policy.\nIf we know the action-value of each possible action, then the best policy should choose an action with the highest action-value. This replaces the policy average with a maximum over actions. The result is the Bellman optimality equation:\nv∗(s)=max⁡a∑r,s′p(r,s′∣s,a)(r+γv∗(s′))=max⁡aq∗(s,a) v^*(s) = \\max_{a} \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v^*(s^\\prime)) = \\max_{a} q^*(s, a) v∗(s)=amax​r,s′∑​p(r,s′∣s,a)(r+γv∗(s′))=amax​q∗(s,a) Here q∗(s,a)q^*(s, a)q∗(s,a) is the action-value when the agent takes action aaa first and behaves optimally afterward.\nUnlike vπ(s)v_\\pi(s)vπ​(s), the optimal value function v∗(s)v^*(s)v∗(s) does not describe one fixed policy. It describes the best achievable expected return from each state. Once we know v∗(s)v^*(s)v∗(s), we can recover an optimal policy by choosing an action with the highest q∗(s,a)q^*(s, a)q∗(s,a) in each state.\nAt this point, we have equations that characterize the value functions we want. The Bellman expectation equation characterizes vπ(s)v_\\pi(s)vπ​(s) for a fixed policy, and the Bellman optimality equation characterizes v∗(s)v^*(s)v∗(s) for the best policy. But an equation is not yet an algorithm. To find these value functions in practice, we need a procedure that starts with a rough guess, updates it repeatedly, and eventually converges to the right answer.\nThis is where contraction mapping enters the story.\nContraction Mapping Contraction mapping is not specific to reinforcement learning. It is a general idea from metric spaces, where we study points, distances between points, and functions that move points around. The definition and theorem below show one important result: if a function always moves points closer together, then repeatedly applying that function converges to a unique fixed point.\nThis sounds abstract, but it will become useful because the Bellman equations are recursive. Later, we will treat the right-hand side of a Bellman equation as a function that updates a value function. Contraction mapping gives us the language to explain why applying that update repeatedly can converge, which is exactly what we need for Value Iteration and Policy Iteration.\nDefinition A contraction mapping is a function T:X→XT: X \\rightarrow XT:X→X on a metric space XXX. It takes a point from XXX and returns another point in the same space. It is called a contraction if applying TTT always brings points closer together.\nMore formally, there must be a constant 0≤κ\u0026lt;10 \\leq \\kappa \u0026lt; 10≤κ\u0026lt;1 such that for any two points xxx and yyy:\nd(T(x),T(y))≤κ⋅d(x,y) d(T(x), T(y)) \\leq \\kappa \\cdot d(x, y) d(T(x),T(y))≤κ⋅d(x,y) In other words, after applying TTT, the distance between two points is at most κ\\kappaκ times the original distance.\nBanach Fixed-Point Theorem The Banach Fixed-Point Theorem gives us the guarantee that makes contraction mappings useful. If TTT is a contraction mapping, then:\nTTT has a unique fixed point x∗x^*x∗ such that T(x∗)=x∗T(x^*) = x^*T(x∗)=x∗. For any initial point x0x_0x0​ in the space, the sequence x0,T(x0),T(T(x0)),…x_0, T(x_0), T(T(x_0)), \\ldotsx0​,T(x0​),T(T(x0​)),… converges to that fixed point x∗x^*x∗. Algorithmically, this means: if an update rule is a contraction, then repeatedly applying it is guaranteed to converge to one solution.\nApplications in RL Now we can connect this abstract theorem back to reinforcement learning.\nThe Bellman equations are recursive: the value of the current state is written using the values of possible next states. For example vπ(s)v_\\pi(s)vπ​(s) depends on vπ(s′)v_\\pi(s^\\prime)vπ​(s′). This means the same unknown function appears on both sides of the equation.\nThat recursive form is exactly what lets us turn a Bellman equation into an update rule. Instead of already knowing the true value function, we start with some current guess vvv. Then we plug that guess into the right-hand side of the Bellman equation and get an updated guess. This update rule is the operator TTT from the fixed-point theorem.\nFor a fixed policy π\\piπ, the Bellman expectation operator is:\nTπv(s)=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)(r+γv(s′)) T^\\pi v(s) = \\sum_{a} \\pi(a \\mid s) \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v(s^\\prime)) Tπv(s)=a∑​π(a∣s)r,s′∑​p(r,s′∣s,a)(r+γv(s′)) For the optimal value function, the Bellman optimality operator is:\nT∗v(s)=max⁡a∑r,s′p(r,s′∣s,a)(r+γv(s′)) T^* v(s) = \\max_{a} \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v(s^\\prime)) T∗v(s)=amax​r,s′∑​p(r,s′∣s,a)(r+γv(s′)) A key result, which we will use without proving here, is that both Bellman operators are contraction mappings when 0≤γ\u0026lt;10 \\leq \\gamma \u0026lt; 10≤γ\u0026lt;1. So Banach\u0026rsquo;s theorem tells us that repeated Bellman updates converge to a unique fixed point.\nFor the Bellman expectation operator TπT^\\piTπ, the fixed point is vπv_\\pivπ​, the value function for policy π\\piπ. For the Bellman optimality operator T∗T^*T∗, the fixed point is v∗v^*v∗, the optimal value function.\nThis is the bridge from equations to algorithms. The Bellman equations define the fixed points, and contraction mapping explains why repeated updates can find them. In the next section, we will turn these two update rules into Value Iteration and Policy Iteration.\nAlgorithms We now have the ingredients for two common algorithms. The Bellman optimality operator gives us a way to update values toward v∗(s)v^*(s)v∗(s) directly. The Bellman expectation operator gives us a way to evaluate a fixed policy π\\piπ. Both ideas can be used to find an optimal policy, but they organize the work differently.\nValue Iteration Value Iteration uses the Bellman optimality operator directly. The algorithm is:\nStart with some initial value function v0(s)v_0(s)v0​(s), which can be a rough guess.\nRepeatedly apply the Bellman optimality update:\nvk+1(s)=max⁡a∑r,s′p(r,s′∣s,a)(r+γvk(s′)) v_{k+1}(s) = \\max_{a} \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v_k(s^\\prime)) vk+1​(s)=amax​r,s′∑​p(r,s′∣s,a)(r+γvk​(s′)) Because this update is a contraction when 0≤γ\u0026lt;10 \\leq \\gamma \u0026lt; 10≤γ\u0026lt;1, repeated updates converge to the optimal value function v∗(s)v^*(s)v∗(s).\nStop when the value function changes only slightly between two iterations, for example when:\nmax⁡s∣vk+1(s)−vk(s)∣≤ϵ \\max_s |v_{k+1}(s) - v_k(s)| \\leq \\epsilon smax​∣vk+1​(s)−vk​(s)∣≤ϵ Once the values have converged, extract a policy greedily. First compute the optimal action-value:\nq∗(s,a)=∑r,s′p(r,s′∣s,a)(r+γv∗(s′)) q^*(s, a) = \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a)(r + \\gamma v^*(s^\\prime)) q∗(s,a)=r,s′∑​p(r,s′∣s,a)(r+γv∗(s′)) Then choose an action with the highest action-value in each state:\nπ∗(s)=arg max⁡aq∗(s,a) \\pi^*(s) = \\argmax_{a} q^*(s, a) π∗(s)=aargmax​q∗(s,a) The important detail is that Value Iteration does not maintain or evaluate a separate policy during the updates. The intermediate function vk(s)v_k(s)vk​(s) is only a working value estimate. A greedy policy can be extracted from it at any time, but the clean guarantee comes after the values converge to v∗(s)v^*(s)v∗(s).\nPolicy Iteration Policy Iteration takes a different route. Instead of updating values directly toward v∗(s)v^*(s)v∗(s), it keeps an explicit policy and improves it step by step. Each iteration has two parts:\nPolicy evaluation: compute vπ(s)v_\\pi(s)vπ​(s) for the current policy π\\piπ. Policy improvement: update the policy so it chooses better actions according to the current value estimates. Policy Evaluation The evaluation step uses the Bellman expectation update:\nvπ(s)=∑aπ(a∣s)∑r,s′p(r,s′∣s,a)(r+γvπ(s′)) v_\\pi(s) = \\sum_{a} \\pi(a \\mid s) \\sum_{r, s^\\prime} p(r, s^\\prime \\mid s, a) (r + \\gamma v_{\\pi}(s^\\prime)) vπ​(s)=a∑​π(a∣s)r,s′∑​p(r,s′∣s,a)(r+γvπ​(s′)) After evaluation, we know how good the current policy is. The next question is how to improve it.\nPolicy Improvement Theorem First, we need to define what it means for one policy to be better than another. We say that π′\\pi^\\primeπ′ is better than or equal to π\\piπ if it has value at least as high in every state:\nvπ′(s)≥vπ(s)for every state s v_{\\pi^\\prime}(s) \\geq v_\\pi(s) \\quad \\text{for every state } s vπ′​(s)≥vπ​(s)for every state s This means that from any starting state, π′\\pi^\\primeπ′ gives us an expected discounted cumulative reward that is no worse than π\\piπ.\nThe Policy Improvement Theorem says that if we act greedily with respect to qπ(s,a)q_\\pi(s, a)qπ​(s,a), the new policy satisfies this condition. In other words, after evaluating π\\piπ, we can improve it by choosing the action that looks best under qπq_\\piqπ​:\nπ′(s)=arg max⁡aqπ(s,a) \\pi^\\prime(s) = \\argmax_{a} q_\\pi(s, a) π′(s)=aargmax​qπ​(s,a) Proof Sketch Why is this greedy update guaranteed to improve the policy?\nUnder the old policy π\\piπ, the value of state sss is an average of the action-values qπ(s,a)q_\\pi(s, a)qπ​(s,a). The greedy policy π′\\pi^\\primeπ′ chooses the action with the largest action-value, so:\nvπ(s)=∑aπ(a∣s)qπ(s,a)≤max⁡aqπ(s,a) v_\\pi(s) = \\sum_a \\pi(a \\mid s) q_\\pi(s, a) \\leq \\max_a q_\\pi(s, a) vπ​(s)=a∑​π(a∣s)qπ​(s,a)≤amax​qπ​(s,a) Because π′\\pi^\\primeπ′ is greedy, this means:\nvπ(s)≤qπ(s,π′(s)) v_\\pi(s) \\leq q_\\pi(s, \\pi^\\prime(s)) vπ​(s)≤qπ​(s,π′(s)) For readability, shift the time index to start at zero. So the condition St=sS_t = sSt​=s from the value-function definition becomes S0=sS_0 = sS0​=s, and we write the realized starting state as s0=ss_0 = ss0​=s inside the rollout. The inequality we will unroll is:\nvπ(s0)≤qπ(s0,π′(s0)) v_\\pi(s_0) \\leq q_\\pi(s_0, \\pi^\\prime(s_0)) vπ​(s0​)≤qπ​(s0​,π′(s0​)) Now expand the right-hand side using the definition of qπq_\\piqπ​:\nqπ(s0,π′(s0))=∑r0,s1p(r0,s1∣s0,π′(s0))(r0+γvπ(s1)) q_\\pi(s_0, \\pi^\\prime(s_0)) = \\sum_{r_0, s_1} p(r_0, s_1 \\mid s_0, \\pi^\\prime(s_0))(r_0 + \\gamma v_\\pi(s_1)) qπ​(s0​,π′(s0​))=r0​,s1​∑​p(r0​,s1​∣s0​,π′(s0​))(r0​+γvπ​(s1​)) So:\nvπ(s0)≤∑r0,s1p(r0,s1∣s0,π′(s0))(r0+γvπ(s1)) v_\\pi(s_0) \\leq \\sum_{r_0, s_1} p(r_0, s_1 \\mid s_0, \\pi^\\prime(s_0))(r_0 + \\gamma v_\\pi(s_1)) vπ​(s0​)≤r0​,s1​∑​p(r0​,s1​∣s0​,π′(s0​))(r0​+γvπ​(s1​)) The continuation term is still vπv_\\pivπ​, not vπ′v_{\\pi^\\prime}vπ′​, because qπ(s,a)q_\\pi(s, a)qπ​(s,a) means: take action aaa first, then follow the old policy π\\piπ. To unroll this expression once more, first write the same greedy bound for the possible next state s1s_1s1​:\nvπ(s1)=∑aπ(a∣s1)qπ(s1,a)≤qπ(s1,π′(s1)) v_\\pi(s_1) = \\sum_a \\pi(a \\mid s_1) q_\\pi(s_1, a) \\leq q_\\pi(s_1, \\pi^\\prime(s_1)) vπ​(s1​)=a∑​π(a∣s1​)qπ​(s1​,a)≤qπ​(s1​,π′(s1​)) Then expand that qπq_\\piqπ​ term as well:\nqπ(s1,π′(s1))=∑r1,s2p(r1,s2∣s1,π′(s1))(r1+γvπ(s2)) q_\\pi(s_1, \\pi^\\prime(s_1)) = \\sum_{r_1, s_2} p(r_1, s_2 \\mid s_1, \\pi^\\prime(s_1))(r_1 + \\gamma v_\\pi(s_2)) qπ​(s1​,π′(s1​))=r1​,s2​∑​p(r1​,s2​∣s1​,π′(s1​))(r1​+γvπ​(s2​)) Now we can replace vπ(s1)v_\\pi(s_1)vπ​(s1​) by this upper bound inside the previous sum. This is not substitution by equality: it is valid here because γ≥0\\gamma \\geq 0γ≥0, so replacing vπ(s1)v_\\pi(s_1)vπ​(s1​) with a larger quantity can only make the right-hand side larger.\nvπ(s0)≤∑r0,s1p(r0,s1∣s0,π′(s0))(r0+γ∑r1,s2p(r1,s2∣s1,π′(s1))(r1+γvπ(s2)))=∑r0,s1,r1,s2p(r0,s1∣s0,π′(s0))p(r1,s2∣s1,π′(s1))(r0+γr1+γ2vπ(s2))=∑r0,s1,r1,s2p(r0,s1,r1,s2∣s0,π′(s0),π′(s1))(r0+γr1+γ2vπ(s2)) \\begin{aligned} v_\\pi(s_0) \u0026amp;\\leq \\sum_{r_0, s_1} p(r_0, s_1 \\mid s_0, \\pi^\\prime(s_0)) \\left(r_0 + \\gamma \\sum_{r_1, s_2} p(r_1, s_2 \\mid s_1, \\pi^\\prime(s_1))(r_1 + \\gamma v_\\pi(s_2))\\right) \\\\ \u0026amp;= \\sum_{r_0, s_1, r_1, s_2} p(r_0, s_1 \\mid s_0, \\pi^\\prime(s_0))p(r_1, s_2 \\mid s_1, \\pi^\\prime(s_1)) (r_0 + \\gamma r_1 + \\gamma^2 v_\\pi(s_2)) \\\\ \u0026amp;= \\sum_{r_0, s_1, r_1, s_2} p(r_0, s_1, r_1, s_2 \\mid s_0, \\pi^\\prime(s_0), \\pi^\\prime(s_1)) (r_0 + \\gamma r_1 + \\gamma^2 v_\\pi(s_2)) \\end{aligned} vπ​(s0​)​≤r0​,s1​∑​p(r0​,s1​∣s0​,π′(s0​))(r0​+γr1​,s2​∑​p(r1​,s2​∣s1​,π′(s1​))(r1​+γvπ​(s2​)))=r0​,s1​,r1​,s2​∑​p(r0​,s1​∣s0​,π′(s0​))p(r1​,s2​∣s1​,π′(s1​))(r0​+γr1​+γ2vπ​(s2​))=r0​,s1​,r1​,s2​∑​p(r0​,s1​,r1​,s2​∣s0​,π′(s0​),π′(s1​))(r0​+γr1​+γ2vπ​(s2​))​ In the second line, we use marginalization to combine terms into one sum: the r0r_0r0​ term can also be summed over r1,s2r_1, s_2r1​,s2​ because ∑r1,s2p(r1,s2∣s1,π′(s1))=1\\sum_{r_1, s_2} p(r_1, s_2 \\mid s_1, \\pi^\\prime(s_1)) = 1∑r1​,s2​​p(r1​,s2​∣s1​,π′(s1​))=1.\nIn the last line, p(r0,s1,r1,s2∣s0,π′(s0),π′(s1))p(r_0, s_1, r_1, s_2 \\mid s_0, \\pi^\\prime(s_0), \\pi^\\prime(s_1))p(r0​,s1​,r1​,s2​∣s0​,π′(s0​),π′(s1​)) is the joint distribution of the two-step rollout when the first two actions are chosen by π′\\pi^\\primeπ′. This does not assume ordinary independence between the two transitions. The probability chain rule gives a product of conditional probabilities:\np(r0,s1,r1,s2∣s0,π′(s0),π′(s1))=p(r0,s1∣s0,π′(s0))p(r1,s2∣s0,r0,s1,π′(s0),π′(s1)) p(r_0, s_1, r_1, s_2 \\mid s_0, \\pi^\\prime(s_0), \\pi^\\prime(s_1)) = p(r_0, s_1 \\mid s_0, \\pi^\\prime(s_0)) p(r_1, s_2 \\mid s_0, r_0, s_1, \\pi^\\prime(s_0), \\pi^\\prime(s_1)) p(r0​,s1​,r1​,s2​∣s0​,π′(s0​),π′(s1​))=p(r0​,s1​∣s0​,π′(s0​))p(r1​,s2​∣s0​,r0​,s1​,π′(s0​),π′(s1​)) Then the Markov property lets us drop the earlier history from the second factor:\np(r1,s2∣s0,r0,s1,π′(s0),π′(s1))=p(r1,s2∣s1,π′(s1)) p(r_1, s_2 \\mid s_0, r_0, s_1, \\pi^\\prime(s_0), \\pi^\\prime(s_1)) = p(r_1, s_2 \\mid s_1, \\pi^\\prime(s_1)) p(r1​,s2​∣s0​,r0​,s1​,π′(s0​),π′(s1​))=p(r1​,s2​∣s1​,π′(s1​)) The sum with the joint rollout probability is therefore expectation notation written out. Once we know S0=sS_0 = sS0​=s and the policy being followed, the Markov property gives the rollout distribution, so the two-step bound becomes:\nvπ(s)≤E[R0+γR1+γ2vπ(S2)∣S0=s, follow π′ for two steps, then π] v_\\pi(s) \\leq \\mathbb{E}\\left[R_0 + \\gamma R_1 + \\gamma^2 v_\\pi(S_2) \\mid S_0 = s,\\ \\text{follow } \\pi^\\prime \\text{ for two steps, then } \\pi\\right] vπ​(s)≤E[R0​+γR1​+γ2vπ​(S2​)∣S0​=s, follow π′ for two steps, then π] If we repeat the same substitution nnn times, we get:\nvπ(s)≤E[∑k=0n−1γkRk+γnvπ(Sn)∣S0=s, follow π′ for n steps, then π] v_\\pi(s) \\leq \\mathbb{E}\\left[\\sum_{k=0}^{n-1} \\gamma^k R_k + \\gamma^n v_\\pi(S_n) \\mid S_0 = s,\\ \\text{follow } \\pi^\\prime \\text{ for } n \\text{ steps, then } \\pi\\right] vπ​(s)≤E[k=0∑n−1​γkRk​+γnvπ​(Sn​)∣S0​=s, follow π′ for n steps, then π] The old policy π\\piπ appears only in the tail term vπ(Sn)v_\\pi(S_n)vπ​(Sn​), which says what happens after the first nnn greedy steps. As n→∞n \\rightarrow \\inftyn→∞, the discounted tail γnvπ(Sn)\\gamma^n v_\\pi(S_n)γnvπ​(Sn​) goes to zero when rewards are bounded and 0≤γ\u0026lt;10 \\leq \\gamma \u0026lt; 10≤γ\u0026lt;1, so what remains is the expected return from following the new policy π′\\pi^\\primeπ′ from state sss:\nEπ′[∑k=0∞γkRk∣S0=s]=Eπ′[G0∣S0=s]=vπ′(s) \\mathbb{E}_{\\pi^\\prime}\\left[\\sum_{k=0}^{\\infty} \\gamma^k R_k \\mid S_0 = s\\right] = \\mathbb{E}_{\\pi^\\prime}[G_0 \\mid S_0 = s] = v_{\\pi^\\prime}(s) Eπ′​[k=0∑∞​γkRk​∣S0​=s]=Eπ′​[G0​∣S0​=s]=vπ′​(s) Therefore:\nvπ(s)≤vπ′(s) v_\\pi(s) \\leq v_{\\pi^\\prime}(s) vπ​(s)≤vπ′​(s) Algorithm After the proof, the actual algorithm is much simpler: it is just the thing software engineers like most, a loop. Putting the pieces together, Policy Iteration works as follows:\nStart with any policy π\\piπ.\nEvaluate the policy by computing vπ(s)v_\\pi(s)vπ​(s).\nImprove the policy by acting greedily with respect to qπ(s,a)q_\\pi(s, a)qπ​(s,a):\nπ′(s)=arg max⁡aqπ(s,a) \\pi^\\prime(s) = \\argmax_{a} q_\\pi(s, a) π′(s)=aargmax​qπ​(s,a) If the policy no longer changes, stop. Otherwise, set π←π′\\pi \\leftarrow \\pi^\\primeπ←π′ and repeat.\nPolicy Iteration is therefore an explicit loop between evaluation and improvement. Evaluation asks how good is the current policy. Improvement asks if we can choose better actions using what we just learned.\nGeneralized Policy Iteration Value Iteration and Policy Iteration look different, but they share the same structure. Both combine two ideas:\nPolicy evaluation: estimate how good the current policy is, usually by estimating vπ(s)v_\\pi(s)vπ​(s) or qπ(s,a)q_\\pi(s, a)qπ​(s,a). Policy improvement: change the policy so it acts greedily, or more greedily, with respect to the current value estimates. This shared structure is called Generalized Policy Iteration (GPI).\nThe important word is generalized. GPI describes the interaction between evaluation and improvement, but it does not prescribe the schedule. It does not say how long we must evaluate a policy, whether we should update all states or only some states, or in what order states should be updated.\nPolicy Iteration keeps an explicit policy. We evaluate that policy for some amount of time, improve it greedily, and repeat. Classical Policy Iteration evaluates the policy all the way to convergence before improving it. Modified Policy Iteration evaluates it only for a few sweeps.\nValue Iteration does not keep an explicit policy during the value updates. It repeatedly applies the recursive optimality update for a long time, and once the values are good enough, it extracts a greedy policy. In that sense, the greedy improvement step is built into the value update, and the explicit policy extraction can happen at the end.\nThis is why GPI is a useful umbrella concept. The important part is not one specific schedule, but the interaction between estimating values and improving the policy.\nFrozen Lake Example Let\u0026rsquo;s ground the algorithms in a small grid-world game. Each cell is a state sss, and from each non-terminal state the agent can choose one of four actions: up, down, left, or right.\nThe goal is to reach the gift while avoiding holes. Every step costs −1-1−1, and falling into a hole ends the episode. Since each move is penalized equally, the agent maximizes its total reward by reaching the gift as fast as possible - the shortest path is the optimal policy. This is also a clean example of how a classic shortest-path problem can be expressed in RL terms.\nBecause this is a model-based setting, we know the environment dynamics: for each state-action pair, we know the probability distribution over next states and rewards - p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a). We don\u0026rsquo;t know exactly where the agent will land, only how likely each outcome is. That is exactly the information needed to compute the Bellman equations we derived earlier, estimate v(s)v(s)v(s) and q(s,a)q(s, a)q(s,a), and improve the policy.\nOptimal Policy You can see an optimal policy on the image below:\nExercise Want to test your understanding? I prepared an exercise for this lesson in my Reinforcement Learning Course - there\u0026rsquo;s a task to implement yourself along with an example solution.\nFinal thoughts In this article we covered how a model-based agent can learn an optimal policy using Policy Iteration and Value Iteration - both grounded in the Bellman equations. The key assumption was that we have full access to the environment dynamics p(r,s′∣s,a)p(r, s^\\prime \\mid s, a)p(r,s′∣s,a), which made it possible to solve for v(s)v(s)v(s) and q(s,a)q(s, a)q(s,a) directly.\nIn the real world, that assumption rarely holds. In the next article we will move to model-free environments, where the agent has no access to transition probabilities and must instead learn purely from experience. We will look at Q-learning and SARSA - two foundational algorithms that make this possible.\n","permalink":"https://mateuszpieniak.com/courses/reinforcement-learning/101-policy-iteration-value-iteration/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eReinforcement learning is about learning from consequences. Unlike supervised learning, nobody tells the agent the correct action for every situation. The agent tries actions, receives rewards, and slowly discovers which behavior leads to better long-term outcomes.\u003c/p\u003e\n\u003cp\u003eImagine playing a computer game. You observe the screen, make a decision, and the game responds. Sometimes you get points immediately, sometimes nothing happens, and sometimes the real consequence appears much later. The same idea can be described in a more abstract mathematical language. In such a formal language, you are the agent and the game is the environment.\u003c/p\u003e","title":"Reinforcement Learning 101: Policy Iteration \u0026 Value Iteration"}]