RL Policy Gradient (PG) methods are model-free methods that try to maximize the RL objective directly without requiring a value function. Although a value function can be used as a baseline for variance reduction, or in order to evaluate current and successor state's value (Actor-Critic), it is not required for the purpose of action selection. The RL objective or the performance measure $J(\theta)$ is defined as the sum of rewards from the initial state to a terminal state for the episodic task, and the average return for the continuing task. We are trying to fit an ANN parametrized by $\theta$ to learn a stochastic policy $\pi_{\theta}(a|s_t)$, a probability distribution mapping from state to actions, that will maximize $J(\theta)$.

In this blog post I will start with a review of PG methods for model-free RL, followed by a recap of some relevant RL concepts. I will then discuss the different types of PG algorithms (REINFORCE with/without a baseline, Actor-Critic) and some variance reduction methods ("reward to go" and advantage). Finally, I will summarize the advantages of PG methods over the state and state-action value methods. If you are not familiar with the concept of the cross-entropy loss for a supervised learning classification task or just need a refresh, I suggest reading my previous blog post cross entropy loss derivative before continuing.

Relevant RL concepts

  • Policy: a mapping from states to action probabilities. Following a policy $\pi$ means that at each timestep $t$ when beeing at state $s_t$ we select an action $a_t$ with probability $\pi(a_t|s_t)$. The notation $\pi^*$ represents the optimal policy for a task, which is the policy that yields the highest expected reward.
  • Experience: a record of all the interactions the agent had with the environment. RL methods try to improve a policy based on the agent's experience.
  • Value function: a function that estimates the expected future reward.
  • State-Value ($V$) function: $\hat{v}_\pi(s)$ is a value function that estimates $v_\pi(s)$, the expected reward starting at a state $s$ and following policy $\pi$ onwards. The notation $v_{\pi^*}(s)$ represents the true state-value for the optimal policy $\pi^*$.
  • State-Action Value ($Q$) function: $\hat{q}_\pi(s, a)$ is a value function that estimates $q_\pi(s, a)$, the expected reward of applying action $a$ at state $s$ and following policy $\pi$ onwards. The notation $q_{\pi^*}(s, a)$ represents the true state-action value for the optimal policy $\pi^*$.
  • State-Value and State-Action value methods: methods that try to estimate the State and State-Action Value functions respectively.
  • Greedy policy: a policy that fully exploits the current state/state-action value estimation. This means it always selects the action with the maximal estimated value and never explores new actions. If there are multiple actions that have the same maximum value, it chooses an action arbitrarily.
  • $\epsilon$-greedy policy: a policy that tries to balance between exploration and exploitation. It chooses a greedy action with probability $1-\epsilon$ and a random action with probability $\epsilon$. It is common to start with a large $\epsilon$ value and decrease it during the learning process.
  • Episodic Task: an RL task with terminal states, for which the experience is separated into independent sequences (episodes) of state-reward pairs. All episodes will eventually end in a terminal state regardless of the actions selected.
  • Continuing Task: an RL infinite horizon task that continues indefinitely with no terminal states.
  • Trajectory $\tau$ - a sequence of continuous timestep state-reward pairs from a single episode. In episodic tasks, this term sometimes describes the full path from the initial state to a terminal state. In other cases, the explicit term Full/Complete trajectory is used.
  • Discount rate: receiving rewards sooner is better than later. A concept that is essential for an infinite horizon (continuing) task because an infinite sum of finite rewards can be infinite, but can also be applied to episodic tasks. A discount rate $0 \le \gamma \le 1$ is a parameter that is used to calculate the present value of future rewards. A reward that is received $k$ steps in the future is worth $\gamma^{k-1}$ at the current time step. On the extremes, when $\gamma=1$ there is no discounting, and $\gamma=0$ means the agent only cares about immediate rewards.
  • Bootstrapping: a method that updates an estimate for a given value function at a given state based on value estimates of successor states. Dynamic Programming (DP), Temporal-Difference (TD) and Actor-Critic methods use Bootstrapping.
  • Environment Dynamics - In RL we assume that the environment has a transition distribution $p(s_{t+1}|s_t, a_t)$ (a.k.a transition function or transition dynamics) that defines the probability to arrive at state $s_{t+1}$ at time $t+1$ after being at state $s_t$ at time $t$ and applying action $a_t$. We also assume that the environment also has a reward distribution $R(r_{t+1}|s_t, a_t)$ that defines the probability of receiving reward $r_{t+1}$ when applying action $a_t$ at state $s_t$. These two distributions are usually unknown to us, but we are able to sample from them by interacting with (sampling from) the environment.
  • Sampling Methods - often it is easy to generate experience sampled from the desired probability distribution but it is infeasible to obtain the explicit form of that distribution. Sampling methods do not require a perfect probabilistic model of the environment, they interact with it and create samples of temporal state transitions and rewards.
  • Dynamic Programming (DP) a collection of algorithms that computes the optimal policy given a perfect probabilistic model of the environment (state transition, reward probabilities, and termination) as a Markov decision process (MDP). DP methods use bootstrapping but don't use sampling.
  • Monte Carlo (MC): a sampling method that is defined for episodic tasks. It updates the state/state-action value estimates only when an episode is complete (an exception is the use eligibility traces, for which MC methods can also be applied to continuing tasks and values updates happen online). MC methods are not biased because they actually sample a complete trajectory instead of using a biased value estimate of successor states (they don't bootstrap). Because of the nature of sampling multiple random variables and only updating at the end of the episode, MC methods usually have higher variance compared to methods that update on a step-to-step or n-step basis (e.g. TD methods).
  • Temporal Difference (TD) learning: combine both MC and DP methods and they are suitable for online step by step learning. Like MC methods they learn by sampling experience directly from the environment and do not require the full model of the environment dynamics, and like DP methods they use successor state value estimates (bootstrap) in order to estimate the current state’s state/state-action value estimate. TD methods can be calculated for more than a one-step (n-step target) and also be united with Monte Carlo methods (TD($\lambda$)).
  • Tabular methods: refers to the use of state-value and State-Action value methods when the state and action spaces are small enough to fit in a table. Tabular methods have no generalization, each state/state-action has its unique value in the table.
  • Function Approximation for action/action-value methods: methods that can be applied to tasks with arbitrarily large state spaces, for which there is not enough memory for state/state-action tables, or that these tables cannot be filled optimally even in the limit of infinite amounts of time and data. For example, an autonomous car camera will never capture the exact same image but will usually encounter images of scenarios it can generalize (e.g. "a single pedestrian walking on the zebra crossing"). For these tasks, an optimal solution cannot be obtained and generalization and finding a good approximated solution with limited compute resources is the goal. Function approximation in RL is related to Supervised Learning, but it also deals with some unique issues such as nonstationarity, bootstrapping, and delayed targets. Although in theory all supervised learning methods can be applied to RL tasks, in practice we will usually use a linear function approximator or artificial neural network (ANN).
  • Markov Property: a Markov property of a stochastic process is a property of the conditional probability distribution of arrival at future states only depends on the current state and not the past. Meaning that given the current state, the history/path of arriving to it is a redundant information that does not affect the transition distribution: $P(s_t | s_0, s_1,...s_{t-1})=P(s_t | s_{t-1})$. A Markov chain satisfies the Markov property.
  • Stationary Distributions: the definition of Stationary Distributions comes from Markov Chains and it refers to a probability distribution over the state space of being in a specific state after enough timestamps since starting at the initial state. Stationary Distributions are defined for Ergodic and Absorbing Markov chains.
  • Ergodic Markov Chains: contain only recurrent states (a state you are guaranteed to return too) and unique stationary distribution. In the context of RL, they represent a continuing task for which all states will be revisited with non-zero probability and a reward distribution for each state that won't vary with time.
  • Absorbing Markov Chains: have absorbing states which are states that are impossible to leave. Any path from any given state will eventually reach an absorbing state with non-zero probability. This is the reason that for these types of Markov chains the stationary distributions are non-zero only for absorbing states. In the RL context, episodic task's terminal states can be viewed as absorbing states with infinite zero rewards regardless of the actions chosen.
  • Nonstationarity in the context of RL: RL mainly deals with nonstationary target functions (target functions that change over time) either because the task in itself is nonstationary (e.g. reward probabilities that vary with time), or it is effectively nonstationary (e.g. target policies that change during training, or using bootstrapping with successor state reward estimates that haven't converged yet). This nonstationary nature of targets in reinforcement methods requires function approximation methods that can handle it.

Policy Gradient methods VS Supervised Learning

PG methods are similar to DL methods for supervised learning problems in the sense that they both try to fit a neural network to approximate some function by learning an approximation of its gradient using a Stochastic Gradient Descent (SGD) method and then using this gradient to update the network parameters. There are also some key differences in PG methods:

  • There are no static targets labeled by an expert to be used in training. The reward signal is used as a way of measuring the "goodness" of the performance.
  • They are harder to fine-tune and take longer to converge compared to supervised learning problems such as image classification tasks.
  • Unlike supervised learning problems in which we are trying to minimize a loss function, here our goal is to maximize the performance so we will update the parameters $\theta$ with the direction of the gradient (gradient ascent).
Policy Network
Figure 1: A comparison between an Image Classification task and a Policy Gradient network used to play Atari games. In the classification example, we are given an image and a true label "dog" for the object in the image. Our task is to predict a probability distribution over $k=3$ classes and the objective is to minimize a loss function or maximize the log-likelihood of the "dog" class. In the Policy Gradient example, we are given an image or a sequence of images from sequential timeframes (in order to understand the speed/direction of objects in the game) and we are trying to predict a discrete probability of over $k=3$ possible actions ("left", "right", "down"). Our objective is to maximize the reward, in this case, the game's score, but unfortunately there is no labeled target or an expert that tells us how to play. We need to monitor the reward signal and attribute good rewards to our actions. In addition, the reward signal can be very sparse and can happen many timesteps after the action was applied (e.g. a Chess game) which makes the training process even more challenging.

Introduction to Policy Gradient

Training procedure

In RL we assume that the environment dynamics are an unknown probability function $p(s_{t+1}|s_t, a_t)$ that defines the probability to arrive at state $s_{t+1}$ at time $t+1$ after being at state $s_t$ at time $t$ and applying action $a_t$. Because PG is a model-free RL method, there isn't a model for the environment dynamics at hand, and we have to sample the environment. Meaning, at each timestep $t$, sample an action $a_t$ from the policy, apply it, and record the reward $r_t$ and the new state $s_{t+1}$.

Like in supervised learning, we first initialize the policy network parameters $\theta$ arbitrarily and then train the policy network iteratively. Unlike with supervised learning, there aren't fix target labels available. Instead, we derive an estimation of the gradient with respect to the actions we sampled from policy, multiply it by a function of the observed rewards and update the network parameters. We will first consider the REINFORCE algorithm (vanilla PG) which is a Monte Carlo method, meaning we only update the policy parameters at the end of the episode. Later we will introduce the Actor-Critic algorithm, which is a TD method that bootstraps using a value function.

Objective function

RL Policy Gradient (PG) methods are model-free methods that try to maximize the RL objective directly without requiring a value function. Although value functions can be used as a baseline for variance reduction, or in order to evaluate current and successor state’s value (Actor-Critic), it is not required for the purpose of action selection. The RL objective or the performance measure $J(\theta)$ is defined as the sum of rewards from the initial state to a terminal state for the episodic task, and the average return for the continuing task when following policy $\pi_{\theta}$.

Episodic task

For the episodic task this objective is:

$J(\theta) \doteq v_{\pi_{\theta}}(s_{0}) = E_{\tau \sim \pi_{\theta}(\tau)}[\gamma^{t} r(s_t,a_t)]$

Where this value function $v_{\pi_{\theta}}(s_{0})$ is the value of the expected discounted sum of rewards for a trajectory starting at state $s_0$ and following policy $\pi_{\theta}$ until the episode terminates. This objective can be unbiasly evaluated by sampling N trajectories from the environment using policy $\pi_{\theta}$:

$J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T_{i}-1} \gamma^{t} r(s_{i,t}, a_{i,t})$

$T_{i}$ is the timestep in which trajectory $\tau_{i}$ terminates. I’ll discard the discounting factor $\gamma$ from now on for simplicity but it usually is added to the reward function, especially on continuing tasks.

Continuing task

For the continuing task, the objective function is:

$J(\theta) = \lim_{h\to\infty} \frac{1}{h} \sum_{t=0}^{h}E[r(s_t, a_t) | A_{0:t} \sim \pi_{\theta}] = \lim_{t\to\infty} E[r(s_t, a_t) | A_{0:t} \sim \pi_{\theta}]$

This objective function is proportional to the episodic objective by a constant that is equal to the average episode length. In practice, this constant multiplier can be discarded and we can use the same gradient term in both cases because the gradient estimate is also multiplied by the learning rate $\alpha$ that can be chosen arbitrarily.

Policy $\pi_{\theta}(a|s)$

We are trying to fit an ANN parametrized by $\theta$ to learn a stochastic policy $\pi_{\theta}(a|s)$, a probability distribution mapping from state to actions, that will maximize $J(\theta)$. Figure 2 shows an example of such a policy network.

Policy Network
Figure 2: An example of a basic policy network $\pi$ parameterized by $\theta$ ($\theta$ represents the weights of the network). This network gets a state $s\in R^3$ as input and maps it to a probability distribution over actions $a\in R^5$.

The probability distribution $\pi_{\theta}(a|s)$ can be defined:

  • Over discrete action space, in which case the distribution is usually categorical with a softmax over the action logits.
  • Over continuous action space, in which case the output is the parameters of a continuous distribution (e.g. the mean and variance of a gaussian).
Policy Network
Figure 3: An example of a policy network for Autonomous Driving. If the policy $\pi_{\theta}(a|s)$ is discrete, a possible set of actions could be ("left", "right"). For a continuous action space, the network output is a continuous value like the degree in which to turn the steering wheel. (This image was taken from UC Berkeley's CS285 lecture slides)


The gradient with respect to $\theta$ according to the Policy Gradient Theorem can be approximated over N trajectories as:

$\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N}\left( \sum_{t=0}^{T_{i}-1} G_{i,t} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t}) \right)$

where $a_{i,t}$ is the action taken at time $t$ of episode $i$ at state $s_{i,t}$, $T_{i}$ is the timestep in which trajectory $\tau_{i}$ terminates and $G_{i,t}$ is a function of the reward assigned to this action. In vanilla PG settings (REINFORCE algorithm) $G_{i,t}$ is just trajectory $i$'s sum of rewards:

$G_{i,t} = \sum_{t'=0}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})$

Later, when we discuss variance reduction methods like "reward to go" and Advantage, we'll use a more sophisticated function for $G_{i,t}$.

Policy Network
Figure 4: An example of a reward function calculation on a given backup diagram of a full trajectory $\tau_0$ of an episode of length $3$. In a backup diagram, an open circle represents a state, a solid black circle represents an action, the observed reward is written above the edge that connects an action to the next state and the solid grey square is a terminal state. Starting at the initial state $s_0$ we apply action $a_1$, observe a reward $r_1=1$ and arrive at state $s_1$. This process of applying an action, observing a reward and arriving at a new state continues until we arrive at a terminal state $s_3$ and the episode ends. In the vanilla PG settings the reward function has the same value for all $3$ non-terminal states: $G_{0,0} = G_{0,1} = G_{0,2} = \sum_{t'=0}^{T_{i}-1} r(s_{0,t'}, a_{0, t'}) = r_1 + r_2 + r_3 = 1 + 3 + 5 = 9$

PG gradient VS Supervised Learning classification gradient

In order to gain some intuition regarding this gradient expression, let’s consider the discrete action case in which each episode has a length of $1$, and compare it to a supervised learning classification task. We assume both networks are using a softmax function to normalize the logits to a probability distribution. If you are not familiar with the Cross-Entropy Loss, I recommend reading my previous blog post cross entropy loss derivative before continuing.

In a supervised learning classification task, for each training example $X_i$ we are given a true label $y_i$. The network output $p_{\theta}(y_i|X_i)$ is a distribution over all possible labels just like $\pi_{\theta}(a_{t}|s_{t})$ is a distribution over all possible actions. If we consider the contribution of a single training example $X_i$ (e.g. a single image in a classification task) to the gradient, the objecive is to maximize the log-likelihood $log p_{\theta}(y_i|X_i)$ which is like trying to minimize the cross entropy loss $-log p_{\theta}(y_i|X_i)$. The gradient of this minimization problem is $\nabla_{\theta} -log p_{\theta}(y_i|X_i)$ and the parameters update is:

$\theta_{i+1} = \theta_{i} - \alpha \nabla_{\theta} -log p_{\theta}(y_i|X_i)$

where $\alpha$ is the learning rate and it has a $(-)$ prefix because this is a minimization problem (therefore we are running gradient descent).

If we look at the PG gradient, a single episode $i$'s (of lenght $1$) contribution to the gradient term is $G_{i} \nabla_{\theta} log \pi_{\theta}(a|s_{0})$ and the update to the network parameters $\theta$ can be seen as:

$\theta_{i+1} = \theta_{i} + \alpha G_{i} \nabla_{\theta} log \pi_{\theta}(a|s_{0})$

Here the $\alpha$ sign is $(+)$ because this is a maximization problem and we are updating with the gradient direction (gradient accent). Equivalently, we can rewrite this as a minimization problem by adding a $(-)$ to the gradient term:

$\theta_{i+1} = \theta_{i} - \alpha G_{i} \nabla_{\theta} -log \pi_{\theta}(a|s_{0})$

which is exactly the classification gradient when $G_{i}$ has a constant value of $1$. In the case of non-negative rewards, we can view the PG gradient as a weighted Cross-Entropy Loss gradient for which the weights are determined by the reward. And in the general case with possibly negative rewards, the reward function $G_{i}$ determines the sign and magnitude of the gradient. This notion will help us in understanding the PG gradient flow and the concept of Advantage we'll discuss shortly.

Wait a minute... Are we treating an action sampled from our policy as the true/optimal action?

Yes and no. Yes, we are using this sampled action as the true label, our target will be a 1-hot encoded vector in which the $1$ is at the index of the sampled action. And no, we don't assume it is the optimal action and we trust/hope that the reward signal $G_{i, t}$ which multiplies this gradient, to "guide" it by both sign and magnitude.

Personally, I found this fact about PG algorithms very hard to digest. Let's try to build some intuition.

Classification task gradient flow

In a regular supervised learning classification task, we are given a true target label that determines the gradient for each sample. Let's zoom-in only on the last few layers of the network, from the logits layer onward. Assume we have a classification task with $k$ classes and $q\in R^k$ is the softmax layer's output which implies that $q$ is a probability distribution over $k$ classes and therefore satisfies:

\begin{align*} & 0 \leq q_j \leq 1 \hspace{10pt} \forall j \in \{1,..k\} \\ & and \hspace{10pt} \sum_{j \in \{1,..k\}} q_j=1 \end{align*}

For a single input example $X_i$ with true label $y_i$ (an integer $1 \leq y_i \leq k$) the gradients that will backpropagate from the softmax are the gradients of the loss with respect to the logits vector $z \in R^k$:

$$ \nabla_{z}Loss =\begin{cases} q_{y_i} - 1 \hspace{10pt} if \hspace{10pt} j=y_i\\ q_j \hspace{10pt} \forall j\neq y_i\\ \end{cases} $$

where each $q_j$ represent the model approximation of the probability $p(y=j|X_i)$. Note that the gradient for the true label's logit is non-positive (as expected) and decreases proportionally in magnitude as $q_{y_i}$ increases. The rest of the logits gradients ($q_j \hspace{5pt} \forall j\neq y_i$) will be non-negative and increase proportionally as $q_j$ increases. In the specific case of perfect classification where $q_{y_i}=1$, the gradient will be $\vec{0}$ and thus none of the network's parameters will be modified. Figure 5 shows the forward and backward passes of backpropagation of a single training example when $k=3$.

Policy Network
Figure 5: This example shows the end layers of a basic classifier for a classification task with $k=3$ when classifying a single sample with label $y=1$. The top image demonstrates the forward pass of the backpropagation in which the softmax outputs a probability distribution $(q_1, q_2, q_3)$ over the 3 classes. The bottom image demonstrates the backward pass of backpropagation in which the gradients out of the softmax are shown. Note that this gradient is different for the true label ($y=1$) than the rest of the labels.

The cross-entropy loss function for the classification task with a single training example is $-log(q_{y_i})$. Increasing the probability of the true label's class $q_{y_i}$ will decrease the loss, and increasing the probability of each of the incorrect classes will increase the loss. Because gradients are directed towards the maximal value increase of their function, when running gradient descent we will update the network parameters in the counter direction to the gradient in order to minimize the loss. As a result, the network will try to move all the probability mass towards the correct class, which will reduce the current training loss, and (hopefully) generalize and improve the classification of new unseen inputs.

Discrete action gradient flow

The applied actions in PG methods are not the true/optimal actions to be taken. We just used actions sampled from the policy and hoped for the best. We saw earlier that in the case of discrete action space, the PG gradient can be viewed as a weighted cross-entropy loss gradient for which the weights are determined by the reward magnitude, except that the reward sign can flip the direction of the gradient which can be interpreted as allowing negative weights.

To make this explanation simple, I'll make a number of assumptions:

  • There are only 2 actions, a good action $a_+$ and a bad action $a_-$.
  • Our reward function is constant and intuitive, a positive reward of $+1$ for applying $a_+$ and a negative reward of $-1$ when applying $a_-$.
  • The episode length is always $1$ to simplify the notation by removing the subindex $t$.

Now suppose that we picked the good action $a_+$ for which the reward is $+1$. In this case, the gradient is the same as in the classification task because we only multiply a constant $+1$ by the cross-entropy loss. We know that in this case, the gradient will try to increase the probability of selecting $a_+$ and reduce the probabilities of $a_-$. And if on the other hand, we were to choose a bad action $a_-$, the negative $-1$ reward will change the sign of the gradient, causing the probability of selecting $a_-$ to decrease and increase the probability of $a_+$.

Now let's consider a reward function that is less intuitive. What will happen if we multiply the original rewards by $10$, in which actions $a_+$ and $a_-$ will yield a reward of $+10$, $-10$ respectively. What will happen to the gradient? Well, this will increase the magnitude of the gradient by $10$ but won't change its direction, which is equivalent to increasing our learning rate $\alpha$ by $10$. Also, the use of gradient clipping can mitigate this effect.

But what if we add a constant to the original actions? Adding $+2$ to each of the original values yields a reward of $+3$, $+1$ to $a_+$ and $a_-$. Because the objective is to maximize the future reward, adding a constant should not change the actions' relative goodness. But in practice when using policy gradient methods, choosing the suboptimal action $a_-$ will also yield a positive reward, which will increase the probability of choosing action $a_-$ and thus make the convergence of the policy network slower. A third issue can arise when a given action has a reward of exactly $0$. In this case, the gradient will be $\vec{0}$ regardless of its relative goodness. These scenarios can be tackled when using the concept of Advantage, which tries to address these relative action goodness issues by reducing the average/expected return from each state.

The REINFORCE algorithm

So far we have considered the REINFORCE algorithm. This is the basic PG algorithm in which the reward function $G_{i,t}$ for trajectory $\tau_i$ at time $t$ is defined as the total sum of rewards of the episode:

$G_{i,t} = \sum_{t'=0}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})$

The REINFORCE algorithm has the folowing steps:

  1. Sample N trajectories by following the policy $\pi_{\theta}$. Each trajectory $\tau_i$ is the sequence: $s_{i,0}, a_{i,0}, r_{i,1}, s_{i,1}....s_{i,T-1}, a_{i,T-1}, r_{i,T-1}, s_{i,T}$
  2. Evaluate the gradient using these samples:
$$\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N}\left( \sum_{t=0}^{T_{i}-1} G_{i,t} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t}) \right)$$

$\hspace{20pt}$ where the reward function $G_{i,t}$ for trajectory $\tau_i$ at time $t$ is defined as the sum rewards from the beginning of the episode:

$$G_{i,t} = \sum_{t'=0}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})$$
  1. Update the policy parameters: $$\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)$$

Variance reduction methods

This basic REINFORCE algorithm probably won't perform very well, however there are various methods of improving it - mainly by trying to reduce its variance. Because REINFORCE is a Monte Carlo algorithm, it is unbiased but has high variance. It is unbiased because we are sampling from our current policy and interacting with the environment. The reward function is only based on the sum of rewards with no bootstrapping. Thus, if we were to run an infinite number of trajectories starting from $s_0$, the average sum of trajectory rewards will be the true value of $v_{\pi_{\theta}}(s_{0})$ which is exactly our objective $J(\theta)$.

What does it actually mean to have high variance? It means that if we try to estimate the gradient over a small number of trajectories we will probably get a different estimate every time. Why is it a bad thing? Because the gradient is very noisy and the model will take a long time to converge to a local maximum. This is especially true when dealing with nonconvex function approximators like ANNs. We also might need to reduce our learning rate as a consequence of the high variance.

Reward To Go (Causality)

This variance reduction method exploits causality, as it is based on the fact that the future does not effect the past. The policy at time $t$ cannot effect the rewards we get until time $t$ and therefore we can define the reward $G_{i,t}$ only based on future rewards relative to $t$:

$G_{i,t} = \sum_{t'=t}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})$

And the corresponding gradient estimate is:

$\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=0}^{T_{i}-1} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t}) * \left( \sum_{t'=t}^{T_{i}-1} r(s_{i,t'}, a_{i, t'}) \right)$

Policy Network
Figure 6: The calculation of the "reward to go" reward function for the same backup diagram presented earlier. In this case we get a different value for each of the states:
$G_{0,0} = r_1 + r_2 + r_3 = 1 + 3 + 5 = 9$
$G_{0,1} = r_2 + r_3 = 3 + 5 = 8$
$G_{0,2} = r_3 = 5$

Why does the "reward to go" reduce variance? The variance is reduced because we are summing up less random variables. It is not trivial to prove but I'll try to give some intuition for a specific case of independent rewards. If we assume that the rewards in a trajectory are independent of each other, then the variance of the sum of N random variables is equal to the sum of the individual variances:

$Var(\sum_{i=1}^{N}r_i) = \sum_{i=1}^{N}Var(r_i)$

Because the variance is a nonnegative function, this sum monotonically increases as N increases.

Discount rate

The use of a discount rate also reduces the variance because it puts less emphasis on rewards far into the future. Rewards in the far future are more likely to have high variance because in many cases the far future is more uncertain than the near future. We may choose to go in different directions and therefore arrive at various states that are very different from our current state.

The discounted gradient estimate is:

$\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=0}^{T_{i}-1} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t}) * \left( \sum_{t'=0}^{T_{i}-1} \gamma^{t'} r(s_{i,t'}, a_{i, t'}) \right)$

And with "reward to go":

$\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=0}^{T_{i}-1} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t}) * \left( \sum_{t'=t}^{T_{i}-1} \gamma^{t'-t} r(s_{i,t'}, a_{i, t'}) \right)$

Policy Network
Figure 7: The calculation of the discounted "reward to go" reward function for a discount rate $\gamma=0.9$:
$G_{0,0} = \gamma^0r_1 + \gamma^1r_2 + \gamma^2r_3 = (0.9)^0 1 + (0.9)^1 3 + (0.9)^2 5 = 7.75$
$G_{0,1} = \gamma^0r_2 + \gamma^1r_3 = (0.9)^0 3 + (0.9)^1 5 = 7.5$
$G_{0,2} = \gamma^0r_3 = (0.9)^0 5 = 5$


The advantage tries to quantify how much the chosen action was better than the expected/average action. We already saw some of the motivation for the use of advantage in the previous sections. Because the direction of the gradient is determined by the sum of rewards, any action that will have a positive reward function $G_{i, t}$ will cause a positive update to the policy and make the chosen action more likely. We want our policy to be robust to the absolute reward function's value, and to differentiate between actions based on their relative performance to the alternative actions. The average reward is called a Baseline.

For example, let's assume that we have two actions $a_1$ and $a_2$ that yield an average reward of $100$ and $102$ respectively. Reducing a baseline of $101$ from these rewards will yield an average reward of $-1$ and $+1$ for the two actions. This simple reward transformation will ease the convergence of the PG algorithm. If we sample $a_1$, the gradient signal will be negative on average and reduce the probability of choosing $a_1$, and if we chose $a_2$ the gradient will be positive on average, making $a_2$ more likely to be chosen.

We usualy prefer a baseline that is unbiased and state dependent, the latter because some states are inherently better than others and therefore will have different average returns. Our new reward function $G_{i,t}$ is the advantage function $A(s_t, a_t)$ which is defined as:

$A(s_t, a_t) = (\sum_{t'=0}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})) - b(s_t)$

where $b(s)$ is a baseline that is only a function of the state $s$ or combined with "reward to go", which is more common:

$A(s_t, a_t) = (\sum_{t'=t}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})) - b(s_t)$

Proof that the baseline is unbiased

We now need to show that the advantage is an unbias estimator of the gradient. What does it mean to be an unbiased estimator? It means that the expected value $E_{\pi_\theta}$ of the two terms are equal. We need to show that:

$E_{\pi_\theta}[\nabla_{\theta}log\pi_{\theta}(a|s)r(s, a)] \stackrel{?}{=} E_{\pi_\theta}[\nabla_{\theta}log\pi_{\theta}(a|s) (r(s, a) - b(s)) ]$

The notation $E_{\pi_\theta}$ is the expectation with respect to the policy $\pi_\theta$ which is defined as the probability of taking action $a$ at state $s$ according to the policy $\pi_{\theta}(a|s)$. Meaning, the expectation should be summed/integrated over all possible actions that can be taken at state $s$ with probability $\pi_{\theta}(a|s)$.

The linearity of expectation defines that the sum/difference between two random variables is equal to the sum/difference of their expectations:

$E(X-Y) = E(X) - E(Y)$

Therefore, we can rewrite the left term as the difference between the right term and an additional term:

$E_{\pi_\theta}[\nabla_{\theta}log\pi_{\theta}(a|s) (r(s, a) - b(s)) ] = E_{\pi_\theta}[\nabla_{\theta}log\pi_{\theta}(a|s)r(s, a)] - E_{\pi_\theta}[\nabla_{\theta}log\pi_{\theta}(a|s)b(s)]$

Then it is sufficient to show that the second term's expectation is $0$:

$E_{\pi_\theta}[\nabla_{\theta}log\pi_{\theta}(a|s)b(s)] \stackrel{?}{=} 0$


\begin{align*} E_{\pi_\theta}[\nabla_{\theta}log\pi_{\theta}(a|s)b(s)] & = \int_{a}\pi_{\theta}(a|s) \hspace{2pt} \nabla_{\theta}log\pi_{\theta}(a|s)b(s) \hspace{2pt} da &&\text{by definition of } E_{\pi_\theta}[X] = \int_{a}\pi_{\theta}(a|s) \hspace{2pt} X \hspace{2pt} da \\ & = \int_{a}\nabla_{\theta}\pi_{\theta}(a|s)b(s) \hspace{2pt} da &&\text{from} \hspace{10pt} \nabla_{x} log f(x) = \frac{\nabla_{x}f(x)}{f(x)} => \nabla_{x}f(x) = f(x)\nabla_{x} log f(x)\\ & = b(s)\nabla_{\theta}\int_{a}\pi_{\theta}(a|s) \hspace{2pt} da && b(s) \text{ is not a function of the actions and } \nabla_{\theta} \text{ is a linear operator, both can be taken outside the integral}\\ & = b(s)\nabla_{\theta}1 && \text{integrating over a probability distribution sums up to 1}\\ & = 0 && \text{the gradient of a constant is 0} \end{align*}

In fact, any baseline that is not a function of the actions is unbiased.

Baseline network

The use of a state-dependent baseline has no effect on the expected value of the gradient update, but it may reduce the variance significantly and speed up learning. Most commonly, we'll use an ANN parametrized by $\phi$ as our function approximator of the state-value function $\hat{V}_\phi^\pi(s)$. The notation represents the expected value of rewards starting at state $s$ and then following policy $\pi$. The state-value function $\hat{V}_\phi^\pi(s)$ can be approximated with a new separated network, but in some cases, sharing some of the network weights is very reasonable because the features extracted for the policy network are also required for evaluating the state-value. For example, in Atari games where the input is a sequence of images from sequential timeframes, identifying a gold coin or a monster can influence both policy and the current state's value. Figure 8 below demonstrates the two architecture options.

Policy and Value Networks

The value network's head is just a regression head with a single output (the state-value $\hat{V}_\phi^\pi(s)$). The loss function can be any regression loss (e.g. Mean Square Error) and the target is the actual sum of rewards or the "reward to go" for each specific timestep. For example, the state value’s gradient with MSE loss and "reward to go" targets can be estimated as:

$\nabla_{\phi} Loss(\phi) \approx \frac{1}{2N} \sum_{i=1}^{N} \sum_{t=0}^{T_{i}-1} \nabla_{\phi} \left( \underbrace{(\sum_{t'=t}^{T_{i}-1} r(s_{i,t'}, a_{i, t'}))}_\text{target} - \hat{V}_\phi^\pi(s_{i,t}) \right)^2$

Baseline gradient estimate

Using the value network output $\hat{V}_\phi^\pi(s)$ as our baseline, the advantage gradient estimate is:

$\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=0}^{T_{i}-1} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t}) * \left( (\sum_{t'=0}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})) - \hat{V}_\phi^\pi(s_{i,t}) \right)$

And with "reward to go":

$\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=0}^{T_{i}-1} \nabla_{\theta} log \pi_{\theta}(a_{i,t}|s_{i,t}) * \left( (\sum_{t'=t}^{T_{i}-1} r(s_{i,t'}, a_{i, t'})) - \hat{V}_\phi^\pi(s_{i,t}) \right)$

Actor-Critic Algorithm

The Actor-Critic algorithm is the TD version of Policy Gradient. It is called actor-critic because it uses two networks, a policy network (the actor) is used for action selection, and a state-value network (the critic) that is used to evaluate the goodness or the outcome of the selected actions. This sounds similar to the REINFORCE with baseline, right? Like REINFORCE with baseline, it uses a network that evaluates the state-value $\hat{V}_\phi^\pi(s)$. Unlike REINFORCE with baseline, this state-value estimate is not just a baseline. It is used to bootstrap. So, what does it mean to bootstrap? In REINFORCE we sampled the full trajectory and used the actual trajectory's rewards as our value. For the case of using all the variance reduction methods described in the previous sections, our reward function $G_{i, t}$ is the advantage:

$A(s_t, a_t) = (\sum_{t'=t}^{T-1} \gamma^{t'-t} r(s_{t'}, a_{t'})) - \hat{V}_\phi^\pi(s_t)$

With bootstrapping, we are using an approximation of successor states values instead of sampling the full trajectory. This means that with actor-critic we don't need to wait until the episode ends in order to update the policy network, and that we can apply PG to continuing tasks. I will present the 1-step or online version of actor-critic, but it can be trivially extended to any desired n-step version. In the online version, at each timestep $t$ when being at state $s_t$ we apply an action $a_t$ that was sampled from the policy. We then observe the reward $r_{t+1}$ and a new state $s_{t+1}$ returned from the environment. The advantage function for the online actor-critic is the following:

$A(s_t, a_t) = r(s_{t}, a_{t}) + \gamma \hat{V}_\phi^\pi(s_{t+1}) - \hat{V}_\phi^\pi(s_t)$

This advantage function is the difference (or error) according to our state-value network, between $\hat{V}_\phi^\pi(s_t)$, our prior expectation of $s_t$'s state-value when following policy $\pi$, and $r(s_{t}, a_{t}) + \gamma \hat{V}_\phi^\pi(s_{t+1})$, which is the observed reward $r_{t+1}$ plus the discounted state-value evaluation of the new state $s_{t+1}$. Note that if:

$\hat{V}_\phi^\pi(s) = V_\pi(s) \hspace{7pt} \forall \hspace{3pt} s \hspace{12pt} => \hspace{12pt} E_{\pi}[A(s_t, a_t)]=0$

Meaning that if we actually had the true state-value function $V_\pi(s)$ then the expectation of the advantage when following policy $\pi$ would be equal to $0$. This implies that when exploring new actions, getting a positive advantage on average means that we are applying better actions than the actions sampled from $\pi$.

Bias & Variance

In practice, we don't have a perfect estimation of the state-value function because:

  • We keep changing our policy on-the-fly and do not wait for it to converge to the true value.
  • Even if we waited enough time, ANN's with nonlinearities aren't convex functions and are unlikely to converge to the true value.
  • In many cases the state space is too large that even if someone told us the true state-value we wouldn't have the capacity to store a unique value for each possible state. In these cases we are actually using an aggregation of states or observations.

And therefore, unlike REINFORCE, the advantage estimation is biased.

Now you probably expect me to say that on the other hand, actor-critic improves the variance... Well, if we are to use the online version described above, we'll be updating the gradient-based on a single environment interaction and that will be a very noisy signal (high variance). We prefer to use batches, and this can be achieved with the Asynchronous Advantage Actor-Critic (A3C) algorithm, which I'm not going to dive deep into here. In a nutshell, it runs multiple agents in parallel and synchronized their gradient signals which reduce the variance.

Advantages of Policy Gradient over action-value methods

This is a summary of the advantages of Policy Gradient over action-value given in Sutton and Barto's book chapter 13. It requires reader familiarity with state-value and action-value methods.

  1. Allows deterministic policies (discrete action space):
    The deterministic policy is naturally achieved by a PG method. In case a certain action dominants others, the logit value for this action will be pushed infinitely high, causing this action's softmax value to reach a probability of 1 in the limit. On the contrary, for action-values with an e-greedy policy, there is always an epsilon probability of choosing a random action. Trying to achieve exploration by defining a softmax over the action-values also won't allow convergence to a deterministic policy. This is because the action-values will converge to their true values, which have a finite difference, and therefore all actions will have a non-zero probability.
  2. Can learn a stochastic policy
    In some cases, the policy that yields the highest reward is stochastic - for example, in an imperfect information card game like poker, where you occasionally choose to bluff. PG methods using an ANN function approximator with a softmax layer can easily output any arbitrary probability over actions, while for action-value methods, the policy is set implicitly by choosing the action with the maximal action value estimate. This inherent maximization makes it impossible for action-value methods to naturally learn a distribution over actions.
  3. Stronger convergence guarantees due to policy smoothness
    Action probability of PG methods changes smoothly when using a continuous policy parametrization (e.g. an ANN model). This is an important theoretical advantage in comparison to the action-value policy, which is set implicitly by the learned action values. This implicit policy is not a smooth function because of the use of maximization to select between actions. A minor increase in a single action-value estimate can make this action's value estimation maximal and significantly increase the probability for that action to be chosen. This theoretical advantage is largely responsible for the stronger convergence guarantees available for PG methods over action-value methods.
  4. Injecting prior knowledge to policy parametrization
    The choice of policy parametrization can be a good way of injecting prior knowledge of the desired form of the policy into the reinforcement learning system. This is often the most important reason for using a policy-based learning method.
    "The advantages of policy gradient methods for parameterized motor primitives are numerous. Among the most important ones are that the policy representation can be hosen such that it is meaningful for the task, i.e., we can use a suitable motor primitive representation, and that domain knowledge can be incorporated, which often leads to fewer parameters in the learning process in comparison to traditional value function based approaches." (Reinforcement learning of motor skills with policy gradients)


This was an introductory blog post that went over the basic Policy Gradient methods like REINFORCE with/without variance reduction and Actor-Critic. I hope you enjoyed it - I would be pleased to recieve feedback and/or answer any of your questions.

In the next blog post, we'll write some Tensorflow code for both the discrete and continues action spaces (similar to the reparametrization trick used in VAE). We'll also dive into some more sophisticated PG methods that try to restrict the change in policy in comparison to our current policies like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).

Implementation warning

Note that when using a value network (in REINFORCE with baseline and actor-critic), it should be trained separately from the policy network. When training the policy network with an automatic differentiation software (e.g. Tensorflow), we need to be careful not to also generate gradients for the value network that is part of the advantage function. In Tensorflow for example, this can be achieved by first computing the reward function for each timestep $G_{i,t}$ in numpy and then inserting the resulting numpy array as input to the model with tf.placeholder (using feed_dict).


[1] Sutton and Barto book Reinforcement Learning: An Introduction.

[2] UC Berkeley's CS285: "Deep Reinforcement Learning" course by Sergey Levine.