Policy Gradient Methods

Introduction to Policy Gradient Methods

Omar Darwiche Domingues

Inria Lille

Recap

Reinforcement Learning

Environment $\rightarrow$ Markov decision process: $\mathcal{S}, \mathcal{A}, P, R$
Agent $\rightarrow$ Policy $\pi$
Goal: Find $\pi$ with maximum value $$ V^\pi(s) = \mathbf{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \middle| S_0 = s \right] $$

Finite $\mathcal{S}$ and $\mathcal{A}$, known $P$ and $R$

Bellman Equations

$$ Q^*(s, a) = r(s, a) + \gamma \sum_{s'}P(s'|s, a) V^*(s'), \quad\text{where}\quad V^*(s) = \max_a Q^*(s, a) $$

$$ \pi^*(s) = \arg\max_a Q^*(s, a) $$

+ Dynamic Programming algorithms to compute $Q^*$ (Value Iteration, Policy Iteration)

Finite $\mathcal{S}$ and $\mathcal{A}$, unknown $P$ and $R$

$Q$-Learning

$$ \delta_t = r_t + \gamma \max_a \widehat{Q}(s_{t+1}, a) - \widehat{Q}(s_t, a_t) $$

$$ \widehat{Q}(s_t, a_t) \gets \widehat{Q}(s_t, a_t) + \alpha_t \delta_t $$

Large $\mathcal{S}$ and $\mathcal{A}$

Function approximation to estimate $Q$

$$ y_i = r_i + \gamma \max_a\widehat{Q}(s_i', a) $$

$$ \widehat{Q} \gets \arg\min_{f \in \mathcal{F}} \sum_{i=1}^N (f(s_i, a_i) - y_i)^2 $$

Policy Gradient: Idea

Idea

Instead of learning a $Q$ function, learn the policy $\pi$ directly.
Consider a parametric family of stochastic policies $\{\pi_\theta\}_{\theta\in\Theta}$.
$\pi_\theta(a|s)$ = probability of choosing $a$ in $s$, given $\theta$.
$\theta \mapsto \pi_\theta(a|s)$ is assumed to be differentiable.

Goal: find $\theta$ that maximizes $$ J(\theta) = \mathbf{E}_{\pi_\theta}\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \middle| S_0 \sim \rho \right] $$ where $\rho$ is an initial state distribution.

Idea

$$ J(\theta) = \mathbf{E}_{\pi_\theta}\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \middle| S_0 \sim \rho \right] $$ If we can compute $\nabla_\theta J(\theta)$, use gradient ascent $$ \theta \gets \theta + \eta \nabla J(\theta) $$ We will see how to estimate this gradient!

A simplified case

Maximizing a function $f: \mathbb{R} \to \mathbb{R}$ $$ \max_x f(x) $$

Maximizing $f(x)$

What can we do if $f$ is not concave and not differentiable?

If $f$ is a reward function, this is often the case!

Maximizing $f(x)$

What can we do if $f$ is not concave and not differentiable?

Let $p_\theta(x)$ be a probability distribution on $\mathbb{R}$ parameterized by $\theta$
Create a "relaxed" version of the optimization problem: $$ \max_\theta \mathbf{E}_\theta[f(X)] \quad \text{where} \quad X \sim p_\theta $$ and solve it instead of $\max_x f(x)$.

Maximizing $f(x)$

Relaxed version of the optimization problem: $$ \max_\theta \mathbf{E}_\theta[f(X)] \quad \text{where} \quad X \sim p_\theta $$ Intuition:

Let $x^* \in \arg\max_x f(x)$
If $\exists\, \theta^*$ such that $p_{\theta^*}$ is the Dirac measure at $x^*$:
To simplify the problem , take a familiy $\{p_\theta\}_{\theta\in\Theta}$ such that $\theta\mapsto p_\theta$ is differentiable!

Maximizing $f(x)$

Relaxed version of the optimization problem: $\max_\theta J(\theta)$, $$ \text{where} \quad J(\theta) = \mathbf{E}_\theta[f(X)] \text{ and } X \sim p_\theta $$ Now, since $\theta\mapsto p_\theta$ is differentiable: $$ \nabla J(\theta) = \nabla \int f(x)p_\theta(x) dx = \int f(x) \nabla p_\theta(x) dx \\ = \int f(x) \left[\nabla\log p_\theta(x)\right] p_\theta(x) dx $$ $$ = \mathbf{E}_\theta\left[f(X)\nabla\log p_\theta(X)\right] \text{ for } X \sim p_\theta $$

Maximizing $f(x)$

$$ \nabla J(\theta) = \mathbf{E}_\theta\left[f(X)\nabla\log p_\theta(X)\right] \approx \frac{1}{N} \sum_{i=1}^N f(X_i)\nabla\log p_\theta(X_i) \quad \text{ for }X_i \sim p_\theta $$ $$ p_\theta(x) = \frac{1}{\sqrt{2\pi\sigma}}\exp\left(-\frac{1}{2}\frac{(x-\theta)^2}{2\sigma^2}\right) $$

Exercise

For any real constant $b$, show that $$ \mathbf{E}_\theta\left[b\nabla\log p_\theta(X)\right] = 0 $$

Finite Horizon Policy Gradient

Finite-Horizon Objective Function

We want to find $\pi$ that gives the maximum:

deterministic

But we consider a relaxed version of the problem, using stochastic and differentiable policies $\pi_\theta$:

Notation

Trajectory $\tau$ of states and actions:
Probability of this trajectory by following a policy $\pi_\theta$:
Sum of rewards along $\tau$:

Finite-Horizon Policy Gradient

With that notation, we can write the objective function as: $$ J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t r(S_t, A_t) + \gamma^{T} r(S_T) \middle| S_0 \sim \rho \right] = \color{yellow}{\int R(\tau) p_\theta(\tau) \mathrm{d}\tau} $$ And we can compute the gradient of $J(\theta)$ as: $$ \nabla_\theta J(\theta) = \int R(\tau) \nabla_\theta p_\theta(\tau) \mathrm{d}\tau = \int R(\tau) \left[\nabla_\theta \log p_\theta(\tau)\right] p_\theta(\tau)\mathrm{d}\tau \\ = \color{yellow}{ \mathbf{E}_{\pi_{\theta}}\left[ R(\tau) \nabla_\theta \log p_\theta(\tau) \right] } $$

Computing $\nabla_\theta \log p_\theta(\tau)$

$$ \color{yellow}{\nabla_\theta \log p_\theta(\tau)} = \nabla_\theta \log \left( \rho(S_0)\prod_{t=0}^{T-1} \pi_\theta(A_t|S_t) p(S_{t+1}|S_t, A_t) \right) \\ = \nabla_\theta \sum_{t=0}^{T-1} \left( \log \rho(S_0) + \log p(S_{t+1}|S_t, A_t) + \log \pi_\theta(A_t|S_t) \right) \\ \color{yellow}{ = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(A_t|S_t)} $$

Finite-Horizon Policy Gradient

Finally, we obtain

$$ \nabla_\theta J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} R(\tau) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$

REINFORCE Algorithm

At each iteration:

Sample $N$ trajectories $(\tau_i)_{i=1}^N$ following the policy $\pi_\theta$, where
Estimate the gradient as
Update the policy parameters

Exercise

For any real constant $b$, show that $$ \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} b \nabla_\theta \log \pi_\theta(A_t|S_t) \right] = 0 $$

Exercise

For any $t \in \{0, \ldots T-1\}$, let $Y_t$ be a random variable that depends only on $\tau^{0:t} = \left(S_0, A_0, \ldots, S_{t-1}, A_{t-1}, S_t\right)$. Show that $$ \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} Y_t \nabla_\theta \log \pi_\theta(A_t|S_t) \right] = 0 $$

REINFORCE: Variance Reduction

True gradient $\nabla_\theta J(\theta)$: $$ \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} R(\tau) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$

Estimated gradient $\widehat{\nabla_\theta J(\theta)}$: $$ \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} R(\tau_i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$

Problem: the variance $$ \mathbf{E}_{\pi_{\theta}}\left[ \left( \widehat{\nabla_\theta J(\theta)} - \nabla_\theta J(\theta) \right)^2 \right] $$ can be too high!

REINFORCE: Variance Reduction using Baselines

The variance can be reduced by subtracting a baseline $Y_t$ from the estimator: $$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} (R(\tau_i) -\color{yellow}{Y_t} ) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$ and, if $Y_t$ is a function only of $\tau^{0:t}$, then the estimator remains unbiased: $$ \mathbf{E}_{\pi_{\theta}}\left[ \widehat{\nabla_\theta J(\theta)} \right] =\nabla_\theta J(\theta) $$

REINFORCE: Variance Reduction

A common choice is to take $Y_t = \sum_{i=0}^{t-1} \gamma^{t'} r(S_{t'}, A_{t'})$, which gives us

$$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \left( \sum_{t'=t}^{T-1} \gamma^{t'} r(S_{t'}^i, A_{t'}^i) + \gamma^T r(S_T^i) \right) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$

Exercise

Let $$ Q_t^{\pi_\theta}(s, a) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t'=t}^{T-1} \gamma^{t'-t} r(S_{t'}, A_{t'}) + \gamma^{T-t} r(S_T) \middle | S_t = s, A_t = a \right] $$

Show that $$ \nabla_\theta J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t Q_t^{\pi_\theta}(S_t, A_t) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$

Infinite Horizon Policy Gradient

Infinite-Horizon Objective Function

Find $\theta$ that maximizes:
That is, $J(\theta)$ is now the limit of the finite-horizon objective as $T\to \infty$:

Policy Gradient Theorem

[Sutton et al. (1999)] $$ \nabla_\theta J(\theta) = \frac{1}{1-\gamma} \mathbb{E}_{(s, a) \sim d^{\pi_\theta} } \left[ \nabla_\theta \log\pi_\theta(a|s) Q^{\pi_\theta}(s, a) \right] $$

where, in the discounted setting, $$ d^{\pi_\theta}(s, a) = (1-\gamma) \sum_{t=0}^\infty \gamma^t \mathbb{P}\left[ S_t = s, A_t = a\right] $$

Policy Gradient Theorem

Notice that, by using the definition of $d^{\pi_\theta}(s, a)$: $$ \nabla_\theta J(\theta) = \sum_{s, a} \sum_{t=0}^\infty \gamma^t \mathbb{P}\left[ S_t = s, A_t = a\right] \nabla_\theta \log\pi_\theta(a|s) Q^{\pi_\theta}(s, a) \\ = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t \nabla_\theta \log\pi_\theta(A_t|S_t) Q^{\pi_\theta}(S_t, A_t) \right] $$ and notice the similarity with the finite-horizon case.

Policy Gradient Theorem: Proof Idea

Use the Bellman equations $$ Q^{\pi_\theta}(s, a) = r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^{\pi_\theta}(s') \\ V^{\pi_\theta}(s) = \sum_a \pi_\theta(a|s) Q^{\pi_\theta}(s, a) $$ to show that $$ \nabla_\theta V^{\pi_\theta}(s) = \sum_a Q^{\pi_\theta}(s, a)\nabla_\theta \pi_\theta(a|s) + \gamma \sum_a \pi_\theta(a|s)\sum_{s'} p(s'|s, a) \nabla_\theta V^{\pi_\theta}(s') $$ and continue recursively.

Policy Gradient: Recap and Examples

Exact gradient

Finite horizon:

$$ \nabla_\theta J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t Q_t^{\pi_\theta}(S_t, A_t) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$

Infinite horizon with discount $\gamma$:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t Q^{\pi_\theta}(S_t, A_t) \nabla_\theta\log\pi_\theta(A_t|S_t) \right] $$

REINFORCE: Gradient estimation and updates

$$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \color{red}{\gamma^t} \widehat{Q}_t^i(S_t^i, A_t^i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$ where $$ \widehat{Q}_t^i(S_t^i, A_t^i) = \sum_{t'=t}^{T-1} \gamma^{t'-t} r(S_{t'}^i, A_{t'}^i) + \gamma^{T-t} r(S_T^i) $$

Then, the policy is updated as $$ \theta \gets \theta + \eta \widehat{\nabla_\theta J(\theta)} $$

How to represent a policy?

Gaussian policy with mean $\mu_{\theta_1}(s)$ and variance $\sigma_{\theta_2}(s)^2$, where $\theta = (\theta_1, \theta_2)$
Softmax policy
A neural network with parameters $\theta$, with states as input and a distribution over actions as output.

Advantage Actor-Critic

Back to variance reduction

The gradient estimated by REINFORCE can still have a high variance!

Taking $V^{\pi_\theta}(S_t)$ as a baseline, we have

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t \left[ Q^{\pi_\theta}(S_t, A_t) - V^{\pi_\theta}(S_t) \right] \nabla_\theta\log\pi_\theta(A_t|S_t) \right] \\ = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t \color{yellow}{A^{\pi_\theta}(S_t, A_t)} \nabla_\theta\log\pi_\theta(A_t|S_t) \right] $$

where $\color{yellow}{A^{\pi_\theta}(S_t, A_t)} = Q^{\pi_\theta}(S_t, A_t) - V^{\pi_\theta}(S_t)$ is the advantage function.

Advantage Actor-Critic (A2C)

Assume we have access to $V_\omega(s)$, a function approximation of $V^{\pi_\theta}(s)$ with parameters $\omega$.
The advantage function at $(S_t, A_t)$ can be estimated as
Which leads to the following gradient estimator:

Advantage Actor-Critic (A2C)

Actor $\leftrightarrow$ policy

Critic $\leftrightarrow$ value function approximation

This reduces the variance of the gradient estimator, but can introduce bias!

A2C: How to update the approximation $V_\omega(s)$?

Use the temporal difference $\delta_t$
Use batches and boostrapping

Conclusion

Policy Gradient: Advantages and Disadvantages

Allows conservative policy updates that make learning more stable.

Easy to implement and can handle continuous states and actions.

Requires a lot of samples!

It can be difficult to control the variance of the gradient.

How to avoid bad local optima?