Introduction to Policy Gradient Methods

Omar Darwiche Domingues

Inria Lille

Recap

Reinforcement Learning


  • Environment $\rightarrow$ Markov decision process: $\mathcal{S}, \mathcal{A}, P, R$
  • Agent $\rightarrow$ Policy $\pi$
  • Goal: Find $\pi$ with maximum value $$ V^\pi(s) = \mathbf{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \middle| S_0 = s \right] $$
Finite $\mathcal{S}$ and $\mathcal{A}$, known $P$ and $R$


Bellman Equations

$$ Q^*(s, a) = r(s, a) + \gamma \sum_{s'}P(s'|s, a) V^*(s'), \quad\text{where}\quad V^*(s) = \max_a Q^*(s, a) $$

$$ \pi^*(s) = \arg\max_a Q^*(s, a) $$

+ Dynamic Programming algorithms to compute $Q^*$ (Value Iteration, Policy Iteration)
Finite $\mathcal{S}$ and $\mathcal{A}$, unknown $P$ and $R$


$Q$-Learning

$$ \delta_t = r_t + \gamma \max_a \widehat{Q}(s_{t+1}, a) - \widehat{Q}(s_t, a_t) $$

$$ \widehat{Q}(s_t, a_t) \gets \widehat{Q}(s_t, a_t) + \alpha_t \delta_t $$

Large $\mathcal{S}$ and $\mathcal{A}$


Function approximation to estimate $Q$

$$ y_i = r_i + \gamma \max_a\widehat{Q}(s_i', a) $$

$$ \widehat{Q} \gets \arg\min_{f \in \mathcal{F}} \sum_{i=1}^N (f(s_i, a_i) - y_i)^2 $$

Policy Gradient: Idea

Idea


  • Instead of learning a $Q$ function, learn the policy $\pi$ directly.
  • Consider a parametric family of stochastic policies $\{\pi_\theta\}_{\theta\in\Theta}$.
  • $\pi_\theta(a|s)$ = probability of choosing $a$ in $s$, given $\theta$.
  • $\theta \mapsto \pi_\theta(a|s)$ is assumed to be differentiable.


Goal: find $\theta$ that maximizes $$ J(\theta) = \mathbf{E}_{\pi_\theta}\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \middle| S_0 \sim \rho \right] $$ where $\rho$ is an initial state distribution.
Idea


$$ J(\theta) = \mathbf{E}_{\pi_\theta}\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \middle| S_0 \sim \rho \right] $$ If we can compute $\nabla_\theta J(\theta)$, use gradient ascent $$ \theta \gets \theta + \eta \nabla J(\theta) $$ We will see how to estimate this gradient!

A simplified case



Maximizing a function $f: \mathbb{R} \to \mathbb{R}$ $$ \max_x f(x) $$
Maximizing $f(x)$


What can we do if $f$ is not concave and not differentiable?

If $f$ is a reward function, this is often the case!
Maximizing $f(x)$


What can we do if $f$ is not concave and not differentiable?

  • Let $p_\theta(x)$ be a probability distribution on $\mathbb{R}$ parameterized by $\theta$
  • Create a "relaxed" version of the optimization problem: $$ \max_\theta \mathbf{E}_\theta[f(X)] \quad \text{where} \quad X \sim p_\theta $$ and solve it instead of $\max_x f(x)$.
Maximizing $f(x)$


Relaxed version of the optimization problem: $$ \max_\theta \mathbf{E}_\theta[f(X)] \quad \text{where} \quad X \sim p_\theta $$ Intuition:

  • Let $x^* \in \arg\max_x f(x)$
  • If $\exists\, \theta^*$ such that $p_{\theta^*}$ is the Dirac measure at $x^*$:
  • $$ \max_\theta \mathbf{E}_\theta[f(X)] = \max_x f(x) $$
  • To simplify the problem , take a familiy $\{p_\theta\}_{\theta\in\Theta}$ such that $\theta\mapsto p_\theta$ is differentiable!
Maximizing $f(x)$


Relaxed version of the optimization problem: $\max_\theta J(\theta)$, $$ \text{where} \quad J(\theta) = \mathbf{E}_\theta[f(X)] \text{ and } X \sim p_\theta $$ Now, since $\theta\mapsto p_\theta$ is differentiable: $$ \nabla J(\theta) = \nabla \int f(x)p_\theta(x) dx = \int f(x) \nabla p_\theta(x) dx \\ = \int f(x) \left[\nabla\log p_\theta(x)\right] p_\theta(x) dx $$ $$ = \mathbf{E}_\theta\left[f(X)\nabla\log p_\theta(X)\right] \text{ for } X \sim p_\theta $$
Maximizing $f(x)$
$$ \nabla J(\theta) = \mathbf{E}_\theta\left[f(X)\nabla\log p_\theta(X)\right] \approx \frac{1}{N} \sum_{i=1}^N f(X_i)\nabla\log p_\theta(X_i) \quad \text{ for }X_i \sim p_\theta $$ $$ p_\theta(x) = \frac{1}{\sqrt{2\pi\sigma}}\exp\left(-\frac{1}{2}\frac{(x-\theta)^2}{2\sigma^2}\right) $$
Exercise

For any real constant $b$, show that $$ \mathbf{E}_\theta\left[b\nabla\log p_\theta(X)\right] = 0 $$

Finite Horizon Policy Gradient

Finite-Horizon Objective Function


  • We want to find $\pi$ that gives the maximum:
  • $$ \max_{\pi} \mathbf{E}_{\pi}\left[ \sum_{t=0}^{\color{yellow}{T}-1} \gamma^t r(S_t, A_t) + \gamma^{\color{yellow}{T}} r(S_T) \middle| S_0 \sim \rho \right] $$ and we know that there exists a deterministic policy that achieves it.

  • But we consider a relaxed version of the problem, using stochastic and differentiable policies $\pi_\theta$:
  • $$ \max_{\color{yellow}{\theta}} J(\color{yellow}{\theta}) = \mathbf{E}_{\pi_{\color{yellow}{\theta}}}\left[ \sum_{t=0}^{\color{yellow}{T}-1} \gamma^t r(S_t, A_t) + \gamma^{\color{yellow}{T}} r(S_T) \middle| S_0 \sim \rho \right] $$
Notation


  • Trajectory $\tau$ of states and actions:
  • $$ \tau = (S_0, A_0, S_1, A_1, \ldots, S_{T-1}, A_{T-1}, S_T) $$
  • Probability of this trajectory by following a policy $\pi_\theta$:
  • $$ p_\theta(\tau) = \rho(S_0)\prod_{t=0}^{T-1} \pi_\theta(A_t|S_t) p(S_{t+1}|S_t, A_t) $$
  • Sum of rewards along $\tau$:
  • $$ R(\tau) = \sum_{t=0}^{T-1} \gamma^t r(S_t, A_t) + \gamma^{T} r(S_T) $$
Finite-Horizon Policy Gradient


With that notation, we can write the objective function as: $$ J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t r(S_t, A_t) + \gamma^{T} r(S_T) \middle| S_0 \sim \rho \right] = \color{yellow}{\int R(\tau) p_\theta(\tau) \mathrm{d}\tau} $$ And we can compute the gradient of $J(\theta)$ as: $$ \nabla_\theta J(\theta) = \int R(\tau) \nabla_\theta p_\theta(\tau) \mathrm{d}\tau = \int R(\tau) \left[\nabla_\theta \log p_\theta(\tau)\right] p_\theta(\tau)\mathrm{d}\tau \\ = \color{yellow}{ \mathbf{E}_{\pi_{\theta}}\left[ R(\tau) \nabla_\theta \log p_\theta(\tau) \right] } $$
Computing $\nabla_\theta \log p_\theta(\tau)$


$$ \color{yellow}{\nabla_\theta \log p_\theta(\tau)} = \nabla_\theta \log \left( \rho(S_0)\prod_{t=0}^{T-1} \pi_\theta(A_t|S_t) p(S_{t+1}|S_t, A_t) \right) \\ = \nabla_\theta \sum_{t=0}^{T-1} \left( \log \rho(S_0) + \log p(S_{t+1}|S_t, A_t) + \log \pi_\theta(A_t|S_t) \right) \\ \color{yellow}{ = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(A_t|S_t)} $$
Finite-Horizon Policy Gradient


Finally, we obtain

$$ \nabla_\theta J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} R(\tau) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$

REINFORCE Algorithm


At each iteration:

  • Sample $N$ trajectories $(\tau_i)_{i=1}^N$ following the policy $\pi_\theta$, where
  • $$ \tau_i = (S_0^i, A_0^i, S_1^i, A_1^i, \ldots, S_{T-1}^i, A_{T-1}^i, S_T^i) $$
  • Estimate the gradient as
  • $$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} R(\tau_i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$
  • Update the policy parameters
  • $$ \theta \gets \theta + \eta \widehat{\nabla_\theta J(\theta)} $$
Exercise

For any real constant $b$, show that $$ \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} b \nabla_\theta \log \pi_\theta(A_t|S_t) \right] = 0 $$

Exercise

For any $t \in \{0, \ldots T-1\}$, let $Y_t$ be a random variable that depends only on $\tau^{0:t} = \left(S_0, A_0, \ldots, S_{t-1}, A_{t-1}, S_t\right)$. Show that $$ \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} Y_t \nabla_\theta \log \pi_\theta(A_t|S_t) \right] = 0 $$

REINFORCE: Variance Reduction


True gradient $\nabla_\theta J(\theta)$: $$ \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} R(\tau) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$
Estimated gradient $\widehat{\nabla_\theta J(\theta)}$: $$ \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} R(\tau_i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$


Problem: the variance $$ \mathbf{E}_{\pi_{\theta}}\left[ \left( \widehat{\nabla_\theta J(\theta)} - \nabla_\theta J(\theta) \right)^2 \right] $$ can be too high!
REINFORCE: Variance Reduction using Baselines


The variance can be reduced by subtracting a baseline $Y_t$ from the estimator: $$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} (R(\tau_i) -\color{yellow}{Y_t} ) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$ and, if $Y_t$ is a function only of $\tau^{0:t}$, then the estimator remains unbiased: $$ \mathbf{E}_{\pi_{\theta}}\left[ \widehat{\nabla_\theta J(\theta)} \right] =\nabla_\theta J(\theta) $$
REINFORCE: Variance Reduction


A common choice is to take $Y_t = \sum_{i=0}^{t-1} \gamma^{t'} r(S_{t'}, A_{t'})$, which gives us

$$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \left( \sum_{t'=t}^{T-1} \gamma^{t'} r(S_{t'}^i, A_{t'}^i) + \gamma^T r(S_T^i) \right) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$

Exercise


Let $$ Q_t^{\pi_\theta}(s, a) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t'=t}^{T-1} \gamma^{t'-t} r(S_{t'}, A_{t'}) + \gamma^{T-t} r(S_T) \middle | S_t = s, A_t = a \right] $$

Show that $$ \nabla_\theta J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t Q_t^{\pi_\theta}(S_t, A_t) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$

Infinite Horizon Policy Gradient

Infinite-Horizon Objective Function


  • Find $\theta$ that maximizes:
  • $$ \max_{\theta} J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{\infty} \gamma^t r(S_t, A_t) \middle| S_0 \sim \rho \right] $$
  • That is, $J(\theta)$ is now the limit of the finite-horizon objective as $T\to \infty$:
  • $$ J(\theta) = \lim_{T\to\infty}\mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t r(S_t, A_t) + \gamma^{T} r(S_T) \middle| S_0 \sim \rho \right] $$
Policy Gradient Theorem


[Sutton et al. (1999)] $$ \nabla_\theta J(\theta) = \frac{1}{1-\gamma} \mathbb{E}_{(s, a) \sim d^{\pi_\theta} } \left[ \nabla_\theta \log\pi_\theta(a|s) Q^{\pi_\theta}(s, a) \right] $$



where, in the discounted setting, $$ d^{\pi_\theta}(s, a) = (1-\gamma) \sum_{t=0}^\infty \gamma^t \mathbb{P}\left[ S_t = s, A_t = a\right] $$
Policy Gradient Theorem


Notice that, by using the definition of $d^{\pi_\theta}(s, a)$: $$ \nabla_\theta J(\theta) = \sum_{s, a} \sum_{t=0}^\infty \gamma^t \mathbb{P}\left[ S_t = s, A_t = a\right] \nabla_\theta \log\pi_\theta(a|s) Q^{\pi_\theta}(s, a) \\ = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t \nabla_\theta \log\pi_\theta(A_t|S_t) Q^{\pi_\theta}(S_t, A_t) \right] $$ and notice the similarity with the finite-horizon case.
Policy Gradient Theorem: Proof Idea


Use the Bellman equations $$ Q^{\pi_\theta}(s, a) = r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^{\pi_\theta}(s') \\ V^{\pi_\theta}(s) = \sum_a \pi_\theta(a|s) Q^{\pi_\theta}(s, a) $$ to show that $$ \nabla_\theta V^{\pi_\theta}(s) = \sum_a Q^{\pi_\theta}(s, a)\nabla_\theta \pi_\theta(a|s) + \gamma \sum_a \pi_\theta(a|s)\sum_{s'} p(s'|s, a) \nabla_\theta V^{\pi_\theta}(s') $$ and continue recursively.

Policy Gradient: Recap and Examples

Exact gradient


Finite horizon:

$$ \nabla_\theta J(\theta) = \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t Q_t^{\pi_\theta}(S_t, A_t) \nabla_\theta \log \pi_\theta(A_t|S_t) \right] $$



Infinite horizon with discount $\gamma$:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t Q^{\pi_\theta}(S_t, A_t) \nabla_\theta\log\pi_\theta(A_t|S_t) \right] $$

REINFORCE: Gradient estimation and updates


$$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \color{red}{\gamma^t} \widehat{Q}_t^i(S_t^i, A_t^i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$ where $$ \widehat{Q}_t^i(S_t^i, A_t^i) = \sum_{t'=t}^{T-1} \gamma^{t'-t} r(S_{t'}^i, A_{t'}^i) + \gamma^{T-t} r(S_T^i) $$

Then, the policy is updated as $$ \theta \gets \theta + \eta \widehat{\nabla_\theta J(\theta)} $$
How to represent a policy?


  • Gaussian policy with mean $\mu_{\theta_1}(s)$ and variance $\sigma_{\theta_2}(s)^2$, where $\theta = (\theta_1, \theta_2)$
  • $$ \pi_\theta(a|s) = \frac{1}{\sqrt{2\pi} \sigma_{\theta_2}(s)} \exp\left( -\frac{1}{2\sigma_{\theta_2}(s)^2} (a -\mu_{\theta_1}(s))^2 \right) $$
  • Softmax policy
  • $$ \pi_\theta(a|s) = \frac{ \exp(\theta_{s, a}) }{ \sum_{a'}\exp(\theta_{s, a'}) } $$
  • A neural network with parameters $\theta$, with states as input and a distribution over actions as output.

Advantage Actor-Critic

Back to variance reduction


The gradient estimated by REINFORCE can still have a high variance!

Taking $V^{\pi_\theta}(S_t)$ as a baseline, we have

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t \left[ Q^{\pi_\theta}(S_t, A_t) - V^{\pi_\theta}(S_t) \right] \nabla_\theta\log\pi_\theta(A_t|S_t) \right] \\ = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^\infty \gamma^t \color{yellow}{A^{\pi_\theta}(S_t, A_t)} \nabla_\theta\log\pi_\theta(A_t|S_t) \right] $$

where $\color{yellow}{A^{\pi_\theta}(S_t, A_t)} = Q^{\pi_\theta}(S_t, A_t) - V^{\pi_\theta}(S_t)$ is the advantage function.
Advantage Actor-Critic (A2C)


  • Assume we have access to $V_\omega(s)$, a function approximation of $V^{\pi_\theta}(s)$ with parameters $\omega$.
  • The advantage function at $(S_t, A_t)$ can be estimated as
  • $$ \widehat{A}(S_t, A_t) = r(S_t, A_t) + \gamma V_\omega(S_{t+1}) - V_\omega(S_t) $$
  • Which leads to the following gradient estimator:
  • $$ \widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \color{red}{\gamma^t} \widehat{A}(S_t^i, A_t^i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i) $$
Advantage Actor-Critic (A2C)


Actor $\leftrightarrow$ policy

Critic $\leftrightarrow$ value function approximation



This reduces the variance of the gradient estimator, but can introduce bias!
A2C: How to update the approximation $V_\omega(s)$?


    • Use the temporal difference $\delta_t$
    • $$ \delta_t = r(S_t, A_t) + \gamma V_\omega(S_{t+1}) - V_\omega(S_t) , \quad \omega \gets \omega + \beta \delta_t \nabla_\omega V_\omega(S_t) $$
    • Use batches and boostrapping
    • $$ \widehat{V}(S_t^i) = \sum_{k=t}^{t+\color{yellow}{p}} \gamma^{k-t}r(S_{k}^i, A_{k}^i) + \gamma^{\color{yellow}{p}+1} V_\omega(S_{t+\color{yellow}{p}+1}^i) \\ $$ and minimize the loss with respect to $\omega$ $$ \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} \left( \widehat{V}(S_t^i)- V_\omega(S_t^i) \right)^2 $$

Conclusion

Policy Gradient: Advantages and Disadvantages


  • Allows conservative policy updates that make learning more stable.


  • Easy to implement and can handle continuous states and actions.


  • Requires a lot of samples!


  • It can be difficult to control the variance of the gradient.


  • How to avoid bad local optima?