- Environment $\rightarrow$ Markov decision process: $\mathcal{S}, \mathcal{A}, P, R$
- Agent $\rightarrow$ Policy $\pi$
- Goal: Find $\pi$ with maximum value
$$
V^\pi(s) = \mathbf{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t r(S_t, A_t) \middle| S_0 = s \right]
$$
Finite $\mathcal{S}$ and $\mathcal{A}$, known $P$ and $R$
Bellman Equations
$$
Q^*(s, a) = r(s, a) + \gamma \sum_{s'}P(s'|s, a) V^*(s'), \quad\text{where}\quad V^*(s) = \max_a Q^*(s, a)
$$
$$
\pi^*(s) = \arg\max_a Q^*(s, a)
$$
+ Dynamic Programming algorithms to compute $Q^*$ (Value Iteration, Policy Iteration)
Finite $\mathcal{S}$ and $\mathcal{A}$, unknown $P$ and $R$
$Q$-Learning
$$
\delta_t = r_t + \gamma \max_a \widehat{Q}(s_{t+1}, a) - \widehat{Q}(s_t, a_t)
$$
$$
\widehat{Q}(s_t, a_t) \gets \widehat{Q}(s_t, a_t) + \alpha_t \delta_t
$$
Large $\mathcal{S}$ and $\mathcal{A}$
Function approximation to estimate $Q$
$$
y_i = r_i + \gamma \max_a\widehat{Q}(s_i', a)
$$
$$
\widehat{Q} \gets \arg\min_{f \in \mathcal{F}} \sum_{i=1}^N (f(s_i, a_i) - y_i)^2
$$
Finite Horizon Policy Gradient
Finite-Horizon Objective Function
- We want to find $\pi$ that gives the maximum:
$$
\max_{\pi} \mathbf{E}_{\pi}\left[ \sum_{t=0}^{\color{yellow}{T}-1} \gamma^t r(S_t, A_t)
+ \gamma^{\color{yellow}{T}} r(S_T) \middle| S_0 \sim \rho \right]
$$
and we know that there exists a deterministic policy that achieves it.
- But we consider a relaxed version of the problem, using stochastic
and differentiable policies $\pi_\theta$:
$$
\max_{\color{yellow}{\theta}} J(\color{yellow}{\theta}) =
\mathbf{E}_{\pi_{\color{yellow}{\theta}}}\left[
\sum_{t=0}^{\color{yellow}{T}-1} \gamma^t r(S_t, A_t)
+ \gamma^{\color{yellow}{T}} r(S_T)
\middle| S_0 \sim \rho
\right]
$$
- Trajectory $\tau$ of states and actions:
$$
\tau = (S_0, A_0, S_1, A_1, \ldots, S_{T-1}, A_{T-1}, S_T)
$$
- Probability of this trajectory by following a policy $\pi_\theta$:
$$
p_\theta(\tau) = \rho(S_0)\prod_{t=0}^{T-1} \pi_\theta(A_t|S_t) p(S_{t+1}|S_t, A_t)
$$
- Sum of rewards along $\tau$:
$$
R(\tau) = \sum_{t=0}^{T-1} \gamma^t r(S_t, A_t)
+ \gamma^{T} r(S_T)
$$
Finite-Horizon Policy Gradient
With that notation, we can write the objective function as:
$$
J(\theta) =
\mathbf{E}_{\pi_{\theta}}\left[
\sum_{t=0}^{T-1} \gamma^t r(S_t, A_t)
+ \gamma^{T} r(S_T)
\middle| S_0 \sim \rho
\right]
=
\color{yellow}{\int R(\tau) p_\theta(\tau) \mathrm{d}\tau}
$$
And we can compute the gradient of $J(\theta)$ as:
$$
\nabla_\theta J(\theta) = \int R(\tau) \nabla_\theta p_\theta(\tau) \mathrm{d}\tau
= \int R(\tau) \left[\nabla_\theta \log p_\theta(\tau)\right] p_\theta(\tau)\mathrm{d}\tau
\\
=
\color{yellow}{
\mathbf{E}_{\pi_{\theta}}\left[ R(\tau) \nabla_\theta \log p_\theta(\tau) \right]
}
$$
Computing $\nabla_\theta \log p_\theta(\tau)$
$$
\color{yellow}{\nabla_\theta \log p_\theta(\tau)}
= \nabla_\theta \log \left( \rho(S_0)\prod_{t=0}^{T-1} \pi_\theta(A_t|S_t) p(S_{t+1}|S_t, A_t) \right)
\\
= \nabla_\theta \sum_{t=0}^{T-1} \left( \log \rho(S_0) + \log p(S_{t+1}|S_t, A_t) + \log \pi_\theta(A_t|S_t) \right)
\\
\color{yellow}{ = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(A_t|S_t)}
$$
Finite-Horizon Policy Gradient
Finally, we obtain
$$
\nabla_\theta J(\theta)
= \mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} R(\tau) \nabla_\theta \log \pi_\theta(A_t|S_t) \right]
$$
At each iteration:
- Sample $N$ trajectories $(\tau_i)_{i=1}^N$ following the policy $\pi_\theta$, where
$$
\tau_i = (S_0^i, A_0^i, S_1^i, A_1^i, \ldots, S_{T-1}^i, A_{T-1}^i, S_T^i)
$$
- Estimate the gradient as
$$
\widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} R(\tau_i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i)
$$
- Update the policy parameters
$$
\theta \gets \theta + \eta \widehat{\nabla_\theta J(\theta)}
$$
For any real constant $b$, show that
$$
\mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} b \nabla_\theta \log \pi_\theta(A_t|S_t) \right] = 0
$$
For any $t \in \{0, \ldots T-1\}$, let $Y_t$ be a random variable that depends only on
$\tau^{0:t} = \left(S_0, A_0, \ldots, S_{t-1}, A_{t-1}, S_t\right)$. Show that
$$
\mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} Y_t \nabla_\theta \log \pi_\theta(A_t|S_t) \right] = 0
$$
REINFORCE: Variance Reduction
True gradient $\nabla_\theta J(\theta)$:
$$
\mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} R(\tau) \nabla_\theta \log \pi_\theta(A_t|S_t) \right]
$$
Estimated gradient $\widehat{\nabla_\theta J(\theta)}$:
$$
\frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1} R(\tau_i) \nabla_\theta \log \pi_\theta(A_t^i|S_t^i)
$$
Problem: the variance
$$
\mathbf{E}_{\pi_{\theta}}\left[
\left(
\widehat{\nabla_\theta J(\theta)}
-
\nabla_\theta J(\theta)
\right)^2
\right]
$$
can be too high!
REINFORCE: Variance Reduction using Baselines
The variance can be reduced by subtracting a baseline $Y_t$ from the estimator:
$$
\widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1}
(R(\tau_i) -\color{yellow}{Y_t} )
\nabla_\theta \log \pi_\theta(A_t^i|S_t^i)
$$
and, if $Y_t$ is a function only of $\tau^{0:t}$, then the estimator remains unbiased:
$$
\mathbf{E}_{\pi_{\theta}}\left[ \widehat{\nabla_\theta J(\theta)} \right] =\nabla_\theta J(\theta)
$$
REINFORCE: Variance Reduction
A common choice is to take $Y_t = \sum_{i=0}^{t-1} \gamma^{t'} r(S_{t'}, A_{t'})$, which gives us
$$
\widehat{\nabla_\theta J(\theta)} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T-1}
\left(
\sum_{t'=t}^{T-1} \gamma^{t'} r(S_{t'}^i, A_{t'}^i) + \gamma^T r(S_T^i)
\right)
\nabla_\theta \log \pi_\theta(A_t^i|S_t^i)
$$
Let
$$
Q_t^{\pi_\theta}(s, a) =
\mathbf{E}_{\pi_{\theta}}\left[
\sum_{t'=t}^{T-1} \gamma^{t'-t} r(S_{t'}, A_{t'}) + \gamma^{T-t} r(S_T)
\middle | S_t = s, A_t = a
\right]
$$
Show that
$$
\nabla_\theta J(\theta) =
\mathbf{E}_{\pi_{\theta}}\left[ \sum_{t=0}^{T-1} \gamma^t Q_t^{\pi_\theta}(S_t, A_t) \nabla_\theta \log \pi_\theta(A_t|S_t) \right]
$$