Jake Gunther
2020/3/4
“Proximal Policy Optimization Algorithms”
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov
(OpenAI)
18 August 2017
Recall REINFORCE and REINFORCE with baseline \[ \begin{align} \nabla J(\mathbf{\theta}) &= E\left[ q_\pi(S_t,A_t)\frac{\nabla\pi(A_t|S_t,\mathbf{\theta})}{\pi(A_t|S_t,\mathbf{\theta})}\right]\\ &= E\left[ \left(q_\pi(S_t,A_t)-v_\pi(S_t)\right)\frac{\nabla\pi(A_t|S_t,\mathbf{\theta})}{\pi(A_t|S_t,\mathbf{\theta})}\right] \end{align} \]
Advantage function \(A(s, a) = q(s, a) - v(s)\) measures the improvement of action \(a\) compared to the average \(v(s) = \sum_a \pi(a|s)q(s, a)\).
If \(A(s,a)>0\), then gradient is pushed in that direction.
If \(A(s,a) <0\), then gradient is pushed in the opposite direction.
\[ L^{PG}(\theta) = \hat{E}\left[ \hat{A}_t \log \pi_\theta(a_t|s_t)\right], \qquad \hat{g} = \nabla_\theta L^{PG}(\theta) \]
where \(\hat{A}_t\) is an estimate of the advantage function and \(\hat{E}\) is empirical average of batch of samples.
Problem: Gradients of \(L^{PG}(\theta)\) can become destructively large. This is why we saw such small \(\alpha\) used with REINFORCE in book.
Surrogate objective maximized subject to constraint on size of policy update \[ \begin{align} \max_\theta &\;\; \hat{E}_t\left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}\hat{A}_t\right) \\ \text{s.t.} &\;\; \hat{E}_t\left(\text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t),\pi_\theta(\cdot|s_t)]\right) \leq \delta \end{align} \]
Use a penalty instead of a constraint \[ \max_\theta\;\; \hat{E}_t\left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}\hat{A}_t - \beta\text{KL}[\pi_{\theta_\text{old}}(\cdot|s_t),\pi_\theta(\cdot|s_t)]\right) \]
Tuning \(\beta\) is problematic.
\[ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}, \qquad r_t(\theta_\text{old}) = 1 \]
Conservative policy iteration (CPI): \(L^\text{CPI}(\theta) = \hat{E}_t \left[ r_t(\theta) \hat{A}_t \right]\).
Without constraint maximizing CPI leads to large policy update.
\[ L^\text{CLIP}(\theta) = \hat{E}_t\left[\min\left(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon\right)\hat{A}_t \right] \]
Clipping ratio removes the incentive for moving \(r_t\) outside interval \([1-\epsilon,1+\epsilon]\).
\(N\) = number of instances of environment
Agent uses \(\pi_{\theta_\text{old}}\) in \(N\) different instances of env.
SGD or Adam on mini batches of data
Comparison with other algorithms considered to be effective for continuous action problems.