PPO is commonly referred to as an on-policy algorithm. We argue that this naming scheme is confusing, and show that true on-policy PPO reduces to vanilla policy gradient REINFORCE with baseline.

PPO is off-policy RL

Reinforcement learning is used in LLM post-training because we cannot backpropagate through generation . The original policy gradient $$\hat{g}=\hat{\mathbb{E}}_t [\nabla_\theta \log \pi_\theta (a_t|s_t)\hat{A}_t]$$ intuitively increases the probability of actions that led to high returns, and decreases that of actions that led to low returns. PPO is a two-fold modification of the vanilla policy gradient. The first modification is the vanilla policy gradient's extension to the off-policy regime: The policy gradient theorem assumes the behaviour policy to be the target policy due to its expectation. When behaviour and target policy are different, a further approximated policy gradient with a corrective term can be used instead: $$\hat{g}=\hat{\mathbb{E}}_t [\frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_\text{old}} (a_t|s_t)} \nabla_\theta \log \pi_\theta (a_t|s_t)\hat{A}_t]$$. Instead of nudging the logprobs, the importance sampling ratio $$\frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_\text{old}} (a_t|s_t)}$$ is nudged instead, leading to the objective $$L^{CPI}=\hat{\mathbb{E}}_t [\frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_\text{old}} (a_t|s_t)} \hat{A}_t ]$$. A common source of confusion for newcomers to policy gradient methods is the meaning of $$\theta_{\text{old}}$$, which refers to the policy that was used to gather the experience (commonly called behaviour policy). PPO's primary contribution over the vanilla policy gradient is increased sample-efficiency by reusing collected trajectories. More specifically, trajectories are implicitly reused by performing multiple gradient updates from the same experience. Note that this is different to classical off-policy methods, which usually sample from a replay buffer. Therefore, a better (but slightly less scientifically sounding) way to characterize PPO is to call it a on-policy-ish algorithm (where the -ish refers to the fact that the behaviour and target policies in PPO are fairly similar, unlike for classical off-policy methods).

However, even with the importance sampling ratio, applying multiple gradient steps towards $$L^{CPI}$$ empirically leads to destructively large updates , which leads us to PPO's second contribution: Instead of directly optimizing $$L^{CPI}$$, PPO optimizes $$L^{CLIP}=\hat{\mathbb{E}}_t [ \min(r_t(\theta)) \hat{A}_t, \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t ]$$, where the clipping disincentivizes large changes to the policy within a single PPO step.

PPO reduces to REINFORCE with baseline on the first optimizer step

In PPO, the behaviour and target policies are identical for the first (of potentially many) optimizer step of each PPO step, but the gradient at said step is still non-zero due to an implicit stop_gradient in the denominator of the importance sampling ratio due to $$\pi_{\theta_{\text{old}}}$$ being collected during the rollout in PPO implementations. Specifically, the gradient at the first optimizer step reduces to the gradient of REINFORCE with baseline (and GAE). At this initial step, the target parameters $$\theta$$ are identical to the behavior parameters $$\theta_{\text{old}}$$, meaning the ratio $$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$$ numerically evaluates to 1. Since $$1-\epsilon < 1 < 1+\epsilon$$, the clipping mechanism is inactive, and the objective matches $$L^{CPI}$$ with the gradient $$ \nabla_\theta L^{CPI} = \hat{\mathbb{E}}_t [ \hat{A}_t \nabla_\theta r_t(\theta) ] $$. Using the identity $$ \nabla_\theta r_t(\theta) = r_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t) $$, the gradient becomes $$ \nabla_\theta L^{CPI} = \hat{\mathbb{E}}_t [ \hat{A}_t r_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t) ] $$. Evaluating this expression when $$r_t(\theta)=1$$ yields: $$ \nabla_\theta L^{CLIP}(\theta)|_{\theta=\theta_{\text{old}}} = \hat{\mathbb{E}}_t [ \hat{A}_t \nabla_\theta \log \pi_\theta(a_t|s_t) ] $$ This is exactly the gradient of REINFORCE using the generalized advantage estimate $$\hat{A}_t$$. PPO's clipping only modifies gradients in subsequent steps as $$\theta$$ diverges from $$\theta_{\text{old}}$$. When used with $$\gamma=1$$, $$\lambda=1$$, the first optimizer step of PPO fully reduces to that of REINFORCE with baseline.

Notes

To derive the identity $$ \nabla_\theta r_t(\theta) = r_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t) $$, first recognize the identity $$ \nabla_\theta r_t(\theta) = \nabla_\theta [ \frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} ] = \frac{1}{\pi_{\theta_{\text{old}}}(a_t|s_t)} * \nabla_\theta [ \pi_\theta (a_t|s_t) ] $$ since $$ \theta_{\text{old}} $$ is treated as a constant. Now we apply the log-derivative trick $$ \nabla_\theta f(\theta) = f(\theta) \nabla_\theta \log f(\theta) $$ to $$ \pi_\theta (a_t|s_t) $$ and get $$ \nabla_\theta \pi_\theta(a_t|s_t) = \pi_\theta(a_t|s_t) \nabla_\theta \log \pi_\theta(a_t|s_t) $$. Substituting this back we get $$ \nabla_\theta r_t(\theta) = \frac{1}{\pi_{\theta_{\text{old}}}(a_t|s_t)} * [ \pi_\theta(a_t|s_t) \nabla_\theta \log \pi_\theta (a_t|s_t) ] = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t) } * \nabla_\theta \log \pi_\theta (a_t|s_t) = r_t(\theta) \nabla_\theta \log \pi_\theta(a_t|s_t) $$.

Contributions

MM and FS worked on research and analysis, FS wrote the manuscript. We thank Gemini 2.5 Pro for deriving the identity.