When using PPO in LLM post-training, hyperparameter settings turn Generalized Advantage Estimation into Monte Carlo Advantage Estimation.

Although PPO originally uses Generalized Advantage Estimation , modern LLM post-training usually employs $$\lambda=1$$, $$\gamma=1$$, which means that we do not introduce bias to reduce variance and do not discount future rewards. Disabling discounting is natural in the post-training regime, since the trajectories are not very long. In post-training, we want the policy updates to be as stable as possible. However, we also want them to be as correct as possible. Since there are other ways of reducing variance (e.g. increasing batch size), we hold off from increasing bias.

The original authors define generalized advantage as $$\hat{A}_t^{GAE(\gamma,\lambda)} := \sum_{l=0}^{\infty}(\gamma \lambda)^l \delta^V_{t+l}$$. They already shows that in the setting $$\lambda=1$$, the advantage estimation reduces to $$\hat{A}_t := \sum_{l=0}^{\infty}\gamma^l \delta_{t+l} = \sum_{l=0}^{\infty}\gamma^l r_{t+l} - V(s_t)$$. Additionally setting $$\gamma=1$$ reduces this to $$\hat{A}_t := \sum_{l=0}^{\infty}r_{t+l} - V(s_t)$$, the empirical returns minus a value function baseline. Since in post-training we only get sparse rewards at the end of the trajectory, the advantage becomes $$\hat{A}_t := r_{T} - V(s_t)$$.

This is simply Monte Carlo Advantage Estimation. Thus in the LLM post-training setting, the difference between REINFORCE with baseline and PPO boils down to PPO's clipping and the use of a likelihood ratio:

$$\mathcal{J}^{\text{CLIP}}=\mathbb{E}[\text{min}(\text{ratio}_t(\theta)(r_T-V(s_t)), \text{clip}(\text{ratio}_t(\theta), 1-\epsilon, \epsilon)(r_T-V(s_t)))]$$ , where $$\text{ratio}_t=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$$.
$$\mathcal{J}^{\text{REINFORCE}}=\mathbb{E}[\log \pi_\theta(a_t|s_t)(r_T-V(s_t))]$$

Contributions

FS worked on all aspects of this post, including research, analysis and writing.