When using PPO in LLM post-training, hyperparameter settings turn Generalized Advantage Estimation into Monte Carlo Advantage Estimation.
Although PPO originally uses Generalized Advantage Estimation
The original authors define generalized advantage as $$\hat{A}_t^{GAE(\gamma,\lambda)} := \sum_{l=0}^{\infty}(\gamma \lambda)^l \delta^V_{t+l}$$. They already shows that in the setting $$\lambda=1$$, the advantage estimation reduces to $$\hat{A}_t := \sum_{l=0}^{\infty}\gamma^l \delta_{t+l} = \sum_{l=0}^{\infty}\gamma^l r_{t+l} - V(s_t)$$. Additionally setting $$\gamma=1$$ reduces this to $$\hat{A}_t := \sum_{l=0}^{\infty}r_{t+l} - V(s_t)$$, the empirical returns minus a value function baseline. Since in post-training we only get sparse rewards at the end of the trajectory, the advantage becomes $$\hat{A}_t := r_{T} - V(s_t)$$.
This is simply Monte Carlo Advantage Estimation. Thus in the LLM post-training setting, the difference between REINFORCE with baseline
FS worked on all aspects of this post, including research, analysis and writing.