PPO Is Secretly Using Monte Carlo Advantage Estimation In LLM Post-Training

When using PPO in LLM post-training, hyperparameter settings turn Generalized Advantage Estimation into Monte Carlo Advantage Estimation.

Authors

Affiliations

Franz Srambical

p(doom)

Published

Feb. 12, 2025

DOI

No DOI yet.

1

Although PPO originally uses Generalized Advantage Estimation , modern LLM post-training usually employs \lambda=1, \gamma=1, which means that we use full rollouts and do not discount future rewards. Disabling discounting is natural in the post-training regime, since the trajectories are not very long. Using full rollouts means that we do not want to trade increased bias for decreased variance. In post-training, we want the policy updates to be as stable as possible. However, we also want them to be as correct as possible. Since there are other ways of reducing variance (e.g. increasing batch size), we hold off from increasing bias.

The original authors define generalized advantage as \hat{A}_t^{GAE(\gamma,\lambda)} := \sum_{l=0}^{\infty}(\gamma \lambda)^l \delta^V_{t+l}. They already shows that in the setting \lambda=1, the advantage estimation reduces to \hat{A}_t := \sum_{l=0}^{\infty}\gamma^l \delta_{t+l} = \sum_{l=0}^{\infty}\gamma^l r_{t+l} - V(s_t). Additionally setting \gamma=1 reduces this to \hat{A}_t := \sum_{l=0}^{\infty}r_{t+l} - V(s_t), the empirical returns minus a value function baseline. Since in post-training we only get sparse rewards at the end of the trajectory, the advantage becomes \hat{A}_t := r_{T} - V(s_t).

This is simply Monte Carlo Advantage Estimation. Thus in the LLM post-training setting, the difference between REINFORCE with baseline and PPO boils down to PPO's clipping and the use of a likelihood ratio:

Contributions

FS worked on all aspects of this post, including research, analysis and writing.

Citation

For attribution in academic contexts, please cite this work as

Srambical, "PPO Is Secretly Using Monte Carlo Advantage Estimation In LLM Post-Training", p(doom), 2025.

BibTeX citation

@article{srambical2025ppo,
  author = {Srambical, Franz},
  title = {PPO Is Secretly Using Monte Carlo Advantage Estimation In LLM Post-Training},
  journal = {p(doom) blog},
  year = {2025},
  note = {https://pdoom.org/blog.html}
}

Footnotes

    References

    1. High-dimensional continuous control using generalized advantage estimation
      Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. arXiv preprint arXiv:1506.02438.
    2. Simple statistical gradient-following algorithms for connectionist reinforcement learning
      Williams, R.J., 1992. Machine learning, Vol 8, pp. 229--256. Springer.