1
Although PPO originally uses Generalized Advantage Estimation ,
modern LLM post-training usually employs \lambda=1, \gamma=1, which means that we use full rollouts
and do not discount future rewards. Disabling discounting is natural in the post-training regime, since the trajectories
are not very long. Using full rollouts means that we do not want to trade increased bias for decreased variance.
In post-training, we want the policy updates to be as stable as possible. However, we also want them to be as
correct as possible. Since there are other ways of reducing variance (e.g. increasing batch size), we hold off
from increasing bias.
The original authors define generalized advantage as
\hat{A}_t^{GAE(\gamma,\lambda)} := \sum_{l=0}^{\infty}(\gamma \lambda)^l \delta^V_{t+l}.
They already shows that in the setting \lambda=1,
the advantage estimation reduces to \hat{A}_t := \sum_{l=0}^{\infty}\gamma^l \delta_{t+l} = \sum_{l=0}^{\infty}\gamma^l r_{t+l} - V(s_t).
Additionally setting \gamma=1 reduces this to \hat{A}_t := \sum_{l=0}^{\infty}r_{t+l} - V(s_t), the empirical returns minus a value function baseline.
Since in post-training we only get sparse rewards at the end of the trajectory, the advantage becomes \hat{A}_t := r_{T} - V(s_t).
This is simply Monte Carlo Advantage Estimation. Thus in the LLM post-training setting, the difference between REINFORCE with baseline
and PPO boils down to PPO's clipping and the use of a likelihood ratio:
-
\mathcal{L}^{\text{CLIP}}=\mathbb{E}[\text{min}(\text{ratio}_t(\theta)(r_T-V(s_t)), \text{clip}(\text{ratio}_t(\theta), 1-\epsilon, \epsilon)(r_T-V(s_t)))]
, where \text{ratio}_t=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}.
-
\mathcal{L}^{\text{REINFORCE}}=\mathbb{E}[\log \pi_\theta(a_t|s_t)(r_T-V(s_t))]
Contributions
FS worked on all aspects of this post, including research, analysis and writing.
Citation
For attribution in academic contexts, please cite this work as
Srambical, "PPO Is Secretly Using Monte Carlo Advantage Estimation In LLM Post-Training", p(doom), 2025.
BibTeX citation
@article{srambical2025ppo,
author = {Srambical, Franz},
title = {PPO Is Secretly Using Monte Carlo Advantage Estimation In LLM Post-Training},
journal = {p(doom) blog},
year = {2025},
note = {https://pdoom.org/blog.html}
}
Footnotes
References
- High-dimensional continuous control using generalized advantage estimation
Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. arXiv preprint arXiv:1506.02438. - Simple statistical gradient-following algorithms for connectionist reinforcement learning
Williams, R.J., 1992. Machine learning, Vol 8, pp. 229--256. Springer.