The GAE Implementation Of RLax Is Wrong

Franz Srambical

RLax does not adhere to the original GAE formula during the last advantage calculation.

PPO Is Secretly Using Monte Carlo Advantage Estimation In LLM Post-Training

Franz Srambical

When using PPO in LLM post-training, hyperparameter settings turn Generalized Advantage Estimation into Monte Carlo Advantage Estimation.

NNs Do Not Generalize OOD

Franz Srambical, Mihir Mahajan

Neural networks are mean-seeking. They work well when you run inference on data points that lie around the mean of their training data. They embarrassingly fail otherwise.

Going Beyond the Causal Mask in Language Modeling

Franz Srambical

Although ubiquitously used in large-scale language modeling, the necessity of the causal mask is seldom questioned in the literature. Why do we really need the causal mask?

ACT: Adaptive Compute Transformer

Mihir Mahajan, Franz Srambical

Large language models exhibit remarkable reasoning capabilities with scale. However, a fundamental flaw of current-generation transformer-based language models is their uniform allocation of compute per token.