NNs Do Not Generalize OOD

Franz Srambical, Mihir Mahajan

Neural networks are mean-seeking. They work well when you run inference on data points that lie around the mean of their training data. They embarrassingly fail otherwise.

Going Beyond the Causal Mask in Language Modeling

Franz Srambical

Although ubiquitously used in large-scale language modeling, the necessity of the causal mask is seldom questioned in the literature. Why do we really need the causal mask?

ACT: Adaptive Compute Transformer

Mihir Mahajan, Franz Srambical

Large language models exhibit remarkable reasoning capabilities with scale. However, a fundamental flaw of current-generation transformer-based language models is their uniform allocation of compute per token.