Going Beyond the Causal Mask in Language Modeling

Franz Srambical

Although ubiquitously used in large-scale language modeling, the necessity of the causal mask is seldom questioned in the literature. Why do we really need the causal mask?

ACT: Adaptive Compute Transformer

Mihir Mahajan, Franz Srambical

Large language models exhibit remarkable reasoning capabilities with scale. However, a fundamental flaw of current-generation transformer-based language models is their uniform allocation of compute per token.