ACT: Adaptive Compute Transformer
Mihir Mahajan, Franz Srambical
Large language models exhibit remarkable reasoning capabilities with scale. However, a fundamental flaw of current-generation transformer-based language models is their uniform allocation of compute per token.