Large language models exhibit remarkable reasoning capabilities with scale. However, a fundamental flaw of current-generation transformer-based language models is their uniform allocation of compute per token.
A vanilla transformer (and all currently used variations thereof) generates each token after one (and only one) forward-pass through the network. Intuitively this means that at inference time, the network thinks for the same amount of time before each token generation. While this is not an inherent limitation on the reasoning capabilities of transformer models, it does mean that we would need obscenely large transformers for them to exhibit longer-term planning capabilities. Since we posit that (standalone) models with more than 1 trillion parameters are already unfeasible to serve at scale, we need to overcome this limitation of the transformer architecture.
While we have been aware of this limitation and of its severity for a long time (much longer than p(doom) even existed for), today it is an open secret in the ML research community. Everyone knows that all big AGI labs (DeepMind, OpenAI, imbue, to name a few) are rushing towards overcoming this limitation
Most public discourse around scaling up reasoning capabilities involves developing methods to shift computational resources from training to inference time: While only a tiny fraction of computational resources of LLMs are used for inference (>99% of compute goes to pretraining), systems like Pluribus
We posit that one elegant way of addressing the transformer's limitation is via adaptive computation. We want to find a way of letting the model think for as long as it deems necessary before each token generation. That way, we do not hard-code a 'reasoning architecture' into the model
MM worked on research and analysis, FS wrote the manuscript.