Large language models exhibit remarkable reasoning capabilities with scale. However, a fundamental flaw of current-generation transformer-based language models is their uniform allocation of compute per token.

A vanilla transformer (and all currently used variations thereof) generates each token after one (and only one) forward-pass through the network. Intuitively this means that at inference time, the network thinks for the same amount of time before each token generation. While this is not an inherent limitation on the reasoning capabilities of transformer models, it does mean that we would need obscenely large transformers for them to exhibit longer-term planning capabilities. Since we posit that (standalone) models with more than 1 trillion parameters are already unfeasible to serve at scale, we need to overcome this limitation of the transformer architecture.

While we have been aware of this limitation and of its severity for a long time (much longer than p(doom) even existed for), today it is an open secret in the ML research community. Everyone knows that all big AGI labs (DeepMind, OpenAI, imbue, to name a few) are rushing towards overcoming this limitation , yet nobody has found a solution yet. The market value of an LLM/LMM with agentic reasoning capabilities is difficult to imagine.

Most public discourse around scaling up reasoning capabilities involves developing methods to shift computational resources from training to inference time: While only a tiny fraction of computational resources of LLMs are used for inference (>99% of compute goes to pretraining), systems like Pluribus , AlphaGo or MuZero leverage vast amounts of compute at inference time, primarily via Monte Carlo Tree Search. A few techniques for shifting compute to inference time already exist, albeit rather simple ones: chain-of-thought , tree-of-thought , sampling+scoring . All of these techniques try to shift computation post-hoc. While only helps if the correct solution is actually in the sampled space, and seem unelegant in execution, somewhat arbitrary as solutions and above all, have not been able to yield agentic systems.

We posit that one elegant way of addressing the transformer's limitation is via adaptive computation. We want to find a way of letting the model think for as long as it deems necessary before each token generation. That way, we do not hard-code a 'reasoning architecture' into the model and we do not materialize the search. The model's thought process stays in the latent space. We posit that such a model trained at GPT-4 scale will exhibit human-level reasoning.

As of May 2024, Noam Brown's public opinion implies the polar opposite: He now thinks that shifting compute from inference to training time is the only way to develop and deploy AI systems for the masses. Public perception in general has shifted since the release of Llama 3 8B , which was trained on 15T tokens, way beyond what is considered chinchilla-compute optimal . While there has been a lot of confusion around chinchilla-optimality right after the release of Llama 3 , the research community has now adopted the terms training-compute optimality and inference-compute optimality to clear up the confusion. While we agree that inference-compute optimality is needed for mass-deployment, we still believe that leveraging vast amounts of compute at inference time is inalienable to solve the biggest problems of our time using AI.

Contributions

MM worked on research and analysis, FS wrote the manuscript.