Neural networks are mean-seeking. They work well when you run inference on data points that lie around the mean of their training data. They embarrassingly fail otherwise.

Currently, we exploit the paradigm of data-driven self-supervision by using neural networks to approximate the underlying data-generating process of a training distribution, until we run out of data. When we have exhausted the data readily accessible to us, a system trained using naïve self-supervision on such a large-scale dataset will have a reasonably good internal model of neighborhoods in data-space that have a high presence in the training data. The further we go along the long tail of the data distribution, the worse a neural network gets at modeling such data points in its representation-space.

Language models are simulators. They will simulate anything, so long as you let the language model ingest enough imitation data at training time. While general, conversational data may be highly present in a simple data dump of the internet, data of highly skilled behavior is scarce. Upsampling desired behavioral data works until such data is overly scarce. At said skill level, behavior cloning is not feasible anymore, at which point we use reward signals to approximate how we should update the model’s parameters instead. Crucially, the entire progression from naïve modeling of data dumps of the internet towards attaining higher skill levels is handcrafted: dataset curation is handcrafted, upsampling is handcrafted, reward signals are handcrafted.

They should not be. In systems prospectively as capable as us, they cannot be. At inference time, when given a long-horizon goal such as getting an IMO gold medal, an intelligent system should itself identify whether its internal model and — by extension — the current distribution of training data it has seen is sufficient to solve the task at hand, and if not, what data to gather next. Beyond gathering new data points, such an endeavor necessitates some verification signal, which can be external or intrinsic, in order to gauge whether progress towards the long-horizon goal is being made.

There are four, largely orthogonal, but additive directions towards AGI: acting, reasoning, continual learning, and the ability to gather the next most useful set of data points. While the first three are active research areas both in the academic literature and in closed AGI labs, the latter is still largely neglected. We motivate the latter by the inherent inability of neural networks to generalize beyond their training distribution. Ultimately, such a capability would solve both curriculum learning and the ‘long-tail problem’ from first principles, allowing us to move beyond power-law scaling and into a new regime. A regime that leads to AGI in a world full of constraints.

Contributions

MM and FS worked on research and analysis, FS wrote the manuscript.