We introduce Jasmine, a production-ready JAX-based codebase for world modeling from unlabeled videos. Scale from single hosts to hundreds of xPUs thanks to XLA.

Figure 1: Jasmine in action.

Introduction

We are at the cusp of an intelligence revolution. Neural networks are able to clone the behaviour of peak human intellectual performance given enough compute, data, and the right algorithms . While an increasing amount of capital expenditure is allocated to compute clusters, and a well-working recipe of equipping models with the required priors and capacity to reason is publicly available, the path to human-level intelligence with the ability to automate large fractions of the economy will increasingly be shaped by paradigms that are able to find and efficiently use untouched data troves.

While product-feedback-loops constitute an adaptive data trove, many domains like robotics are not mature enough to yield a product with wide enough adoption to create a feedback-loop of sufficient magnitude, prompting the search for alternatives. One paradigm proposed by the research community to overcome the data scarcity in those domains is that of world models. While world models can help frontier model development in numerous ways, an ambitious goal of the community is to train a world model to act as a simulation of the world , in order to train an agent in that simulation, via an adaptive curriculum or otherwise.

Deriving Empirical Environment Complexity Scaling Trends

While numerous previous works have investigated large-scale world modeling and its application to robotics , world modeling for agent training calls for a vastly different treatment. Such regime requires the compounding error of world models to be orders of magnitude smaller than when solely used for short-term look-ahead. The feasibility of such a world model in its truest sense is entirely understudied, and Jasmine, a world modeling codebase, is our first milestone towards studying the setting using rigorous evaluations. Specifically, we want to develop Empirical Environment Complexity Scaling Trends, where we train world models to full convergence in environments of increasing complexity (Atari , RetroGym , Craftax , Minecraft ) and under the synthetic infinite-data regime. Subsequently, we want to evaluate those models two-fold: i) via a taxonomy of granular benchmarks probing specific world modeling capabilities (reconstruction quality, environment dynamics at the body/tail of the data distribution, long-horizon consistency) , and ii) by training reinforcement learning (RL) agents in both the world model and the corresponding ground-truth environment, and measuring the performance difference between those agents.

Ultimately, such treatment permits us to derive empirical estimates of compute and data requirements to model environments of increasing complexity sufficiently well (as determined by our evaluation procedure). Only given such estimates can we try to draw conclusions about the feasibility of world modeling of environments as complex as the real world for agent training. If our empirical estimates show resource requirement trends that are feasible under the assumption of the continuation of Moore's Law and increased capital expenditure, that would manifest world modeling as a paradigm with high likelihood of success in overcoming the data-scarcity in domains as general as (humanoid) robotics. Otherwise, the world modeling research community must realign its direction with downstream goals that are feasible.

A batteries-included foundation for world modeling research

Jasmine, our first milestone towards deriving Empirical Environment Complexity Scaling Trends, is the result of weeks of infrastructure work to make large-scale world modeling research more accessible. What started off as a fork of Jafar grew into a full-fledged world modeling codebase amenable to large-scale training, implementing multiple dynamics model baselines, asynchronous checkpointing, process-parallel dataloading, checkpointing of model weights, optimizer and dataloader states, checkpointing policies, full reproducibility with identical training curves, mixed precision training, optimized FlashAttention (via cuDNN SDPA), activation checkpointing, DDP (with FSDP/HSDP requiring changing a singe LoC), WSD schedule, index-shuffling during dataloading, and native Treescope support. Jasmine implements the new flax.nnx API and strictly adheres to Noam Shazeer's shape suffix convention, thereby providing a didactic implementation of world modeling architectures. Jasmine solely depends on battle-tested libraries from the Google ecosystem (Flax, Optax, Orbax, Grain, PIX, ArrayRecord).

Releasing a dataset of fine-grained research engineering

We captured every step of the research engineering process behind Jasmine using crowd-code , a VS Code/ Cursor extension that captures fine-grained IDE interactions (character-level edits, navigation, debugging patterns, terminal usage) and allows researchers to contribute their engineering process to a crowd-sourced dataset. Today, we release crowd-code-0.1, our first dataset of dense IDE interactions, which encompasses the entire development of Jasmine. crowd-code-0.1 is unfiltered, uncleaned, and uncurated, but only contains IDE interactions of the Jasmine authors. We are actively working on cleaning and curating the full dataset, which will be released in the future.

Contributions

MM, AN and FS worked on research, ideation and implementation. FS wrote the manuscript. SB provided feedback and guidance.