We introduce Jasmine, a production-ready JAX-based codebase for world modeling from unlabeled videos. Scale from single hosts to hundreds of xPUs thanks to XLA.
We are at the cusp of an intelligence revolution. Neural networks are able to clone the behaviour of peak human intellectual performance
While product-feedback-loops
While numerous previous works have investigated large-scale world modeling and its application to robotics
Ultimately, such treatment permits us to derive empirical estimates of compute and data requirements to model environments of increasing complexity sufficiently well (as determined by our evaluation procedure). Only given such estimates can we try to draw conclusions about the feasibility of world modeling of environments as complex as the real world for agent training. If our empirical estimates show resource requirement trends that are feasible under the assumption of the continuation of Moore's Law and increased capital expenditure, that would manifest world modeling as a paradigm with high likelihood of success in overcoming the data-scarcity in domains as general as (humanoid) robotics. Otherwise, the world modeling research community must realign its direction with downstream goals that are feasible.
Jasmine, our first milestone towards deriving Empirical Environment Complexity Scaling Trends, is the result of weeks of infrastructure work to make large-scale world modeling research more accessible. What started off as a fork of Jafar grew into a full-fledged world modeling codebase amenable to large-scale training, implementing multiple dynamics model baselines, asynchronous checkpointing, process-parallel dataloading, checkpointing of model weights, optimizer and dataloader states, checkpointing policies, full reproducibility with identical training curves, mixed precision training, optimized FlashAttention (via cuDNN SDPA), activation checkpointing, DDP (with FSDP/HSDP requiring changing a singe LoC), WSD schedule, index-shuffling during dataloading, and native Treescope support. Jasmine implements the new flax.nnx API and strictly adheres to Noam Shazeer's shape suffix convention, thereby providing a didactic implementation of world modeling architectures. Jasmine solely depends on battle-tested libraries from the Google ecosystem (Flax, Optax, Orbax, Grain, PIX, ArrayRecord).
We captured every step of the research engineering process behind Jasmine using crowd-code crowd-code-0.1
, our first dataset of dense IDE interactions, which encompasses the entire development of Jasmine.
crowd-code-0.1
is unfiltered, uncleaned, and uncurated, but only contains IDE interactions of the Jasmine authors. We are actively working on cleaning and curating the full dataset,
which will be released in the future.
MM, AN and FS worked on research, ideation and implementation. FS wrote the manuscript. SB provided feedback and guidance.