Models can win IMO gold medals yet struggle at tasks that would provide obvious economic value. They can one-shot solutions yet struggle at long debugging sessions. This is not a capability problem, it is a data problem. Models do what they are trained to do.

Millions of people work every day, creating billions of hours of open-ended, yet goal-conditioned long-horizon behaviour-cloning data every month, but this data trove goes uncaptured and lost. With AGI-CAST, we recently published the largest, open long-horizon screen capture dataset by recording p(doom) members conducting open-ended AGI research. We believe that behaviour-cloning from long-horizon screencasts (and eventual product-feedback loops) will unlock the next set of model capabilities in a way that mere internet-scale training and data-labeler-powered reinforcement learning cannot. The story of deep learning is that of scalable methods winning. The data story is no different.

A Simplified Setting for Behaviour Cloning from Screencasts

Behaviour cloning from unlabeled videos means attaining policies given observation streams without action labels or rewards. AGI-CAST captures raw screen recordings of AGI research, but training on videos is compute-expensive, and large-scale, open datasets of workforce screencasts beyond AGI-CAST that are suitable for training are non-existent.

crowd-code 2.0 is a simplified setting to study behaviour cloning from screencasts where observations correspond to what humans can see inside of IDEs (editor and terminal viewports), while actions correspond to edits, cursor movement, navigation and terminal interaction.

The result is a sequence of action–observation rollouts, directly analogous to video-based imitation learning, but purely text-based. We subsample continuously changing viewports (scrolling, streaming terminal output) at 10 Hz matching the temporal granularity of video. This means we capture the state of interactive CLI tools like Claude Code, Codex CLI, vim and less in real-time.

Why crowd-code 2.0?

Over the past six months, we crowd-sourced over 2.5 million actions across 25 people using the original crowd-code extension. Today, we publicly release crowd-code-dataset-1.0, a cleaned and anonymized version of the dataset, under the most permissive Creative Commons license (CC0).

The crowd-code dataset is flexible enough to support serialization to formats suitable for continual training of off-the-shelf language models. crowd-pilot-serializer provides a Rust-based serialization implementation for an action-observation format familiar for coding agents:

System

You are a helpful assistant that interacts with a computer shell to solve programming tasks. Your goal is to predict the next bash command a developer would most likely execute, given their editing and navigation history.

Assistant

Command:

python src/input_pipeline/validation_to_testcases.py --input_file data/temp/validation.jsonl --output_file data/temp/validation_testcases.jsonl

User

Output:

Traceback (most recent call last):
  File "/fast/home/mihir.mahajan/Projects/tab-model-eval/src/input_pipeline/validation_to_testcases.py", line 47, in <module>
    args = tyro.cli(Args)
           ^^^^
NameError: name 'tyro' is not defined

Assistant

Command:

cat -n src/input_pipeline/validation_to_testcases.py

User

Output:

     1  import json
     2  import os
     3  from dataclasses import dataclass

     6  @dataclass
     7  class Args:
     8      input_file: str = "data/eval/val/validation.jsonl"
     9      output_file: str = "data/eval/val/validation_testcases.jsonl"
    10      task_name: str = "validation_set"
    ...

Assistant

Command:

sed -i '1i\import tyro' src/input_pipeline/validation_to_testcases.py && cat -n src/input_pipeline/validation_to_testcases.py | sed -n '1,10p'

User

Output:

     1  import tyro
     2  import json
     3  import os
     4  from dataclasses import dataclass

     7  @dataclass
     8  class Args:
     9      input_file: str = "data/eval/val/validation.jsonl"
    10      output_file: str = "data/eval/val/validation_testcases.jsonl"

To give one example that illustrates crowd-code's usefulness, we can use this serialization format to train tab completion models that allow us to avoid KV cache invalidation and harness multi-token prediction-powered speculative decoding at deployment time. That way, we can deploy much larger models compared to traditional tab completion systems under the same latency requirements.

However, crowd-code 1.0 was not designed for the agent-heavy software engineering workflows that are now standard, and its data capturing logic is not reminiscent of the setting of behaviour cloning from videos. crowd-code 2.0 moves beyond purely event-based recordings towards capturing states (observations) and events (actions), with states designed to be as close to human observations as possible. We now capture edits workspace-wide and differentiate between user actions/edits (keystrokes, navigation, terminal commands), agent edits (changes made by Cursor, Copilot, or other IDE agents) and external changes (git operations, CLI tools).

Attributing user edits to filesystem changes is non-trivial: We buffer user edits, correlate them with filesystem changes on save, and attribute uncorrelated changes to agents. We tested our implementation for edge-cases: if a save is triggered by an agent edit while user changes are pending, the agent edit is still correctly attributed.

With crowd-code 2.0 we directly capture prompt-response pairs of CLI agents invoked from the terminal, but even when using non-CLI based agents (that we cannot capture using the VS Code extension API), we can reconstruct prompts and model rollouts by capturing a compressed snapshot of the workspace immediately before each agent edit as well as a diff of the agent edit itself: Language models can reconstruct likely prompts by conditioning on the workspace snapshot and the agent edit diff. This has an additional benefit: we can replay the rollout with a different model for on-policy data.

Cool dataset, what gives?

crowd-code 2.0 decouples the capture format from the training format. The raw capture format is a sequence of timestamped actions and observations. Post-processing can transform this into:

With crowd-code 2.0, we hope to also enable the community to work on methods and algorithms for product-feedback loops.

Looking Forward

We believe that many capabilities are yet to be unlocked in current-generation architectures by behaviour-cloning them: Expanding the task horizon of models, working on single problems for hours and days at a time, imitating human exploration priors, learning on-the-go, knowing where to look (how to search by efficiently jumping around repositories, how to recall information beyond working memory; what we call attention in environment-space). Architectural bottlenecks only become imminent by constructing potential paths towards AGI. Architecture research prerequisites first exhausting the data regime.

Our long-term goal is not merely to train on crowd-code data. We want to use it to train inverse dynamics models inferring actions from unlabeled observation sequences, as a testbed for behaviour-cloning from videos to gather insights transferable to the setting of AGI-CAST and beyond, and to bootstrap agents that can acquire the data they need when they encounter unfamiliar regimes.

AGI will require systems that can expand their training distribution by knowing where to look. We believe software engineering is one of the best environments to study these problems.

You Get What You Train

A Simplified Setting for Behaviour Cloning from Screencasts

Why crowd-code 2.0?

Cool dataset, what gives?

Looking Forward

Contributions