We train an inverse dynamics model on crowd-cast's paired dataset, yielding a model that recovers key presses, mouse clicks, cursor movements and scroll events from unlabeled videos. Along with the model, we openly release 600 hours of IDM-annotated screencasts. Trained exclusively on macOS, the model generalizes to Windows and Linux.

Interactive demo: the model watches a screencast and recovers every user action. Loops automatically; click to interact.
1

Long-horizon data for long-horizon models

Frontier labs are spending billions on handcrafted RL environments hoping to lengthen the task horizon to general intelligence. Meanwhile, a billion people produce months-long trajectories of economically valuable digital work. We capture that data as action-annotated screencasts and use it to behaviour-clone models towards longer task horizons.

Previously, we openly released over 600 hours of unlabeled screencasts of AGI research. However, unlabeled screencasts alone are insufficent for behaviour-cloning. With crowd-cast, our open-source recording tool, we started capturing paired data: screencasts with synchronized keylogs and mouse movement. To date, we have collected 2500+ hours of paired recordings across dozens of contributors.

While crowd-cast gives us paired screencasts and input logs from highly productive participants, often spanning months of work on single projects, it is still limited in breadth.

The internet contains thousands of hours of unlabeled screencasts across more professions, applications and operating systems. To use them for behaviour-cloning, we first need to recover the actions behind the frames. We start with an inverse dynamics model trained on crowd-cast, and use it to action-label all 600 hours of AGI-CAST. We openly release that annotated dataset alongside the model.

2

An IDM for unlabeled screencasts

A human watching screen recordings can infer what was typed, what was clicked, and how the mouse moved. An inverse dynamics model is no different. Trained on crowd-cast's paired data (videos and ground-truth actions), our model learns to recover actions from pixels alone.

Unlike previous work , our IDM predicts sparse, low-level input events over multi-frame clips. Its action space is close to raw OS input logs: individual key presses, mouse clicks, scrolls and relative mouse movement, rather than GUI-level action steps. To our knowledge, this is the first openly released IDM for recovering raw computer input events from screen recordings.

F1
1.0 0.8 0.6 0.4 0.2 0
0.86
0.73
0.70
0.53
0.42
0.34
Ours (8B) Gemini 3.5 Flash GPT 5.5 Kimi K2.6 Gemma 4 31B Qwen3-VL 8B
Ours (Qwen3-VL 8B fine-tuned) Closed-source Open-source
F1 on the 44-clip eval set.

Our fine-tune of Qwen3-VL 8B surpasses the strongest zero-shot off-the-shelf model, Gemini 3.5 Flash. Our qualitative evaluations (see interactive demo) show that predicted cursor displacements closely track both the direction and magnitude of ground truth.

Without fine-tuning, even the best off-the-shelf model leaves significant room for improvement. We observe two failure modes across nearly all models: they are competent at single-frame grounding and OCR but struggle with multi-frame tracking, and they severely underpredict, missing the majority of actions.

2.1

Format

Given a short window of consecutive frames, the model outputs a sparse list of user actions: key presses, mouse clicks, mouse movement and scroll events. While most frames have no action, some have several, and the model is trained to emit only those. We deliberately avoid per-frame action classification. At 5 FPS, ~70% of frames have no associated input event, hence a naive per-frame classifier collapses to always predicting no-ops, getting 70% accuracy with zero useful signal. The sparse formulation sidesteps this entirely: the empty list [] is a valid response for clips where nothing happened.

Type Details Notes
KeyPress key name + modifiers, e.g. Cmd+S
MouseClick Left, Right, or Middle position is implied by the preceding MouseMove sequence
MouseMove signed dx,dy relative displacement on a 0-1000 per-axis normalized scale
MouseScroll signed magnitude direction and magnitude
The action space stays close to raw OS input events.
[
  {"frame": "F01", "type": "MouseMove",   "details": "120,45"},
  {"frame": "F02", "type": "MouseClick",  "details": "Left"},
  {"frame": "F04", "type": "KeyPress",    "details": "Shift+H"},
  {"frame": "F04", "type": "KeyPress",    "details": "E"},
  {"frame": "F05", "type": "KeyPress",    "details": "L"},
  {"frame": "F05", "type": "KeyPress",    "details": "L"},
  {"frame": "F05", "type": "KeyPress",    "details": "O"},
  {"frame": "F07", "type": "MouseMove",   "details": "-340,12"},
  {"frame": "F08", "type": "MouseScroll", "details": "-150"}
]
Example model output for a 10-frame clip.
2.2

Data processing

Each crowd-cast recording consists of a screencast and the corresponding keylog containing raw OS-level input events with microsecond-level timestamps. We process these into training clips as follows:

  1. Frame extraction. We downsample each recording to 5 FPS and resize to 720p. crowd-cast records a black screen when the focused application is not in the user's predefined recording list. These black frames carry no visual information and have no associated actions, so we detect and drop them. We then slide a 10-frame window across each recording (stride=5). At 5 FPS, roughly 70% of the 2-second windows contain no user action at all. We keep only 5% of these zero-action clips as negative examples, enough for the model to learn that [] is a valid response without dominating the training loss.
  2. Action extraction. Raw OS-level keylog events are converted into the four action types. Modifier keys never emit standalone, they are tracked as held state and prefixed to the next non-modifier key press. For example, holding Cmd+Shift and subsequently pressing P yields one event Cmd+Shift+P. Mouse movement deltas are $$(dx, dy) \in [-1000, 1000]^2$$, where $$\pm1000$$ corresponds to a full screen-width or screen-height traversal. Scroll magnitudes are normalized the same way. Since multiple events sometimes land on the same frame, we coalesce scrolling and mouse movement by summing their per-frame vectors.

We use a snapshot of the crowd-cast dataset from May 19, 2026, comprising ~18,000 5-minute recordings. After processing, this yields approximately 1.5m training clips and 210k validation clips, split at the session level.

2.3

Model and training

We fine-tune Qwen3-VL-8B with LoRA applied to both the language model and the vision encoder. Frames are interleaved with text labels ("Frame F00:", "Frame F01:", ...) in the input sequence, giving the model text-based anchors across the images. The model produces a JSON array of actions as output, where only model responses contribute to the loss.

Hyperparameter Value
LoRA rank / alpha256 / 512
LoRA dropout0.05
OptimizerAdamW
Peak learning rate2e-5
ScheduleWSD (500-step warmup; 1,333-step decay to 10% of peak)
Per-GPU batch size2
DDP world size8
Number of steps5,000
Max. pixels524,288
Sequence length8192
Hardware8× H100
2.4

Evaluation setup

We manually curate 44 macOS clips from a held-out set of crowd-cast (5 seconds each at 5 FPS) covering keystroke-heavy, click-heavy, scroll/drag, hotkey and mixed workflows, with hand-verified ground-truth actions. Each ground-truth action is annotated with a visibility class: visible (~91%, clear visual change in frames), inferable (~4%, deducible from context), or not predictable (~5%, no visual evidence at this frame rate). Only visible and inferable actions count toward the score.

Each evaluation clip is 25 frames, but the model only sees 10-frame sequences during training. Thus, during inference we run four overlapping windows (stride=5) and only evaluate predictions from frame indices 2 to 7. We match predictions to ground truth with a type-dependent frame tolerance (5 frames for key presses and mouse clicks, 0 frames for scrolling and mouse movement).

We report per-type precision, recall and F1. For mouse movement we additionally report R² between predicted and ground-truth displacement vectors, and cosine similarity for directional alignment. Reasoning was enabled for Gemini 3.5 Flash, GPT 5.5 and Kimi K2.6, and disabled for Gemma 4, Qwen3-VL and our fine-tuned model.

2.5

Ablations

Our ablations each change one component of the final recipe.

Config F1 cos
Final recipe (8B, r=256) 0.86 0.66 0.99
2B backbone 0.86 0.61 0.97
4B backbone 0.84 0.55 0.91
− vision LoRA 0.84 0.64 0.76
− interleaved frame labels 0.77 -0.03 0.57

Removing interleaved frame labels materially affects all metrics and degenerates mouse movement prediction. Without text anchors between frames, the model loses track of frame identity and enters repetitive prediction loops.

Model size, LoRA rank and ablating vision-LoRA have no measurable effect on action detection. However, bigger models and vision-LoRA improve mouse movement prediction, and direction accuracy scales monotonically with LoRA rank.

0.70 0.80 0.90 .84 .85 .85 .86 16 64 128 256 LoRA rank F1
0.70 0.85 1.00 .84 .89 .94 .99 16 64 128 256 LoRA rank cos
F1 is stable across ranks. Mouse direction quality scales monotonically with adapter capacity.
2.6

Out-of-distribution generalization

Although the IDM has only been trained on macOS data, it largely transfers to Windows and Linux screencasts without adaptation. Mouse movement and click detection generalize best, while keyboard shortcuts are the main failure mode.

We additionally evaluate the model on AgentNet, which is even more out-of-distribution than mere cross-platform clips due to each frame depicting one logical action (like inserting one full command into the terminal). Since the model has never seen non-uniformly sampled frames, it hallucinates dynamics that could explain extensive changes between frames.

Positive example

Negative example

The model has not seen this kind of terminal text insertion during training. It first predicts only part of the inserted text, and subsequently explains large frame changes as terminal-history navigation.

3

What's next

Today, we expand crowd-cast support to Linux and Windows, thus covering all major operating systems. We expect our IDM to get better with more diverse data.

Our ultimate goal is to lengthen model's task horizons from days to weeks and months. With crowd-cast, we now have our in-house data supply chain of such months-long trajectories, and with our IDM we unlock the internet as a data trove. While our firm believe has always been that you need to train models on trajectories that span months if you want them to exhibit month-long horizons, a dataset alone will not suffice. While we are scaling our data collection effort orders of magnitude, we are investigating alternatives to dense attention for retaining memory over months-long rollouts (fixed-size state), continually training off-the-shelf models, inserting goals and synthetic thinking traces, and letting the model learn from experience.

With cross-platform support, anyone can now contribute to crowd-cast. We are paying participants to record their work sessions. You are not asked to do tasks for us, you record yourself doing work you would be doing anyway. If you are interested, apply here or check the live dashboard to see collection progress.

Contributions

MM worked on data sourcing, training, evaluation, and wrote large parts of the manuscript and the crowd-cast codebase. MM ported crowd-cast to Windows. FS ported crowd-cast to Linux and helped with the manuscript. The p(doom) team jointly contributed to crowd-cast.