We train an inverse dynamics model on crowd-cast's paired dataset, yielding a model that recovers key presses, mouse clicks, cursor movements and scroll events from unlabeled videos. Along with the model, we openly release 600 hours of IDM-annotated screencasts. Trained exclusively on macOS, the model generalizes to Windows and Linux.
Frontier labs are spending billions on handcrafted RL environments hoping to lengthen the task horizon to general intelligence. Meanwhile, a billion people produce months-long trajectories of economically valuable digital work. We capture that data as action-annotated screencasts and use it to behaviour-clone models towards longer task horizons.
Previously, we openly released over 600 hours of unlabeled screencasts of AGI research. However, unlabeled screencasts alone are insufficent for behaviour-cloning. With crowd-cast, our open-source recording tool, we started capturing paired data: screencasts with synchronized keylogs and mouse movement. To date, we have collected 2500+ hours of paired recordings across dozens of contributors.
While crowd-cast gives us paired screencasts and input logs from highly productive participants, often spanning months of work on single projects, it is still limited in breadth.
The internet contains thousands of hours of unlabeled screencasts across more professions, applications and operating systems. To use them for behaviour-cloning, we first need to recover the actions behind the frames. We start with an inverse dynamics model trained on crowd-cast, and use it to action-label all 600 hours of AGI-CAST. We openly release that annotated dataset alongside the model.
2A human watching screen recordings can infer what was typed, what was clicked, and how the mouse moved. An inverse dynamics model is no different. Trained on crowd-cast's paired data (videos and ground-truth actions), our model learns to recover actions from pixels alone.
Unlike previous work
Our fine-tune of Qwen3-VL 8B surpasses the strongest zero-shot off-the-shelf model, Gemini 3.5 Flash. Our qualitative evaluations (see interactive demo) show that predicted cursor displacements closely track both the direction and magnitude of ground truth.
Without fine-tuning, even the best off-the-shelf model leaves significant room for improvement. We observe two failure modes across nearly all models: they are competent at single-frame grounding and OCR but struggle with multi-frame tracking, and they severely underpredict, missing the majority of actions.
2.1
Given a short window of consecutive frames, the model outputs a sparse list of user actions: key presses, mouse clicks, mouse movement and scroll events.
While most frames have no action, some have several, and the model is trained to emit only those.
We deliberately avoid per-frame action classification. At 5 FPS, ~70% of frames have no associated input event, hence a naive per-frame classifier collapses to always predicting no-ops, getting 70% accuracy with zero useful signal.
The sparse formulation sidesteps this entirely: the empty list [] is a valid response for clips where nothing happened.
| Type | Details | Notes |
|---|---|---|
KeyPress |
key name + modifiers, e.g. Cmd+S |
|
MouseClick |
Left, Right, or Middle |
position is implied by the preceding MouseMove sequence |
MouseMove |
signed dx,dy |
relative displacement on a 0-1000 per-axis normalized scale |
MouseScroll |
signed magnitude | direction and magnitude |
[
{"frame": "F01", "type": "MouseMove", "details": "120,45"},
{"frame": "F02", "type": "MouseClick", "details": "Left"},
{"frame": "F04", "type": "KeyPress", "details": "Shift+H"},
{"frame": "F04", "type": "KeyPress", "details": "E"},
{"frame": "F05", "type": "KeyPress", "details": "L"},
{"frame": "F05", "type": "KeyPress", "details": "L"},
{"frame": "F05", "type": "KeyPress", "details": "O"},
{"frame": "F07", "type": "MouseMove", "details": "-340,12"},
{"frame": "F08", "type": "MouseScroll", "details": "-150"}
]
Each crowd-cast recording consists of a screencast and the corresponding keylog containing raw OS-level input events with microsecond-level timestamps. We process these into training clips as follows:
[] is a valid response without dominating the training loss.
Cmd+Shift and subsequently pressing P yields one event Cmd+Shift+P.
Mouse movement deltas are $$(dx, dy) \in [-1000, 1000]^2$$, where $$\pm1000$$ corresponds to a full screen-width or screen-height traversal. Scroll magnitudes are normalized the same way.
Since multiple events sometimes land on the same frame, we coalesce scrolling and mouse movement by summing their per-frame vectors.
We use a snapshot of the crowd-cast dataset from May 19, 2026, comprising ~18,000 5-minute recordings. After processing, this yields approximately 1.5m training clips and 210k validation clips, split at the session level.
2.3
We fine-tune Qwen3-VL-8B with LoRA applied to both the language model and the vision encoder.
Frames are interleaved with text labels ("Frame F00:", "Frame F01:", ...) in the input sequence, giving the model text-based anchors across the images.
The model produces a JSON array of actions as output, where only model responses contribute to the loss.
| Hyperparameter | Value |
|---|---|
| LoRA rank / alpha | 256 / 512 |
| LoRA dropout | 0.05 |
| Optimizer | AdamW |
| Peak learning rate | 2e-5 |
| Schedule | WSD (500-step warmup; 1,333-step decay to 10% of peak) |
| Per-GPU batch size | 2 |
| DDP world size | 8 |
| Number of steps | 5,000 |
| Max. pixels | 524,288 |
| Sequence length | 8192 |
| Hardware | 8× H100 |
We manually curate 44 macOS clips from a held-out set of crowd-cast (5 seconds each at 5 FPS) covering keystroke-heavy, click-heavy, scroll/drag, hotkey and mixed workflows, with hand-verified ground-truth actions. Each ground-truth action is annotated with a visibility class: visible (~91%, clear visual change in frames), inferable (~4%, deducible from context), or not predictable (~5%, no visual evidence at this frame rate). Only visible and inferable actions count toward the score.
Each evaluation clip is 25 frames, but the model only sees 10-frame sequences during training. Thus, during inference we run four overlapping windows (stride=5) and only evaluate predictions from frame indices 2 to 7. We match predictions to ground truth with a type-dependent frame tolerance (5 frames for key presses and mouse clicks, 0 frames for scrolling and mouse movement).
We report per-type precision, recall and F1. For mouse movement we additionally report R² between predicted and ground-truth displacement vectors, and cosine similarity for directional alignment. Reasoning was enabled for Gemini 3.5 Flash, GPT 5.5 and Kimi K2.6, and disabled for Gemma 4, Qwen3-VL and our fine-tuned model.
2.5Our ablations each change one component of the final recipe.
| Config | F1 | R² | cos |
|---|---|---|---|
| Final recipe (8B, r=256) | 0.86 | 0.66 | 0.99 |
| 2B backbone | 0.86 | 0.61 | 0.97 |
| 4B backbone | 0.84 | 0.55 | 0.91 |
| − vision LoRA | 0.84 | 0.64 | 0.76 |
| − interleaved frame labels | 0.77 | -0.03 | 0.57 |
Removing interleaved frame labels materially affects all metrics and degenerates mouse movement prediction. Without text anchors between frames, the model loses track of frame identity and enters repetitive prediction loops.
Model size, LoRA rank and ablating vision-LoRA have no measurable effect on action detection. However, bigger models and vision-LoRA improve mouse movement prediction, and direction accuracy scales monotonically with LoRA rank.
Although the IDM has only been trained on macOS data, it largely transfers to Windows and Linux screencasts without adaptation. Mouse movement and click detection generalize best, while keyboard shortcuts are the main failure mode.
We additionally evaluate the model on AgentNet, which is even more out-of-distribution than mere cross-platform clips due to each frame depicting one logical action (like inserting one full command into the terminal). Since the model has never seen non-uniformly sampled frames, it hallucinates dynamics that could explain extensive changes between frames.
Positive example
Negative example
The model has not seen this kind of terminal text insertion during training. It first predicts only part of the inserted text, and subsequently explains large frame changes as terminal-history navigation.
Today, we expand crowd-cast support to Linux and Windows, thus covering all major operating systems. We expect our IDM to get better with more diverse data.
Our ultimate goal is to lengthen model's task horizons from days to weeks and months. With crowd-cast, we now have our in-house data supply chain of such months-long trajectories, and with our IDM we unlock the internet as a data trove. While our firm believe has always been that you need to train models on trajectories that span months if you want them to exhibit month-long horizons, a dataset alone will not suffice. While we are scaling our data collection effort orders of magnitude, we are investigating alternatives to dense attention for retaining memory over months-long rollouts (fixed-size state), continually training off-the-shelf models, inserting goals and synthetic thinking traces, and letting the model learn from experience.
With cross-platform support, anyone can now contribute to crowd-cast. We are paying participants to record their work sessions. You are not asked to do tasks for us, you record yourself doing work you would be doing anyway. If you are interested, apply here or check the live dashboard to see collection progress.
MM worked on data sourcing, training, evaluation, and wrote large parts of the manuscript and the crowd-cast codebase. MM ported crowd-cast to Windows. FS ported crowd-cast to Linux and helped with the manuscript. The p(doom) team jointly contributed to crowd-cast.