We train an inverse dynamics model on crowd-cast's action-labeled dataset, yielding a model that recovers key presses, mouse clicks, cursor movements and scroll events from unlabeled videos. Trained exclusively on macOS, the model outperforms orders-of-magnitude larger off-the-shelf models and generalizes to Windows and Linux. Along with the model, we openly release 600 hours of IDM-annotated screencasts.

Interactive demo: the model watches a screencast and recovers every user action. Loops automatically; click to interact.

Long-horizon data for long-horizon models

Frontier labs are spending billions on handcrafted RL environments hoping to lengthen the task horizon to general intelligence. Meanwhile, a billion people produce months-long trajectories of economically valuable digital work: Data that can be used to behaviour-clone models towards longer task horizons. We believe that if you want models to work on tasks for weeks and months at a time, you must train them on trajectories that span months.

We previously started capturing that data as screencasts with synchronized keylogs and mouse movement using crowd-cast. While crowd-cast yields data from highly productive participants, often spanning months of work on single projects, it is still limited in breadth.

At the same time, the internet contains thousands of hours of unlabeled screencasts across diverse professions, applications and operating systems. To use that data for behaviour-cloning, we first need to recover the actions behind frames. As a first step towards unlocking internet-scale screencasts, we train an inverse dynamics model on crowd-cast data, and use it to action-label all 600 hours of AGI-CAST, a dataset of unlabeled long-horizon screencasts. We openly release that annotated dataset alongside our model.

An IDM for unlabeled screencasts

A human watching screen recordings can infer what was typed, what was clicked, and how the mouse moved. An inverse dynamics model is no different. Trained on crowd-cast's paired data (videos and ground-truth actions), our model learns to recover actions from pixels alone.

Unlike previous work , our IDM predicts sparse, low-level input events over multi-frame clips. Its action space is close to raw OS input logs: individual key presses, mouse clicks, scrolls and relative mouse movement, rather than GUI-level action steps. To our knowledge, this is the first openly released IDM for recovering raw computer input events from screen recordings.

F1

1.0 0.8 0.6 0.4 0.2 0

0.79

0.74

0.71

0.54

0.43

0.36

Ours (8B) Gemini 3.5 Flash GPT 5.5 Kimi K2.6 Gemma 4 31B Qwen3-VL 8B

        Ours (Qwen3-VL 8B fine-tuned)
        Closed-source
        Open-source
      

Ours (8B)

0.79

Gemini 3.5 Flash

0.74

GPT 5.5

0.71

Kimi K2.6

0.54

Gemma 4 31B

0.43

Qwen3-VL 8B

0.36

0F11.0

Overall F1 on the 44-clip eval set.

Our fine-tune of Qwen3-VL 8B surpasses the strongest zero-shot off-the-shelf model, Gemini 3.5 Flash. Our qualitative evaluations (see interactive demo) show that predicted cursor displacements closely track both the direction and magnitude of ground truth.

Without fine-tuning, even the best off-the-shelf model leaves significant room for improvement. We observe two failure modes across nearly all models: they are competent at single-frame grounding and OCR but struggle with multi-frame tracking, and they severely underpredict, missing the majority of actions.

2.1

Format

Given a short window of consecutive frames, the model outputs a sparse list of user actions: key presses, mouse clicks, mouse movement and scroll events. While most frames have no action, some have several, and the model is trained to emit only those. We deliberately avoid per-frame action classification. At 5 FPS, ~70% of frames have no associated input event, hence a naive per-frame classifier collapses to always predicting no-ops, getting 70% accuracy with zero useful signal. The sparse formulation sidesteps this entirely: the empty list [] is a valid response for clips where nothing happened.

Type	Details	Notes
`KeyPress`	key name + modifiers, e.g. `Cmd+S`
`MouseClick`	`Left`, `Right`, or `Middle`	position is implied by the preceding MouseMove sequence
`MouseMove`	signed `dx,dy`	relative displacement on a 0-1000 per-axis normalized scale
`MouseScroll`	signed magnitude	direction and magnitude

The action space stays close to raw OS input events.

[
  {"frame": "F01", "type": "MouseMove",   "details": "120,45"},
  {"frame": "F02", "type": "MouseClick",  "details": "Left"},
  {"frame": "F04", "type": "KeyPress",    "details": "Shift+H"},
  {"frame": "F04", "type": "KeyPress",    "details": "E"},
  {"frame": "F05", "type": "KeyPress",    "details": "L"},
  {"frame": "F05", "type": "KeyPress",    "details": "L"},
  {"frame": "F05", "type": "KeyPress",    "details": "O"},
  {"frame": "F07", "type": "MouseMove",   "details": "-340,12"},
  {"frame": "F08", "type": "MouseScroll", "details": "-150"}
]

Example model output for a 10-frame clip.

2.2

Data processing

Each crowd-cast recording consists of a screencast and the corresponding keylog containing raw OS-level input events with microsecond-level timestamps. We process these into training clips as follows:

Frame extraction. We downsample each recording to 5 FPS and resize to 720p. crowd-cast records a black screen when the focused application is not in the user's predefined recording list. These black frames carry no visual information and have no associated actions, so we detect and drop them. We then slide a 10-frame window across each recording (stride=5). At 5 FPS, roughly 70% of the 2-second windows contain no user action at all. We keep only 5% of these zero-action clips as negative examples, enough for the model to learn that [] is a valid response without dominating the training loss.
Action extraction. Raw OS-level keylog events are converted into the four action types. Modifier keys never emit standalone, they are tracked as held state and prefixed to the next non-modifier key press. For example, holding Cmd+Shift and subsequently pressing P yields one event Cmd+Shift+P. Mouse movement deltas are $$(dx, dy) \in [-1000, 1000]^2$$, where $$\pm1000$$ corresponds to a full screen-width or screen-height traversal. Scroll magnitudes are normalized the same way. Since multiple events sometimes land on the same frame, we coalesce scrolling and mouse movement by summing their per-frame vectors.

We use a snapshot of the crowd-cast dataset from May 19, 2026, comprising ~18,000 5-minute recordings. After processing, this yields approximately 1.5m training clips and 210k validation clips, split at the session level.

2.3

Model and training

We fine-tune Qwen3-VL 8B with LoRA applied to both the language model and the vision encoder. Frames are interleaved with text labels ("Frame F00:", "Frame F01:", ...) in the input sequence, giving the model text-based anchors across the images. The model produces a JSON array of actions as output, where only model responses contribute to the loss.

Hyperparameter	Value
LoRA rank / alpha	256 / 512
LoRA dropout	0.05
Optimizer	AdamW
Peak learning rate	2e-5
Schedule	WSD (500-step warmup; 1,333-step decay to 10% of peak)
Per-GPU batch size	2
DDP world size	8
Number of steps	5,000
Max. pixels	524,288
Sequence length	8192
Hardware	8× H100

2.4

Evaluation setup

We manually curate 44 macOS clips from a held-out set of crowd-cast (5 seconds each at 5 FPS) covering keystroke-heavy, click-heavy, scroll/drag, hotkey and mixed workflows, with hand-verified ground-truth actions. Each ground-truth action is annotated with a visibility class: visible (~91%, clear visual change in frames), inferable (~4%, deducible from context), or not predictable (~5%, no visual evidence at this frame rate). Only visible and inferable actions count toward the score.

Each evaluation clip is 25 frames, but the model only sees 10-frame sequences during training. Thus, during inference we run four overlapping windows (stride=5) and only evaluate predictions from frame indices 2 to 7. We match predictions to ground truth with a type-dependent frame tolerance (5 frames for key presses and mouse clicks, 0 frames for scrolling and mouse movement).

We report per-type precision, recall and F1. For mouse movement we additionally report R² between predicted and ground-truth displacement vectors, and mean cosine similarity for directional alignment. Reasoning was enabled for Gemini 3.5 Flash, GPT 5.5 and Kimi K2.6, and disabled for Gemma 4, Qwen3-VL and our fine-tuned model.

2.5

Ablations

Our ablations each change one component of the final recipe.

Config	F1	R²	cos
Final recipe (8B, r=256)	0.79	0.71	0.64
2B backbone	0.74	0.65	0.57
4B backbone	0.76	0.54	0.40
− vision LoRA	0.76	0.63	0.42
− interleaved frame labels	0.73	-0.08	0.23

Removing interleaved frame labels materially affects all metrics and degenerates mouse movement prediction. Without text anchors between frames, the model loses track of frame identity and enters repetitive prediction loops.

Performance scales cleanly with both model size and LoRA rank: overall F1 improves along both axes, and mouse-direction quality improves sharply with LoRA rank. Removing vision LoRA hurts performance, especially mouse-direction quality.

Scaling sweeps. Overall F1 improves with LoRA rank and model size; mouse-direction cosine improves with LoRA rank.

2.6

Out-of-distribution generalization

Although the IDM has only been trained on macOS data, it largely transfers to Windows and Linux screencasts without adaptation. Mouse movement and click detection generalize best, while keyboard shortcuts are the main failure mode.

The remaining cross-platform errors are mostly Cmd/Ctrl confusion from macOS-only training.

We additionally evaluate the model on AgentNet, which is even more out-of-distribution than mere cross-platform clips due to each frame depicting one logical action (like inserting one full command into the terminal). Since the model has never seen non-uniformly sampled frames, it hallucinates dynamics that could explain extensive changes between frames.

Positive example

Negative example

The model has not seen this kind of terminal text insertion during training. It first predicts only part of the inserted text, and subsequently explains large frame changes as terminal-history navigation.

What's next

Today, we expand crowd-cast support to Linux and Windows, thus covering all major operating systems. We expect our IDM to get better with more diverse data.

Our ultimate goal is to lengthen model's task horizons from days to weeks and months. With crowd-cast, we now have our in-house data supply chain of such months-long trajectories, and with our IDM we unlock the internet as a data trove. While our firm believe has always been that you need to train models on trajectories that span months if you want them to exhibit month-long horizons, a dataset alone will not suffice. While we are scaling our data collection effort orders of magnitude, we are investigating alternatives to dense attention for retaining memory over months-long rollouts (fixed-size state), continually training off-the-shelf models, inserting goals and synthetic thinking traces, and letting the model learn from experience.

With cross-platform support, anyone can now contribute to crowd-cast. We are paying participants to record their work sessions. You are not asked to do tasks for us, you record yourself doing work you would be doing anyway. If you are interested, apply here or check the live dashboard to see collection progress.

Contributions

Mihir Mahajan worked on data sourcing, training, evaluation, and wrote large parts of the manuscript and the crowd-cast codebase, and ported crowd-cast to Windows. Franz Srambical ported crowd-cast to Linux and helped with the manuscript. Alfred Nguyen contributed to the data pipeline of the IDM. Stefan Bauer provided feedback and guidance. The p(doom) team jointly contributed to crowd-cast.