We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. Unlike prior benchmarks that evaluate video prediction on fixed scenes, CRONOS systematically intervenes on four key factors while keeping the underlying physical event β such as a collision, occlusion, or fall β fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event is affected by appearance, environment, and particularly by viewpoint changes. CRONOS provides a principled and reproducible testbed for diagnosing how generated video quality changes across interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions.
A model's ability to produce predictions of physical events that remain coherent across counterfactual variants of the visual input, such as changes in viewpoint, scene context, object appearance, or object category.
CRONOS frames video model evaluation as a controlled counterfactual experiment. Each physical event is rendered into multiple counterfactual observations by intervening on one scene factor at a time while holding all others fixed.
We span three fundamental rigid-body interactions, chosen to isolate distinct aspects of physical reasoning.
An object rolls across a surface and falls from an edge β testing prediction across changing contact conditions and free-fall motion.
One object impacts another β testing whether generated videos preserve plausible interaction dynamics, temporal/spatial coherence, and object permanence.
An object becomes fully occluded behind a scene element and later reappears β probing long-range temporal coherence and inference of hidden motion.
For each event, CRONOS applies one intervention at a time while holding the remaining variables fixed.
The rendering viewpoint is changed. Probes whether models disentangle scene geometry from observed motion.
Visual attributes (e.g., color) of the primary object are changed without altering physical parameters β isolating whether models disentangle appearance from dynamics.
The full scene is replaced β background, lighting, and event-relevant layout details change. Alters how the event unfolds across scenes, testing whether models adapt the event dynamics coherently to the new context.
The object of interest is replaced with another, changing both visual properties and physical parameters (mass, friction). Probes generalization across instances.
CRONOS decomposes generation quality into complementary per-video metrics, validated against human ratings.
We evaluate Cosmos2.5 (2B, 14B), CogVideoX1.5 (5B), MAGI-1 (4.5B), and Wan2.2 (14B) across both I2V and V2V settings.
All evaluated video models fail to reliably generate short clips of basic rigid-body physics. Even the strongest model (Cosmos2.5-2B V2V) achieves only a 22% success rate, with some models below 5%.
Generation quality substantially shifts under every intervention β including superficial appearance changes, and particularly under viewpoint, object-type, and scene interventions.
V2V outperforms I2V not only on motion fidelity (as expected) but also on background and object stability, suggesting additional conditioning frames help form more stable internal representations at inference time.
Scaling Cosmos from 2B to 14B parameters yields no improvement on physical event generation. Model size alone does not guarantee better counterfactual physical consistency.
Performance averaged across all videos. The table reports the headline Success rate; per-metric scores (background stability, motion similarity, appearance stability, 3D shape, physical plausibility) are plotted in the radar on the right β all axes on a 0-1 scale. Click a row to toggle that model on or off in the radar.
| Rank | Model | Mode | Success |
|---|---|---|---|
| 1 | Cosmos2.5-2B | V2V | 0.22 |
| 2 | Wan2.2-14B | I2V | 0.20 |
| 3 | Cosmos2.5-14B | V2V | 0.14 |
| 4 | Cosmos2.5-2B | I2V | 0.12 |
| 5 | Cosmos2.5-14B | I2V | 0.08 |
| 6 | MAGI-1-4.5B | I2V | 0.02 |
| 6 | CogVideoX1.5-5B | I2V | 0.02 |
| 8 | MAGI-1-4.5B | V2V | 0.01 |
Side-by-side comparisons of generated futures for the same physical event. While most models preserve coarse scene structure, they fail to generate plausible motion, static camera viewpoints, or consistent objects.
Generated video and four intervention variants for a selected model. All videos are generated with the same random seed. A counterfactually consistent model would produce stable, comparable predictions across all variants.
AK acknowledges support via his Emmy Noether Research Group funded by the German Research Foundation (DFG) under grant number 468670075.
This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 539134284, through EFRE (FEIH_2698644) and the state of Baden-WΓΌrttemberg.
@misc{begiristain2026cronos,
title={CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models},
author={Le{\'o}n Begiristain and Olaf D{\"u}nkel and Adam Kortylewski},
year={2026},
eprint={2605.23699},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.23699},
}