CRONOS: Benchmarking Counterfactual
Physical Consistency in Video Models

LeΓ³n Begiristain1 Olaf DΓΌnkel2 Adam Kortylewski3
1University of Freiburg, Germany 2Max Planck Institute for Informatics, Saarland Informatics Campus, Germany 3CISPA Helmholtz Center for Information Security, Germany

Abstract

We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. Unlike prior benchmarks that evaluate video prediction on fixed scenes, CRONOS systematically intervenes on four key factors while keeping the underlying physical event β€” such as a collision, occlusion, or fall β€” fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event is affected by appearance, environment, and particularly by viewpoint changes. CRONOS provides a principled and reproducible testbed for diagnosing how generated video quality changes across interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions.

Core Concept

Counterfactual Physical Consistency

A model's ability to produce predictions of physical events that remain coherent across counterfactual variants of the visual input, such as changes in viewpoint, scene context, object appearance, or object category.

The CRONOS Benchmark

CRONOS frames video model evaluation as a controlled counterfactual experiment. Each physical event is rendered into multiple counterfactual observations by intervening on one scene factor at a time while holding all others fixed.

CRONOS benchmark overview: counterfactual interventions on viewpoint, scene, appearance, and object category.
Figure 1. The CRONOS benchmark systematically intervenes on visual factors while preserving the underlying physical event, isolating each axis of model sensitivity.

Physical Events

We span three fundamental rigid-body interactions, chosen to isolate distinct aspects of physical reasoning.

Fall

An object rolls across a surface and falls from an edge β€” testing prediction across changing contact conditions and free-fall motion.

Collision

One object impacts another β€” testing whether generated videos preserve plausible interaction dynamics, temporal/spatial coherence, and object permanence.

Occlusion

An object becomes fully occluded behind a scene element and later reappears β€” probing long-range temporal coherence and inference of hidden motion.

Systematic Visual Interventions

For each event, CRONOS applies one intervention at a time while holding the remaining variables fixed.

Evaluation Metrics

CRONOS decomposes generation quality into complementary per-video metrics, validated against human ratings.

Background Stability β€” perturbations in environment over time, camera movement and appearance of new objects.
Appearance Stability β€” consistency of object appearance over time.
3D-Shape Stability β€” geometric consistency of objects throughout the video.
Motion Similarity β€” motion agreement with reference rendered video.
Physical Plausibility β€” VLM-as-judge (Qwen3-VL-32B) over physics-specific questions.
Success Rate β€” binary pass/fail aggregation with human-calibrated thresholds.

Key Findings

We evaluate Cosmos2.5 (2B, 14B), CogVideoX1.5 (5B), MAGI-1 (4.5B), and Wan2.2 (14B) across both I2V and V2V settings.

01

Models fail at basic rigid-body physics

All evaluated video models fail to reliably generate short clips of basic rigid-body physics. Even the strongest model (Cosmos2.5-2B V2V) achieves only a 22% success rate, with some models below 5%.

02

No model is counterfactually consistent

Generation quality substantially shifts under every intervention β€” including superficial appearance changes, and particularly under viewpoint, object-type, and scene interventions.

03

Video conditioning helps beyond motion

V2V outperforms I2V not only on motion fidelity (as expected) but also on background and object stability, suggesting additional conditioning frames help form more stable internal representations at inference time.

04

Scaling alone is not enough

Scaling Cosmos from 2B to 14B parameters yields no improvement on physical event generation. Model size alone does not guarantee better counterfactual physical consistency.

Benchmark Results

Performance averaged across all videos. The table reports the headline Success rate; per-metric scores (background stability, motion similarity, appearance stability, 3D shape, physical plausibility) are plotted in the radar on the right β€” all axes on a 0-1 scale. Click a row to toggle that model on or off in the radar.

Rank Model Mode Success
1 Cosmos2.5-2B V2V 0.22
2 Wan2.2-14B I2V 0.20
3 Cosmos2.5-14B V2V 0.14
4 Cosmos2.5-2B I2V 0.12
5 Cosmos2.5-14B I2V 0.08
6 MAGI-1-4.5B I2V 0.02
6 CogVideoX1.5-5B I2V 0.02
8 MAGI-1-4.5B V2V 0.01
Per-metric breakdown on a 0-1 scale. Higher is better on every axis.
Hover bars for exact values Β· click legend to hide a model
Figure 2. Sensitivity to counterfactual interventions averaged across metrics. Lower is better. All models show substantial sensitivity across every intervention type, including superficial appearance changes.

Qualitative Results

Side-by-side comparisons of generated futures for the same physical event. While most models preserve coarse scene structure, they fail to generate plausible motion, static camera viewpoints, or consistent objects.

Event:
Render GT
πŸŽ₯
Wan2.2-14B I2V
πŸŽ₯
Cosmos2.5-2B V2V
πŸŽ₯
Cosmos2.5-14B V2V
πŸŽ₯
MAGI-1-4.5B V2V
πŸŽ₯
Cosmos2.5-2B I2V
πŸŽ₯
Cosmos2.5-14B I2V
πŸŽ₯
MAGI-1-4.5B I2V
πŸŽ₯
CogVideoX1.5-5B I2V
πŸŽ₯

Counterfactual Variants of the Same Event

Generated video and four intervention variants for a selected model. All videos are generated with the same random seed. A counterfactually consistent model would produce stable, comparable predictions across all variants.

Model:
V2V
I2V
Original
πŸŽ₯
Viewpoint
πŸŽ₯
Appearance
πŸŽ₯
Scene
πŸŽ₯
Object
πŸŽ₯

Limitations

Acknowledgments

AK acknowledges support via his Emmy Noether Research Group funded by the German Research Foundation (DFG) under grant number 468670075.

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 539134284, through EFRE (FEIH_2698644) and the state of Baden-WΓΌrttemberg.

BibTeX

@misc{begiristain2026cronos,
      title={CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models}, 
      author={Le{\'o}n Begiristain and Olaf D{\"u}nkel and Adam Kortylewski},
      year={2026},
      eprint={2605.23699},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.23699}, 
}