CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Abstract

We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. Unlike prior benchmarks that evaluate video prediction on fixed scenes, CRONOS systematically intervenes on four key factors while keeping the underlying physical event — such as a collision, occlusion, or fall — fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event is affected by appearance, environment, and particularly by viewpoint changes. CRONOS provides a principled and reproducible testbed for diagnosing how generated video quality changes across interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions.

Core Concept

Counterfactual Physical Consistency

A model's ability to produce predictions of physical events that remain coherent across counterfactual variants of the visual input, such as changes in viewpoint, scene context, object appearance, or object category.

The CRONOS Benchmark

CRONOS frames video model evaluation as a controlled counterfactual experiment. Each physical event is rendered into multiple counterfactual observations by intervening on one scene factor at a time while holding all others fixed.

CRONOS benchmark overview: counterfactual interventions on viewpoint, scene, appearance, and object category. — **Figure 1.** The CRONOS benchmark systematically intervenes on visual factors while preserving the underlying physical event, isolating each axis of model sensitivity.

Physical Events

We span three fundamental rigid-body interactions, chosen to isolate distinct aspects of physical reasoning.

Fall

An object rolls across a surface and falls from an edge — testing prediction across changing contact conditions and free-fall motion.

Collision

One object impacts another — testing whether generated videos preserve plausible interaction dynamics, temporal/spatial coherence, and object permanence.

Occlusion

An object becomes fully occluded behind a scene element and later reappears — probing long-range temporal coherence and inference of hidden motion.

Systematic Visual Interventions

For each event, CRONOS applies one intervention at a time while holding the remaining variables fixed.

📷

Camera Viewpoint

The rendering viewpoint is changed. Probes whether models disentangle scene geometry from observed motion.
🎨

Object Appearance

Visual attributes (e.g., color) of the primary object are changed without altering physical parameters — isolating whether models disentangle appearance from dynamics.
🏞️

Scene

The full scene is replaced — background, lighting, and event-relevant layout details change. Alters how the event unfolds across scenes, testing whether models adapt the event dynamics coherently to the new context.
📦

Object Category

The object of interest is replaced with another, changing both visual properties and physical parameters (mass, friction). Probes generalization across instances.

Evaluation Metrics

CRONOS decomposes generation quality into complementary per-video metrics, validated against human ratings.

Background Stability — perturbations in environment over time, camera movement and appearance of new objects.

Appearance Stability — consistency of object appearance over time.

3D-Shape Stability — geometric consistency of objects throughout the video.

Motion Similarity — motion agreement with reference rendered video.

Physical Plausibility — VLM-as-judge (Qwen3-VL-32B) over physics-specific questions.

Success Rate — binary pass/fail aggregation with human-calibrated thresholds.

Key Findings

We evaluate Cosmos2.5 (2B, 14B), CogVideoX1.5 (5B), MAGI-1 (4.5B), and Wan2.2 (14B) across both I2V and V2V settings.

01

Models fail at basic rigid-body physics

All evaluated video models fail to reliably generate short clips of basic rigid-body physics. Even the strongest model (Cosmos2.5-2B V2V) achieves only a 22% success rate, with some models below 5%.

02

No model is counterfactually consistent

Generation quality substantially shifts under every intervention — including superficial appearance changes, and particularly under viewpoint, object-type, and scene interventions.

03

Video conditioning helps beyond motion

V2V outperforms I2V not only on motion fidelity (as expected) but also on background and object stability, suggesting additional conditioning frames help form more stable internal representations at inference time.

04

Scaling alone is not enough

Scaling Cosmos from 2B to 14B parameters yields no improvement on physical event generation. Model size alone does not guarantee better counterfactual physical consistency.

Benchmark Results

Performance averaged across all videos. The table reports the headline Success rate; per-metric scores (background stability, motion similarity, appearance stability, 3D shape, physical plausibility) are plotted in the radar on the right — all axes on a 0-1 scale. Click a row to toggle that model on or off in the radar.

Rank	Model	Mode	Success
1	Cosmos2.5-2B	V2V	0.22
2	Wan2.2-14B	I2V	0.20
3	Cosmos2.5-14B	V2V	0.14
4	Cosmos2.5-2B	I2V	0.12
5	Cosmos2.5-14B	I2V	0.08
6	MAGI-1-4.5B	I2V	0.02
6	CogVideoX1.5-5B	I2V	0.02
8	MAGI-1-4.5B	V2V	0.01

Per-metric breakdown on a 0-1 scale. Higher is better on every axis.

Hover bars for exact values · click legend to hide a model

Figure 2. Sensitivity to counterfactual interventions averaged across metrics. Lower is better. All models show substantial sensitivity across every intervention type, including superficial appearance changes.

Qualitative Results

Side-by-side comparisons of generated futures for the same physical event. While most models preserve coarse scene structure, they fail to generate plausible motion, static camera viewpoints, or consistent objects.

Event:

Render GT

🎥

Wan2.2-14B I2V

🎥

Cosmos2.5-2B V2V

🎥

Cosmos2.5-14B V2V

🎥

MAGI-1-4.5B V2V

🎥

Cosmos2.5-2B I2V

🎥

Cosmos2.5-14B I2V

🎥

MAGI-1-4.5B I2V

🎥

CogVideoX1.5-5B I2V

🎥

Counterfactual Variants of the Same Event

Generated video and four intervention variants for a selected model. All videos are generated with the same random seed. A counterfactually consistent model would produce stable, comparable predictions across all variants.

Model:

V2V

I2V

Original

🎥

Viewpoint

🎥

Appearance

🎥

Scene

🎥

Object

🎥

Limitations

Synthetic-to-real domain gap. CRONOS uses Unreal Engine renderings; this control is necessary for matched counterfactuals, but introduces a domain gap.
Single-reference rollouts. Most metrics compare against one rendered reference, while the conditioning permits multiple plausible futures. We mitigate this with multi-seed evaluation and reference-independent stability metrics.
Scope of evaluated models. We evaluate open-source models with reproducible settings, not closed commercial systems (Veo, Sora, Kling). Even so, the benchmark is far from saturated.

Acknowledgments

AK acknowledges support via his Emmy Noether Research Group funded by the German Research Foundation (DFG) under grant number 468670075.

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 539134284, through EFRE (FEIH_2698644) and the state of Baden-Württemberg.

BibTeX

@misc{begiristain2026cronos,
      title={CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models}, 
      author={Le{\'o}n Begiristain and Olaf D{\"u}nkel and Adam Kortylewski},
      year={2026},
      eprint={2605.23699},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.23699}, 
}