An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they mostly do not capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To remove failure cases, we propose a filtering mechanism that outperforms previous methods and hence enables a reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness.
CNS-Bench is the first benchmark enabling robustness evaluation w.r.t. realistic and continuous nuisance shifts, scalable to any number of classes and shifts. It covers 14 diverse shifts across 100 ImageNet classes (200k images at 5 severity levels).
ImageNet class-specific LoRA adapters are applied to a diffusion model, continuously modulating nuisance intensity. To close the distribution gap between diffusion-generated images and ImageNet, we apply textual inversion to learn class-specific embeddings and adapters.
Out-of-class samples are filtered via an ensemble of four filters: two text-alignment scores (CLIP with base and shifted prompts) and two image-feature similarity scores (CLIP and DINOv2 CLS token), calibrated on a human-annotated dataset. Our filter achieves substantially larger filter accuracy than previous CLIP-based strategies.
Accuracy drops and model rankings vary across shifts and scales — e.g., ViT outperforms other architectures at low painting-style scales but performs worse at high scales. This shift-and-scale dependence cannot be captured by single-point binary evaluations.
Average relative corruption error (rCE, lower is better) along three axes: (i) architecture, (ii) model size, and (iii) pre-training paradigm and data.
Adam Kortylewski acknowledges support via his Emmy Noether Research Group funded by the German Research Foundation (DFG) under Grant No. 468670075.
@inproceedings{duenkel2025cns,
title = {CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts},
author = {D{\"u}nkel, Olaf and Jesslen, Artur and Xie, Jiaohao and Theobalt, Christian and Rupprecht, Christian and Kortylewski, Adam},
booktitle = {ICCV},
year = {2025}
}