CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

ICCV 2025
1Max Planck Institute for Informatics, 2University of Freiburg, 3University of Oxford

We evaluate the robustness of different models under gradually increasing nuisance shifts. A failure point is the smallest shift severity at which a model first misclassifies an image (highlighted in red). The distribution of failure points reveals whether models degrade gradually or abruptly.

Abstract

An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they mostly do not capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To remove failure cases, we propose a filtering mechanism that outperforms previous methods and hence enables a reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness.

Our Approach

CNS-Bench is the first benchmark enabling robustness evaluation w.r.t. realistic and continuous nuisance shifts, scalable to any number of classes and shifts. It covers 14 diverse shifts across 100 ImageNet classes (200k images at 5 severity levels).

Overview.

ImageNet class-specific LoRA adapters are applied to a diffusion model, continuously modulating nuisance intensity. To close the distribution gap between diffusion-generated images and ImageNet, we apply textual inversion to learn class-specific embeddings and adapters.

Method generation.

Out-of-class samples are filtered via an ensemble of four filters: two text-alignment scores (CLIP with base and shifted prompts) and two image-feature similarity scores (CLIP and DINOv2 CLS token), calibrated on a human-annotated dataset. Our filter achieves substantially larger filter accuracy than previous CLIP-based strategies.

Method filtering.

Benchmarking Results

Accuracy drops and model rankings vary across shifts and scales — e.g., ViT outperforms other architectures at low painting-style scales but performs worse at high scales. This shift-and-scale dependence cannot be captured by single-point binary evaluations.

Result plots.

Average relative corruption error (rCE, lower is better) along three axes: (i) architecture, (ii) model size, and (iii) pre-training paradigm and data.

Model evaluation.

Key Findings

  • Visual state-space models are most robust. VMamba achieves the lowest rCE among architectures with comparable parameters and training data, outperforming DeiT3, ConvNeXt, ResNet-152, and ViT.
  • Self-supervised pre-training beats supervised pre-training. DINOv1 fine-tuned on IN1k outperforms a fine-tuned model that was pre-trained on the larger IN21k in a supervised manner, despite using less training data.
  • Diffusion classifiers are less robust than discriminative models, and the gap widens with higher shift severity.
  • Model failure points differ across shift types. Weather shifts (e.g., snow) cause gradual failure accumulation, while style shifts (e.g., cartoon) cause abrupt failure clusters at a specific scale — a distinction invisible to binary benchmarks.

Acknowledgments

Adam Kortylewski acknowledges support via his Emmy Noether Research Group funded by the German Research Foundation (DFG) under Grant No. 468670075.

BibTeX

@inproceedings{duenkel2025cns,
    title = {CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts},
    author = {D{\"u}nkel, Olaf and Jesslen, Artur and Xie, Jiaohao and Theobalt, Christian and Rupprecht, Christian and Kortylewski, Adam},
    booktitle = {ICCV},
    year = {2025}
  }