Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence covering 100 categories and over 1M correspondence pairs. SOCO introduces a taxonomy of correspondence types, provides consistent and functionally meaningful keypoint annotations, and expands both the scale and category diversity of previous datasets. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision–language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localisation than at visual-reference cross-image matching, and (iii) correspondence performance correlates strongly with downstream tasks such as segmentation, tracking, 3D pose estimation, and 3D detection. Together, these findings position SOCO as a benchmark for evaluating structured, part-level representation quality in modern vision and multimodal foundation models.
A taxonomy-driven formulation of semantic correspondence that disentangles three distinct abilities collapsed by prior benchmarks: recognising the same local concept (CC), identifying the correct object-relative instance of that concept (SOC), and transferring concepts across related categories (Cross-SOC). This decomposition standardises what counts as a valid correspondence and makes distinct model failure modes separately measurable.
SOCO reframes correspondence evaluation around a taxonomy of object parts. Every keypoint is tied to a hierarchical, functionally meaningful concept, allowing consistent part definitions both within and across object categories.
SOC decomposes a single correspondence score into three progressively harder tasks, each isolating a distinct aspect of structured object understanding.
Match the same local semantic concept across instances (e.g., a wheel centre to a wheel centre) — pure part recognition, ignoring which specific instance it is.
Match the same concept with the same object-relative identity (front-left wheel to front-left wheel) — requiring awareness of object geometry and structure.
Match object-relative keypoints across related categories through shared taxonomy concepts (a wheel on a car, bus, or tractor) — requiring category-level abstraction.
SOCO substantially expands the scale, diversity, and consistency of prior correspondence datasets, and is the first to pair every keypoint with a natural-language description.
Categories span four super-classes — Transportation, Hand-held objects, Furniture, and Animals — sharing a hierarchical vocabulary of semantic concepts that enables consistent intra- and cross-category matching.
We evaluate a broad family of vision foundation models and large vision–language models on the SOC taxonomy, and study how SOC relates to dense downstream tasks across 37 vision models.
Strong backbones recognise local concepts well but degrade sharply when object-relative identity is required. Even the best model, DINOv2, drops from 78.9 (CC) to 60.4 (SOC) — a consistent CC→SOC repeated-part confusion across every model.
Moving from SOC to Cross-SOC reveals limited category-level abstraction: DINOv2 falls further to 55.0, and most backbones lose 15–24 points from CC, exposing reliance on category-specific appearance.
Current LVLMs are far stronger at text-prompted part localisation than at visual-reference cross-image matching. Qwen3-VL-8B rises from 34.2 (Vis.) to 54.0 (Desc.) — yet all remain well below the DINOv2 ceiling of 81.0.
Across 37 vision models, SOC correlates with dense downstream tasks — segmentation, tracking, 3D pose, 3D detection — far more strongly than ImageNet kNN classification, making it a practical zero-shot probe of representation quality.
PCK@0.1 across the three correspondence tasks. Click a row to toggle that model on or off in the chart — the lines make the CC→SOC→Cross-SOC drop visible for every model.
| Model | CC | SOC | Cross-SOC | Avg |
|---|---|---|---|---|
| DINOv2 | 78.9 | 60.4 | 55.0 | 64.8 |
| DINOv3 | 69.7 | 55.5 | 49.4 | 58.2 |
| C-RADIOv3 | 69.0 | 51.1 | 46.3 | 55.5 |
| I-JEPA | 60.5 | 46.3 | 38.4 | 48.4 |
| DUNE | 60.1 | 45.7 | 38.5 | 48.1 |
| PE-Spatial | 60.6 | 43.8 | 38.8 | 47.7 |
| SD 2.1 | 56.0 | 44.8 | 38.3 | 46.4 |
| iBOT | 55.2 | 39.6 | 34.1 | 43.0 |
| PIXIO | 49.5 | 37.5 | 32.9 | 40.0 |
| DINOv1 | 43.8 | 30.6 | 23.9 | 32.8 |
| QWEN-L | 27.2 | 19.4 | 16.2 | 20.9 |
| CLIP | 24.9 | 16.1 | 11.2 | 17.4 |
| CroCov2 | 15.2 | 10.2 | 7.8 | 11.1 |
| MAE | 14.4 | 9.4 | 7.2 | 10.3 |
SOC accuracy broken down by SOCO's four super-categories. Models behave quite differently across object families: e.g. DINOv3 leads on furniture, while DINOv2 dominates transportation, hand-held objects, and animals.
Click a row to toggle that model on or off in the radar chart.
| Model | Transp. | Hand | Furn. | Animals |
|---|---|---|---|---|
| DINOv2 | 56.9 | 61.6 | 45.5 | 66.3 |
| DINOv3 | 51.6 | 57.4 | 59.9 | 56.6 |
| C-RADIOv3 | 51.7 | 48.1 | 39.7 | 54.8 |
| I-JEPA | 41.5 | 51.0 | 46.7 | 47.7 |
| DUNE | 40.0 | 50.5 | 51.0 | 46.6 |
| PE-Spatial | 45.1 | 40.7 | 37.6 | 45.9 |
| SD 2.1 | 42.4 | 45.3 | 47.7 | 45.6 |
| iBOT | 36.1 | 40.1 | 32.8 | 43.9 |
| PIXIO | 37.3 | 40.2 | 46.7 | 34.2 |
| DINOv1 | 29.1 | 32.9 | 27.4 | 31.4 |
| QWEN-L | 21.7 | 24.1 | 22.6 | 14.5 |
| CLIP | 17.5 | 14.2 | 11.6 | 16.9 |
| CroCov2 | 11.6 | 12.3 | 10.2 | 8.1 |
| MAE | 10.0 | 11.9 | 11.7 | 7.1 |
SOC-geo restricts the candidates to keypoints of the same semantic concept, so the model only has to identify the correct geometric position (e.g. front-left vs. rear-right wheel). Random baseline ≈ 41.2 %. The ranking changes substantially: SD 2.1 leads, and DINOv3 surpasses DINOv2 — geometric part awareness does not track overall SOC.
Across 37 vision models, we compare how well SOC and ImageNet kNN predict performance on dense downstream tasks (Pearson r). SOC dominates kNN on every task.
All settings show the target image with candidate keypoints; only the query differs. Vis. marks the query in a source image, Vis.+Desc. adds the keypoint description, and Desc. uses the description alone. LVLMs improve markedly as language is added, but remain well below the strongest vision backbones (cf. table above).
| Method | Vis. | Vis.+Desc. | Desc. |
|---|---|---|---|
| Random | 0.4 | 0.4 | 0.4 |
| Random++ | 25.0 | 25.0 | 25.0 |
| DINOv2 (VFM ceiling) | 81.0 | — | — |
| LLaVA-OV-7B | 2.9 | 14.1 | 24.3 |
| InternVL3.5-8B | 24.9 | 38.5 | 39.6 |
| Qwen2.5-VL-3B | 5.2 | 17.4 | 29.9 |
| Qwen2.5-VL-7B | 19.4 | 30.8 | 39.1 |
| Qwen3-VL-4B | 8.6 | 18.0 | 44.4 |
| Qwen3-VL-8B | 34.2 | 30.8 | 54.0 |
| GPT4o | 30.2 | 30.9 | 37.6 |
Taxonomy-driven keypoint annotations across diverse categories, with consistent semantic concepts and object-relative identities.
AK acknowledges support via his Emmy Noether Research Group funded by the German Research Foundation (DFG) under grant number 468670075.
We thank Matthis Heimberg for early analyses and experiments.
@misc{duenkel2026soco,
title = {SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models},
author = {D{\"u}nkel, Olaf and Sunagad, Basavaraj and Wang, Haoran and
Hoffmann, David T. and Theobalt, Christian and Kortylewski, Adam},
year = {2026},
eprint = {2605.31597},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.31597}
}