Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

ICCV 2025
1Max Planck Institute for Informatics, 2ETH Zurich, 3University of Oxford, 4University of Freiburg
Teaser qualitative Method simple
DIY-SC finds semantic correspondences for extreme appearance and shape changes. Feature refinement with filtered pseudo-labels brings significant improvements.

Interactive Demo

Try DIY-SC directly in your browser β€” upload two images and explore semantic correspondences.

Abstract

Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features with pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.

Method

Method overview.

Method overview. We use azimuth information to sample image pairs for which higher zero-shot performance can be expected (1). We then chain the pairwise predictions to get correspondences for larger viewpoint changes, where we reject matches that do not fulfill a relaxed cyclic consistency constraint (2). We further filter pseudo-labels by rejecting pairs that cannot be mapped to a similar location on a 3D spherical prototype (3). Finally, we use the resulting pseudo-labels to train an adapter in a supervised manner (4). This approach does not require keypoint annotations and is, therefore, easily scalable to larger datasets, which we demonstrate by pre-training on ImageNet-3D and fine-tuning on SPair-71k.

Quantitative Results

Method SPair-71k AP-10k (PCK@0.1)
@0.1 @0.05 @0.01 I.S. C.S. C.F.
Zero-shot
SD + DINOv259.944.77.962.959.348.3
DistillDIFT* (U.S.)60.845.48.0β€”β€”β€”
Weakly supervised
SphMap†64.448.28.465.463.151.0
TLR65.449.19.968.764.652.7
DistillDIFT* (W.S.)65.349.88.9β€”β€”β€”
Ours 71.653.810.1 70.669.157.8
Ours (DINOv2 only) 70.651.19.0 71.269.858.3
Fully supervised
TLR (sup)82.972.621.670.168.358.4

PCK per image. Bold = best, underlined = second best among non-supervised methods. AP-10k: intra-species (I.S.), cross-species (C.S.), cross-family (C.F.). † SphMap evaluated with our 2-sphere configuration on ImageNet-3D. * DistillDIFT uses dataset-specific keypoint label definitions. DINOv2 only: adapter trained on DINOv2 features without SD.

Ablations

Ablations on SPair-71k. Each component brings a significant improvement; the baseline is the SD+DINOv2 zero-shot approach.

Pseudo-labels Cyc. cons. Relaxed c.c. Chaining Sph. rej. PCK@0.1
65.0
βœ“67.2
βœ“βœ“66.9
βœ“βœ“68.4
βœ“βœ“βœ“70.0
βœ“βœ“72.9
βœ“βœ“βœ“βœ“74.4

Scaling to ImageNet-3D

Pre-training on ImageNet-3D (86k images) and fine-tuning on SPair-71k further boosts performance, demonstrating the scalability of the weakly-supervised training strategy.

Model SPair-71k AP-10k I.S. AP-10k C.S. AP-10k C.F.
Ours (SPair)71.670.669.157.8
Ours (IN3D)68.067.865.853.3
Ours (IN3D β†’ SPair)72.271.169.458.1

PCK@0.1 per image. AP-10k splits: intra-species (I.S.), cross-species (C.S.), cross-family (C.F.).

Key Findings

  • New state-of-the-art on SPair-71k. DIY-SC achieves +4.5pp absolute gain over the previous best, and +7pp over methods with comparable supervision requirements.
  • Strongest gains on symmetric and repeated-part objects. Categories like Bus (+15.7pp) and Car (+14.0pp) benefit most, where prior methods fail due to left-right ambiguities.
  • Generalizes out-of-the-box to unseen datasets. Without any training on AP-10k, DIY-SC outperforms competing methods on intra-, cross-species, and cross-family splits.
  • Scalable to larger data sources. Pre-training on ImageNet-3D further improves SPair-71k performance, demonstrating the method's scalability beyond CO3D.

Qualitative Comparison

Four challenging SPair-71k examples where prior SOTA methods mostly fail. Green = correct match, red = incorrect.

Qualitative comparison β€” bus

Bus. Appearance change, repeated object parts.

Qualitative comparison β€” car

Car. Appearance change, repeated parts, viewpoint change.

Qualitative comparison β€” motorbike

Motorbike. Viewpoint change, feature ambiguity in background.

Qualitative comparison β€” person

Person. Left-right ambiguity, semantically similar categories.

Columns (left to right): SphMap, TLR, DistillDIFT (W.S.), Ours.

Animated Visualizations

UMAP side by side

UMAP feature visualization. The features encode 3D-aware semantic information.


Heatmap side by side

Exemplary tracking. DIY-SC-refined features result in more stable tracking compared to DINOv2.

Acknowledgments

Adam Kortylewski acknowledges support via his Emmy Noether Research Group funded by the German Research Foundation (DFG) under Grant No. 468670075. Thomas Wimmer is supported through the Max Planck ETH Center for Learning Systems.

BibTeX

@inproceedings{duenkel2025diysc,
    title = {Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels},
    author = {D{\"u}nkel, Olaf and Wimmer, Thomas and Theobalt, Christian and Rupprecht, Christian and Kortylewski, Adam},
    booktitle = {ICCV},
    year = {2025}
  }