Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

Abstract

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D.

We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences.

We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over prior methods while reducing manual geometric supervision.

Motivation

Purely 2D features confuse symmetric sides and repeated parts. Geodesic filtering recovers precision at the cost of recall. Our PartField integration restores dense, accurate correspondences.

◀︎ Geodesic-filtered Correspondences All Mutual Nearest Neighbor ▶︎

← move mouse →

SD+DINO

SD+DINO+PartField (Ours)

Figure 1. Semantic correspondence on a challenging cross-instance pair. (a) SD+DINO features produce many incorrect matches due to left–right and repeated-part confusion. (b) Geodesic filtering removes wrong matches but leaves sparse correspondences. (c) Incorporating PartField descriptors with filtering yields dense and accurate correspondences even under large pose changes.

Method

Our approach consists of three sequential stages. Starting from a single image, we reconstruct instance-specific 3D geometry and use it to generate geometry-verified pseudo-labels for training a lightweight correspondence adapter.

Hover (or tap) any step marked for its sub-mechanics.

Stage 1 — Canonicalized 3D Reconstruction

Single
Image

→

SAM3 +
SAM3D SAM3 segments the object (2D mask); SAM3D lifts the mask to an instance-specific 3D mesh.

→

Pose
Canonicalization Render-and-compare aligns mesh to image — Phase 1: distance-transform alignment, Phase 2: soft-IoU refinement. OrientAnything V2 resolves the four-fold yaw ambiguity by majority vote over 8 orientations.

→

Canonicalized
Mesh + Pose

grounds matching in geometry

Stage 2 — Geometry-Verified Pseudo-Labels

DINOv2

Stable
Diffusion

PartField Part-aware descriptors rasterized from the Stage-1 mesh — the 3D prior that carries geometry into matching.

→

Feature
Fusion

→

NN
Matching Nearest-neighbour search in the fused descriptor space, kept under relaxed cyclic consistency.

→

Geodesic
Filtering The quality gate. Candidates are lifted onto the mesh surface and rejected if their bicyclic geodesic distance exceeds a threshold — the 3D check that replaces fragile spherical priors.

→

Verified
Pseudo-labels

supervision — no manual annotations

Stage 3 — Lightweight Adapter Training

Verified
Pseudo-labels

DINO + Stable
Diffusion (frozen)

→

Lightweight
Adapter A small trainable head on the frozen features, supervised by a sparse contrastive loss + dense regression loss.

→

Trained
Correspondence
Model

Figure 2. Overview of our three-stage pipeline. Stage 1 turns a single image into a canonicalized 3D mesh and pose (SAM3 + SAM3D, render-and-compare, OrientAnything V2 for yaw). Stage 2 fuses frozen DINO and Stable Diffusion features with mesh-derived PartField descriptors, matches them, and keeps only candidates that pass a geodesic-distance check on the mesh (amber) — yielding geometry-verified pseudo-labels. Stage 3 uses those pseudo-labels to train a lightweight adapter on the frozen backbone, requiring no manual pose annotations. Hover any step for its sub-mechanics.

Quantitative Results

We report per-image PCK (Percentage of Correct Keypoints, %, ↑ higher is better) on four standard benchmarks: SPair-71k and its challenging Geo-Aware subset (symmetric / repeated parts), AP-10K (animal pose), and SpairU. Methods are grouped by supervision; the best result per group is in bold and our method is highlighted. 3D-SC is best in its category — weakly supervised without human annotations — across every benchmark, and on SPair-71k it even surpasses DIY-SC, which is trained with human pose annotations (73.0 vs. 71.6 at PCK@0.10).

Method	SPair-71k			SPair-Geo-Aware			AP-10K (0.10)			SpairU
Method	0.01	0.05	0.10	0.01	0.05	0.10	I.S.	C.S.	C.F.	0.01	0.05	0.10
Supervised
DHF Luo et al.	8.7	50.2	64.9	8.0	45.8	62.7	62.7	60.0	47.8	–	–	–
SD+DINOv2 Zhang et al.	9.6	57.7	74.6	9.9	57.0	77.0	77.0	74.0	65.8	–	–	–
GECO Hartwig et al.	14.2	59.6	73.6	–	–	–	82.5	81.2	76.6	–	–	55.2
Jamais Vu Mariotti et al.	20.5	71.9	82.5	–	–	–	–	–	–	–	–	62.4
Geo-SC Zhang et al.	21.7	72.8	83.2	–	–	–	87.7	85.9	78.5	–	–	56.9
SemAlign3D Wandel & Wang	15.8	77.5	88.9	–	–	–	–	–	–	–	–	–
MARCO Cuttano et al.	27.0	77.6	87.2	22.8†	76.8†	87.5†	89.1	88.3	83.4	5.0†	42.7†	67.5
Unsupervised
DINOv2+NN Zhang et al.	6.3	38.4	53.9	3.4	28.2	42.0	60.9	57.3	47.4	–	–	54.9
DIFT Tang et al.	7.2	39.7	52.9	3.4	28.2	42.5	50.3	46.0	35.0	–	–	47.4
Weakly supervised — with human annotations
Spherical Map. Mariotti et al.	8.4	48.2	64.4	–	–	–	65.4	63.1	51.0	–	–	61.0
DIY-SC Dünkel et al.	10.1	53.8	71.6	7.7	47.7	67.5	70.6	69.8	57.8	5.4	44.0	67.9
Weakly supervised — without human annotations
SD+DINOv2 Zhang et al.	7.9	44.7	59.9	5.3	34.5	49.3	62.9	59.3	48.3	–	–	59.4
DIY-SC+OriAny Dünkel et al.	9.5	51.2	69.6	6.9	45.7	65.8	69.3	66.8	54.0	5.2	43.1	66.3
3D-SC (Ours)	10.2	54.8	73.0	7.8	50.1	70.8	69.6	68.5	56.9	5.6	43.5	67.3

PCK@α: a prediction is correct if it lies within α·max(h, w) of the ground-truth keypoint. AP-10K columns: I.S. intra-species, C.S. cross-species, C.F. cross-family. † obtained from the official checkpoint. Best per supervision group in bold; our method highlighted. Numbers reproduced from Table C1 of the paper.

Per-category PCK@0.1 on SPair-71k

Per-keypoint PCK@0.1, grouped into man-made objects and fauna & flora. Our gains concentrate in rigid, symmetric man-made categories — bus (+10.8), tv/monitor (+9.8), bottle (+8.8), car (+6.9), train (+6.2) over DIY-SC+OriAny — exactly where 2D features confuse symmetric sides and repeated parts; non-rigid living categories show little change.

Method	Man-made objects										Fauna and Flora								Avg
Method																			Avg
Unsupervised
DINOv2+NN Zhang et al.	72.7	62.0	41.3	40.4	52.3	51.5	36.2	61.0	54.3	24.2	85.2	71.1	67.1	64.6	67.6	68.2	62.0	30.7	55.6
DIFT Tang et al.	63.5	54.5	34.5	46.2	52.7	48.3	39.0	53.3	71.1	63.4	80.8	77.7	76.0	54.9	61.3	46.0	57.1	57.8	57.7
Weakly supervised — with human annotations
Spherical Mapper Mariotti et al.	75.3	63.8	48.2	50.9	74.9	71.1	47.3	65.4	75.0	58.5	87.7	81.7	81.6	66.9	73.1	61.8	70.2	55.5	67.8
DIY-SC Dünkel et al.	77.2	69.1	54.2	57.9	83.7	77.5	53.1	72.5	77.2	69.5	90.8	86.5	86.7	73.1	78.5	74.0	76.0	73.5	74.4
Weakly supervised — without human annotations
SD+DINOv2 Zhang et al.	73.0	64.1	40.7	52.9	55.0	53.8	45.5	63.3	66.2	53.5	86.4	78.6	77.3	64.7	69.7	69.2	67.6	58.4	64.0
Geo-SC Zhang et al.	78.0	66.4	44.5	60.1	66.6	60.8	53.2	66.1	83.8	55.5	90.2	82.7	82.3	69.5	75.1	71.7	71.6	58.9	69.6
DIY-SC+OriAny Dünkel et al.	76.1	65.9	52.2	57.3	75.7	75.3	52.8	69.9	76.7	69.6	90.4	85.0	86.3	71.4	78.3	73.5	75.0	69.2	72.9
3D-SC (Ours)	77.6	70.3	54.8	66.1	86.5	82.2	56.8	75.0	82.9	79.4	90.4	83.5	84.6	72.6	77.8	72.5	72.3	68.6	76.3

Per-keypoint PCK@0.1 (differs from the per-image PCK in the table above). Columns grouped into man-made objects and fauna & flora. Best per supervision group in bold; our method highlighted. Hover a header for the full category name.

Qualitative Results

We compare the pseudo-groundtruth correspondence annotations generated by each method — the geometry-verified pseudo-labels used to supervise the adapter. Our 3D-aware filtering yields denser and more accurate annotations than the spherical-prior pipeline of DIY-SC (Dünkel et al.), particularly on symmetric and repeated parts.

3D-SC (Ours)

Pseudo-groundtruth annotations

DIY-SC (Dünkel et al.)

Pseudo-groundtruth annotations

Limitations & Future Work

Our framework relies on off-the-shelf 3D foundation models, and its current weaknesses stem largely from where those models are weakest.

Dependence on single-image reconstruction. The pipeline depends on SAM3D's pose and shape estimates. Reconstruction errors propagate through the 2D–3D reprojection and can degrade geodesic consistency — although our bicyclic geodesic filtering removes most of the resulting false positives before they reach the adapter.

PartField gives coarse, part-level cues. PartField is trained with a part-level contrastive objective, so it provides regional rather than precise within-part localization. This is why it receives a relatively low fusion weight (γ = 1/6), and it is most visible on SPair-U, where the extra keypoints typically lie in the middle of a part — exactly where PartField contributes the least signal.

          The biggest opportunity: morphable categories (animals & flora).
          PartField is trained predominantly on rigid shapes, so it generalizes less reliably to deformable objects. This is precisely the pattern in our per-category results: while rigid man-made categories see large gains (bus +10.8, tv/monitor +9.8, bottle +8.8, car +6.9), non-rigid living categories show little movement or slight regressions — sheep (−2.7), cow (−1.7), cat (−1.5), and potted plant (−0.6, since SAM3D fuses pot and plant into one shape while the keypoints lie only on the pot). Our method is therefore currently bottlenecked by the 3D feature, not the framework. A 3D descriptor purpose-built for morphable / articulated categories — one that models part deformation and within-part structure for animals — could be assigned a much higher fusion weight and would likely drastically improve results on the fauna & flora half of the benchmark, closing most of the remaining gap to the man-made categories and lifting the overall average well beyond what rigid-shape features allow today.
        

Sparse cross-mesh matching. Our cross-mesh correspondence uses nearest-neighbor matching in PartField space. Replacing it with denser registration — via optimal transport or functional maps — is a natural next step that trades additional compute for finer, deformation-aware alignment, and would compound with a morphable-object descriptor above.

BibTeX

@article{jesslen2026geometry,
  title     = {Geometry Matters: {3D} Foundation Priors for Learning Semantic Correspondence},
  author    = {Jesslen, Artur and D{\"u}nkel, Olaf and Kortylewski, Adam},
  journal   = {arXiv preprint arXiv:2605.30093},
  year      = {2026},
}