Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

¹ University of Freiburg · ² CISPA Helmholtz Center for Information Security
^* Equal contribution ·

arXiv Code [Coming Soon] Dataset [Coming Soon] BibTeX

Teaser: semantically consistent 3D keypoints transferred across different instances of a category. — Fig. 1 We predict semantically consistent 3D keypoint locations across instances of a category from a single RGB-D image. Matching colors denote corresponding areas.

TL;DR

We define monocular category-level 3D correspondence in camera space, release HouseCorr3D to benchmark it, and propose Morpheus, a morphable-prior model that sets a new state of the art — without any correspondence supervision.

0: RGB-D image pairs
0: household categories
0: unique instances
0: 3D keypoint annotations

Read the abstract

Understanding 3D objects from images is fundamental to robotics and AR/VR. While recent work has progressed in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. We study category-level 3D correspondence in camera space — predicting, from a single image, 3D locations that remain consistent across instances within a category — and show it can emerge without explicit correspondence supervision by learning a shared morphable object prior. We introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence, with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models — including amodal correspondence labels for occluded regions and explicit symmetry annotations. We further propose Morpheus, which learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose; semantically meaningful 3D correspondences in camera space then emerge implicitly, setting a new state of the art on HouseCorr3D.

The Task — Monocular category-level 3D correspondence in camera space

Traditional 3D understanding stops at pose, detection, or reconstruction — it never says which point on one object is the same functional part as on another. 2D semantic correspondence tries, but is trapped by viewpoint, occlusion, and symmetry.

We move the question into 3D camera space: given a 3D query point on one instance, return the 3D point on another instance that represents the same semantic part — resolving ambiguities that image-space matching cannot.

Problem definition

Given query & target RGB-D images I^q, I^t of the same category and a query 3D point x^q ∈ ℝ³ in the camera space of I^q, predict x^t ∈ ℝ³ in the camera space of I^t at the same semantic point.

f : (x^q, I^q, I^t) → x^t

◍
Amodal correspondences

Evaluate parts that are occluded or off-screen — impossible for any 2D matcher.
⟳
Explicit symmetry

Whole orbits of rotation-equivalent points count as valid.
⧈
3D over 2D reasoning

Camera space removes the ambiguous center & scale of object-centric spaces.

Task setup: query point projected to deformed query mesh, transferred via barycentric coordinates to the target mesh. — Fig. 3a Given a query point x^q ∈ ℝ³, we project it onto the deformed query mesh M^q and encode its location as barycentric coordinates. Since query and target instances share the same mesh topology, these coordinates transfer directly to M^t, yielding the corresponding point x^t ∈ ℝ³.

The Benchmark — HouseCorr3D

The first large-scale benchmark for category-level 3D correspondence from monocular images

Built on the photorealistic synthetic subset of Omni6DPose, HouseCorr3D crops 178k test images (and 2.6M for training) across 50 everyday categories. Keypoints are annotated once on CAD meshes, then projected through ground-truth poses into every view — yielding consistent, amodal-aware labels at scale. It is a test-only benchmark: keypoints are used exclusively for evaluation.

Dataset overview: up to 19 semantic 3D keypoints annotated on CAD meshes across categories, shown for several instances each. — Fig. 2 Keypoints are semantically consistent and shared across all instances within each category.

A Amodal labels

Correspondences for parts that are occluded or out of frame — inferring the full 3D extent of objects.
S Explicit symmetry

Discrete & continuous symmetries handled by treating the full rotation orbit as valid matches.
M Mesh annotation

One annotation on a CAD mesh scales to 178k pairs through ground-truth pose projection.
T Test-only

Synthetic but high-quality: exact-by-construction labels, modeled transparency for depth.

Table 1 · Comparison to existing correspondence datasets

Prior benchmarks evaluate in 2D camera or 3D object space. HouseCorr3D is the first to target 3D camera space.

Dataset	Pairs	Classes	Input	Eval. space	Symmetry	Occlusion
Pascal-Parts	4k	20	2D	2D camera	✗	✗
PF-Pascal	2k	20	2D	2D camera	✗	✗
SPair-71k	71k	18	2D	2D camera	✗	✓
KeypointNet	—	16	3D	3D object	✗	✗
CPNet	—	25	3D	3D object	✓	✗
DenseCorr3D	—	23	3D	3D object	✓	✗
HouseCorr3D ours	178k	50	2.5D	3D camera	✓	✓

Annotation protocol

2 annotators independently mark keypoints on each mesh in an interactive 3D tool.
Auto-merge via mutual nearest neighbours (5% bbox-diagonal threshold) + consistency.
Manual merge resolves undecided keypoints side-by-side across instances.
~65 h → 2329 keypoints (2–19 per category) projected into all 178k views.

Evaluation metric

PCK@0.1

d(x̂^t, x^t) < 0.1 · max(h, w, d)

A prediction is correct if within 10% of the largest side of the object's 3D bounding box. We report 2D, 3D modal (both points visible), and 3D amodal (one point occluded) settings.

The Method — Morpheus

Morphable category priors, so 3D correspondence emerges

Morpheus represents every object in a category as an identity-preserving deformation of one shared template mesh. Because template vertices keep their identity while the mesh morphs, correspondence becomes free: points tied to the same template vertex are the same semantic part across instances. 3D correspondence reduces to predicting a pose and a deformation.

PRIOR

3D morphable prior

A canonical category shape as a signed distance field, turned into a mesh via Differentiable Marching Tetrahedra — a hybrid volumetric-mesh representation that is stable to optimize.

DEFORM

Instance deformation

A DINOv2 encoder maps the image to a latent code that drives a per-vertex affine field ϕ_a(v,l) = α(v,l)⊙v + δ(v,l), morphing the template to the observed instance.

POSE

6D pose

A pretrained pose-diffusion network places the deformed mesh into camera space, disentangling pose from shape and canonicalization.

Training objectives — geometric supervision only

Amodal mask

Pixel-wise MSE + distance-transform overlap against ground-truth amodal masks.

Chamfer

Aligns deformed mesh vertices to ground-truth geometry for accurate 3D shape.

REG

Regularizers

Eikonal (SDF), small-deformation ℓ₂, and edge-based smoothness keep meshes clean.

⚡

No explicit correspondence supervision. Semantic alignment emerges because every instance must explain its image through the same canonical template.

Results

A new state of the art on 2D, 3D-modal, and 3D-amodal correspondence

Table 2 · PCK@0.1 on HouseCorr3D (mean over categories, %)

Morpheus outperforms every 2D and 3D baseline. ★ 2D predictions lifted to 3D via depth — amodal is not applicable.

Method	2D	3D Modal	3D Amodal	3D (M+A)
DINOv2 ★	22.9	24.4	n/a	n/a
MagicPony_2D ★	15.7	14.0	n/a	n/a
NOCS ★	26.7	26.4	n/a	n/a
GenPose++	36.3	37.0	32.9	34.3
MagicPony + GP++	10.7	7.5	7.1	7.1
Morpheus w/o Def.	39.1	40.2	37.8	38.4
Morpheus ours	41.2	43.7	40.8	41.5

Qualitative correspondence results: Morpheus predictions across several categories. — Fig. 4 Qualitative comparison. DINOv2 confuses parts; GenPose++ predicts points off-object; MagicPony is plausible in 2D but wrong in 3D. Morpheus stays consistent across viewpoints.

Table 3 · Real-world (ROPE subset)

5 classes · 24 instances · 134 keypoints. Morpheus generalizes to real data.

Method	2D@0.1	3D@0.1
MagicPony_2D	16.8	n/a
GenPose++	37.0	25.1
MagicPony + GP++	12.6	7.3
Morpheus ours	44.7	34.8

Why it wins

Occlusions: only ~2.9% PCK drop from modal → amodal; 2D matchers can't evaluate amodal at all.
Disentangled pose & shape: avoids MagicPony's failure of compensating rotation with deformation.
More challenging than SPair-71k: DINOv2 performs noticeably worse on HouseCorr3D, where the categories are more diverse.

Limitations

Fixed template topology can't model large topological change; correspondence is sensitive to pose error; smoothness regularization can over-smooth thin structures.

Citation

If you find HouseCorr3D or Morpheus useful, please cite

@misc{sommer2026categorylevel3dcorrespondencecamera,
      title  = {Category-Level 3D Correspondence in Camera Space via Morphable Object Priors},
      author = {Leonhard Sommer and Artur Jesslen and Basavaraj Sunagad and Adam Kortylewski},
      year   = {2026},
      eprint = {2605.28257},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CV},
      url    = {https://arxiv.org/abs/2605.28257},
}

Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

The Task — Monocular category-level 3D correspondence in camera space

Amodal correspondences

Explicit symmetry

3D over 2D reasoning

The Benchmark — HouseCorr3D

A Amodal labels

S Explicit symmetry

M Mesh annotation

T Test-only

Table 1 · Comparison to existing correspondence datasets

Annotation protocol

Evaluation metric

The Method — Morpheus

3D morphable prior

Instance deformation

6D pose

Training objectives — geometric supervision only

Amodal mask

Chamfer

Regularizers

Results

Table 2 · PCK@0.1 on HouseCorr3D (mean over categories, %)

Table 3 · Real-world (ROPE subset)

Why it wins

Citation