Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

1 University of Freiburg · 2 CISPA Helmholtz Center for Information Security
* Equal contribution  · 

Teaser: semantically consistent 3D keypoints transferred across different instances of a category.
Fig. 1 We predict semantically consistent 3D keypoint locations across instances of a category from a single RGB-D image. Matching colors denote corresponding areas.
TL;DR

We define monocular category-level 3D correspondence in camera space, release HouseCorr3D to benchmark it, and propose Morpheus, a morphable-prior model that sets a new state of the art — without any correspondence supervision.

0
RGB-D image pairs
0
household categories
0
unique instances
0
3D keypoint annotations
Read the abstract

Understanding 3D objects from images is fundamental to robotics and AR/VR. While recent work has progressed in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. We study category-level 3D correspondence in camera space — predicting, from a single image, 3D locations that remain consistent across instances within a category — and show it can emerge without explicit correspondence supervision by learning a shared morphable object prior. We introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence, with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models — including amodal correspondence labels for occluded regions and explicit symmetry annotations. We further propose Morpheus, which learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose; semantically meaningful 3D correspondences in camera space then emerge implicitly, setting a new state of the art on HouseCorr3D.

01

The Task — Monocular category-level 3D correspondence in camera space

Traditional 3D understanding stops at pose, detection, or reconstruction — it never says which point on one object is the same functional part as on another. 2D semantic correspondence tries, but is trapped by viewpoint, occlusion, and symmetry.

We move the question into 3D camera space: given a 3D query point on one instance, return the 3D point on another instance that represents the same semantic part — resolving ambiguities that image-space matching cannot.

Problem definition

Given query & target RGB-D images Iq, It of the same category and a query 3D point xq ∈ ℝ3 in the camera space of Iq, predict xt ∈ ℝ3 in the camera space of It at the same semantic point.

f : (xq, Iq, It) → xt

  • Amodal correspondences

    Evaluate parts that are occluded or off-screen — impossible for any 2D matcher.

  • Explicit symmetry

    Whole orbits of rotation-equivalent points count as valid.

  • 3D over 2D reasoning

    Camera space removes the ambiguous center & scale of object-centric spaces.

Task setup: query point projected to deformed query mesh, transferred via barycentric coordinates to the target mesh.
Fig. 3a Given a query point xq ∈ ℝ3, we project it onto the deformed query mesh Mq and encode its location as barycentric coordinates. Since query and target instances share the same mesh topology, these coordinates transfer directly to Mt, yielding the corresponding point xt ∈ ℝ3.
02

The Benchmark — HouseCorr3D

The first large-scale benchmark for category-level 3D correspondence from monocular images

Built on the photorealistic synthetic subset of Omni6DPose, HouseCorr3D crops 178k test images (and 2.6M for training) across 50 everyday categories. Keypoints are annotated once on CAD meshes, then projected through ground-truth poses into every view — yielding consistent, amodal-aware labels at scale. It is a test-only benchmark: keypoints are used exclusively for evaluation.

Dataset overview: up to 19 semantic 3D keypoints annotated on CAD meshes across categories, shown for several instances each.
Fig. 2 Keypoints are semantically consistent and shared across all instances within each category.

Table 1 · Comparison to existing correspondence datasets

Prior benchmarks evaluate in 2D camera or 3D object space. HouseCorr3D is the first to target 3D camera space.

Dataset Pairs Classes Input Eval. space Symmetry Occlusion
Pascal-Parts4k202D2D camera
PF-Pascal2k202D2D camera
SPair-71k71k182D2D camera
KeypointNet163D3D object
CPNet253D3D object
DenseCorr3D233D3D object
HouseCorr3D ours178k502.5D3D camera

Annotation protocol

  1. 2 annotators independently mark keypoints on each mesh in an interactive 3D tool.
  2. Auto-merge via mutual nearest neighbours (5% bbox-diagonal threshold) + consistency.
  3. Manual merge resolves undecided keypoints side-by-side across instances.
  4. ~65 h → 2329 keypoints (2–19 per category) projected into all 178k views.

Evaluation metric

PCK@0.1

d(x̂t, xt) < 0.1 · max(h, w, d)

A prediction is correct if within 10% of the largest side of the object's 3D bounding box. We report 2D, 3D modal (both points visible), and 3D amodal (one point occluded) settings.

03

The Method — Morpheus

Morphable category priors, so 3D correspondence emerges

Morpheus represents every object in a category as an identity-preserving deformation of one shared template mesh. Because template vertices keep their identity while the mesh morphs, correspondence becomes free: points tied to the same template vertex are the same semantic part across instances. 3D correspondence reduces to predicting a pose and a deformation.

Morpheus pipeline: deformation encoder predicts a latent code that drives a decoder to adapt the category prior; the deformed mesh is placed in camera space using a predicted 6D pose.
Fig. 3b From an RGB-D image, the deformation encoder predicts a latent code that adapts the category shape prior; the deformed mesh is posed in camera space.
PRIOR

3D morphable prior

A canonical category shape as a signed distance field, turned into a mesh via Differentiable Marching Tetrahedra — a hybrid volumetric-mesh representation that is stable to optimize.

DEFORM

Instance deformation

A DINOv2 encoder maps the image to a latent code that drives a per-vertex affine field ϕa(v,l) = α(v,l)⊙v + δ(v,l), morphing the template to the observed instance.

POSE

6D pose

A pretrained pose-diffusion network places the deformed mesh into camera space, disentangling pose from shape and canonicalization.

Training objectives — geometric supervision only

2D

Amodal mask

Pixel-wise MSE + distance-transform overlap against ground-truth amodal masks.

3D

Chamfer

Aligns deformed mesh vertices to ground-truth geometry for accurate 3D shape.

REG

Regularizers

Eikonal (SDF), small-deformation ℓ₂, and edge-based smoothness keep meshes clean.

No explicit correspondence supervision. Semantic alignment emerges because every instance must explain its image through the same canonical template.

04

Results

A new state of the art on 2D, 3D-modal, and 3D-amodal correspondence

Table 2 · PCK@0.1 on HouseCorr3D (mean over categories, %)

Morpheus outperforms every 2D and 3D baseline. 2D predictions lifted to 3D via depth — amodal is not applicable.

Method 2D 3D Modal 3D Amodal 3D (M+A)
DINOv2 22.924.4n/an/a
MagicPony2D 15.714.0n/an/a
NOCS 26.726.4n/an/a
GenPose++36.337.032.934.3
MagicPony + GP++10.77.57.17.1
Morpheus w/o Def.39.140.237.838.4
Morpheus ours41.243.740.841.5
Qualitative correspondence results: Morpheus predictions across several categories.
Fig. 4 Qualitative comparison. DINOv2 confuses parts; GenPose++ predicts points off-object; MagicPony is plausible in 2D but wrong in 3D. Morpheus stays consistent across viewpoints.

Table 3 · Real-world (ROPE subset)

5 classes · 24 instances · 134 keypoints. Morpheus generalizes to real data.

Method2D@0.13D@0.1
MagicPony2D16.8n/a
GenPose++37.025.1
MagicPony + GP++12.67.3
Morpheus ours44.734.8

Why it wins

  • Occlusions: only ~2.9% PCK drop from modal → amodal; 2D matchers can't evaluate amodal at all.
  • Disentangled pose & shape: avoids MagicPony's failure of compensating rotation with deformation.
  • More challenging than SPair-71k: DINOv2 performs noticeably worse on HouseCorr3D, where the categories are more diverse.
Limitations

Fixed template topology can't model large topological change; correspondence is sensitive to pose error; smoothness regularization can over-smooth thin structures.

05

Citation

If you find HouseCorr3D or Morpheus useful, please cite

@misc{sommer2026categorylevel3dcorrespondencecamera,
      title  = {Category-Level 3D Correspondence in Camera Space via Morphable Object Priors},
      author = {Leonhard Sommer and Artur Jesslen and Basavaraj Sunagad and Adam Kortylewski},
      year   = {2026},
      eprint = {2605.28257},
      archivePrefix = {arXiv},
      primaryClass  = {cs.CV},
      url    = {https://arxiv.org/abs/2605.28257},
}