[NeurIPS 2025 Spotlight]
Orient Anything V2: Unifying Orientation and Rotation Understanding

1Zhejiang University     2SEA AI Lab     3HKU

*Equal Contribution

Note: To avoid ambiguity, Orient Anything V2 only accept images containing a single object as input. For multi-object scenarios, we first use SAM to isolate each object and then predict their orientation separately.

Orient Anything V2, a unified spatial vision model for understanding orientation, symmetry, and relative rotation, achieves SOTA performance across 14 datasets. For visualization, the object orientation is represented by the red axis, while the blue and green axes indicate the upward and left sides of the object.

Visualizations on Images in the wild.

Visualization on Symmetry Objects.


Qualitative Comparison

Abstract

This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.

3D Asset Synthesis

The synthetic 3D asset generation pipeline is composed of three steps: 1) Class Tag → Caption: Starting from ImageNet-21K category tags, we use Qwen-2.5 to generate rich, descriptive captions that capture object attributes and pose variations, ensuring broad and diverse category coverage. 2) Caption → Image: Leveraging the FLUX.1-Dev text-to-image model, we synthesize high-fidelity images from captions, enhanced with positional descriptors to encourage upright poses and explicit 3D structure. 3) Image → 3D Mesh: Using Hunyuan-3D-2.0, we convert the generated images into high-quality 3D meshes with complete geometry and detailed textures.

generation

Robust Annotation

The annotation pipeline features two stages. First, an enhanced Orient-Anything-V1 model generates pseudo-labels from multiple rendered views of each 3D asset; these are projected into a shared 3D space and aggregated to infer dominant front-facing directions and rotational symmetry (e.g., single-front, bilateral, or full symmetry), effectively suppressing view-specific errors.

Second, a human-in-the-loop consistency check is applied at the category level: assets within the same ImageNet-21K class are expected to share the same symmetry type. Categories with consistent annotations are auto-accepted; Categories with consistent annotations are auto-accepted. Inconsistent ones, which account for only about 15% of the 21k classes and usually involve few assets, are flagged for manual review.

annotaion

Model Training

Orient-Anything-V2 builds on the VGGT backbone and employs multi-task heads for prediction. The model is trained to estimate the 3D distribution of absolute pose and symmetry for the first-frame image, as well as the 3D distribution of relative rotation between the second frame and the first.

model

Citation