The Prismatic Model and CADView Dataset

How Vitrus is training a geometry-aware vision-language model on open CAD-grounded pose data.

Dataset sample

One CAD-conditioned pose example

A scene render from the CAD-pose pipeline. The full training sample pairs this scene with CAD reference renders, a 32-view orientation bank, and exact 3D supervision from the manifest.

Sample seed: 32
Target asset: 00001300
Target size: 150 x 35 x 60 mm
Scene resolution: 1280 x 720
Pool: machined / asymmetric

Robots do not need captions for the world. They need geometry they can act on: where a part is, how it is oriented, which face is visible, and how that state changes when the robot moves.

The Prismatic model is our current answer to that problem. It pairs a language-aligned vision-language model with a geometry-rich DINOv2 tower, then trains the combined model on CAD-conditioned scenes where every target has exact 3D supervision.

Why prismatic vision

SigLIP-style vision towers are strong at semantic alignment, but the CAD-pose runs exposed a failure mode: orientation stayed weak even when the model could identify the right part. DINOv2 carries sharper patch-level geometry, so the Prismatic model fuses DINOv2 features into Qwen's native vision tokens while keeping Qwen's generation path intact.

The result is still a text-generating VLM. It sees a tabletop scene, CAD references, and a 32-view CAD bank, then emits center coordinates, a matching view index, and small residual orientation deltas.

Qwen3.5-4B provides the language and multimodal generation backbone.
DINOv2 contributes spatial features that preserve object geometry.
The fusion starts identity-initialized, so training begins from the original VLM behavior.

The dataset

The open CADView dataset is built around CAD-conditioned 3D pose grounding. Each example includes a rendered scene, a target CAD reference, a 32-view orientation bank for that part, and a manifest with camera calibration, dimensions, visibility, 3D corners, and split metadata.

The current v2 pool contains 6,320 machined CAD parts with strict scene-level train and holdout splits. A symmetry manifest marks orbit-equivalent orientations, so symmetric parts are graded by what is physically observable instead of forcing arbitrary labels.

Scenes: cluttered tabletop images with exact target pose and box annotations.
CAD refs: canonical reference renders for the queried part.
Atlas banks: one 32-view grid per part for render-and-compare orientation.
Symmetry groups: 6,320 part groups, including 3,034 non-trivial orientation orbits.

What this unlocks

The useful unit is not object recognition. It is CAD-to-scene correspondence: can the model find this exact part, infer its metric pose from one camera view, and express that pose in a format a planner can consume?

Open-sourcing this dataset makes the benchmark and training interface reproducible. It gives researchers a concrete way to test whether a model understands physical parts as geometry, not just as pixels or names.

Dataset layout

The release is structured as scene images, CAD references, atlas banks, and manifests so training can stay reproducible.

Geometry-aware model

Prismatic fusion combines DINOv2 geometry with Qwen's language-aligned visual tokens without replacing the native generation path.

Strict holdout

Train and evaluation examples split by CAD identity, so held-out scenes measure whether the model generalizes to new parts.

Canonical dataset paths

gs://vitrus-assets/cad_pose_grounding/v2/scenes
gs://vitrus-assets/cad_pose_grounding/v2/cad_refs
gs://vitrus-assets/cad_pose_grounding/v2/atlas/n32
gs://vitrus-assets/cad_pose_grounding/v2/symmetry_groups.json