3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

Carnegie Mellon University
Teaser
    3PoinTr is a general and scalable method for pretraining manipulation policies with casual human videos. Given an observed point cloud, 3PoinTr answers: how will the scene evolve when completing the task? We contribute a state-of-the-art 3D point track prediction transformer, and use a Perceiver IO architecture and Diffusion Policy to enable state-of-the-art imitation learning.

Casual Human Video Pretraining

20 Robot Demos

Downstream Policy Rollouts

Abstract

Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap, often requiring restrictive and tightly calibrated setups. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. Using a Perceiver IO architecture, we extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points.

Model Architecture

Model Architecture

Diagram of the 3PoinTr network architecture. We first encode an initial point cloud, pass through a transformer decoder, and project each point token to a 3D point trajectory. This yields dense 3D point tracks that encode goal specifications, scene geometry, and spatiotemporal relationships. We then aggregate the per-point trajectory features using a Perceiver IO-style cross-attention module. A small set of learned query tokens attends to the full set of point track tokens, producing a compact global representation of the task. This representation is used as conditioning input to a Diffusion Policy, which generates an open-loop sequence of robot actions.

Results: Sample-Efficient Policy Learning

Simulation Tasks

Simulation task success rate vs. number of robot demonstrations. Results are averaged across the three simulation tasks. 3PoinTr achieves the highest average success rate across all tasks and numbers of demonstrations.

Real-World Tasks

Task ATM DP3 3PoinTr
Open Drawer 3/10 7/10 9/10
Right Glass 3/10 8/10 10/10
Throw Away Paper 0/10 1/10 9/10
Fold Sock 0/10 2/10 9/10

Real-world success rates evaluated over 10 rollouts.

Rollouts

Input Point Cloud

3D Point Track Prediction

Example Policy Rollouts

Cup Scene
Cup 3D Point Tracks
Sock Scene
Sock 3D Point Tracks
Paper Scene
Paper 3D Point Tracks
Drawer Scene
Drawer 3D Point Tracks
Glass Scene
Glass 3D Point Tracks
Microwave Scene
Microwave 3D Point Tracks
Block Stack Scene
Block Stack 3D Point Tracks

Results: 3D Point Track Prediction

We also evaluate the quality of 3PoinTr's 3D point track predictions. Here we show Average Distance Error (ADE) and ADE of the 5% of points that move the most (5% ADE) for 3PoinTr and General Flow on simulation and real-world tasks (in millimeters). 3PoinTr outperforms General Flow in both metrics on every task in our experiments. Across all tasks, 3PoinTr achieves average error reductions of 49.1% and 61.8% compared to General Flow.

Acknowledgements

This work is supported by the NSF GRFP (Grant No. DGE2140739).

BibTeX


      @article{3pointr,
    title={3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos},
    author={Hung, Adam and Duisterhof, Bardienus P. and Ichnowski, Jeffrey},
    journal={arXiv preprint arXiv:XXXX.XXXX},
    year={2026}
    }