Diagram of the 3PoinTr network architecture. We first encode an initial point cloud, pass through a transformer decoder, and project each point token to a 3D point trajectory. This yields dense 3D point tracks that encode goal specifications, scene geometry, and spatiotemporal relationships. We then aggregate the per-point trajectory features using a Perceiver IO-style cross-attention module. A small set of learned query tokens attends to the full set of point track tokens, producing a compact global representation of the task. This representation is used as conditioning input to a Diffusion Policy, which generates an open-loop sequence of robot actions.
Simulation task success rate vs. number of robot demonstrations. Results are averaged across the three simulation tasks. 3PoinTr achieves the highest average success rate across all tasks and numbers of demonstrations.
| Task | ATM | DP3 | 3PoinTr |
|---|---|---|---|
| Open Drawer | 3/10 | 7/10 | 9/10 |
| Right Glass | 3/10 | 8/10 | 10/10 |
| Throw Away Paper | 0/10 | 1/10 | 9/10 |
| Fold Sock | 0/10 | 2/10 | 9/10 |
Real-world success rates evaluated over 10 rollouts.
We also evaluate the quality of 3PoinTr's 3D point track predictions. Here we show Average Distance Error (ADE) and ADE of the 5% of points that move the most (5% ADE) for 3PoinTr and General Flow on simulation and real-world tasks (in millimeters). 3PoinTr outperforms General Flow in both metrics on every task in our experiments. Across all tasks, 3PoinTr achieves average error reductions of 49.1% and 61.8% compared to General Flow.
This work is supported by the NSF GRFP (Grant No. DGE2140739).
@article{3pointr,
title={3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos},
author={Hung, Adam and Duisterhof, Bardienus P. and Ichnowski, Jeffrey},
journal={arXiv preprint arXiv:XXXX.XXXX},
year={2026}
}