1Meta 2KAUST 3Max Planck Institute for Intelligent Systems
* Work done during internship at Meta † Equal supervision
Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.
Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher–student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments.
On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.
EgoPoseFormer v2 is a fully differentiable, end-to-end transformer architecture. Unlike its predecessor that uses separate learnable queries for each joint, EPFv2 uses a single holistic query conditioned on pose-dependent metadata (user identity and headset pose) to represent the entire body state. This makes computation independent of the body representation, improving efficiency and flexibility.
The model stacks two architecturally identical transformer decoder blocks with full gradient flow. The first decoder predicts an initial 3D pose from temporal multi-view features. The second refines the estimation by projecting coarse 3D keypoints onto image planes and using the 2D positions as spatial conditioning for conditioned multi-view cross-attention. Coupled with causal temporal attention, EPFv2 produces accurate, temporally-consistent motion even for invisible body parts.
Architecture overview (left). We stack two transformer decoders for coarse-to-fine pose estimation. A single holistic query, initialized from auxiliary metadata, attends to multi-view features and historic information. Illustration of the two core attention modules (right). Causal temporal attention enables each frame to attend to its temporal history. Conditioned multi-view cross attention incorporates view identity and optional 2D keypoint projections.
Per-keypoint uncertainty predicted by EPFv2. Larger ellipse extent and higher transparency indicate higher predicted uncertainty. Prediction is in green whereas ground-truth is in red.
Collecting large-scale, labeled egocentric motion data is costly. EPFv2 adopts a scalable semi-supervised learning pipeline that leverages large collections of unlabeled egocentric videos. A high-quality teacher model is trained on a small labeled subset and used to generate pseudo-labels for a large pool of unlabeled data. The student is then trained jointly on labeled and pseudo-labeled samples using an uncertainty-guided distillation loss, enabling it to recognize and down-weight unreliable pseudo-labels.
Overview of the mixture training in auto-labeling system. A stronger teacher model generates pseudo-labels, and an uncertainty distillation loss facilitates knowledge transfer. The teacher is pre-trained on the labeled dataset before this stage.
On the EgoBody3M benchmark, EgoPoseFormer v2 achieves a mean per-joint position error (MPJPE) of 4.02 cm, improving upon EgoBody3M and EgoPoseFormer by 22.4% and 15.4%, respectively. The wrist MPJPE reaches 4.99 cm, significantly outperforming prior works by more than 15%. In terms of temporal stability, EPFv2 achieves an MPJVE reduction of 22.2% compared to EgoBody3M and 51.7% compared to EgoPoseFormer.
Qualitative results on EgoBody3M. Predictions are colored in green and ground-truths are colored in orange.
As more unlabeled data is used, both student models achieve improved accuracy. Notably, the MobileNetv4-S-based model benefits more proportionally from auto-labeling despite having lower model capacity, indicating the pipeline's suitability for lightweight deployment models. The auto-labeling system yields an 11.7% reduction in wrist MPJPE.
ALS effectiveness in in-domain scaling. As more unlabeled data is used, both students achieve improved accuracy.
Qualitative video results comparing ground-truth (GT) and predicted poses on the EgoBody3M dataset. Predictions overlay estimated 3D poses in green and ground-truth annotations in orange.
@inproceedings{li2026egoposeformerv2,
title = {EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR},
author = {Li, Zhenyu and Dwivedi, Sai Kumar and Maric, Filip and Chac{\'o}n, Carlos and Bertsch, Nadine and Arcadu, Filippo and Hodan, Tomas and Ramamonjisoa, Michael and Wonka, Peter and Zhao, Amy and Kips, Robin and Keskin, Cem and Tkach, Anastasia and Yang, Chenhongyi},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}