Stanford University
* Equal Contribution
We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.
For truly affordable human digitization, the ideal setup is capturing with a single mobile phone. This leads to monocular reconstruction, where we must recover a full 3D human from a single view despite missing unseen regions. To address this, GenFusion leverages temporal memory and probabilistic rendering.
We consider a video of a subject moving in front of the camera. At the current frame, only a side view may be visible, making frontal reconstruction difficult. However, previous frames can provide frontal, side, and back views of the subject. Assuming continuous motion, we propose to gather information from earlier frames to accurately reconstruct missing details in the current frame.
Deterministic regression supervision (e.g., pixel-wise loss) penalizes deformation mismatch, leading to blurry outputs, while probabilistic regression supervision focuses on perceptually realistic synthesis rather than pixel-wise mismatch.
(a) Given a live frame \(I_t\), feature map \(F_t\) is obtained with a pretrained ResNet-18 encoder \(R_{18}\). Feature set \(S_t\) is sampled from \(F_t\) at the 2D projected location of corresponding SMPL-X vertices.
(b) \(S_t\) is then weighted by its SMPL-X vertex visibility \(V_t\) and fused into the canonical feature set \(S_{can}\):
The first three columns show the feature maps, and the next three columns show their RGB renderings. Our method synthesizes realistic details even without observations and refines the canonical space as more frames are incorporated.
Champ, a frame-based probabilistic method, produces sharp and visually appealing results at the frame level due to its generative capabilities. However, its lack of temporal context leads to the generation of details that are irrelevant to past observations. NHP, a temporal deterministic method, leverages past frames but still produces blurry results, as it suppresses sharp features to avoid misalignment penalties. GenFusion effectively integrates temporal context and probabilistic rendering for robust human performance capture in monocular settings.
GenFusion exhibits strong generalization, producing details that align with past observations.
Although the back view is not available in the current frame, GenFusion reconstructs details consistent with past observations using canonical space memory. In contrast, per-frame probabilistic method LHM generates details that are inconsistent with past observations.
@inproceedings{kwon2026genfusion,
title={GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates},
author={Kwon, Youngjoong and He, Yao and Choi, Heejung and Geng, Chen and Liu, Zhengmao and Wu, Jiajun and Adeli, Ehsan},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}