GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates

ICLR 2026

Stanford University

* Equal Contribution

GenFusion is a feed-forward human performance capture method that progressively updates a canonical space and renders it in a probabilistic manner to reconstruct humans in alignment with past observations from a monocular RGB stream.

Abstract

We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.

Motivation

For truly affordable human digitization, the ideal setup is capturing with a single mobile phone. This leads to monocular reconstruction, where we must recover a full 3D human from a single view despite missing unseen regions. To address this, GenFusion leverages temporal memory and probabilistic rendering.

Key Idea 1 : Temporal history

We consider a video of a subject moving in front of the camera. At the current frame, only a side view may be visible, making frontal reconstruction difficult. However, previous frames can provide frontal, side, and back views of the subject. Assuming continuous motion, we propose to gather information from earlier frames to accurately reconstruct missing details in the current frame.

Key Idea 2 : Probabilistic rendering

Deterministic regression supervision (e.g., pixel-wise loss) penalizes deformation mismatch, leading to blurry outputs, while probabilistic regression supervision focuses on perceptually realistic synthesis rather than pixel-wise mismatch.

Overview

(a) Given a live frame $I_t$, feature map $F_t$ is obtained with a pretrained ResNet-18 encoder $R_{18}$. Feature set $S_t$ is sampled from $F_t$ at the 2D projected location of corresponding SMPL-X vertices.

(b) $S_t$ is then weighted by its SMPL-X vertex visibility $V_t$ and fused into the canonical feature set $S_{can}$:

$$ S_{can} = \frac{(S_t \cdot V_t) + (S_{can} \cdot V_{can})}{\max(V_t + V_{can}, 1)} $$

(c) $S_t$ is reposed into the live pose and rasterized from the novel view to become $W_t$, which is further encoded with $U_{enc}$ into $G_{context,t}$.

(d) Conditioned on the live frame feature $G_{live,t}$ and $G_{context,t}$, the denoising network $U_{denoiser}$ denoises the noisy image $Z$ into the final novel view live frame.

Canonical space visualization

The first three columns show the feature maps, and the next three columns show their RGB renderings. Our method synthesizes realistic details even without observations and refines the canonical space as more frames are incorporated.

In-domain Generalization - Trained on THuman 2.1 + 4D-Dress and tested on 4D-Dress (unseen subjects)

Champ, a frame-based probabilistic method, produces sharp and visually appealing results at the frame level due to its generative capabilities. However, its lack of temporal context leads to the generation of details that are irrelevant to past observations. NHP, a temporal deterministic method, leverages past frames but still produces blurry results, as it suppresses sharp features to avoid misalignment penalties. GenFusion effectively integrates temporal context and probabilistic rendering for robust human performance capture in monocular settings.

Cross-domain Generalization - Trained on THuman 2.1 + 4D-Dress and tested on MVHumanNet

GenFusion exhibits strong generalization, producing details that align with past observations.

In-the-wild Generalization - Trained on THuman 2.1 + 4D-Dress and tested on TikTok

Although the back view is not available in the current frame, GenFusion reconstructs details consistent with past observations using canonical space memory. In contrast, per-frame probabilistic method LHM generates details that are inconsistent with past observations.

BibTeX

@inproceedings{kwon2026genfusion,
  title={GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates},
  author={Kwon, Youngjoong and He, Yao and Choi, Heejung and Geng, Chen and Liu, Zhengmao and Wu, Jiajun and Adeli, Ehsan},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}