
The Conference on Computer Vision and Pattern Recognition (CVPR) 2026 is being hosted in Denver, Colorado from June 3 - 7. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!
List of Accepted Papers
BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models
Authors: Ryan Po, Eric Ryan Chan, Changan Chen, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: autoregressive video generation, exposure bias, diffusion transformers, backwards aggregation
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: controllable video generation, video diffusion models, camera trajectory control, 4d video synthesis
Choreographing a World of Dynamic Objects
Authors: Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, Jiajun Wu
Contact: yanzhel@stanford.edu
Links: Paper | Website
Keywords: 4d generation, diffusion models, score distillation sampling, motion generation
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Authors: Mark Endo, Serena Yeung-Levy
Contact: markendo@stanford.edu
Links: Paper | Website
Keywords: small multimodal models, perception, reasoning
Dual Ascent Diffusion for Inverse Problems
Authors: Minseo Kim, Axel Levy, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Links: Paper | Video | Website
Keywords: inverse problems, diffusion priors, constrained optimization, image restoration
Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data
Authors: Ji Woong Kim* , Ke Wang* , Zipeng Fu , Sirui Chen , Cong Gao , Jeff Lai , Chelsea Finn
Contact: jwbkim@stanford.edu
Links: Paper | Video | Website
Keywords: humanoid robots, robot learning, vla, human data
GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator
Authors: Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein, Iro Armeni
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: 3d gaussian splatting, novel view synthesis, geometry-guided video generation, neural rendering
Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control
Authors: Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Workshop: Findings
Links: Paper | Website
Keywords: video world models, interactive video generation, hand pose control, embodied ai
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
Authors: Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, Ehsan Adeli
Contact: fnerrise@stanford.edu
Workshop: CV4Clinic (Computer Vision for Real-world Clinical Translation) 2026
Links: Paper | Website
Keywords: brain mri foundation models; sparse autoencoders; mechanistic interpretability; biomarker discovery; alzheimer’s disease
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
Authors: Xiaomeng Xu, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, Jeannette Bohg, Shuran Song
Contact: xuxm@stanford.edu
Workshop: Embodied AI Workshop
Links: Paper | Video | Website
Keywords: mobile manipulation, learning from human demonstrations, imitation learning
Physical Object Understanding with a Physically Controllable World Model
Authors: Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins
Contact: rmvenkat@stanford.edu
Award nominations: Highlight
Links: Paper | Blog Post | Video | Website
Keywords: visual world models, object understanding, physical control
Spherical Leech Quantization for Visual Tokenization and Generation
Authors: Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Krähenbühl
Contact: yzz@stanford.edu
Award nominations: Highlight
Links: Paper | Blog Post | Website
Keywords: quantization, tokenization, compression, generation
Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
Contact: jackykwok@stanford.edu
Workshop: CVPR 2026 Scalable Robot Learning Systems Workshop
Award nominations: Best Paper Finalist
Links: Paper | Website
Keywords: vision-language-action models, test-time scaling, contrastive learning, visuomotor control
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Authors: Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, Jing Lyu
Contact: bowenxue@stanford.edu
Links: Paper | Video | Website
Keywords: video generation, identity control
Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
Authors: Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li
Contact: PingyueZhang2029@u.northwestern.edu
Links: Paper | Blog Post | Website
Keywords: large language mode, vision-language model, spatial reasoning, spatial agent, active exploration
VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
Authors: Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: 3d scene understanding, multi-agent systems, spatial reasoning, embodied ai
We look forward to seeing you at CVPR!