The Conference on Computer Vision and Pattern Recognition (CVPR) 2026 is being hosted in Denver, Colorado from June 3 - 7. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

Authors: Ryan Po, Eric Ryan Chan, Changan Chen, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: autoregressive video generation, exposure bias, diffusion transformers, backwards aggregation


BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: controllable video generation, video diffusion models, camera trajectory control, 4d video synthesis


Choreographing a World of Dynamic Objects

Authors: Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, Jiajun Wu
Contact: yanzhel@stanford.edu
Links: Paper | Website
Keywords: 4d generation, diffusion models, score distillation sampling, motion generation


Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Authors: Mark Endo, Serena Yeung-Levy
Contact: markendo@stanford.edu
Links: Paper | Website
Keywords: small multimodal models, perception, reasoning


Dual Ascent Diffusion for Inverse Problems

Authors: Minseo Kim, Axel Levy, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Links: Paper | Video | Website
Keywords: inverse problems, diffusion priors, constrained optimization, image restoration


Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Authors: Ji Woong Kim* , Ke Wang* , Zipeng Fu , Sirui Chen , Cong Gao , Jeff Lai , Chelsea Finn
Contact: jwbkim@stanford.edu
Links: Paper | Video | Website
Keywords: humanoid robots, robot learning, vla, human data


GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator

Authors: Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein, Iro Armeni
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: 3d gaussian splatting, novel view synthesis, geometry-guided video generation, neural rendering


Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Authors: Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein
Contact: gordonwz@stanford.edu
Workshop: Findings
Links: Paper | Website
Keywords: video world models, interactive video generation, hand pose control, embodied ai


GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models

Authors: Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, Ehsan Adeli
Contact: fnerrise@stanford.edu
Workshop: CV4Clinic (Computer Vision for Real-world Clinical Translation) 2026
Links: Paper | Website
Keywords: brain mri foundation models; sparse autoencoders; mechanistic interpretability; biomarker discovery; alzheimer’s disease


HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Authors: Xiaomeng Xu, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, Jeannette Bohg, Shuran Song
Contact: xuxm@stanford.edu
Workshop: Embodied AI Workshop
Links: Paper | Video | Website
Keywords: mobile manipulation, learning from human demonstrations, imitation learning


Physical Object Understanding with a Physically Controllable World Model

Authors: Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins
Contact: rmvenkat@stanford.edu
Award nominations: Highlight
Links: Paper | Blog Post | Video | Website
Keywords: visual world models, object understanding, physical control


Spherical Leech Quantization for Visual Tokenization and Generation

Authors: Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Krähenbühl
Contact: yzz@stanford.edu
Award nominations: Highlight
Links: Paper | Blog Post | Website
Keywords: quantization, tokenization, compression, generation


Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
Contact: jackykwok@stanford.edu
Workshop: CVPR 2026 Scalable Robot Learning Systems Workshop
Award nominations: Best Paper Finalist
Links: Paper | Website
Keywords: vision-language-action models, test-time scaling, contrastive learning, visuomotor control


Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Authors: Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, Jing Lyu
Contact: bowenxue@stanford.edu
Links: Paper | Video | Website
Keywords: video generation, identity control


Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Authors: Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li
Contact: PingyueZhang2029@u.northwestern.edu
Links: Paper | Blog Post | Website
Keywords: large language mode, vision-language model, spatial reasoning, spatial agent, active exploration


VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Authors: Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo
Contact: gordonwz@stanford.edu
Links: Paper | Website
Keywords: 3d scene understanding, multi-agent systems, spatial reasoning, embodied ai


We look forward to seeing you at CVPR!