Stanford AI Lab Papers and Talks at CVPR 2025

Compiled by Ruhana Azam

June 10, 2025

The Conference on Computer Vision and Pattern Recognition (CVPR) 2025 is being hosted from June 11th - June 15st. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia
Contact: orrzohar@stanford.edu
Links: Paper | Website
Keywords: lmms, video understanding

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Authors: Yuhui Zhang*, Yuchang Su*, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt†, Serena Yeung-Levy†
Contact: yuhuiz@stanford.edu
Links: Paper | Video | Website
Keywords: vision language models, evaluation, multiple choice questions

Birth and Death of a Rose

Authors: Chen Geng, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu
Contact: gengchen@cs.stanford.edu
Links: Paper | Website
Keywords: 4d vision, computer graphics, computer vision

Category-Agnostic Neural Object Rigging

Authors: Guangzhao He, Chen Geng, Shangzhe Wu, Jiajun Wu
Contact: gengchen@cs.stanford.edu
Links: Paper | Website
Keywords: 4d vision, computer vision, computer graphics

Authors: Ian Huang, Yanan Bao, Karen Truong, Howard Zhou, Cordelia Schmid, Leonidas Guibas, Alireza Fathi
Contact: ianhuang@stanford.edu
Links: Paper | Video | Website
Keywords: multimodal large language models, 3d object placement, scene generation

Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis

Authors: Yousef Yeganeh, Azade Farshad, Ioannis Charisiadis, Marta Hasny, Martin Hartenberger, Björn Ommer, Nassir Navab, Ehsan Adeli
Contact: azade.farshad@tum.de
Award nominations: CVPR Highlight
Links: Paper | Website
Keywords: medical image generation, counterfactual image generation, diffusion models

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Authors: James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Contact: jmhb@stanford.edu
Links: Paper | Blog Post | Website
Keywords: reasoning, benchmark, science, microscopy, biomedical, vqa

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

Authors: Yunzhi Zhang, Zizhang Li, Matt Zhou, Shangzhe Wu, Jiajun Wu
Contact: yzzhang@cs.stanford.edu
Links: Paper | Website
Keywords: visual representation, visual generation

We look forward to seeing you at CVPR 2025!

Keep on top of the latest SAIL Blog posts via , , or email: