The Conference on Computer Vision and Pattern Recognition (CVPR) 2025 is being hosted from June 11th - June 15st. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia
Contact: orrzohar@stanford.edu
Links: Paper | Website
Keywords: lmms, video understanding


Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Authors: Yuhui Zhang*, Yuchang Su*, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt†, Serena Yeung-Levy†
Contact: yuhuiz@stanford.edu
Links: Paper | Video | Website
Keywords: vision language models, evaluation, multiple choice questions


Birth and Death of a Rose

Authors: Chen Geng, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu
Contact: gengchen@cs.stanford.edu
Links: Paper | Website
Keywords: 4d vision, computer graphics, computer vision


Category-Agnostic Neural Object Rigging

Authors: Guangzhao He, Chen Geng, Shangzhe Wu, Jiajun Wu
Contact: gengchen@cs.stanford.edu
Links: Paper | Website
Keywords: 4d vision, computer vision, computer graphics


FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

Authors: Ian Huang, Yanan Bao, Karen Truong, Howard Zhou, Cordelia Schmid, Leonidas Guibas, Alireza Fathi
Contact: ianhuang@stanford.edu
Links: Paper | Video | Website
Keywords: multimodal large language models, 3d object placement, scene generation


Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis

Authors: Yousef Yeganeh, Azade Farshad, Ioannis Charisiadis, Marta Hasny, Martin Hartenberger, Björn Ommer, Nassir Navab, Ehsan Adeli
Contact: azade.farshad@tum.de
Award nominations: CVPR Highlight
Links: Paper | Website
Keywords: medical image generation, counterfactual image generation, diffusion models


MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Authors: James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Contact: jmhb@stanford.edu
Links: Paper | Blog Post | Website
Keywords: reasoning, benchmark, science, microscopy, biomedical, vqa


The Scene Language: Representing Scenes with Programs, Words, and Embeddings

Authors: Yunzhi Zhang, Zizhang Li, Matt Zhou, Shangzhe Wu, Jiajun Wu
Contact: yzzhang@cs.stanford.edu
Links: Paper | Website
Keywords: visual representation, visual generation


We look forward to seeing you at CVPR 2025!