Xiaohan Wang

Postdoc, Stanford University

xhanwang [AT] stanford.edu

Bio

I am a Postdoctoral Researcher at the Stanford AI Lab, where I work with Prof. Serena Yeung-Levy. I build systems for video understanding, multimodal learning, and physical intelligence, with a focus on training models that can perceive, reason about, and predict the dynamics of the physical world from large-scale video.

At Stanford, I created VideoAgent, the first multimodal agent capable of tool-use and long-horizon video reasoning; developed Temporal Preference Optimization (TPO), one of the first post-training frameworks designed specifically for video large multimodal models; and co-authored Apollo, the first comprehensive exploration of how to create VLMs that understand video. My doctoral research advanced egocentric video understanding and text–video alignment, achieving state-of-the-art performance across four international competitions.

I believe the path to physical AI runs through understanding the world first—through video, through multimodal learning, and through grounding models in real physical dynamics. If you're building toward this future, feel free to reach out.

News

Publications

Most recent publications on Google Scholar.
indicates equal contribution.

Temporal Preference Optimization for Long-Form Video Understanding

Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

arXiv preprint (2025)

project paper code

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

arXiv preprint (2024)

project paper

Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

ICLR (2025)

project paper code

Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

ICLR (2025)

project paper code

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

arXiv preprint (2024)

project paper

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy

ECCV (2024)

project paper code

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

NeurIPS (2024)

project paper code

Describing Differences in Image Sets with Natural Language

Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*

CVPR (2024) Oral (90/11532)

project paper code

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

ICLR (2024)

project paper code

LANA: A Language-Capable Navigator for Instruction Following and Generation

Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang

CVPR (2023)

paper code

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

CVPR (2023)

paper code

Gloss-Free End-to-End Sign Language Translation

Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

ACL (2023) Oral

paper code

Action Sensitivity Learning for Temporal Action Localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang

ICCV (2023)

paper code

Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark

Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang

CVPR (2022)

paper code

Interactive Prototype Learning for Egocentric Action Recognition

Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

ICCV (2021)

paper

Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment

Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang

T-PAMI (2021)

paper code

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Xiaohan Wang, Linchao Zhu, Yi Yang

CVPR (2021)

paper code

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

AAAI (2020) Oral

paper code

Temporal Preference Optimization for Long-Form Video Understanding

Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

arXiv preprint (2025)

project paper code

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

arXiv preprint (2024)

project paper

Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

ICLR (2025)

project paper code

Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

ICLR (2025)

project paper code

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

arXiv preprint (2024)

project paper

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy

ECCV (2024)

project paper code

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

NeurIPS (2024)

project paper code

Describing Differences in Image Sets with Natural Language

Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*

CVPR (2024) Oral (90/11532)

project paper code

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

ICLR (2024)

project paper code

LANA: A Language-Capable Navigator for Instruction Following and Generation

Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang

CVPR (2023)

paper code

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

CVPR (2023)

paper code

Gloss-Free End-to-End Sign Language Translation

Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

ACL (2023) Oral

paper code

Action Sensitivity Learning for Temporal Action Localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang

ICCV (2023)

paper code

Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark

Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang

CVPR (2022)

paper code

Interactive Prototype Learning for Egocentric Action Recognition

Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

ICCV (2021)

paper

Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment

Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang

T-PAMI (2021)

paper code

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Xiaohan Wang, Linchao Zhu, Yi Yang

CVPR (2021)

paper code

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

AAAI (2020) Oral

paper code

Teaching

Acknowledgements

This website uses the website design and template by Martin Saveski.