I am a Postdoctoral Researcher at the Stanford AI Lab, where I work with Prof. Serena Yeung-Levy. I build systems for video understanding, multimodal learning, and physical intelligence, with a focus on training models that can perceive, reason about, and predict the dynamics of the physical world from large-scale video.
At Stanford, I created VideoAgent, the first multimodal agent capable of tool-use and long-horizon video reasoning; developed Temporal Preference Optimization (TPO), one of the first post-training frameworks designed specifically for video large multimodal models; and co-authored Apollo, the first comprehensive exploration of how to create VLMs that understand video. My doctoral research advanced egocentric video understanding and text–video alignment, achieving state-of-the-art performance across four international competitions.
I believe the path to physical AI runs through understanding the world first—through video, through multimodal learning, and through grounding models in real physical dynamics. If you're building toward this future, feel free to reach out.
Most recent publications on Google Scholar.
‡ indicates equal contribution.
Temporal Preference Optimization for Long-Form Video Understanding
Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
arXiv preprint (2025)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia
arXiv preprint (2024)
Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
ICLR (2025)
Video Action Differencing
James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
ICLR (2025)
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
arXiv preprint (2024)
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy
ECCV (2024)
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy
NeurIPS (2024)
Describing Differences in Image Sets with Natural Language
Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*
CVPR (2024) Oral (90/11532)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
ICLR (2024)
LANA: A Language-Capable Navigator for Instruction Following and Generation
Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang
CVPR (2023)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
CVPR (2023)
Gloss-Free End-to-End Sign Language Translation
Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang
ACL (2023) Oral
Action Sensitivity Learning for Temporal Action Localization
Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang
ICCV (2023)
Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark
Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang
CVPR (2022)
Interactive Prototype Learning for Egocentric Action Recognition
Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang
ICCV (2021)
Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment
Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang
T-PAMI (2021)
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Xiaohan Wang, Linchao Zhu, Yi Yang
CVPR (2021)
Symbiotic Attention with Privileged Information for Egocentric Action Recognition
Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
AAAI (2020) Oral
Temporal Preference Optimization for Long-Form Video Understanding
Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
arXiv preprint (2025)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia
arXiv preprint (2024)
Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
ICLR (2025)
Video Action Differencing
James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
ICLR (2025)
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
arXiv preprint (2024)
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy
ECCV (2024)
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy
NeurIPS (2024)
Describing Differences in Image Sets with Natural Language
Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*
CVPR (2024) Oral (90/11532)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
ICLR (2024)
LANA: A Language-Capable Navigator for Instruction Following and Generation
Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang
CVPR (2023)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
CVPR (2023)
Gloss-Free End-to-End Sign Language Translation
Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang
ACL (2023) Oral
Action Sensitivity Learning for Temporal Action Localization
Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang
ICCV (2023)
Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark
Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang
CVPR (2022)
Interactive Prototype Learning for Egocentric Action Recognition
Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang
ICCV (2021)
Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment
Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang
T-PAMI (2021)
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Xiaohan Wang, Linchao Zhu, Yi Yang
CVPR (2021)
Symbiotic Attention with Privileged Information for Egocentric Action Recognition
Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang
AAAI (2020) Oral
This website uses the website design and template by Martin Saveski.