Xiaohan Wang

Bio

I am a Postdoctoral Researcher at the Stanford AI Lab, where I work with Prof. Serena Yeung-Levy. I build systems for video understanding, multimodal learning, and physical intelligence, with a focus on training models that can perceive, reason about, and predict the dynamics of the physical world from large-scale video.

At Stanford, I created VideoAgent, the first multimodal agent capable of tool-use and long-horizon video reasoning; developed Temporal Preference Optimization (TPO), one of the first post-training frameworks designed specifically for video large multimodal models; and co-authored Apollo, the first comprehensive exploration of how to create VLMs that understand video. My doctoral research advanced egocentric video understanding and text–video alignment, achieving state-of-the-art performance across four international competitions.

I believe the path to physical AI runs through understanding the world first—through video, through multimodal learning, and through grounding models in real physical dynamics. If you're building toward this future, feel free to reach out.

News

[2025/10]: Release SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts.
[2025/09]: Release FineVision, the largest open-source multimodal dataset, in collaboration with Hugging Face.
[2025/06]: Co-organize MMFM-BIOMED and Multimodal Video Agent workshops @ CVPR 2025.
[2025/01]: Video-STAR and VidDiff are accepted to ICLR 2025.
[2025/01]: Release Temporal Preference Optimization (TPO), a video-centric post-training framework that enhances temporal grounding in long-form videos for Video-LMMs.
[2024/12]: Release Apollo, a comprehensive exploration of video understanding in large multimodal models.
[2024/10]: VLM Classifier is accepted to NeurIPS 2024.
[2024/07]: VideoAgent is accepted to ECCV 2024.
[2024/06]: Give a talk at "What is Next in Video Understanding" workshop @ CVPR 2024.
[2024/03]: Introduce VideoAgent, where we leverage a large language model as an agent for long-form video understanding.
[2024/03]: VisDiff is accepted as an oral presentation (90/11532) at CVPR 2024!
[2024/01]: RLCF is accepted by ICLR 2024.

Publications

Most recent publications on Google Scholar.
^‡ indicates equal contribution.

Selected
All

Temporal Preference Optimization for Long-Form Video Understanding

Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

arXiv preprint (2025)

project paper code

@article{li2025temporal,
  title={Temporal Preference Optimization for Long-Form Video Understanding},
  author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2501.13919},
  year={2025}
}

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

arXiv preprint (2024)

project paper

@article{zohar2024apollo,
  title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
  author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
  journal={arXiv preprint arXiv:2412.10360},
  year={2024}
}

Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

arXiv preprint (2024)

project paper

@article{endo2024feather,
  title={Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration},
  author={Endo, Mark and Wang, Xiaohan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2412.13180},
  year={2024}
}

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy

ECCV (2024)

project paper code

@article{VideoAgent,
  title={VideoAgent: Long-form Video Understanding with Large Language Model as Agent},
  author={Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2403.10517},
  year={2024}
}

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

NeurIPS (2024)

project paper code

@article{VLMClassifier,
  title={Why are Visually-Grounded Language Models Bad at Image Classification?},
  author={Zhang, Yuhui and Unell, Alyssa and Wang, Xiaohan and Ghosh, Dhruba and Su, Yuchang and Schmidt, Ludwig and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2405.18415},
  year={2024}
}

Describing Differences in Image Sets with Natural Language

Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*

CVPR (2024) Oral (90/11532)

project paper code

@inproceedings{VisDiff,
  title={Describing Differences in Image Sets with Natural Language},
  author={Dunlap, Lisa and Zhang, Yuhui and Wang, Xiaohan and Zhong, Ruiqi and Darrell, Trevor and Steinhardt, Jacob and Gonzalez, Joseph E. and Yeung-Levy, Serena},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

ICLR (2024)

project paper code

@inproceedings{
  zhao2024testtime,
  title={Test-Time Adaptation with {CLIP} Reward for Zero-Shot Generalization in Vision-Language Models},
  author={Shuai Zhao and Xiaohan Wang and Linchao Zhu and Yi Yang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=kIP0duasBb}
}

LANA: A Language-Capable Navigator for Instruction Following and Generation

Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang

CVPR (2023)

paper code

@inproceedings{wang2023lana,
  title={Lana: A language-capable navigator for instruction following and generation},
  author={Wang, Xiaohan and Wang, Wenguan and Shao, Jiayi and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19048--19058},
  year={2023}
}

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

CVPR (2023)

paper code

@inproceedings{bike,
  title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
  author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023}
}

Gloss-Free End-to-End Sign Language Translation

Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

ACL (2023) Oral

paper code

@inproceedings{lin2023gloss,
  title={Gloss-Free End-to-End Sign Language Translation},
  author={Lin, Kezhou and Wang, Xiaohan and Zhu, Linchao and Sun, Ke and Yang, Yi and others},
  booktitle={The 61st Annual Meeting Of The Association For Computational Linguistics},
  year={2023}
}

Action Sensitivity Learning for Temporal Action Localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang

ICCV (2023)

paper code

@InProceedings{Shao_2023_ICCV,
    author    = {Shao, Jiayi and Wang, Xiaohan and Quan, Ruijie and Zheng, Junjun and Yang, Jiang and Yang, Yi},
    title     = {Action Sensitivity Learning for Temporal Action Localization},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {13457-13469}
}

Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark

Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang

CVPR (2022)

paper code

@inproceedings{miao2022large,
  title={Large-scale Video Panoptic Segmentation in the Wild: A Benchmark},
  author={Miao, Jiaxu and Wang, Xiaohan and  Wu, Yu and Li, Wei and Zhang, Xu and Wei, Yunchao and Yang, Yi},
  booktitle={Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Interactive Prototype Learning for Egocentric Action Recognition

Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

ICCV (2021)

paper

@inproceedings{wang2021interactive,
  title={Interactive prototype learning for egocentric action recognition},
  author={Wang, Xiaohan and Zhu, Linchao and Wang, Heng and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={8168--8177},
  year={2021}
}

Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment

Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang

T-PAMI (2021)

paper code

@article{wang2020symbiotic,
  title={Symbiotic attention for egocentric action recognition with object-centric alignment},
  author={Wang, Xiaohan and Zhu, Linchao and Wu, Yu and Yang, Yi},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={45},
  number={6},
  pages={6605--6617},
  year={2020},
  publisher={IEEE}
}

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Xiaohan Wang, Linchao Zhu, Yi Yang

CVPR (2021)

paper code

@inproceedings{wang2021t2vlad,
  title={T2vlad: global-local sequence alignment for text-video retrieval},
  author={Wang, Xiaohan and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={5079--5088},
  year={2021}
}

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

AAAI (2020) Oral

paper code

@inproceedings{wang2020symbiotic,
  title={Symbiotic attention with privileged information for egocentric action recognition},
  author={Wang, Xiaohan and Wu, Yu and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={07},
  pages={12249--12256},
  year={2020}
}

Temporal Preference Optimization for Long-Form Video Understanding

Rui Li*, Xiaohan Wang*, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

arXiv preprint (2025)

project paper code

@article{li2025temporal,
  title={Temporal Preference Optimization for Long-Form Video Understanding},
  author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2501.13919},
  year={2025}
}

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia

arXiv preprint (2024)

project paper

@article{zohar2024apollo,
  title={Apollo: An Exploration of Video Understanding in Large Multimodal Models},
  author={Zohar, Orr and Wang, Xiaohan and Dubois, Yann and Mehta, Nikhil and Xiao, Tong and Hansen-Estruch, Philippe and Yu, Licheng and Wang, Xiaofang and Juefei-Xu, Felix and Zhang, Ning and Yeung-Levy, Serena and Xia, Xide},
  journal={arXiv preprint arXiv:2412.10360},
  year={2024}
}

Video-STaR: Bootstrapping Weak Video Supervision for Visual Instruction Tuning

Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

ICLR (2025)

project paper code

@article{zohar2024video,
  title={Video-star: Self-training enables video instruction tuning with any supervision},
  author={Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2407.06189},
  year={2024}
}

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Mark Endo, Xiaohan Wang, Serena Yeung-Levy

arXiv preprint (2024)

project paper

@article{endo2024feather,
  title={Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration},
  author={Endo, Mark and Wang, Xiaohan and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2412.13180},
  year={2024}
}

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy

ECCV (2024)

project paper code

@article{VideoAgent,
  title={VideoAgent: Long-form Video Understanding with Large Language Model as Agent},
  author={Wang, Xiaohan and Zhang, Yuhui and Zohar, Orr and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2403.10517},
  year={2024}
}

Why are Visually-Grounded Language Models Bad at Image Classification?

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy

NeurIPS (2024)

project paper code

@article{VLMClassifier,
  title={Why are Visually-Grounded Language Models Bad at Image Classification?},
  author={Zhang, Yuhui and Unell, Alyssa and Wang, Xiaohan and Ghosh, Dhruba and Su, Yuchang and Schmidt, Ludwig and Yeung-Levy, Serena},
  journal={arXiv preprint arXiv:2405.18415},
  year={2024}
}

Describing Differences in Image Sets with Natural Language

Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell*, Jacob Steinhardt*, Joseph E. Gonzalez*, Serena Yeung-Levy*

CVPR (2024) Oral (90/11532)

project paper code

@inproceedings{VisDiff,
  title={Describing Differences in Image Sets with Natural Language},
  author={Dunlap, Lisa and Zhang, Yuhui and Wang, Xiaohan and Zhong, Ruiqi and Darrell, Trevor and Steinhardt, Jacob and Gonzalez, Joseph E. and Yeung-Levy, Serena},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

ICLR (2024)

project paper code

@inproceedings{
  zhao2024testtime,
  title={Test-Time Adaptation with {CLIP} Reward for Zero-Shot Generalization in Vision-Language Models},
  author={Shuai Zhao and Xiaohan Wang and Linchao Zhu and Yi Yang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=kIP0duasBb}
}

LANA: A Language-Capable Navigator for Instruction Following and Generation

Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang

CVPR (2023)

paper code

@inproceedings{wang2023lana,
  title={Lana: A language-capable navigator for instruction following and generation},
  author={Wang, Xiaohan and Wang, Wenguan and Shao, Jiayi and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19048--19058},
  year={2023}
}

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

CVPR (2023)

paper code

@inproceedings{bike,
  title={Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models},
  author={Wu, Wenhao and Wang, Xiaohan and Luo, Haipeng and Wang, Jingdong and Yang, Yi and Ouyang, Wanli},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023}
}

Gloss-Free End-to-End Sign Language Translation

Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

ACL (2023) Oral

paper code

@inproceedings{lin2023gloss,
  title={Gloss-Free End-to-End Sign Language Translation},
  author={Lin, Kezhou and Wang, Xiaohan and Zhu, Linchao and Sun, Ke and Yang, Yi and others},
  booktitle={The 61st Annual Meeting Of The Association For Computational Linguistics},
  year={2023}
}

Action Sensitivity Learning for Temporal Action Localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang

ICCV (2023)

paper code

@InProceedings{Shao_2023_ICCV,
    author    = {Shao, Jiayi and Wang, Xiaohan and Quan, Ruijie and Zheng, Junjun and Yang, Jiang and Yang, Yi},
    title     = {Action Sensitivity Learning for Temporal Action Localization},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {13457-13469}
}

Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark

Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang

CVPR (2022)

paper code

@inproceedings{miao2022large,
  title={Large-scale Video Panoptic Segmentation in the Wild: A Benchmark},
  author={Miao, Jiaxu and Wang, Xiaohan and  Wu, Yu and Li, Wei and Zhang, Xu and Wei, Yunchao and Yang, Yi},
  booktitle={Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Interactive Prototype Learning for Egocentric Action Recognition

Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

ICCV (2021)

paper

@inproceedings{wang2021interactive,
  title={Interactive prototype learning for egocentric action recognition},
  author={Wang, Xiaohan and Zhu, Linchao and Wang, Heng and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={8168--8177},
  year={2021}
}

Symbiotic Attention for Egocentric Action Recognition with Object-centric Alignment

Xiaohan Wang, Linchao Zhu, Yu Wu, Yi Yang

T-PAMI (2021)

paper code

@article{wang2020symbiotic,
  title={Symbiotic attention for egocentric action recognition with object-centric alignment},
  author={Wang, Xiaohan and Zhu, Linchao and Wu, Yu and Yang, Yi},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={45},
  number={6},
  pages={6605--6617},
  year={2020},
  publisher={IEEE}
}

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Xiaohan Wang, Linchao Zhu, Yi Yang

CVPR (2021)

paper code

@inproceedings{wang2021t2vlad,
  title={T2vlad: global-local sequence alignment for text-video retrieval},
  author={Wang, Xiaohan and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={5079--5088},
  year={2021}
}

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

AAAI (2020) Oral

paper code

@inproceedings{wang2020symbiotic,
  title={Symbiotic attention with privileged information for egocentric action recognition},
  author={Wang, Xiaohan and Wu, Yu and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={07},
  pages={12249--12256},
  year={2020}
}

Teaching

Co-instruct Advanced Topics in Computer Vision and Biomedicine (CS286/BIODS276) at Stanford
Guest Lecture Advanced Machine Learning (CS-5806) at Virginia Tech

Acknowledgements

This website uses the website design and template by Martin Saveski.