I believe video is the foundation for AI that truly understands the world—and I've spent my PhD proving it.
Vision Language Models
O'Reilly, 2025
I am a PhD candidate at the Stanford AI Lab and a Knight-Hennessy Scholar. Currently, I am at NVIDIA, studying how unified multimodal pre-training shapes foundation model capabilities—and how these representations transfer to physical world understanding.
At Meta, I led Apollo, the first study of how to achieve video understanding in large multimodal models. I then created SmolVLM, tiny vision-language models for image and video understanding to deploy on-device—now powering HuggingSnap and serving as the foundation model for SmolVLA for robotic action. I built Video-STaR, the first method to train models to reason over video, FineVision, the largest open VLM training corpus, and wrote the O'Reilly textbook on Vision Language Models.
LLMs proved general systems beat task-specific engineering. The same shift is coming for the physical world—but it requires video: motion, causality, temporal structure. If you're thinking about where AI goes next, I'd like to talk.
Most recent publications on Google Scholar.
‡ indicates equal contribution.
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti*, Orr Zohar*, Miquel Farré*, Merve Noyan, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf
COLM (2025)
The smallest LMMs for image + video—runs on web browsers and iPhones; powers HuggingSnap and SmolVLA for robot control
FineVision: Open Data Is All You Need
Luis Wiedmann*, Orr Zohar*, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti
arXiv (2025)
Largest open training corpus for vision-language models (24M samples)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia
CVPR 2025
First work exploring video understanding in large multimodal models
Temporal Preference Optimization for Long-Form Video Understanding
Rui Li*, Xiaohan Wang*, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy
arXiv (2025)
Learning precise temporal grounding through preference optimization
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
ICLR (2025)
First work exploring video reasoning and self-training in large multimodal models
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy
ECCV (2024)
Long-form temporal reasoning through agentic architectures
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti*, Orr Zohar*, Miquel Farré*, Merve Noyan, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf
COLM (2025)
The smallest LMMs for image + video—runs on web browsers and iPhones; powers HuggingSnap and SmolVLA for robot control
FineVision: Open Data Is All You Need
Luis Wiedmann*, Orr Zohar*, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti
arXiv (2025)
Largest open training corpus for vision-language models (24M samples)
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia
CVPR 2025
First work exploring video understanding in large multimodal models
Temporal Preference Optimization for Long-Form Video Understanding
Rui Li*, Xiaohan Wang*, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy
arXiv (2025)
Learning precise temporal grounding through preference optimization
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
ICLR (2025)
First work exploring video reasoning and self-training in large multimodal models
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy
ECCV (2024)
Long-form temporal reasoning through agentic architectures
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Philippe Hansen-Estruch, David Yan, Ching-Yao Chung,Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen
arXiv preprint (2025)
Open World Object Detection in the Era of Foundation Models
Orr Zohar, Alejandro Lozano, Shelly Goel, Serena Yeung-Levy, Kuan-Chieh Wang
arXiv preprint (2023)
LOVM: Language-Only Vision Model Selection
Orr Zohar, Shih-Cheng Huang, Kuan-Chieh Wang, Serena Yeung-Levy
NeurIPS (2023)
PROB: Probabilistic Objectness for Open World Object Detection
Orr Zohar, Kuan-Chieh Wang, Serena Yeung-Levy
CVPR (2023)
Analyzing surgical technique in diverse open surgical videos with multitask machine learning
Emmett D Goodman, Krishna K Patel, Yilun Zhang, William Locke, Chris J Kennedy, Rohan Mehrotra, Stephen Ren, Melody Guan, Orr Zohar, Maren Downing, Hao Wei Chen, Jevin Z Clark, Margaret T Berrigan, Gabriel A Brat, Serena Yeung-Levy
JAMA Surgery (2023)
Biointerfaced Sensors for Biodiagnostics
Orr Zohar*, Muhammad Khatib*, Rawan Omar, Rotem Vishinkin, Yoav Y Broza, Hossam Haick
View (2021)
Self-Healing Soft Sensors: From Material Design to Implementation
Muhammad Khatib, Orr Zohar, Hossam Haick
Advanced Materials (2021)
A Multifunctional Electronic Skin Empowered with Damage Mapping and Autonomic Acceleration of Self-Healing in Designated Locations
Muhammad Khatib, Orr Zohar, Walaa Saliba, Hossam Haick
Advanced Materials (2020)
Highly Efficient and Water-Insensitive Self-Healing Elastomer for Wet and Underwater Electronics
Muhammad Khatib, Orr Zohar, Walaa Saliba, Simcha Srebnik, Hossam Haick
Advanced Functional Materials (2020)
Angular Compounding for Speckle Reduction in Optical Coherence Tomography using Geometric Image Registration Algorithm and Digital Focusing
Jingjing Zhao, Yonatan Winetraub, Edwin Yuan, Warren H Chan, Sumaira Z Aasi, Kavita Y Sarin, Orr Zohar, Adam de la Zerda
Scientific Reports (2020)
Epitaxial Superconducting Tunnel Diodes for Light Detection Applications
Krishna Balasubramanian, John Wright, Orr Zohar, Boaz Taitler, Shlomi Bouscher, Huili Grace Xing, Debdeep Jena, Alex Hayat
Optical Materials Express (2020)
Photoresponse above 85 K of Selective Epitaxy Grown High-Tc Superconducting Microwires
Xinxi Xing, Krishna Balasubramanian, Shlomi Bouscher, Orr Zohar, Yuval Nitzav, Amit Kanigel, Alex Hayat
Applied Physics Letters (2020)
Full Resume in PDF.
You can find all the code needed to build this website in my Github. Feel free to use it, but please link to here, as well as Martin Saveski and Nerfies, whose templates I adapted for the website. Licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.