Weixin Liang Research @ OpenAI


Weixin Liang is a Member of Technical Staff in Research at OpenAI, where he works on reinforcement-related topics at the frontier of large-scale AI systems. He received his Ph.D. in Computer Science from Stanford University as part of the Stanford Artificial Intelligence Laboratory (SAIL), working closely with Professors Christopher D Manning, Christopher Potts, Daniel Jurafsky, Daniel A. McFarland, James Zou, Li Fei-Fei, and Serena Yeung.

Weixin’s recent research focuses on multimodal AI and building the next generation of multimodal foundation models. He studies scaling, architecture, and the science of representation learning across modalities. His Mixture-of-Transformers paper has been a foundational contribution in ushering in the era of models that unify multimodal understanding and generation while remaining modality-aware by design. By introducing modality-decoupled MoE and decoupled attention, the work articulated a general architectural template: a shared foundation model with specialized capacity per modality. Since its introduction, this approach has become a vibrant and active research direction, inspiring extensive follow-on work across the multimodal modeling community.

Earlier in his career, Weixin worked on dialog systems and conversational AI—then a comparatively niche area that foreshadowed today’s large language model–based assistants—and developed early instincts for how language technologies evolve from research prototypes into product-defining platforms. That experience gave him a first-hand view of multiple generational shifts in the field and a practical understanding of how to make interactive systems robust outside the lab. Complementing this, he has a background in computer architecture and systems, which anchors his view of AI as an end-to-end stack spanning models, infrastructure, and deployment constraints. More recently, he has combined these strengths to help drive agentic AI projects that automate chip-design workflows.

Weixin’s research is deeply interdisciplinary, with first-author publications across multiple high-impact venues, including Nature Machine Intelligence and Nature Human Behaviour.

His work has been recognized with the Best Presentation Runner-up Award at ICSSI 2024 at the U.S. National Academy of Sciences and was selected for Cell Patterns “Editors’ Picks: Best of 2025.” He also serves as a reviewer for Science, and his research has been covered by 300+ media outlets worldwide, including Nature, The New York Times, Scientific American, The Guardian, and Fortune.

  • Semantic Scholar
  • Google Scholar
  • DBLP
  • a portrait of Weixin Liang

    Signature Papers

    Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin
    Transactions on Machine Learning Research (TMLR 2025) 
    Paper Twitter Bluesky Code

    How can we reduce pretraining costs for multi-modal models without sacrificing quality? At Meta, We introduce Mixture-of-Transformers (MoT), a sparse architecture with modality-aware sparsity for every non-embedding transformer parameter (e.g., feed-forward networks, attention matrices, and layer normalization). MoT achieves dense-level performance with up to 66% fewer FLOPs and was pretrained from scratch at 30B-scale; this TMLR signature paper highlights 2x efficiency gains via modality-aware sparsity.


    Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

    Weixin Liang*, Yuhui Zhang*, Yongchan Kwon*, Serena Yeung, James Zou
    NeurIPS (2022) 
    Paper HTML Poster Website Code

    Our new paper explains the intriguing AI ModalityGap: in multi-modal AI, there are large gaps in the representation space separating different data types. We show changing the gap improves zero-shot learning and fairness. Interestingly, modality gaps are created at model initialization and are reinforced by contrastive learning.


    Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

    Weixin Liang*, Zachary Izzo*, Yaohui Zhang*, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou
    International Conference on Machine Learning (ICML 2024)
    Oral/top 5% of accepted papers, ICML 2024
    Best Presentation Runner-up award at ICSSI 2024 (International Conference on the Science of Science and Innovation), National Academy of Sciences in Washington, DC.  
    Paper NY Times Article Twitter Code
    Slides PDF Google Slides

    Our estimates suggest that 10.6% of ICLR 2024 review sentences and 16.9% for EMNLP have been substantially modified by ChatGPT, with no significant evidence of ChatGPT usage in Nature portfolio reviews. Estimated ChatGPT usage in reviews spikes significantly within 3 days of review deadlines. We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM).


    2025

    Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

    Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu,
    SCOPE Workshop @ ICLR (2025)  Oral  
    Paper Twitter Bluesky Code

    Extension of Mixture-of-Transformers to state-space models (Mamba), successfully pretrained from scratch at 1.4B-scale. In the Transfusion (image+text) setting, Mixture-of-Mamba delivers dense-level performance with just 34.76% of the FLOPs.


    LMFusion: Adapting Pretrained Language Models for Multimodal Generation

    Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu,
    NeurIPS (2025) 
    Paper Twitter

    Applied Mixture-of-Transformers in a post-training setting for Llama-3, enabling text+image understanding and generation while preserving language performance.


    Adaptive Self-improvement LLM Agentic System for ML Library Development

    Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun,
    International Conference on Machine Learning (ICML 2025) 
    Best Paper Award at Deep Learning for Code Workshop, ICLR 2025
    Paper Twitter Bluesky

    Can LLMs program themselves to run faster? Programming AI accelerators is a major bottleneck in ML. Our self-improving LLM agent learns to write optimized code for new hardware, achieving 3.9x better results.


    From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review

    Yaohui Zhang, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, Weixin Liang
    NeurIPS (2025) 
    Paper PDF

    Explores how pairwise comparisons can improve LLM-based peer review workflows.


    ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

    Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T Truong, Weixin Liang, Fan-Yun Sun, Nick Haber
    NeurIPS (2025) 
    Paper PDF Project Code Leaderboard

    Can today’s LLMs implement tomorrow’s research ideas? We put them to the test. Introducing ResearchCodeBench: 212 tasks from 2024–25 ML papers and code, most released after any model’s training cutoff.


    OmicsNavigator: an LLM-driven multi-agent system for autonomous zero-shot biological analysis in spatial omics

    Li Yiyao, Nirvi Vakharia, Weixin Liang, Aaron T Mayer, Ruibang Luo, Alexandro E Trevino, Zhenqin Wu
    bioRxiv (2025) 
    Paper PDF

    LLM-driven multi-agent system for zero-shot biological analysis in spatial omics.


    Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

    Fan Nie, Lan Feng, Haotian Ye, Weixin Liang, Pan Lu, Huaxiu Yao, Alexandre Alahi, James Zou
    COLM (2025) 
    Paper PDF Code

    Trains a weak meta-agent to orchestrate stronger executors for improved performance.


    Quantifying large language model usage in scientific papers.

    Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, Diyi Yang, Christopher Potts, Christopher D Manning, James Y. Zou
    Nature Human Behaviour (2025) 
    Best Presentation Runner-up award at ICSSI 2024 (International Conference on the Science of Science and Innovation), National Academy of Sciences in Washington, DC.  
    Nature.com PDF Code

    Our study finds ~17% of recent CS arXiv papers and ~8% of bioRxiv papers substantially rely on LLM assistance. Higher LLM usage correlates with more frequent preprints, crowded research areas, and shorter papers. Featured in Scientific American, ABC.net.au, Entrepreneur, The Conversation, Tech Xplore, and more.


    The Widespread Adoption of Large Language Model-Assisted Writing Across Society

    Weixin Liang*, Yaohui Zhang*, Mihai Codreanu, Jiayu Wang, Hancheng Cao, James Y. Zou
    Cell Patterns (2025)
    Selected for Cell Patterns “Editors’ Picks: Best of 2025”
    Paper Code Media

    This paper offers the first large-scale measurement of LLM-assisted writing across a diverse range of public online text. We find that AI-generated or AI-edited text is already widespread in domains such as consumer complaints, corporate communications, job postings, and international organization press releases. Our results highlight how LLMs are quietly shaping digital communication and raise urgent questions about transparency, authorship, and the future of human-AI collaboration.


    2024

    Can large language models provide useful feedback on research papers? A large-scale empirical analysis

    Weixin Liang*, Yuhui Zhang*, Hancheng Cao*, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel A. McFarland, James Zou
    NEJM AI (2024) 
    Paper Twitter Code

    With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. Our results suggest that LLM and human feedback can complement each other. While human expert review is and should continue to be the foundation of rigorous scientific process, LLM feedback could benefit researchers, especially when timely expert feedback is not available and in earlier stages of manuscript preparation before peer-review.


    Navigating Dataset Documentation in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

    Xinyu Yang*, Weixin Liang*, James Zou
    ICLR 2024 (2024) 
    PDF Code

    Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices.


    Systematic analysis of 32,111 AI model cards characterizes documentation practice in AI.

    Weixin Liang, Nazneen Rajani, Xinyu Yang, Ezinwanne Ozoani, Eric Wu, Yiqun Chen, Daniel Scott Smith, James Zou
    Nature Machine Intelligence (2024) 
    Paper

    The rapid proliferation of AI models has underscored the importance of thorough documentation, as it enables users to understand, trust, and effectively utilize these models in various applications. Although developers are encouraged to produce model cards, it's not clear how much information or what information these cards contain. In this study, we conduct a comprehensive analysis of 32,111 AI model documentations on Hugging Face, a leading platform for distributing and deploying AI models. Our investigation sheds light on the prevailing model card documentation practices. Most of the AI models with substantial downloads provide model cards, though the cards have uneven informativeness. We find that sections addressing environmental impact, limitations, and evaluation exhibit the lowest filled-out rates, while the training section is the most consistently filled-out. Our study opens up a new perspective for analyzing community norms and practices for model documentation through large-scale data science and linguistics analysis.


    2023

    GPT detectors are biased against non-native English writers

    Weixin Liang*, Mert Yuksekgonul*, Yining Mao* Eric Wu*, James Zou
    Cell Patterns (2023) 
    Paper Cell.com Code Data

    We should be very cautious when using detectors to classify if text is written by AI or human. Our research has shown that such detectors classify over 50% of real text written by non-native English speakers as AI-generated, while most polished essays generated by GPT evade detection. This creates a bias and false positives against non-native speakers, as literary language is often classified as "human."

    Media Coverage: The Markup, The Guardian, Fortune, Stanford HAI, Stanford EE, The Hechinger Report


    Accuracy on the Curve: On the nonlinear correlation of ML performance between data subpopulations

    Weixin Liang*, Yining Mao* Yongchan Kwon*, Xinyu Yang, James Zou
    International Conference on Machine Learning (ICML 2023) 
    Paper Poster Website Recording Code

    Recent works empirically find that there is a strong linear relationship between in-distribution (ID) and out-of-distribution (OOD) performance, but we show that this is not necessarily true if there are subpopulation shifts. In this paper, we empirically show that out-of-distribution performance often has nonlinear ("moon shape") correlation with in-distribution performance under subpopulation shifts.


    OpenDataVal: a Unified Benchmark for Data Valuation

    Kevin Jiang*, Weixin Liang*, James Zou, Yongchan Kwon
    NeurIPS Datasets and Benchmarks Track (2023) 
    Paper Project Page Code Twitter

    Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms.


    Characterizing the Clinical Adoption of Medical AI Devices through U.S. Insurance Claims

    Kevin Wu, Eric Wu, Brandon Theodorou, Weixin Liang, Christina Mack, Lucas Glass, Jimeng Sun, James Zou
    NEJM AI (2023) 
    Paper

    We analyze billions of insurance claims data to produce a first-look at medical AI adoption.

    Media Coverage: Nature Medicine, The Imaging Wire, Cardiac Wire


    2022

    Advances, opportunities and challenges in creating data for trustworthy AI

    Weixin Liang, Girmaw Abebe Tadesse, Daniel Ho, Li Fei-Fei, Matei Zaharia, Ce Zhang, James Zou
    Nature Machine Intelligence (2022) 
    Paper Nature.com Twitter

    As AI model-building becomes more automated, much of the resources and time in practice are devoted to designing what data to collect, data cleaning, annotations and data evaluations. Our article discusses the best practices, new challenges and opportunities for each of these key components of the data for AI pipeline.


    Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set

    R Daneshjou*, K Vodrahalli*, W Liang*, R Novoa, M Jenkins, V Rotemberg, J Ko, S Swetter, E Bailey, O Gevaert, P Mukherjee, M Phung, K Yekrang, B Fong, R Sahasrabudhe, Albert Chiou, James Zou
    Science Advances (2022) 
    Machine Learning for Health (ML4H 2021) 
    Paper Science.org Diverse Dermatology Images (DDI) Dataset
    Training physicians and algorithms in dermatology diversity | Scope

    In order to train and test AI algorithms in dermatology, we need diverse, validated benchmarks. We curated the Diverse Dermatology Images (DDI) dataset to meet this need—the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones.


    Systematic analysis of 50 years of Stanford University Technology Transfer and Commercialization

    Weixin Liang, Scott Elrod, Daniel A. McFarland, James Zou
    Patterns (2022) 
    Paper Cell.com Twitter Recording
    Stanford HAI News: Analyzing 50 Years of Stanford Patents
    Finding patterns of success across 50 years of innovation | Scope
    OTL 50th Anniversary Report: A Half Century of Pioneering Innovation

    Computational analysis of 4,512 inventions marketed by Stanford's Office of Technology Licensing between 1970 and 2020 characterizes how the academic innovation landscape changed over time. We identified factors, such as the composition of the inventors, associated with the commercial success of the inventions. We also identified linguistic differences in how high-revenue and low-revenue inventions in the same field are described and marketed.


    MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts

    Weixin Liang, James Zou
    Contributed Talk at ICML 2022 Workshop on Shift happens: Crowdsourcing metrics and test datasets beyond ImageNet
    International Conference on Learning Representations (ICLR 2022) 
    Paper HTML Website Code
    HuggingFace Recording Blog

    MetaShift introduces a collection of >10K sets of images with annotated contexts! Context is missing in many ML datasets but is critical for understanding model performance. It enables evaluating how ML works in different contexts (e.g. indoor cat vs outdoor cat).
    Bonus: we give distance between contexts.


    SEAL: Interactive Tool for Systematic Error Analysis and Labeling

    Nazneen Rajani, Weixin Liang, Lingjiao Chen, Meg Mitchell, James Zou
    Empirical Methods in Natural Language Processing (EMNLP 2022) 
    Paper SEAL toolkit and Demo

    Machine learning systems that seemingly perform well on average can still make systematic errors on important subsets of data. We introduce an interactive Systematic Error Analysis and Labeling (SEAL) tool that uses a two-step approach to first identify high error slices of data and then in the second step introduce methods to give human-understandable semantics to those under-performing slices.


    Improving Out-of-Distribution Robustness via Selective Augmentation

    Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, Chelsea Finn
    International Conference on Machine Learning (ICML 2022) 
    Paper HTML Poster Code Recording

    To deploy machine learning algorithms in real-world applications, we must pay attention to distribution shift, i.e. when the test distribution is different from the training distribution, which substantially degrades model performance. We propose a simple mixup-based method to learn invariant functions via selective augmentation.


    2021

    Neural Group Testing to Accelerate Deep Learning

    Weixin Liang, James Zou
    International Symposium on Information Theory (ISIT 2021) 
    Paper HTML Code Slides

    Our new ISIT 2021 paper proposes neural group testing to speed up DeepLearning. The idea is to adaptively apply the network to groups of data pooled at suitable layers, which greatly reduces total compute.


    HERALD: An Annotation Efficient Method to Train User Engagement Predictors in Dialogs

    Weixin Liang*, Kaihui Liang*, Zhou Yu
    Annual Conference of the Association of Computational Linguistics (ACL 2021)  Paper HTML Code Recording

    We propose a workflow that automatically labels training data with minimum human efforts involved, built upon our previous ACL 2020 work.


    2020

    ALICE: Active Learning with Contrastive Natural Language Explanations

    Weixin Liang, James Zou, Zhou Yu
    Empirical Methods in Natural Language Processing (EMNLP 2020) 
    Paper HTML Slides Recording Blog

    Review Ratings: 4, 4, 4.5 in 5-point scale

    Our new EMNLP paper shows how to teach ML via natural language explanation of contrasts between concepts (e.g. "difference between COVID and flu is ..."). It's much more efficient than using labeled examples. Excited for more human-like learning!


    Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

    Weixin Liang, James Zou, Zhou Yu
    Annual Conference of the Association for Computational Linguistics (ACL 2020) 
    Paper HTML Code Slides Recording Blog

    Review Ratings: 4.5, 4.5, 5 in 5-point scale

    For dialog system evaluation, we found that self-reported dialog ratings are skewed, noisy and insensitive due to bias and variance among different users. We propose a three-stage denoising pipeline to reduce self-reported ratings and, at the same time, build an automatic comparison-based automatic dialog quality predictor.


    MOSS: Training End-to-End Dialog Systems with Modular Supervision

    Weixin Liang*, Youzhi Tian*, Chengcai Chen, Zhou Yu
    AAAI Conference on Artificial Intelligence (AAAI 2020) 
    Paper HTML Slides Press

    We propose an end-to-end framework for task-oriented dialog systems, which can flexibly incorporate supervision from multiple intermediate dialog system modules (e.g. natural language understanding, dialog state tracking, dialog policy learning and natural language generation) in an end-to-end manner.


    2019 and before

    DeepStore: In-Storage Acceleration for Intelligent Queries

    VS Mailthody, Z Qureshi, W Liang, Z Feng, SG De Gonzalo, Y Li, H Franke, J Xiong, J Huang, Wen-mei Hwu
    International Symposium on Microarchitecture (MICRO 2019) 
    Paper

    A computer architecture conference paper on in-storage hardware acceleration for deep learning.


    CU-Net: Component Unmixing Network for Textile Fiber Identification.

    Zunlei Feng, Weixin Liang, Daocheng Tao, Li Sun, Anxiang Zeng, Mingli Song
    International Journal of Computer Vision (IJCV 2019) 
    Paper

    We are the first to leverage computer vision techinques for image-based nondestructive textile fiber identification, which is practically useful in fashion, decoration, and design industry. Existing methods based on physical, chemical and microscopy techniques are normally limited by their long identification cycles, many human factors, high technological barriers, and existing damage.


    MemCloak: Practical Access Obfuscation for Untrusted Memory

    Weixin Liang, Kai Bu, Ke Li, Jinhong Li, Arya Tavakoli
    Annual Computer Security Applications Conference (ACSAC 2018) 
    Paper

    Outstanding Graduation Thesis, Zhejiang University

    Access patterns over untrusted memory have long been exploited to infer sensitive information like program types or even secret keys. We propose a light-weight obfuscation solutions to hide real memory accesses.


    Latest News

    • [Aug 22, 2022] Nature Machine Intelligence article on data-centric AI; analysis of 50 years of Stanford research commercialization published in Patterns; Science Advances paper on disparity in skin cancer AI and new Diverse Derm data.
    • [Jul 17, 2022] Look forward to meeting you all at ICML 2022, Baltimore, Maryland!
    • [Apr 23, 2022] New AI4Health Dataset! In order to train and test AI algorithms in dermatology, we need diverse, validated benchmarks. We curated the Diverse Dermatology Images (DDI) dataset to meet this need—the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones.
    • [Mar 3, 2022] Our new paper explains the intriguing AI ModalityGap in multi-modal AI.
    • [Jan 22, 2022] New ICLR paper MetaShift offers a resource of 1000s of distribution shifts.
    • [Jun 13, 2021] Happy to share that I graduated from the master's program at @Stanford today and will stay as a Ph.D. student starting this fall! Endlessly grateful to the people who supported me throughout the journey.
    • [Sep 6, 2019] Arrived @Stanford!
    More >

    Teaching