3D-Generalist:
Self-Improving Vision-Language-Action Models for Crafting 3D Worlds

1 2

🎉 3DV 2026

3D-Generalist is a generative graphics framework composed of multiple foundation models and modules to scale up 3D environments and data that are readily usable for Synthetic Data and Embodied AI purposes.

Here are some 3D environments crafted by 3D-Generalist, demonstrating controllable generation over
🎨 materials, 💡 lighting, 🏠 assets, and 📐 layout:

"An international restaurant with vibrant decor."

"A spacious home gym that is fully equipped."

"A bohemian art studio with a vintage easel."

3D-Generalist uses diffusion model to generate panoramic images to create the structure of 3D environments via an inverse graphics pipeline.


"A chic clothing store with mannequins."

3D-Generalist employs a Vision-Language-Action (VLA) model to generate code to craft and modify all aspects (materials, lighting, assets, and layout) of the resulting 3D environments. The VLA is finetuned to optimize for prompt alignment via a self-improvement training loop.

"A colorful arcade with neon signs."

3D-Generalist employs another VLA to handle diverse small object placement tasks with *unlabeled* 3D assets, capable of:

  • Densely populating surfaces
  • Adding assets between shelves
  • Stacking assets
  • "A modern bar with brick wall and marble bar counter."

    "A quaint bookstore."

    After Finetuning

    These examples qualitatively highlight 3D-Generalist's self-correcting behavior.


    In the Omniverse ecosystem,
  • we use Omniverse Replicator enables large-scale synthetic data generation with domain randomization.

  • Isaac Lab provides readily available embodiments (e.g., humanoid robots) that can be used in these generated environments for robotic simulation.
  • We first train a vision foundation model using the Florence-2 framework on synthetic data rendered from 3D environments generated by 3D-Generalist. Results demonstrate that results that approach those achieved with real data orders of magnitude larger.

    BibTeX

    @article{sun20253d,
      title={3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds},
      author={Sun, Fan-Yun and Wu, Shengguang and Jacobsen, Christian and Yim, Thomas and Zou, Haoming and Zook, Alex and Li, Shangru and Chou, Yu-Hsin and Can, Ethem and Wu, Xunlei and others},
      journal={arXiv preprint arXiv:2507.06484},
      year={2025}
    }