LayoutVLM

🎉 CVPR 2025

¹Stanford University ²Google Research
^*Equal Contribution

LayoutVLM can generate novel layouts from open-ended language instructions.

All the tables aligned to form a line, diving the room up into two halves, place all the chairs on one side of the line and the buffets on the other side.

Stack the tables vertically in the middle of the room. Arrange the utensils in a smaller circle around the tables, then position the chairs in a larger circle surrounding the utensils.

Tables are symmetrically placed in the room; each table should have two chairs on opposite sides of the table facing each other, ready for dining.

One table is placed in the middle of the room with all the plates and bowls placed on top of it; other tables are placed towards the corners with chairs on top of them.

Abstract

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

How does LayoutVLM arrange unlabeled 3D assets according to open-ended language instructions?

Our approach employs vision-language models (VLMs) to generate code for our proposed scene layout representation that specifies both an initial layout as well as a set of spatial relations between assets (and walls). This representation is then used to produce the final object placements through differentiable optimization.

BibTeX

@article{sun2024layoutvlm, title={LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models}, author={Sun, Fan-Yun and Liu, Weiyu and Gu, Siyi and Lim, Dylan and Bhat, Goutam and Tombari, Federico and Li, Manling and Haber, Nick and Wu, Jiajun}, journal={arXiv preprint arXiv:2412.02193}, year={2024} }

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

🎉 CVPR 2025

Click on the visualizations to see the input language instructions:

LayoutVLM can generate novel layouts from open-ended language instructions.

Abstract

Method

How does LayoutVLM arrange unlabeled 3D assets according to open-ended language instructions?

BibTeX