Programs, Words, and Embeddings
We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes. It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity. This representation can be inferred from pre-trained language models via a training-free inference technique, given text or image inputs. The resulting scene can be rendered into images using traditional, neural, or hybrid graphics renderers. Together, this forms a robust, automated system for high-quality 3D and 4D scene generation. Compared with existing representations like scene graphs, our proposed Scene Language generates complex scenes with higher fidelity, while explicitly modeling the scene structures to enable precise control and editing.
In the representation, a program declares a set of functions. Each function defines a semantic class of parts or objects, with natural language words as the class name, by defining a mapping from neural embeddings capturing geometry and appearance details to class instances. The function body explicitly describes the computation process of how simpler semantic components are spatially transformed and composed into complex scenes.
Below shows various applications of the proposed pipeline. Each video is a 360-degree scene rendering of a pipeline output. Click "Show Response" buttons to reveal raw language model responses, which contain the program and word components of the full representation.
This section shows results on text-conditioned 3D scene generation (left column) and editing-instruction-following (right column), with text inputs shown below the corresponding output renderings. Our representation provides an intuitive interface for controllable scene generation and editing, as 1) program function names correspond to words in natural language, offering interpretable semantic meanings, and 2) program structure reflects scene structure, enabling significant scene changes through simple, one-line edits while preserving overall structure.
The proposed representation captures the structure not only for static, but also for dynamic scenes, and can be applied for synthesizing 4D scenes conditioned on text inputs. It explicitly represents the temporal correspondence of entities—click the button below for tracking visualizations.
The same representation can be rendered with different renderers, showing the versatility of the proposed representation. Different renderers produce renderings that adhere to the same representation and therefore are visually aligned, while each exhibits a different imaging style. The following shows text-conditioned 3D generation results, with the renderer names and input text prompts shown below corresponding renderings.
The proposed representation can be used for image parsing and generating 3D scenes consistent with the parsed image structure and content. In the results below, input images are shown in the first column and renderings with two graphic renderers are shown in the second and third columns.
To automatically infer the representation from text or image inputs, we convert the Scene Language grammar into Python and prompt Claude 3.5 Sonnet [1] to generate programs and words. Neural embeddings reside in the CLIP text embedding space [2,3] and are obtained from texts via the CLIP text encoder or inverted from images with a pre-trained text-to-image model [4,5]. Afterward, a program interpreter executes the program and computes a data object for the scene, and a graphics renderer renders the data object into an image.
We evaluate our method on text-conditioned 3D generation tasks via a user study across 9 scenes shown below. We compare with two baseline methods: GraphDreamer [6], which uses scene graphs for intermediate representation, and MVDream [7], which directly generates scenes from text inputs. Input text prompts are shown below output renderings. Our representation offers flexible and precise specifications for entity relations, which in practice offloads the burden of reasoning about complex entity relations from the graphics renderer towards the language model in our inference pipeline and produces accurate scene generation results.
Failure case 1: prompt sensitivity. For text-conditioned tasks, minor variations in textual scene descriptions can lead to large quality differences in the output.
Failure case 2: image parsing errors. For image-conditioned tasks, input images are parsed with the backbone visual language model (Claude Sonnet 3.5). In the example below, with the same input image, parsing results have high variance across multiple inference runs.
- Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku, 2024.
- Gadre, S. Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., ... & Schmidt, L. (2024). DataComp: In search of the next generation of multimodal datasets.
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision.
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-resolution image synthesis with latent diffusion models.
- Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2022). An image is worth one word: Personalizing text-to-image generation using textual inversion.
- Gao, G., Liu, W., Chen, A., Geiger, A., & Schölkopf, B. (2024). GraphDreamer: Compositional 3D scene synthesis from scene graphs.
- Shi, Y., Wang, P., Ye, J., Long, M., Li, K., & Yang, X. (2023). MVDream: Multi-view diffusion for 3D generation.
@article{zhang2024scenelanguage,
title={The Scene Language: Representing Scenes with Programs, Words, and Embeddings},
author={Yunzhi Zhang and Zizhang Li and Matt Zhou and Shangzhe Wu and Jiajun Wu},
year={2024},
journal={arXiv preprint arXiv:2410.16770},
}
We thank Jiaqi Chen and Koven Yu for insightful discussions, and thank Jiayuan Mao and Chen Geng for their detailed comments and feedback.
The source code for this webpage is available here.