Language-Informed Visual Concept Learning
ICLR 2024
Stanford University
*Equal contribution, alphabetically ordered
teaser


Abstract

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g., color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g., a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

Method

During training, an input image is processed by a set of concept encoders that predict concept embeddings specific to given concept axes. These embeddings are trained to (1) retain information in order to reproduce visual inputs via a pre-trained Text-to-Image model (DeepFloyd) given an axis-informed text template, and (2) ensure disentanglement across different axes by anchoring to text embeddings obtained from a pre-trained Visual Question Answering model (BLIP-2).

recomposition
I. Concept Recomposition

At test time, our model extracts visual concepts along various axes, such as <category>, <color>, and <material>, from different images and recompose them to generate new images. We show recomposition results across different pairs of concept axes in 3 datasets: (a) Fruits, (b) Paintings, and (c) Furniture.

recomposition
II. Generalization to Unseen Concepts via Finetuning

After finetuning on a single test-time image, the encoders can adapt to novel concept. Visual details from the input images are preserved as can be seen from images visualizing embedding predictions. Importantly, these embeddings do not overfit to the input images and maintain a good disentanglement, such that they can be freely recomposed to create new concepts.

test_time_optimization
III. Text-Based Image Editing

Compared the text-based image editing methods, our method achieves significantly better compositionality due to the disentangled concept representation, and captures more fine-grained visual details from the input image compared to the baselines.

editing_baselines
IV. Concept Extrapolation

Given an input image, we can extrapolate along a concept axis by querying BLIP-2 and GPT-4 to name a few alternatives to the concept in the input and generate new variants for visual exploration.

test_time_optimization
IV. Concept Interpolation

Given two input images, their concept embeddings can be extracted individually and then interpolated to synthesize interpolated images.

test_time_optimization
BibTeX
@article{lee2023languageinformed,
  author    = {Sharon Lee and Yunzhi Zhang and Shangzhe Wu and Jiajun Wu},
  title     = {Language-Informed Visual Concept Learning},
  journal   = {arXiv preprint arXiv:2312.03587},
  year      = {2023}
}
Acknowledgement

We thank Kyle Hsu, Joy Hsu, and Stephen Tian for their detailed comments and feedback. This work is in part supported by NSF RI #2211258, ONR MURI N00014-22-1-2740, ONR N00014-23-1-2355, AFOSR YIP FA9550-23-1-0127, the Stanford Institute for Human-Centered AI (HAI), Amazon, J.P. Morgan, and Samsung.