Compositionality in Computer Vision

June 15th, held in conjunction with CVPR 2020 Virtual

Overview

People understand the world as a sum of its parts. Events are composed of other actions, objects can be broken down into pieces, and this sentence is composed of a series of words. When presented with new concepts, people can decompose the novelty into familiar parts. Our knowledge representation is naturally compositional. Unfortunately, many of the underlying architectures that catalyze vision tasks generate representations that are not compositional.

In our workshop, We will discuss compositionality in computer vision --- the notion that the representation of the whole should be composed of the representation of its parts. As humans, our perception is intertwined greatly by reasoning through composition: we understand a scene by components, a 3D shape by parts, an activity by events, etc. We hypothesize that intelligent agents also need to develop compositional understanding that is robust, generalizable, and powerful. In computer vision, there was a long-standing line of work based on semantic compositionality such as part-based object recognition. Pioneering statistical modeling approaches have built hierarchical feature representations for numerous vision tasks. And more recently, recent works has demonstrated that concepts can be learned from only a few examples using a compositional representation. As we move towards higher-level reasoning tasks, our workshop aims at revisiting the idea and reflecting on the future directions of compositionality.

At the workshop, we would like to discuss the following questions. How should we represent composition in scenes, videos, 3D spaces and robotics? How can human perception shed light on compositional understanding algorithms? What are the benefits of exploring compositionality? What structures, architectures and learning algorithms help models learn compositionality? How do we find the balance between compositional and black-box-based understanding? What problems are there in the current compositional understanding methods and how can we remedy them? What efforts should our community make in the future? What inductive biases can be build into our architectures to improve few-shot learning, meta learning and compositional decomposition?

Program Schedule

Time (Pacific Time, UTC-7)

Event

Title/Presenter

Links

08:30 - 08:45

Opening remarks

Ranjay Krishna, Stanford University

[video] [slides]

08:45 - 10:15

Keynote talk

Composition in Concept, Space and Time
Abhinav Gupta, Carnegie Mellon University

[video]

Meta-Learning Symmetries and Distributions
Chelsea Finn, Stanford University

[video]

10:15 - 11:00

Keynote talk

A Roadmap for Activity and Event Recognition Models
Aude Oliva, Massachusetts Institute of Technology

11:00 - 11:45

Keynote talk

What next in Computer Vision
Jitendra Malik, University of California, Berkeley

[video] [slides 1] [slides 2]

11:45 - 12:30

Lunch break

12:30 - 13:00

Poster session #1

Training Neural Networks to Produce Compatible Features
Michael Gygli, Jasper Uijlings, Vittorio Ferrari

[paper] [video, slides]

Exploring Latent Class Structures in Classification-By-Components Networks
Lars Holdijk

[paper] [video, slides]

Decomposing Image Generation into Layout Prediction and Conditional Synthesis
Anna Volokitin, Ender Konukoglu, Luc Van Gool

[paper] [video, slides]

Semantic Bottleneck Layers: Quantifying and Improving Inspectability of Deep Representations
Max Losch, Mario Fritz, Bernt Schiele

[paper] [video, slides]

13:00 - 13:45

Keynote talk

Unsupervised Representations towards Counterfactual Predictions
Animesh Garg, University of Toronto

[slides]

13:45 - 14:30

Keynote talk

Composing Humans and Objects in the 3D World
Angjoo Kanazawa, University of California, Berkeley

[slides]

14:30 - 15:15

Live panel discussion

Panelists:

Jitendra Malik
Aude Oliva
Chelsea Finn
Animesh Garg
Angjoo Kanazawa

Moderated by Ranjay Krishna

15:15 - 15:45

Afternoon break

15:45 - 16:05

Oral talk #1

Compositional Convolutional Neural Networks: A Deep Architecture with Innate Robustness to Partial Occlusion
Adam Kortylewski, Ju He, Qing Liu, Alan Yuille

[paper] [video, slides]

16:05 - 16:25

Oral talk #2

PaStaNet: Toward Human Activity Knowledge Engine
Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, Cewu Lu

[paper] [video, slides]

16:25 - 16:45

Oral talk #3

Searching for Actions on the Hyperbole
Teng Long, Pascal Mettes, Heng Tao Shen, Cees Snoek

[paper] [video, slides]

16:45 - 17:15

Poster session #2

Inferring Temporal Compositions of Actions Using Probabilistic Automata
Rodrigo Santa Cruz, Anoop Cherian, Basura Fernando, Dylan Campbell, Stephen Gould

[paper] [video] [slides]

Understanding Action Recognition in Still Images
Deeptha Girish, Vineeta Singh, Anca Ralescu

[paper] [video, slides]

17:15 - 17:30

Closing remarks

Keynote Speakers

Jitendra Malik is Arthur J. Chick Professor in the Department of Electrical Engineering and Computer Science at the University of California at Berkeley, where he also holds appointments in vision science, cognitive science and Bioengineering. He received the PhD degree in Computer Science from Stanford University in 1985 following which he joined UC Berkeley as a faculty member. He served as Chair of the Computer Science Division during 2002-2006, and of the Department of EECS during 2004-2006. Jitendra's group has worked on computer vision, computational modeling of biological vision, computer graphics and machine learning. Several well-known concepts and algorithms arose in this work, such as anisotropic diffusion, normalized cuts, high dynamic range imaging and shape contexts. He was awarded the Longuet-Higgins Award for “A Contribution that has Stood the Test of Time” twice, in 2007 and 2008, received the PAMI Distinguished Researcher Award in computer vision in 2013 the K.S. Fu prize in 2014, and the IEEE PAMI Helmholtz prize for two different papers in 2015. Jitendra Malik is a Fellow of the IEEE, ACM, and the American Academy of Arts and Sciences, and a member of the National Academy of Sciences and the National Academy of Engineering.

Aude Oliva has a dual French baccalaureate in Physics and Mathematics and a B.Sc. in Psychology (minor in Philosophy). She received two M.Sc. degrees –in Experimental Psychology, and in Cognitive Science and a Ph.D from the Institut National Polytechnique of Grenoble, France. She joined the MIT faculty in the Department of Brain and Cognitive Sciences in 2004, the MIT Computer Science and Artificial Intelligence Laboratory - CSAIL - in 2012, the MIT-IBM Watson AI Lab in 2017, and the leadership of the Quest for Intelligence in 2018. She is also affiliated with the Athinoula A. Martinos Imaging Center at the McGoven Institute for Brain Research MIT, and the MIT CSAIL Initiative "Systems That Learn". She is the MIT Executive Director of the MIT-IBM Watson AI Lab, and the Executive Director of the MIT Quest for Intelligence, a new MIT-wide initiative which seeks to discover the foundations of human and machine intelligence and deliver transformative new technology for humankind. She is currently on the Scientific Advisory Board of the Allen Institute for Artificial Intelligence.

Abhinav Gupta is an Associate Professor at the Robotics Institute, Carnegie Mellon University. and Research Manager at Facebook AI Research (FAIR). Abhinav's research focuses on scaling up learning by building self-supervised, lifelong and interactive learning systems. Specifically, he is interested in how self-supervised systems can effectively use data to learn visual representation, common sense and representation for actions in robots. Abhinav is a recipient of several awards including ONR Young Investigator Award, PAMI Young Research Award, Sloan Research Fellowship, Okawa Foundation Grant, Bosch Young Faculty Fellowship, YPO Fellowship, IJCAI Early Career Spotlight, ICRA Best Student Paper award, and the ECCV Best Paper Runner-up Award. His research has also been featured in Newsweek, BBC, Wall Street Journal, Wired and Slashdot.

Chelsea Finn is an Assistant Professor in Computer Science and Electrical Engineering at Stanford University. Finn's research interests lie in the ability to enable robots and other agents to develop broadly intelligent behavior through learning and interaction. To this end, Finn has developed deep learning algorithms for concurrently learning visual perception and control in robotic manipulation skills, inverse reinforcement methods for scalable acquisition of nonlinear reward functions, and meta-learning algorithms that can enable fast, few-shot adaptation in both visual perception and deep reinforcement learning. Finn received her Bachelors degree in Electrical Engineering and Computer Science at MIT and her PhD in Computer Science at UC Berkeley. Her research has been recognized through the ACM doctoral dissertation award, an NSF graduate fellowship, a Facebook fellowship, the C.V. Ramamoorthy Distinguished Research Award, and the MIT Technology Review 35 under 35 Award, and her work has been covered by various media outlets, including the New York Times, Wired, and Bloomberg. With Sergey Levine and John Schulman, Finn also designed and taught a course on deep reinforcement learning, with thousands of followers online. Throughout her career, she has sought to increase the representation of underrepresented minorities within CS and AI by developing an AI outreach camp at Berkeley for underprivileged high school students, a mentoring program for underrepresented undergraduates across three universities, and leading efforts within the WiML and Berkeley WiCSE communities of women researchers.

Animesh Garg is a Assistant Professor of Computer Science at University of Toronto and a Faculty Member at the Vector Institute. He leads the Toronto People, AI and Robotics (PAIR) research group. He is affiliated with Mechanical and Industrial Engineering (courtesy) and Toronto Robotics Institute. He also shares time as a senior research scientist at Nvidia in ML and Robotics. Prior to this, he was a postdoc at Stanford AI Lab working with Fei-Fei Li and Silvio Savarese. He received MS in Computer Science and Ph.D. in Operations Research from the UC, Berkeley in 2016. He was advised by Ken Goldberg in the Automation Lab as a part of the Berkeley AI Research Lab (BAIR). He also worked closely with Pieter Abbeel, Alper Atamturk and UCSF Radiation Oncology.

Angjoo Kanazawa will be starting as an Assistant Professor at UC Berkeley from Fall 2020. She is a research scientist at Google NYC. Previously, she was a BAIR postdoc at UC Berkeley advised by Jitendra Malik, Alexei A. Efros and Trevor Darrell. She completed her PhD in CS at the University of Maryland, College Park with her advisor David Jacobs. Prior to UMD, she spent four years at NYU where she worked with Rob Fergus and completed her BA in Mathematics and Computer Science.

Call for Papers

This workshop aims to bring together researchers from both academia and industry interested in addressing various aspects of compositional understanding in computer vision. The domains include but are not limited to scene understanding, video analysis, 3D vision and robotics. For each of these domains, we will discuss the following topics:

Algorithmic approaches: How should we develop and improve representations of compositionality for learning, such as graph embedding, message-passing neural networks, probabilistic models, etc.?
Evaluation methods: What are the convincing metrics to measure the robustness, generalizability, and accuracy of compositional understanding algorithms?
Cognitive aspects: How would cognitive science research inspire computational model to capture compositionality as humans do?
Optimization and scalability challenges: How should we handle the inherent representations of different components and curse of dimensionality of graph-based data? How should we effectively collect large-scale databases for training multi-tasking models?
Domain-specific applications: How should we improve scene graph generation, spatio-temporal-graph-based action recognition, structural 3D recognition and reconstruction, meta-learning, reinforcement learning, etc.?
Any other topic of interest for compositionality in computer vision.

Submission

Submit in this CMT portal: cmt3.research.microsoft.com/CICV2020

We provide three submission tracks, please submit to your desired one:

Archival full paper track. The length limit is 4 - 8 pages excluding references. The format is the same as CVPR'20 main conference submission (template). Accepted papers in this track will be published in CVPR workshop proceedings and IEEE Xplore. These papers will also be in the CVF open access archive.
Non-archival short paper track. The length limit is 4 pages including references. The format is the same as CVPR'20 main conference submission (template) but shorter in length. Accepted papers in this track will NOT be published in CVPR workshop proceedings but public on this workshop website. Note that accepted papers in this non-archival short paper track will not conflict with the dual submission policy of ECCV'20.
Non-archival long paper track. This track is only for previously published papers or papers to appear on CVPR'20 main conference. There is no page limit. Accepted papers in this track will NOT be published in CVPR workshop proceedings.

The submission deadline for all tracks has been extended to April 3rd, 2020 at 11:59 pm PST due to COVID-19 situation. Author notification will be sent out on April 10th, 2020. Camera ready due is April 18th, 2020.

All accepted papers will be required for poster presentation. Oral presentations will be selected from the accepted papers.

Organizers

Please contact Jingwei Ji or Ranjay Krishna with any questions: jingweij / ranjaykrishna [at] cs [dot] stanford [dot] edu.

Important Dates and Details

Signup to receive updates: ~~using this form~~
Apply to be part of Program Committee by: ~~Feb 15, 2020~~
Paper submission deadline: ~~Mar 27~~ ~~Apr 3, 2020 at 11:59pm PST. CMT portal: cmt3.research.microsoft.com/CICV2020~~
Notification of acceptance: ~~Apr 10, 2020~~
Camera ready due: ~~April 18, 2020~~
Workshop date: June 15, 2020

Program Committee

Shyamal Buch - Stanford University
Chien-Yi Chang - Stanford University
Apoorva Dornadula - Stanford University
Yong-Lu Li - Shanghai Jiao Tong University
Bingbin Liu - Carnegie Mellon University
Karttikeya Mangalam - University of California, Berkeley
Kaichun Mo - Stanford University
Samsom Saju - Mindtree
Gunnar Sigurdsson - Carnegie Mellon University
Paroma Varma - Stanford University
Alec Hodgkinson - Panasonic Beta
Boxiao Pan - Stanford University
Mingzhe Wang - Princeton University
Kaidi Cao - Stanford University