Continuous Perception Benchmark

Abstract

Humans continuously perceive and process visual signals. However, current video models typically either sample key frames sparsely or divide videos into chunks and densely sample within each chunk. This approach stems from the fact that most existing video benchmarks can be addressed by analyzing key frames or aggregating information from separate chunks. We anticipate that the next generation of vision models will emulate human perception by processing visual input continuously and holistically. To facilitate the development of such models, we propose the Continuous Perception Benchmark, a video question answering task that cannot be solved by focusing solely on a few frames or by captioning small chunks and then summarizing using language models. Extensive experiments demonstrate that existing vision models, whether commercial or open-source, struggle with these tasks, indicating the need for new technical advancements in this direction.

Background

(Top) Most existing video understanding models process videos in one of two ways: either by sparsely processing the entire video or by densely processing it in chunks. Similarly, most existing video benchmarks can be addressed using these approaches, as the information needed to answer questions can either be sparsely extracted from the entire video or found within a local region of the video. (Bottom) We propose the Continuous Perception Benchmark, a task that requires models to densely process input videos to answer questions correctly. We hope this task could facilitate the development of the next generation of vision models that emulate human ability to continuously perceive and process visual signals.

Benchmark

We create the benchmark using OmniGibson, a simulation environment based on NVIDIA's Omniverse platform. We choose a 3D scene, furnish it with items like chairs and tables, and randomly place objects on the tables. Videos are then rendered with a camera that moves along a set trajectory. The task is straightforward: count how many instances of a specific object appear in the input video. Despite its simplicity, none of the current state-of-the-art video models perform well on this task.

Experiments

We evaluate several off-the-shelf video models, both open-source and commercial ones. All models struggle to perform the task, as shown by the low metrics in the table. The abbreviations are as follows: 'OBZ' stands for Off-By-Zero, 'OBO' is Off-By-One, 'OBF' is Off-By-Five, 'MAE' refers to Mean Absolute Error, 'RMSE' is Root Mean Square Error, and 'CORR' represents Pearson Correlation. These results show that none of existing video models can 'holistically' understand visual input, underscoring the need for new technical advancements in this area.

BibTeX

@article{wang2024continuous,
      title={Continuous Perception Benchmark},
      author={Wang, Zeyu and Weng, Zhenzhen and Yeung-Levy, Serena},
      journal={arXiv preprint arXiv:2408.07867},
      year={2024}
  }