Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory (LSTM) deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.
The MultiTHUMOS dataset contains dense, multilabel, frame-level action annotations for 30 hours across 400 videos in the THUMOS'14 action detection dataset. It consists of 38,690 annotations of 65 action classes, with an average of 1.5 labels per frame and 10.5 action classes per video. Examples of the multilabel annotations are shown below.
MultiLSTM extends the standard long short-term memory network (LSTM) to target the dense labeling setting, by expanding the temporal receptive fields at both input and output to be fixed-length windows of frames. A soft-attention weighting is furthermore learned over the input window. These connections enable the model to have direct pathways for referencing previously-seen frames, and refining predictions in retrospect after seeing more frames, without forcing it to maintain and communicate this information solely through its recurrent connections.
The multilabel nature of this model and dataset allows us to go beyond simple action labeling, and additionally tackle higher level tasks such as retrieval of video segments containing sequential or co-occurring actions. Qualitative examples of retrieval results are shown below.