Temporal Aggregate Representations for Long-Range Video Understanding

Fadime Sener, Dipika Singhania und Angela Yao
In proceedings of European Conference on Computer Vision (ECCV), 2020
 

Abstract

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

Bilder

Paper herunterladen

Paper herunterladen

Bibtex

@INPROCEEDINGS{sener-2020-temporal,
     author = {Sener, Fadime and Singhania, Dipika and Yao, Angela},
      title = {Temporal Aggregate Representations for Long-Range Video Understanding},
  booktitle = {European Conference on Computer Vision (ECCV)},
       year = {2020},
   abstract = {Future prediction, especially in long-range videos, requires reasoning from current and past
               observations. In this work, we address questions of temporal extent, scaling, and level of semantic
               abstraction with a flexible multi-granular temporal aggregation framework. We show that it is
               possible to achieve state of the art in both next action and dense anticipation with simple
               techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our
               model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve
               state-of-the-art results. With minimal modifications, our model can also be extended for video
               segmentation and action recognition.}
}