Unsupervised Learning and Segmentation of Complex Activities from Video

In proceedings of Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018
 

Abstract

This paper presents a new method for unsupervised segmentation of complex activities from video into multiple steps, or sub-activities, without any textual input. We propose an iterative discriminative-generative approach which alternates between discriminatively learning the appearance of sub-activities from the videos' visual features to sub-activity labels and generatively modelling the temporal structure of sub-activities using a Generalized Mallows Model. In addition, we introduce a model for background to account for frames unrelated to the actual activities. Our approach is validated on the challenging Breakfast Actions and Inria Instructional Videos datasets and outperforms both unsupervised and weakly-supervised state of the art.

Note: Accepted manuscript at CVPR 2018 (spotlight)

Images

Download Paper

Download Paper

Bibtex

@INPROCEEDINGS{2018_cvpr_sener,
     author = {Sener, Fadime and Yao, Angela},
      title = {Unsupervised Learning and Segmentation of Complex Activities from Video},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
       year = {2018},
      month = jun,
   abstract = {This paper presents a new method for unsupervised segmentation of complex activities from video into
               multiple steps, or sub-activities, without any textual input. We propose an iterative
               discriminative-generative approach which alternates between discriminatively learning the appearance
               of sub-activities from the videos' visual features to sub-activity labels and generatively modelling
               the temporal structure of sub-activities using a Generalized Mallows Model. In addition, we
               introduce a model for background to account for frames unrelated to the actual activities. Our
               approach is validated on the challenging Breakfast Actions and Inria Instructional Videos datasets
               and outperforms both unsupervised and weakly-supervised state of the art.}
}