A good action proposal method should generate proposals with high recall and high temporal overlap with groundtruth. The quality of the proposals relies on the labeled data available during training. Obtaining labeled data for untrimmed videos is a time consuming, expensive and e
...
A good action proposal method should generate proposals with high recall and high temporal overlap with groundtruth. The quality of the proposals relies on the labeled data available during training. Obtaining labeled data for untrimmed videos is a time consuming, expensive and error-prone task. The labels obtained are also subjective and the temporal bounds are inconsistent between different human annotators. We propose using a single key frame label for each action instance instead of the start and end point labels to generate temporal proposals. This reduces the number of labeled action frames in the dataset leading to class imbalance. To overcome this, we replace the learning setting with a PU-learning setup. We demonstrate that using key frames as labels give high quality proposals and yield results comparable to using full annotations while being faster to annotate as the exact temporal bounds no longer need to be annotated. We evaluate our method on THUMOS'14 and ActivityNet v1.2 dataset. Further experiments indicate that by combining existing action classifier on our proposals, our method is able to achieve high mean average precision (mAP) for action localization.