Semi-Supervised Action Recognition with Temporal Contrastive Learning in CVPR 2021

Semi-Supervised Action Recognition with Temporal Contrastive Learning in CVPR 2021

1. Singh1, O. Chakraborty2, A. Varshney2, R. Panda3, R. Feris3, K. Saenko3,4, A. Das2

1IIT Madras, 2IIT Kharagpur, 3MIT-IBM Watson AI Lab, 4Boston University

The success of Deep Learning models is critically dependent on training with large datasets requiring tedious human annotation. Annotating videos is particularly difficult as they need to be watched till the end. This paper from Prof. Das's lab at IIT Kharagpur approaches the problem of learning useful video representations for semi-supervised activity recognition starting with only a handful of labeled videos. For this purpose, a two-pathway Temporal Contrastive Learning (TCL) framework is proposed that leverages the rich supervisory information: 'time' which is present in an otherwise unlabeled pool of videos. The TCL model maximizes the similarity between representations of the same video at two different speeds while the similarity between different videos played at different speeds is minimized. With this simple yet effective strategy of manipulating video playback rates, the framework considerably outperforms video extensions of sophisticated state-of-the-art semi-supervised image recognition methods across multiple diverse benchmark datasets and network architectures. Interestingly, TCL also benefits from out-of-domain unlabeled videos showing the generalization and robustness of the approach. Code, short video description of the work, and more details can be obtained from the project webpage at

The large-scale experiments on video data were performed using NVIDIA DGX Station with 4xV100 GPUs each with 32 GB memory. The code is in PyTorch. Average runtimes and memory usage of the models for a benchmark dataset (mini-something-something-v2) and two backbones are listed below for reference.

* Batch size n implies the number of labeled videos in the batch is n. The batch also includes 3n unlabeled videos making the effective batch size as 4n

Related Stories

No stories found.
logo
Analytics Insight
www.analyticsinsight.net