Spatial-then-Temporal Self-Supervised Learning for Video Correspondence

Rui Li, Dong Liu

Learning temporal correspondence from unlabeled videos is of vital importance in computer vision, and has been tackled by different kinds of self-supervised pretext tasks. For the self-supervised learning, recent studies suggest using large-scale video datasets despite the training cost. We propose a spatial-then-temporal pretext task to address the training data cost problem. The task consists of two steps. First, we use contrastive learning from unlabeled still image data to obtain appearance-sensitive features. Then we switch to unlabeled video data and learn motion-sensitive features by reconstructing frames. In the second step, we propose a global correlation distillation loss to retain the appearance sensitivity learned in the first step, as well as a local correlation distillation loss in a pyramid structure to combat temporal discontinuity. Experimental results demonstrate that our method surpasses the state-of-the-art self-supervised methods on a series of correspondence-based tasks. The conducted ablation studies verify the effectiveness of the proposed two-step task and loss functions.

Knowledge Graph



Sign up or login to leave a comment