The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion
1. Introduction (scene-dominated to motion-dominated)
Video datasets are usually scene-dominated, We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
The generated triplet is as below:
What DSM learned?
With DSM pretrain, the model learn to focus on motion region (Not necessarily actor) powerful without one label available.
Please refer dataset.md for details.
- Intel (on the fly decode)
- hmdb51: the train/val lists of HMDB51
- ucf101: the train/val lists of UCF101
- kinetics-400: the train/val lists of kinetics-400
- diving48: the train/val lists of diving48
- logs: experiments record in detials
- gradientes: grad check
- data: load data
- loss: the loss evaluate in this paper
- model: network architectures
- scripts: train/eval scripts
- augment: detail implementation of Spatio-temporal Augmentation
- feature_extract.py: feature extractor given pretrained model
- main.py: the main function of finetune
- pt.py: self-supervised pretrain
- ft.py: supervised finetune
Supervised Finetune (Clip-level)
Following common practice TSN and Non-local.
The final video-level result is average by 10 temporal window sampling + corner crop, which lead to better result than clip-level.
Refer test.py for details.
Pretrain And Eval In one step
Notice: More Training Options and ablation study Can be find in scripts
Video Retrieve and other visualization
(1). Feature Extractor
As STCR can be easily extend to other video representation task, we offer the scripts to perform feature extract.
The feature will be saved as a single numpy file in the format [video_nums,features_dim] for further visualization.
(2). Reterival Evaluation
modify line60-line62 in reterival.py.
UCF101 Pretrained (I3D)
Video Retrieve (UCF101-C3D)
Video Retrieve (HMDB51-C3D)