Action Recognition Zoo

Codes for popular action recognition models, written based on pytorch, verified on the something-something dataset. This code is built on top of the TRN-pytorch.

Note The main purpose of this repositoriy is to go through several methods and get familiar with their pipelines. Due to the challenge of Something-something dataset and the limitations of our GPUs, the results on [email protected] is unsatisfactory.

Note: always use git clone --recursive to clone this project Otherwise you will not be able to use the inception series CNN architecture.


  • Opencv-2.4.13 or some greater version that has tvl1 api for the computing of optical flow.
  • Pytorch-0.2.0_3
  • Tensorflow-1.3.1,this is only for the using of tensorboard, it's ok without this, but you need to comment the corresponding codes.

Data preparation


  • Download the something-something dataset. Decompress them into some folder.
  • Note that this dataset contains 108,499 videos and each video is presented in JPG images. The JPG images were extracted from the orginal videos at 12 frames per seconds.
  • The temporal evolution in videos is important for this dataset, so it's hard for some classic models which pay attention to short motion such as: Two-Stream Convolutional Networks for Action Recognition in Videos, NIPS 2014.

Prepare optical flow using Opencv

Note that optical flow is an important modal feature in two-stream series methods, which contains the motion information of videos.

Since there only rgb frames in the official dataset, we need compute optical flow by ourselves.

I apply a TV-L1 optical flow algorithm, pixel values are truncated to the range [-20, 20], then rescaled between 0 and 255, each optical flow has two channels representing horizontal and vertical components. Note that the fps in original dataset is 12, which is too fast for optical flow computing in practice, so i sample frame at 6fps.

  • The command to compute optical flow:
cd optical_flow
make bin                          #for cpu
make -f gpu_makefile gpu_bin      #for gpu

./bin          #for cpu
./gpu_bin      #for gpu

Before using the code you should modify the path in main.cpp or gpu_main.cpp.

Generate the meta files



Before using the code you should modify the path as your own. The test time for one video is measured on one K80.

Two stream action recognition

Main Reference Paper: Two-stream convolutional networks for action recognition in videos


  • Base CNN: BN-Inception pretrained on ImageNet.
  • Partical BN and cross-modality tricks have been used in the code.
  • Spatial stream: it's input is single rgb frame.
  • Temporal stream: it's input is stacked optical flows.


  • Spatial CNN: A single rgb frame is randomly selected for a video, which equals to image classification,input channel is 3.

  • Temporal CNN: 5 consequent stacked optical flows are selected for a video, input channel is 5*2(2 channels:x and y).

  • The command to train models:

train for spatial stream:
python TwoStream RGB two-stream-rgb --arch BNInception --batch_size 256 --lr 0.002
train for temporal stream:
python TwoStream Flow two-stream-flow --arch BNInception --batch_size 256 --lr 0.0005

Testing on validation set

At test time, given a video, i sample a fixed number of frames (25 for spatial stream and 8 for temporal stream in my experiments) with equal temporal spacing between them. From each of the frames i then obtain 10 ConvNet
inputs by cropping and flipping four corners and the center of the frame. The class scores for the
whole video are then obtained by averaging the scores across the sampled frames and crops therein.

  • The command to test models:
test for spatial stream:
python --model TwoStream --modality RGB --weights TwoStream_RGB_BNInception_best.pth.tar --train_id two-stream-rgb --save_scores rgb.npz --arch BNInception --test_segments 25

test for temporal stream;
python --model TwoStream --modality Flow --weights TwoStream_Flow_BNInception_best.pth.tar --train_id two-stream-flow --save_scores flow.npz --arch BNInception --test_segments 25

After running the test code, we get the precision scores on validation set and the probability for all class is saved in npz files which is useful in late fusion.

fusion: combine spatial stream and temporal stream results.

Results on validation set

Methods [email protected] test time for one video
Spatial-stream 2.54% 0.507s
Temporal-stream 2.66% 0.621s
Two-stream 7.4% 1.128s

Temporal Segment Networks

Main Reference Paper: Temporal Segment Networks: Towards Good Practices for Deep Action Recognition


  • Base CNN: BN-Inception pretrained on ImageNet.
  • Partical BN and cross-modality tricks have been used in the code.
  • Spatial stream: it's input is k rgb frames, k is the segment number.
  • Temporal stream: it's input is k stacked optical flows.
  • The consensus function i've implemented is average function.


train spatial stream:
python TSN RGB tsn-rgb --arch BNInception --batch_size 128 --lr 0.001 --num_segments 3

train temporal stream:
python TSN Flow tsn-flow --arch BNInception --batch_size 128 --lr 0.0007 --num_segments 3

Testing on validation set

Note that in testing phrase the k equals 1 according to the paper and it's offical code. So the segment mechanism is only used in training phrase.

test spatial stream:
python --model TSN --modality RGB --weights TSN_RGB_BNInception_best.pth.tar --train_id tsn-rgb --save_scores rgb.npz --arch BNInception --test_segments 25

test temporal stream:
python --model TSN --modality Flow --weights TSN_Flow_BNInception_best.pth.tar --train_id tsn-flow --save_scores flow.npz --arch BNInception --test_segments 25

python   # need modify the path to your own

Results on validation set

Methods [email protected] test time for one video
Spatial-stream 2.82% 0.51s
Temporal-stream 4.37% 0.62s
TSN-2stream 9.2% 1.13s

Pretrained-C3D :3D Convolutional Networks

Main Reference Paper: Learning Spatiotemporal Features with 3D Convolutional Networks


  • finetune the model pretrained on sports-1M, the pretrained model is upload to Baidu Cloud: link


python C3D RGB c3d-rgb --arch BNInception --batch_size 32 --lr 0.0001 --num_segments 1 --lr_steps 2 5 10 20 --factor 0.5


python --model C3D --modality RGB --weights C3D_RGB_BNInception_best.pth.tar --train_id c3d-rgb --save_scores rgb.npz --test_segments 5 --test_crops 1

Results on validation set

  • It seems like the C3D is faster than previous methods, but the input size for C3D is 112*112 vs 224*224 for Two-Stream models.
  • The result is not good. I've found that it's hard to traing 3D CNN on this difficult dataset. This is mainly due to the poor GPU which slows the training phrase, so it's hard to choose proper hyperparameters with my machine, but this code works and it'll give you a quick start.
Methods [email protected] test time for one video
mine pretrained-C3D 2.28% 0.37s
official pretrained-3D 12% -


Main Reference Paper: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset


  • The code for I3D model is based on hassony2
  • Training is too slow to report the results on Something-Something, but this code is useful
  • Kinetics pretrained model is uploaded to Baidu Cloud: link


python I3D RGB i3d-rgb --arch I3D --batch_size 32 --lr 0.002 --num_segments 1 --lr_steps 2 10 20 --factor 0.5

Our method:DIN

We will soon public the paper and code.

Results on validation set

Methods [email protected] test time for one video
DIN 34.26% 0.27s