TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

By Dongxu Li*, Chenchen Xu*, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen and Hongdong Li

The repository contains the implementation of TSPNet. Preprocessed dataset, video features and the inference results are available at Google Drive.

We thank authors of fairseq for their efforts.


  • PyTorch version >= 1.4.0
  • Python version >= 3.6
  • For training new models, you'll also need NVIDIA GPU and (optionally) NCCL
  • (optional) BPEMB, if you prepare datasets by yourself (see below)

Install from source

Install the project from source and develop locally:

cd fairseq
pip install --editable .

Getting started


Download the preprocessed dataset, and arrange them as:

├── i3d-features/
│   ├── span=8_stride=2
│   ├── span=12_stride=2
│   └── span=16_stride=2
├── data-bin/
│   └── phoenix2014T/
│       └── sp25000/
├── README.md
├── run-scripts/
└── test-scripts/
  • i3d-features: the i3d output features of input videos
  • data-bin: the preprocessed translation texts


Go to the run_scripts folder and start training:

cd TSPNet/run_scripts
SAVE_DIR=CHECKPOINT_PATH bash run_phoenix_pos_embed_sp_test_3lvl.sh


After training, you can make inference on the testing dataset by specifying a checkpoint file.

Note, CHECKPOINT_FILE_PATH points to a saved checkpoint file, rather the CHECKPOINT folder.

CHECKPOINT=CHECKPOINT_FILE_PATH bash test_phoenix_pos_embed_sp_test_3lvl.sh

The script reports multiple metrics, including the ROUGE-L and BLEU-{n} as reported in the paper.

Alternative instructions for preparing datasets by yourself

  1. Text

Install German word embeddings BPEMB by pip install bpemb.

Preprocess the translation texts using preprocess_sign.py to BPE, repeatedly for each split, for example:

python preprocess_sign.py --save-vecs data/processed/emb data/ori/phoenix2014T.train.de data/processed/train.de

python preprocess_sign.py data/ori/phoenix2014T.test.de data/processed/test.de
  1. Vocabulary

Generate the dictionary file dict.de.txt.

fairseq-preprocess --source-lang de --target-lang de --trainpref data/processed/train --testpref data/processed/test --destdir data-bin/ --dataset-impl raw
  1. Video Prepare sign videos and the corresponding video features (e.g. by pretrained i3d networks), and create a json file for each split (e.g. train.sign-de.sign). The json file should be of the format below. It should have the same number of entries as the text file, where each entry corresponds to the sentence at the same line no in the prepared text file.
        "ident": "VIDEO_ID",
        "size": "64  // length of video features"

  1. Finally, arrange text files, video json files, word embeddings and vocabulary files into a folder as below:
├── train.sign-de.sign
├── train.sign-de.de
├── test.sign-de.sign
├── test.sign-de.de
├── emb
└── dict.de.txt


Please cite our paper and WLASL dataset (for pre-training) as:

	title        = {TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation},
	author       = {Li, Dongxu and Xu, Chenchen and Yu, Xin and Zhang, Kaihao and Swift, Benjamin and Suominen, Hanna and Li, Hongdong},
	year         = 2020,
	booktitle    = {Advances in Neural Information Processing Systems},
	volume       = 33

    title={Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison},
    author={Li, Dongxu and Rodriguez, Cristian and Yu, Xin and Li, Hongdong},
    booktitle={The IEEE Winter Conference on Applications of Computer Vision},

Other works you might be interested to look at:

  title={Transferring cross-domain knowledge for video sign language recognition},
  author={Li, Dongxu and Yu, Xin and Xu, Chenchen and Petersson, Lars and Li, Hongdong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},