Upper body image synthesis from skeleton(Keypoints). Sub module in the ICCV-2021 paper “Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates”. [arxiv / github]

This is a modified implementation of Synthesizing Images of Humans in Unseen Poses.


To install dependencies, run

pip install -r requirements.txt

To run this module, you need two NVIDIA gpus with at least 11 GB respectively. Our code is tested on Ubuntu 18.04LTS with Python3.6.

Train on the custom dataset

For your own dataset, you need to modify custom config.yaml.

python main.py \
    --name test \
    --config_path {config_path}.yaml \
    --batch_size 1 \
  • The raw keypoints for each frame is of shape (3, 137)


Generate a realistic video for Oliver from {keypoints}.npz. You need to do some minor modifications to $configs/yaml/oliver.yaml$.

python inference.py \
    --cfg_path configs/yaml/oliver.yaml \
    --name test \
    --npz_path target_pose/oliver_pose.npz \
    --wav_path target_pose/oliver_audio.wav
  • In the result directory, you can find jpg files which correspond to the npz.

Demo Dataset and Checkpoint

  • Coming soon


If you find this code useful for your research, please use the following BibTeX entry.

  title={Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates},
  author={Qian, Shenhan and Tu, Zhi and Zhi, YiHao and Liu, Wen and Gao, Shenghua},
  journal={International Conference on Computer Vision (ICCV)},


View Github