UDP-Pose

This is the pytorch implementation for UDP++, which won the Fisrt place in COCO Keypoint Challenge at ECCV 2020 Workshop.
Illustrating the performance of the proposed UDP

Top-Down

Results on MPII val dataset

Method— Head Sho. Elb. Wri. Hip Kne. Ank. Mean Mean 0.1
HRNet32 97.1 95.9 90.3 86.5 89.1 87.1 83.3 90.3 37.7
+Dark 97.2 95.9 91.2 86.7 89.7 86.7 84.0 90.6 42.0
+UDP 97.4 96.0 91.0 86.5 89.1 86.6 83.3 90.4 42.1

Results on COCO val2017 with detector having human AP of 65.1 on COCO val2017 dataset

Arch Input size #Params GFLOPs AP Ap .5 AP .75 AP (M) AP (L) AR
pose_resnet_50 256×192 34.0M 8.90 71.3 89.9 78.9 68.3 77.4 76.9
+UDP 256×192 34.2M 8.96 72.9 90.0 80.2 69.7 79.3 78.2
pose_resnet_50 384×288 34.0M 20.0 73.2 90.7 79.9 69.4 80.1 78.2
+UDP 384×288 34.2M 20.1 74.0 90.3 80.0 70.2 81.0 79.0
pose_resnet_152 256×192 68.6M 15.7 72.9 90.6 80.8 69.9 79.0 78.3
+UDP 256×192 68.8M 15.8 74.3 90.9 81.6 71.2 80.6 79.6
pose_resnet_152 384×288 68.6M 35.6 75.3 91.0 82.3 71.9 82.0 80.4
+UDP 384×288 68.8M 35.7 76.2 90.8 83.0 72.8 82.9 81.2
pose_hrnet_w32 256×192 28.5M 7.10 75.6 91.9 83.0 72.2 81.6 80.5
+UDP 256×192 28.7M 7.16 76.8 91.9 83.7 73.1 83.3 81.6
+UDPv1 256×192 28.7M 7.16 77.2 91.6 84.2 73.7 83.7 82.5
+UDPv1+AID 256×192 28.7M 7.16 77.9 92.1 84.5 74.1 84.1 82.8
RSN18+UDP 256×192 2.5 74.7
pose_hrnet_w32 384×288 28.5M 16.0 76.7 91.9 83.6 73.2 83.2 81.6
+UDP 384×288 28.7M 16.1 77.8 91.7 84.5 74.2 84.3 82.4
pose_hrnet_w48 256×192 63.6M 14.6 75.9 91.9 83.5 72.6 82.1 80.9
+UDP 256×192 63.8M 14.7 77.2 91.8 83.7 73.8 83.7 82.0
pose_hrnet_w48 384×288 63.6M 32.9 77.1 91.8 83.8 73.5 83.5 81.8
+UDP 384×288 63.8M 33.0 77.8 92.0 84.3 74.2 84.5 82.5

Note:

  • Flip test is used.
  • Person detector has person AP of 65.1 on COCO val2017 dataset.
  • GFLOPs is for convolution and linear layers only.
  • UDPv1: v0:LOSS.KPD=4.0, v1:LOSS.KPD=3.5

Results on COCO test-dev with detector having human AP of 65.1 on COCO val2017 dataset

Arch Input size #Params GFLOPs AP Ap .5 AP .75 AP (M) AP (L) AR
pose_resnet_50 256×192 34.0M 8.90 70.2 90.9 78.3 67.1 75.9 75.8
+UDP 256×192 34.2M 8.96 71.7 91.1 79.6 68.6 77.5 77.2
pose_resnet_50 384×288 34.0M 20.0 71.3 91.0 78.5 67.3 77.9 76.6
+UDP 384×288 34.2M 20.1 72.5 91.1 79.7 68.8 79.1 77.9
pose_resnet_152 256×192 68.6M 15.7 71.9 91.4 80.1 68.9 77.4 77.5
+UDP 256×192 68.8M 15.8 72.9 91.6 80.9 70.0 78.5 78.4
pose_resnet_152 384×288 68.6M 35.6 73.8 91.7 81.2 70.3 80.0 79.1
+UDP 384×288 68.8M 35.7 74.7 91.8 82.1 71.5 80.8 80.0
pose_hrnet_w32 256×192 28.5M 7.10 73.5 92.2 82.0 70.4 79.0 79.0
+UDP 256×192 28.7M 7.16 75.2 92.4 82.9 72.0 80.8 80.4
pose_hrnet_w32 384×288 28.5M 16.0 74.9 92.5 82.8 71.3 80.9 80.1
+UDP 384×288 28.7M 16.1 76.1 92.5 83.5 72.8 82.0 81.3
pose_hrnet_w48 256×192 63.6M 14.6 74.3 92.4 82.6 71.2 79.6 79.7
+UDP 256×192 63.8M 14.7 75.7 92.4 83.3 72.5 81.4 80.9
pose_hrnet_w48 384×288 63.6M 32.9 75.5 92.5 83.3 71.9 81.5 80.5
+UDP 384×288 63.8M 33.0 76.5 92.7 84.0 73.0 82.4 81.6

Note:

  • Flip test is used.
  • Person detector has person AP of 65.1 on COCO val2017 dataset.
  • GFLOPs is for convolution and linear layers only.

Bottom-Up

HRNet

Arch P2I Input size Speed(task/s) AP Ap .5 AP .75 AP (M) AP (L) AR
HRNet(ori) T 512×512 64.4 57.1 75.6
HRNet(mmpose) F 512×512 39.5 65.8 86.3 71.8 59.2 76.0 70.7
HRNet(mmpose) T 512×512 6.8 65.3 86.2 71.5 58.6 75.7 70.9
HRNet+UDP T 512×512 5.8 65.9 86.2 71.8 59.4 76.0 71.4
HRNet+UDP F 512×512 37.2 67.0 86.2 72.0 60.7 76.7 71.6
HRNet+UDP+AID F 512×512 37.2 68.4 88.1 74.9 62.7 77.1 73.0

HigherHRNet

Arch P2I Input size Speed(task/s) AP Ap .5 AP .75 AP (M) AP (L) AR
HigherHRNet(ori) T 512×512 67.1 61.5 76.1
HigherHRNet T 512×512 9.4 67.2 86.1 72.9 61.8 76.1 72.2
HigherHRNet+UDP T 512×512 9.0 67.6 86.1 73.7 62.2 76.2 72.4
HigherHRNet F 512×512 24.1 67.1 86.1 73.6 61.7 75.9 72.0
HigherHRNet+UDP F 512×512 23.0 67.6 86.2 73.8 62.2 76.2 72.4
HigherHRNet+UDP+AID F 512×512 23.0 69.0 88.0 74.9 64.0 76.9 73.8

Note:

  • ori : Result from original HigherHrnet
  • mmpose : Pretrained models from mmpose
  • P2I : PROJECT2IMAGE
  • we use mmpose for codebase
  • the configurations of the baseline are HRNet-W32-512×512-batch16-lr0.001
  • Speed is tested with dist_test in mmpose codebase and 8 Gpus + 16 batchsize

Quick Start

(Recommend) For mmpose, please refer to MMPose

For hrnet, please refer to Hrnet

For RSN, please refer to RSN

Data preparation
For coco, we provide the human detection result and pretrained model at BaiduDisk(dsa9)

Citation

If you use our code or models in your research, please cite with:

@inproceedings{cai2020learning,
  title={Learning Delicate Local Representations for Multi-Person Pose Estimation},
  author={Yuanhao Cai and Zhicheng Wang and Zhengxiong Luo and Binyi Yin and Angang Du and Haoqian Wang and Xinyu Zhou and Erjin Zhou and Xiangyu Zhang and Jian Sun},
  booktitle={ECCV},
  year={2020}
}
@article{huang2020joint,
  title={Joint coco and lvis workshop at eccv 2020: Coco keypoint challenge track technical report: Udp+},
  author={Huang, Junjie and Shan, Zengguang and Cai, Yuanhao and Guo, Feng and Ye, Yun and Chen, Xinze and Zhu, Zheng and Huang, Guan and Lu, Jiwen and Du, Dalong},
  year={2020}
}

GitHub

View Github