Distribuuuu

The pure and clear PyTorch Distributed Training Framework.

Distribuuuu is a Distributed Classification Training Framework powered by native PyTorch.

Please check tutorial for detailed Distributed Training tutorials:

For the complete training framework, please see distribuuuu.

Requirements and Usage

Dependency

  • Install PyTorch>= 1.5 (has been tested on 1.5, 1.7.1 and 1.8)
  • Install other dependencies: pip install -r requirements.txt

Dataset

Download the ImageNet dataset and move validation images to labeled subfolders, using the script valprep.sh.

Expected datasets structure for ILSVRC

ILSVRC
|_ train
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ val
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ ...

Create a directory containing symlinks:

mkdir -p /path/to/distribuuuu/data

Symlink ILSVRC:

ln -s /path/to/ILSVRC /path/to/distribuuuu/data/ILSVRC

Basic Usage

Single Node with one task

# 1 node, 8 GPUs
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Distribuuuu use yacs, a elegant and lightweight package to define and manage system configurations.
You can setup config via a yaml file, and overwrite by other opts. If the yaml is not provided, the default configuration file will be used, please check distribuuuu/config.py.

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml \
    OUT_DIR /tmp \
    MODEL.SYNCBN True \
    TRAIN.BATCH_SIZE 256

# --cfg config/resnet18.yaml parse config from file
# OUT_DIR /tmp            overwrite OUT_DIR
# MODEL.SYNCBN True       overwrite MODEL.SYNCBN
# TRAIN.BATCH_SIZE 256    overwrite TRAIN.BATCH_SIZE

Single Node with two tasks

# 1 node, 2 task, 4 GPUs per task (8GPUs)
# task 1:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# task 2:
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Multiple Nodes Training

# 2 node, 8 GPUs per node (16GPUs)
# node 1:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# node 2:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Slurm Cluster Usage

# see srun --help 
# and https://slurm.schedmd.com/ for details

# example: 64 GPUs
# batch size = 64 * 128 = 8192
# itertaion = 128k / 8192 = 156 
# lr = 64 * 0.1 = 6.4

srun --partition=openai-a100 \
     -n 64 \
     --gres=gpu:8 \
     --ntasks-per-node=8 \
     --job-name=Distribuuuu \
     python -u train_net.py --cfg config/resnet18.yaml \
     TRAIN.BATCH_SIZE 128 \
     OUT_DIR ./resnet18_8192bs \
     OPTIM.BASE_LR 6.4

Baselines

Baseline models trained by Distribuuuu:

model epoch total batch lr policy base lr Acc@1 Acc@5 model / config
resnet18 100 256 (32*8GPUs) cos 0.2 70.902 89.894 Drive / cfg
resnet18 100 1024 (128*8GPUs) cos 0.8 70.994 89.892
resnet18 100 8192 (128*64GPUs) cos 6.4 70.165 89.374
resnet18 100 16384 (256*64GPUs) cos 12.8 68.766 88.381
resnet50 100 256 (32*8GPUs) cos 0.2 77.252 93.430 Drive / cfg
botnet50 100 256 (32*8GPUs) cos 0.2 77.604 93.682 Drive / cfg
resnext101 100 256 (32*8GPUs) cos 0.2 78.938 94.482

Zombie processes problem

Before PyTorch1.8, torch.distributed.launch will leave some zombie processes after using Ctrl + C, try to use the following cmd to kill the zombie processes. (fairseq/issues/487):

kill $(ps aux | grep YOUR_SCRIPT.py | grep -v grep | awk '{print $2}')

PyTorch1.8 is suggested, which fixed the issue about zombie process. (pytorch/pull/49305)

Acknowledgments

Provided codes were adapted from:

I strongly recommend you to choose pycls, a brilliant image classification codebase and adopted by a number of projects at Facebook AI Research.

Citation

@misc{bigballon2021distribuuuu,
  author = {Wei Li},
  title = {Distribuuuu: The pure and clear PyTorch Distributed Training Framework},
  howpublished = {\url{https://github.com/BIGBALLON/distribuuuu}},
  year = {2021}
}

GitHub

https://github.com/BIGBALLON/distribuuuu