Bagua is a distributed training utility developed by AI [email protected] Technology and DS3 [email protected] Users can extend the training on a single GPU to multi-GPUs (may across multiple machines) by simply adding a few lines of code. One prominent feature of Bagua is to provide a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. This in turns enables better scalability and efficiency of the end-to-end training process. Researchers can also easily develop new distributed training algorithms within the Bagua framework, without worrying about low-level optimizations.

So far, Bagua has integrated communication primitives including

  • Centralized Synchronous Communication (AllReduce)
  • Decentralized Synchronous Communication
  • Low Precision Communication

Its effectiveness has been evaluated in various scenarios, including VGG and ResNet on ImageNet, BERT Large and many industrial applications at Kuaishou.

The underlying communication execution engine is in bagua-core, a library written in Rust.


The scalability of different systems on VGG16 with up to 128 GPUs.

Epoch time of BERT-Large Finetune under different network conditions for different systems.

For more comprehensive and up to date results, refer to Bagua benchmark page.


Develop version:

pip install git+

Release version:

pip install bagua

Build API documentation locally

pip install -r docs/doc-requirements.txt
make html

Cite Bagua

% System Overview
  title={BAGUA: Scaling up Distributed Learning with System Relaxations}, 
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hongmei Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},

% Theory on System Relaxation Techniques
  title={Distributed Learning Systems with First-Order Methods: An Instruction},
  author={Liu, J. and Zhang, C.},
  series={Foundations and trends in databases},
  publisher={now publishers}


  • When communication is not a bottleneck in the training task, using Bagua communication algorithms will not provide significant performance improvement (unless you use other optimizations in Bagua such as fused optimizer).
  • Currently only tested on Linux and NVIDIA GPUs.