NATSpeech: A Non-Autoregressive Text-to-Speech Framework

This repo contains official PyTorch implementation of:

Key Features

We implement the following features in this framework:

  • Data processing for non-autoregressive Text-to-Speech
    using Montreal Forced Aligner.
  • Convenient and scalable framework for training and inference.
  • Simple but efficient random-access dataset implementation.

Install Dependencies

## We tested on Linux/Ubuntu 18.04. 
## Install Python 3.6+ first (Anaconda recommended).

# build a virtual env (recommended).
python -m venv venv
source venv/bin/activate
# install requirements.
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 # torch >= 1.9.0 recommended
pip install -r requirements.txt
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/ # install forced alignment tool



If you find this useful for your research, please cite the following papers:

  • PortaSpeech
  title={PortaSpeech: Portable and High-Quality Generative Text-to-Speech},
  author={Ren, Yi and Liu, Jinglin and Zhao, Zhou},
  journal={Advances in Neural Information Processing Systems},
  • DiffSpeech
  title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
  author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
  journal={arXiv preprint arXiv:2105.02446},


Our codes are influenced by the following repos:

GitHub - NATSpeech/NATSpeech at
A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022) - GitHub - NATSpeech/NATSpeech at pythona...