Chinese mandarin text to speech (MTTS)

This is a modularized Text-to-speech framework aiming to support fast research and product developments. Main features include

  • all modules are configurable via yaml,
  • speaker embedding / prosody embeding/ multi-stream text embedding are supported and configurable,
  • various vocoders (VocGAN, hifi-GAN, waveglow, melGAN) are supported by adapter so that comparison across different vocoders can be done easily,
  • durations/pitch/energy variance predictor are supported, and other variances can be added easily,
  • and more on the road-map.

Contributions are welcome.

Audio samples

Quick start


git clone
cd mandarin-tts
git submodule update --force --recursive --init --remote
pip install -e .  # this is necessary!


Two examples are provided here: biaobei and aishell3.

To train your own models, first make a copy from existing examples, then prepare the melspectrogram features using by

cd examples
python -c ./aishell3/config.yaml -w <aishell3_wav_folder> -m <mel_folder> -d cpu

prepare the scp files necessary for training,

cd examples/aishell3
python --wav_folder <aishell3_wav_folder>  --mel_folder <mel_folder> --dst_folder ./train/

This will generate scp files required by config.yaml (in the dataset/train section).
You would also need to check that everything is fine in the config file.
Usually you don't need to change the code.

Now you can start your training by

cd examples/aishell3
python ../../mtts/ -c config.yaml -d cuda

For biaobei dataset, the workflow is the same, except that there is no speaker embedding but you can add prosody embedding.

More examples will be added. Please stay.


Pretrained mtts checkpoints

Currently two examples are provided, and the corresponding checkpoints/configs are summarized as follows.

dataset checkpoint config
aishell3 link link
biaobei link link

Supported vocoders

Vocoders play the role of converting melspectrograms to waveforms. They are added as submodules and will be be trained in this project. Hence you should download the checkpoints before synthesizing. In training, vocoders are not necessary, as you can monitor the training process from generated melspectrograms and also the loss curve. Current we support the following vocoders,

Vocoder checkpoint github
Waveglow link link
hifi-gan link link
VocGAN link link link
MelGAN link link

All vocoders will be ready after running git submodule update --force --recursive --init --remote. However, you have to download the checkpoint manually and properly set the path in the config.yaml file.

Preparing your input text

The input.txt should be consistent with your setting of emb_type1 to emb_type_n in config file, i.e., same type, same order.

To facilitate transcription of hanzi to pinyin, you can try:

cd examples/aishell3/
python ../../mtts/text/ -t "为适应新的网络传播方式和读者阅读习惯"
>> sil wei4 shi4 ying4 xin1 de5 wang3 luo4 chuan2 bo1 fang1 shi4 he2 du2 zhe3 yue4 du2 xi2 guan4 sil|sil 为 适 应 新 的 网 络 传 播 方 式 和 读 者 阅 读 习 惯 sil

Not you can copy the text to input.txt, and remember to put down the self-defined name and speaker id, separated by '|'.

Synthesizing your waves

With the above checkpoints and text ready, finally you can run the synthesis process,

python ../../mtts/  -d cuda --c config.yaml --checkpoint ./checkpoints/checkpoint_1240000.pth.tar -i input.txt

Please check the config.yaml file for the vocoder settings.

If lucky, audio examples can be found in the output folder.