Comprehensive-Transformer-TTS – PyTorch Implementation
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS. Any suggestions toward the best Non-AR TTS are welcome 🙂
- Fastformer: Additive Attention Can Be All You Need (Wu et al., 2021)
- Long-Short Transformer: Efficient Transformers for Language and Vision (Zhu et al., 2021)
- Conformer: Convolution-augmented Transformer for Speech Recognition (Gulati et al., 2020)
- Reformer: The Efficient Transformer (Kitaev et al., 2020)
- Attention Is All You Need (Vaswani et al., 2017)
Supervised Duration Modelings
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (Ren et al., 2020)
Unsupervised Duration Modelings
One TTS Alignment To Rule Them All (Badlani et al., 2021): We are finally freed from external aligners such as MFA! Validation alignments for LJ014-0329 up to 70K are shown below as an example.
Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)
|Model||Memory Usage||Training Time (1K steps)|
|Fastformer (lucidrains’)||10531MiB / 24220MiB||4m 25s|
|Fastformer (wuch15’s)||10515MiB / 24220MiB||4m 45s|
|Long-Short Transformer||10633MiB / 24220MiB||5m 26s|
|Conformer||18903MiB / 24220MiB||7m 4s|
|Reformer||10293MiB / 24220MiB||10m 16s|
|Transformer||7909MiB / 24220MiB||4m 51s|
Toggle the type of building blocks by
# In the model.yaml block_type: "transformer" # ["transformer", "fastformer", "lstransformer", "conformer", "reformer"]
Toggle the type of duration modelings by
# In the model.yaml duration_modeling: learn_alignment: True # for unsupervised modeling, False for supervised modeling
DATASET refers to the names of datasets such as
VCTK in the following documents.
You can install the Python dependencies with
pip3 install -r requirements.txt
Dockerfile is provided for
You have to download the pretrained models and put them in
output/ckpt/DATASET/. The models are trained with unsupervised duration modeling under transformer building block.
For a single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
For a multi-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET
The dictionary of learned speakers can be found at
preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in
The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.
For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8
Add –speaker_id SPEAKER_ID for a multi-speaker TTS.
The supported datasets are
- LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
- VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.
For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy’s DeepSpeaker for the speaker embedding and locate it in
python3 prepare_align.py --dataset DATASET
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in
preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py --dataset DATASET
Train your model with
python3 train.py --dataset DATASET
- To use a Automatic Mixed Precision, append
--use_ampargument to the above command.
- The trainer assumes single-node multi-GPU training. To use specific GPUs, specify
CUDA_VISIBLE_DEVICES=<GPU_IDs>at the beginning of the above command.
tensorboard --logdir output/log
to serve TensorBoard on your localhost.
The loss curves, synthesized mel-spectrograms, and audios are shown.
- Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
- Note that there are no pre-extracted phoneme-level variance features in unsupervised duration modeling.
- Convolutional embedding is used as StyleSpeech for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as FastSpeech2.
- Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
- Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy’s DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between
- DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.
- For vocoder, HiFi-GAN and MelGAN are supported.
Please cite this repository by the “Cite this repository” of About section (top right of the main page).
- ming024’s FastSpeech2
- wuch15’s Fastformer
- lucidrains’ fast-transformer-pytorch
- lucidrains’ long-short-transformer
- sooftware’s conformer
- lucidrains’ reformer-pytorch
- NVIDIA’s NeMo: Special thanks to Onur Babacan and Rafael Valle for unsupervised duration modeling.