StyleSpeech

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation.

model_1

Status (2021.06.13)

  • [x] StyleSpeech (naive branch)
  • [x] Meta-StyleSpeech (main branch)

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download pretrained models and put them in output/ckpt/LibriTTS/.

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --ref_audio path/to/reference_audio.wav --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml

The generated utterances will be put in output/result/. Your synthesized speech will have ref_audio's style.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LibriTTS/val.txt --restore_step 200000 --mode batch -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml

to synthesize all utterances in preprocessed_data/LibriTTS/val.txt. This can be viewed as a reconstruction of validation datasets referring to themselves for the reference style.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.
For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml --duration_control 0.8 --energy_control 0.8

Note that the controllability is originated from FastSpeech2 and not a vital interest of StyleSpeech.

Training

Datasets

The supported datasets are

  • LibriTTS: a multi-speaker English dataset containing 585 hours of speech by 2456 speakers.
  • (will be added more)

Preprocessing

First, run

python3 prepare_align.py config/LibriTTS/preprocess.yaml

for some preparations.

In this implementation, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.

Download the official MFA package and run

./montreal-forced-aligner/bin/mfa_align raw_data/LibriTTS/ lexicon/librispeech-lexicon.txt english preprocessed_data/LibriTTS

or

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LibriTTS/ lexicon/librispeech-lexicon.txt preprocessed_data/LibriTTS

to align the corpus and then run the preprocessing script.

python3 preprocess.py config/LibriTTS/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml

As described in the paper, the script will start from pre-training the naive model until meta_learning_warmup steps and then meta-train the model for additional steps via episodic training.

TensorBoard

Use

tensorboard --logdir output/log/LibriTTS

to serve TensorBoard on your localhost.

GitHub

https://github.com/keonlee9420/StyleSpeech