A multispeaker voice synthesis model based on Tacotron 2 GST

Nov 30, 2019 1 min read

Mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.

In our recent paper we propose Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data.

By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice.

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Clone this repo: git clone https://github.com/NVIDIA/mellotron.git
CD into this repo: cd mellotron
Initialize submodule: git submodule init; git submodule update
Install [PyTorch]
Install [Apex]
Install python requirements or build docker image
- Install python requirements: pip install -r requirements.txt

Training

Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the speaker embedding layer is [ignored]

Download our published [Mellotron] model trained on LibriTTS
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

jupyter notebook --ip=127.0.0.1 --port=31337
Load inference.ipynb
(optional) Download our published WaveGlow model

GitHub

Natural Language Processing

John was the first writer to have joined pythonawesome.com. He has since then inculcated very effective writing and reviewing culture at pythonawesome which rivals have found impossible to imitate.

A multispeaker voice synthesis model based on Tacotron 2 GST

Mellotron

Pre-requisites

Setup

Training

Training using a pre-trained model

Multi-GPU (distributed) and Automatic Mixed Precision Training

Inference demo

GitHub

John

A distributed Keras engine that is built on top of Ray

Provides fast semantic segmentation models on CityScapes/Camvid DataSet by Pytorch

Mellotron

Pre-requisites

Setup

Training

Training using a pre-trained model

Multi-GPU (distributed) and Automatic Mixed Precision Training

Inference demo

GitHub

A distributed Keras engine that is built on top of Ray

Provides fast semantic segmentation models on CityScapes/Camvid DataSet by Pytorch

You might also like...