A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Implement google's Tacotron TTS system with pytorch.


2018/09/15: Fix RNN feeding bug.


Download python and pytorch by youself.

  • python==3.6.5
  • pytorch==0.4.1

You can use requirements.txt to download packages below.

# I recommend you use virtualenv.
$ pip install -r requirements.txt
  • librosa
  • numpy
  • pandas
  • scipy
  • matplotlib


  • Data
    Download LJSpeech provided by keithito. It contains 13100 short audio clips of a single speaker. The total length is approximately 20 hrs.

  • Set config.

# Set the 'meta_path' and 'wav_dir' in `hyperparams.py` to paths of your downloaded LJSpeech's meta file and wav directory.
meta_path = 'Data/LJSpeech-1.1/metadata.csv'
wav_dir = 'Data/LJSpeech-1.1/wavs'
  • Train
# If you have pretrained model, add --ckpt <ckpt_path>
$ python main.py --train --cuda
  • Evaluate
# You can change the evaluation texts in `hyperparams.py`
# ckpt files are saved in 'tmp/ckpt/' in default
$ python main.py --eval --cuda --ckpt <ckpt_timestep.pth.tar>


The sample texts is based on Harvard Sentences. See the samples at samples/ which are generated after training 200k.


The model starts learning something at 30k.


Differences from the original Tacotron

  1. Data bucketing (Original Tacotron used loss mask)
  2. Remove residual connection in decoder_CBHG
  3. Batch size is set to 8
  4. Gradient clipping
  5. Noam style learning rate decay