End-2-end speech synthesis with recurrent neural networks

TTS-Cube

TTS-Cube is an end-2-end speech synthesis system that provides a full processing pipeline to train and deploy TTS models.

It is entirely based on neural networks, requires no pre-aligned data and can be trained to produce audio just by using character or phoneme sequences.

View Documentation Download Source

Output examples

Encoder outputs:

"Arată că interesul utilizatorilor de internet față de acțiuni ecologiste de genul Earth Hour este unul extrem de ridicat."
encoder_output_1

"Pentru a contracara proiectul, Rusia a demarat un proiect concurent, South Stream, în care a încercat să atragă inclusiv o parte dintre partenerii Nabucco."
encoder_output_2

Vocoder output (conditioned on gold-standard data)

Note: The mel-spectrum is computed with a frame-shift of 12.5ms. This means that Griffin-Lim reconstruction produces sloppy results at most (regardless on the number of iterations)

original vocoder

End to end decoding

The encoder model is still converging, so right now the examples are still of low quality. We will update the files as soon as we have a stable Encoder model.

synthesized original(unseen)

End-2-end speech synthesis with recurrent neural networks

TTS-Cube

Output examples

End to end decoding

GitHub

John

Dance generator using Autoencoder LSTM and Mixture Density Network

Tensorflow prebuilt binary for Windows