The goal of this repository is to provide real-time neural vocoder, which is compatible with ESPnet-TTS. Also, this repository can be combined with NVIDIA/tacotron2-based implementation (See this comment).
You can try the real-time end-to-end text-to-speech demonstration in Google Colab!
Real-time demonstration with ESPnet2
Real-time demonstration with ESPnet1
What’s new
2021/10/21 Single-speaker Korean recipe [egs/kss/voc1] is available.
2021/08/24 Add more pretrained models of StyleMelGAN and HiFi-GAN.
2021/08/07 Add initial pretrained models of StyleMelGAN and HiFi-GAN.
2021/08/03 Support StyleMelGAN generator and discriminator!
2021/08/02 Support HiFi-GAN generator and discriminator!
This repository is tested on Ubuntu 20.04 with a GPU Titan V.
Python 3.6+
Cuda 10.0+
CuDNN 7+
NCCL 2+ (for distributed multi-gpu training)
libsndfile (you can install via sudo apt install libsndfile-dev in ubuntu)
jq (you can install via sudo apt install jq in ubuntu)
sox (you can install via sudo apt install sox in ubuntu)
Different cuda version should be working but not explicitly tested. All of the codes are tested on Pytorch 1.4, 1.5.1, 1.7.1, 1.8.1, and 1.9.
Pytorch 1.6 works but there are some issues in cpu mode (See #198).
Setup
You can select the installation method from two alternatives.
A. Use pip
$ git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
$ cd ParallelWaveGAN
$ pip install -e .# If you want to use distributed training, please install# apex manually by following https://github.com/NVIDIA/apex
$ ...
Note that your cuda version must be exactly matched with the version used for the pytorch binary to install apex. To install pytorch compiled with different cuda version, see tools/Makefile.
B. Make virtualenv
$ git clone https://github.com/kan-bayashi/ParallelWaveGAN.git
$ cd ParallelWaveGAN/tools
$ make
# If you want to use distributed training, please run following# command to install apex.
$ make apex
Note that we specify cuda version used to compile pytorch wheel. If you want to use different cuda version, please check tools/Makefile to change the pytorch wheel to be installed.
Recipe
This repository provides Kaldi-style recipes, as the same as ESPnet. Currently, the following recipes are supported.
To run the recipe, please follow the below instruction.
<div class="highlight highlight-source-shell position-relative overflow-auto" data-snippet-clipboard-copy-content="# Let us move on the recipe directory
$ cd egs/ljspeech/voc1
# Run the recipe from scratch
$ ./run.sh
# You can change config via command line
$ ./run.sh –conf
# You can select the stage to start and stop
$ ./run.sh –stage 2 –stop_stage 2
# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh –stage 2
# If you want to resume training from 10000 steps checkpoint
$ ./run.sh –stage 2 –resume //checkpoint-10000steps.pkl
“>
# Let us move on the recipe directory
$ cd egs/ljspeech/voc1
# Run the recipe from scratch
$ ./run.sh
# You can change config via command line
$ ./run.sh --conf <your_customized_yaml_config># You can select the stage to start and stop
$ ./run.sh --stage 2 --stop_stage 2
# If you want to specify the gpu
$ CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2
# If you want to resume training from 10000 steps checkpoint
$ ./run.sh --stage 2 --resume <path>/<to>/checkpoint-10000steps.pkl
If you use MelGAN’s generator, the decoding speed will be further faster.
<div class="highlight highlight-source-shell position-relative overflow-auto" data-snippet-clipboard-copy-content="# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [04:00<00:00, 1.04it/s, RTF=0.0882]
2020-02-08 10:45:14,111 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.137).
# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:06
# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [04:00<00:00, 1.04it/s, RTF=0.0882]
2020-02-08 10:45:14,111 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.137).
# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:06<00:00, 36.38it/s, RTF=0.00189]
2020-02-08 05:44:42,231 (decode:142) INFO: Finished generation of 250 utterances (RTF = 0.002).
If you use Multi-band MelGAN’s generator, the decoding speed will be much further faster.
<div class="highlight highlight-source-shell position-relative overflow-auto" data-snippet-clipboard-copy-content="# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [01:47<00:00, 2.95it/s, RTF=0.048]
2020-05-22 15:37:19,771 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.059).
# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:05
# On CPU (Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz 16 threads)
[decode]: 100%|██████████| 250/250 [01:47<00:00, 2.95it/s, RTF=0.048]
2020-05-22 15:37:19,771 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.059).
# On GPU (TITAN V)
[decode]: 100%|██████████| 250/250 [00:05<00:00, 43.67it/s, RTF=0.000928]
2020-05-22 15:35:13,302 (decode:151) INFO: Finished generation of 250 utterances (RTF = 0.001).
If you want to accelerate the inference more, it is worthwhile to try the conversion from pytorch to tensorflow. The example of the conversion is available in the notebook (Provided by @dathudeptrai).
Results
Here the results are summarized in the table. You can listen to the samples and download pretrained models from the link to our google drive.
Here the minimal code is shown to perform analysis-synthesis using the pretrained model.
<div class="highlight highlight-source-shell position-relative overflow-auto" data-snippet-clipboard-copy-content="# Please make sure you installed `parallel_wavegan`
# If not, please install via pip
$ pip install parallel_wavegan
# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("", "pretrained_model")
EOF
# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST
print(PRETRAINED_MODEL_LIST.keys())
EOF
# Now you can find downloaded pretrained model in `pretrained_model//`
$ ls pretrain_model/
checkpoint-400000steps.pkl config.yml stats.h5
# These files can also be downloaded manually from the above results
# Please put an audio file in `sample` directory to perform analysis-synthesis
$ ls sample/
sample.wav
# Then perform feature extraction -> feature normalization -> synthesis
$ parallel-wavegan-preprocess \
–config pretrain_model//config.yml \
–rootdir sample \
–dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-normalize \
–config pretrain_model//config.yml \
–rootdir dump/sample/raw \
–dumpdir dump/sample/norm \
–stats pretrain_model//stats.h5
2019-11-13 13:44:29,574 (normalize:87) INFO: the number of files = 1.
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 513.13it/s]
$ parallel-wavegan-decode \
–checkpoint pretrain_model//checkpoint-400000steps.pkl \
–dumpdir dump/sample/norm \
–outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00 synthesis)
$ parallel-wavegan-preprocess \
–config pretrain_model//config.yml \
–rootdir sample \
–dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-decode \
–checkpoint pretrain_model//checkpoint-400000steps.pkl \
–dumpdir dump/sample/raw \
–normalize-before \
–outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00
# Please make sure you installed `parallel_wavegan`# If not, please install via pip
$ pip install parallel_wavegan
# You can download the pretrained model from terminal
$ python <<EOFfrom parallel_wavegan.utils import download_pretrained_modeldownload_pretrained_model("<pretrained_model_tag>", "pretrained_model")EOF# You can get all of available pretrained models as follows:
$ python <<EOFfrom parallel_wavegan.utils import PRETRAINED_MODEL_LISTprint(PRETRAINED_MODEL_LIST.keys())EOF# Now you can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
checkpoint-400000steps.pkl config.yml stats.h5
# These files can also be downloaded manually from the above results# Please put an audio file in `sample` directory to perform analysis-synthesis
$ ls sample/
sample.wav
# Then perform feature extraction -> feature normalization -> synthesis
$ parallel-wavegan-preprocess \
--config pretrain_model/<pretrain_model_tag>/config.yml \
--rootdir sample \
--dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-normalize \
--config pretrain_model/<pretrain_model_tag>/config.yml \
--rootdir dump/sample/raw \
--dumpdir dump/sample/norm \
--stats pretrain_model/<pretrain_model_tag>/stats.h5
2019-11-13 13:44:29,574 (normalize:87) INFO: the number of files = 1.
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 513.13it/s]
$ parallel-wavegan-decode \
--checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
--dumpdir dump/sample/norm \
--outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).
# You can skip normalization step (on-the-fly normalization, feature extraction -> synthesis)
$ parallel-wavegan-preprocess \
--config pretrain_model/<pretrain_model_tag>/config.yml \
--rootdir sample \
--dumpdir dump/sample/raw
100%|████████████████████████████████████████| 1/1 [00:00<00:00, 914.19it/s]
$ parallel-wavegan-decode \
--checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
--dumpdir dump/sample/raw \
--normalize-before \
--outdir sample
2019-11-13 13:44:31,229 (decode:91) INFO: the number of features to be decoded = 1.
[decode]: 100%|███████████████████| 1/1 [00:00<00:00, 18.33it/s, RTF=0.0146]
2019-11-13 13:44:37,132 (decode:129) INFO: finished generation of 1 utterances (RTF = 0.015).
# you can find the generated speech in `sample` directory
$ ls sample
sample.wav sample_gen.wav
Decoding with ESPnet-TTS model’s features
Here, I show the procedure to generate waveforms with features generated by ESPnet-TTS models.
<div class="highlight highlight-source-shell position-relative overflow-auto" data-snippet-clipboard-copy-content="# Make sure you already finished running the recipe of ESPnet-TTS.
# You must use the same feature settings for both Text2Mel and Mel2Wav models.
# Let us move on "ESPnet" recipe directory
$ cd /path/to/espnet/egs//tts1
$ pwd
/path/to/espnet/egs//tts1
# If you use ESPnet2, move on `egs2/`
$ cd /path/to/espnet/egs2//tts1
$ pwd
/path/to/espnet/egs2//tts1
# Please install this repository in ESPnet conda (or virtualenv) environment
$ . ./path.sh && pip install -U parallel_wavegan
# You can download the pretrained model from terminal
$ python << EOF
from parallel_wavegan.utils import download_pretrained_model
download_pretrained_model("", "pretrained_model")
EOF
# You can get all of available pretrained models as follows:
$ python << EOF
from parallel_wavegan.utils import PRETRAINED_MODEL_LIST
print(PRETRAINED_MODEL_LIST.keys())
EOF
# You can find downloaded pretrained model in `pretrained_model//`
$ ls pretrain_model/
checkpoint-400000steps.pkl config.yml stats.h5
# These files can also be downloaded manually from the above results
“>
# Make sure you already finished running the recipe of ESPnet-TTS.# You must use the same feature settings for both Text2Mel and Mel2Wav models.# Let us move on "ESPnet" recipe directory
$ cd /path/to/espnet/egs/<recipe_name>/tts1
$ pwd
/path/to/espnet/egs/<recipe_name>/tts1
# If you use ESPnet2, move on `egs2/`
$ cd /path/to/espnet/egs2/<recipe_name>/tts1
$ pwd
/path/to/espnet/egs2/<recipe_name>/tts1
# Please install this repository in ESPnet conda (or virtualenv) environment
$ . ./path.sh && pip install -U parallel_wavegan
# You can download the pretrained model from terminal
$ python <<EOFfrom parallel_wavegan.utils import download_pretrained_modeldownload_pretrained_model("<pretrained_model_tag>", "pretrained_model")EOF# You can get all of available pretrained models as follows:
$ python <<EOFfrom parallel_wavegan.utils import PRETRAINED_MODEL_LISTprint(PRETRAINED_MODEL_LIST.keys())EOF# You can find downloaded pretrained model in `pretrained_model/<pretrain_model_tag>/`
$ ls pretrain_model/<pretrain_model_tag>
checkpoint-400000steps.pkl config.yml stats.h5
# These files can also be downloaded manually from the above results
Case 1: If you use the same dataset for both Text2Mel and Mel2Wav
<div class="highlight highlight-source-shell position-relative overflow-auto" data-snippet-clipboard-copy-content="# In this case, you can directly use generated features for decoding.
# Please specify `feats.scp` path for `–feats-scp`, which is located in
# exp//outputs_*_decode//feats.scp.
# Note that do not use outputs_*decode_denorm//feats.scp since
# it is de-normalized features (the input for PWG is normalized features).
$ parallel-wavegan-decode \
–checkpoint pretrain_model//checkpoint-400000steps.pkl \
–feats-scp exp//outputs_*_decode//feats.scp \
–outdir
# In the case of ESPnet2, the generated feature can be found in
# exp//decode_*//norm/feats.scp.
$ parallel-wavegan-decode \
–checkpoint pretrain_model//checkpoint-400000steps.pkl \
–feats-scp exp//decode_*//norm/feats.scp \
–outdir
# You can find the generated waveforms in /.
$ ls
utt_id_1_gen.wav utt_id_2_gen.wav … utt_id_N_gen.wav
“>
# In this case, you can directly use generated features for decoding.# Please specify `feats.scp` path for `--feats-scp`, which is located in# exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp.# Note that do not use outputs_*decode_denorm/<set_name>/feats.scp since# it is de-normalized features (the input for PWG is normalized features).
$ parallel-wavegan-decode \
--checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
--feats-scp exp/<your_model_dir>/outputs_*_decode/<set_name>/feats.scp \
--outdir <path_to_outdir># In the case of ESPnet2, the generated feature can be found in# exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp.
$ parallel-wavegan-decode \
--checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
--feats-scp exp/<your_model_dir>/decode_*/<set_name>/norm/feats.scp \
--outdir <path_to_outdir># You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
utt_id_1_gen.wav utt_id_2_gen.wav ... utt_id_N_gen.wav
Case 2: If you use different datasets for Text2Mel and Mel2Wav models
<div class="highlight highlight-source-shell position-relative overflow-auto" data-snippet-clipboard-copy-content="# In this case, you must provide `–normalize-before` option additionally.
# And use `feats.scp` of de-normalized generated features.
# You can find the generated waveforms in /.
$ ls
utt_id_1_gen.wav utt_id_2_gen.wav … utt_id_N_gen.wav
“>
# In this case, you must provide `--normalize-before` option additionally.# And use `feats.scp` of de-normalized generated features.# ESPnet1 case
$ parallel-wavegan-decode \
--checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
--feats-scp exp/<your_model_dir>/outputs_*_decode_denorm/<set_name>/feats.scp \
--outdir <path_to_outdir> \
--normalize-before
# ESPnet2 case
$ parallel-wavegan-decode \
--checkpoint pretrain_model/<pretrain_model_tag>/checkpoint-400000steps.pkl \
--feats-scp exp/<your_model_dir>/decode_*/<set_name>/denorm/feats.scp \
--outdir <path_to_outdir> \
--normalize-before
# You can find the generated waveforms in <path_to_outdir>/.
$ ls <path_to_outdir>
utt_id_1_gen.wav utt_id_2_gen.wav ... utt_id_N_gen.wav
If you want to combine these models in python, you can try the real-time demonstration in Google Colab!
Real-time demonstration with ESPnet2
Real-time demonstration with ESPnet1
Decoding with dumped npy files
Sometimes we want to decode with dumped npy files, which are mel-spectrogram generated by TTS models. Please make sure you used the same feature extraction settings of the pretrained vocoder (fs, fft_size, hop_size, win_length, fmin, and fmax). Only the difference of log_base can be changed with some post-processings (we use log 10 instead of natural log as a default). See detail in the comment.
# Decode without feature normalization
# This case assumes that the input mel-spectrogram is normalized with the same statistics of the pretrained model.
$ parallel-wavegan-decode \
–checkpoint /path/to/checkpoint-400000steps.pkl \
–feats-scp ./feats.scp \
–outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).
# Decode with feature normalization
# This case assumes that the input mel-spectrogram is not normalized.
$ parallel-wavegan-decode \
–checkpoint /path/to/checkpoint-400000steps.pkl \
–feats-scp ./feats.scp \
–normalize-before \
–outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00
# Generate dummy npy file of mel-spectrogram
$ ipython
[ins] In [1]: import numpy as np
[ins] In [2]: x = np.random.randn(512, 80) # (#frames, #mels)
[ins] In [3]: np.save("dummy_1.npy", x)
[ins] In [4]: y = np.random.randn(256, 80) # (#frames, #mels)
[ins] In [5]: np.save("dummy_2.npy", y)
[ins] In [6]: exit# Make scp file (key-path format)
$ find -name "*.npy"| awk '{print "dummy_" NR " " $1}'> feats.scp
# Check (<utt_id> <path>)
$ cat feats.scp
dummy_1 ./dummy_1.npy
dummy_2 ./dummy_2.npy
# Decode without feature normalization# This case assumes that the input mel-spectrogram is normalized with the same statistics of the pretrained model.
$ parallel-wavegan-decode \
--checkpoint /path/to/checkpoint-400000steps.pkl \
--feats-scp ./feats.scp \
--outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).
# Decode with feature normalization# This case assumes that the input mel-spectrogram is not normalized.
$ parallel-wavegan-decode \
--checkpoint /path/to/checkpoint-400000steps.pkl \
--feats-scp ./feats.scp \
--normalize-before \
--outdir wav
2021-08-10 09:13:07,624 (decode:140) INFO: The number of features to be decoded = 2.
[decode]: 100%|████████████████████████████████████████| 2/2 [00:00<00:00, 13.84it/s, RTF=0.00264]
2021-08-10 09:13:29,660 (decode:174) INFO: Finished generation of 2 utterances (RTF = 0.005).
John was the first writer to have joined pythonawesome.com. He has since then inculcated very effective writing and reviewing culture at pythonawesome which rivals have found impossible to imitate.
Previous Post
Pass2Pwn: a simple python3 tool created to assist penetration testers generate possible passwords
Next Post
ScreenshotLogger works just like a keylogger but instead of capturing keystroke,it captures the screen, stores it or sends via email