UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
This is an unofficial PyTorch implementation of Jang et al. (Kakao), UnivNet.
- [ ] Release checkpoint of pre-trained model
- [ ] Extract wav samples for audio sample page
- [ ] Add results including validation loss graph
According to the authors of the paper, UnivNet obtained the best objective results among the recent GAN-based neural vocoders (including HiFi-GAN) as well as outperforming HiFi-GAN in a subjective evaluation. Also its inference speed is 1.5 times faster than HiFi-GAN.
Our default mel calculation hyperparameters are as below, following the original paper.
audio: n_mel_channels: 100 filter_length: 1024 hop_length: 256 # WARNING: this can't be changed. win_length: 1024 sampling_rate: 24000 mel_fmin: 0.0 mel_fmax: 12000.0
You can modify the hyperparameters to be compatible with your acoustic model.
The implementation needs following dependencies.
- Python 3.6
- PyTorch 1.6.0
- NumPy 1.17.4 and SciPy 1.5.4
- Install other dependencies in requirements.txt.
pip install -r requirements.txt
- Download the training dataset. This can be any wav file with sampling rate 24,000Hz. The original paper used LibriTTS.
- LibriTTS train-clean-360 split tar.gz link
- Unzip and place its contents under
- If you want to use wav files with a different sampling rate, please edit the configuration file (see below).
Note: The mel-spectrograms calculated from audio file will be saved as
**.mel at first, and then loaded from disk afterwards.
Following the format from NVIDIA/tacotron2, the metadata should be formatted as:
path_to_wav|transcript|speaker_id path_to_wav|transcript|speaker_id ...
Train/validation metadata for LibriTTS train-clean-360 split and are already prepared in
5% of the train-clean-360 utterances were randomly sampled for validation.
Since this model is a vocoder, the transcripts are NOT used during training.
Preparing Configuration Files
cp config/default.yaml config/config.yamland then edit
Write down the root path of train/validation in the
datasection. The data loader parses list of files within the path recursively.
data: train_dir: 'datasets/' # root path of train data (either relative/absoulte path is ok) train_meta: 'metadata/libritts_train_clean_360_train.txt' # relative path of metadata file from train_dir val_dir: 'datasets/' # root path of validation data val_meta: 'metadata/libritts_train_clean_360_val.txt' # relative path of metadata file from val_dir
We provide the default metadata for LibriTTS train-clean-360 split.
gento switch between UnivNet-c16 and c32.
gen: noise_dim: 64 channel_size: 32 # 32 or 16 dilations: [1, 3, 9, 27] strides: [8, 8, 4] lReLU_slope: 0.2
python trainer.py -c CONFIG_YAML_FILE -n NAME_OF_THE_RUN
tensorboard --logdir logs/
If you are running tensorboard on a remote machine, you can open the tensorboard page by adding
python inference.py -p CHECKPOINT_PATH -i INPUT_MEL_PATH
A pre-trained model will be released soon.
The model was trained on LibriTTS train-clean-360 split.
See audio samples at https://mindslab-ai.github.io/univnet/
Comparison with the results on paper
|Results in Paper (UnivNet-c32)||3.93±0.09||3.70||0.316|
This code is an unofficial implementation, there may be some differences from the original paper.
- Our UnivNet generator has smaller number of parameters (c32: 5.11M, c16: 1.42M) than the paper (c32: 14.89M, c16: 4.00M). So far, we have not encountered any issues from using a smaller model size. If run into any problem, please report it as an issue.
Implementation authors are:
- Kang-wook Kim @ MINDsLab Inc. ([email protected], [email protected])
- Wonbin Jung @ MINDsLab Inc. ([email protected], [email protected])