DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).

Besides, more detailed & improved code framework, which contains the implementations of FastSpeech 2, DiffSpeech and our NeurIPS-2021 work PortaSpeech is coming soon ✨ ✨ ✨.

DiffSinger/DiffSpeech at training	DiffSinger/DiffSpeech at inference

? News:

Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
Sep.29, 2021: Our recent work PortaSpeech: Portable and High-Quality Generative Text-to-Speech was accepted by NeurIPS-2021 .
May.06, 2021: We submitted DiffSinger to Arxiv .

Environments

conda create -n your_env_name python=3.8
source activate your_env_name 
pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)

DiffSpeech (TTS version)

1. Data Preparation

a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/

b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/

c) Run the following scripts to pack the dataset for training/inference.

CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml

# `data/binary/ljspeech` will be generated.

2. Training Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name xxx --reset --infer

We also provide:

the pre-trained model of DiffSpeech;
the pre-trained model of HifiGAN vocoder;
the individual pre-trained model of FastSpeech 2 for the shallow diffusion mechanism in DiffSpeech;

Remember to put the pre-trained models in checkpoints directory.

About the determination of ‘k’ in shallow diffusion: We recommend the trick introduced in Appendix B. We have already provided the proper ‘k’ for Ljspeech dataset in the config files.

DiffSinger (SVS version)

0. Data Acquirement

WIP.
We will provide a form to apply for PopCS dataset.

1. Data Preparation

WIP.
Similar to DiffSpeech.

2. Training Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6.yaml --exp_name xxx --reset
# or
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name xxx --reset

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config xxx --exp_name xxx --reset --infer

The pre-trained model for SVS will be provided recently.

Tensorboard

tensorboard --logdir_spec exp_name

Mel Visualization

Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].

DiffSpeech vs. FastSpeech 2

Audio Demos

Audio samples can be found in our demo page.

We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1218.

(corresponding to the pre-trained model DiffSpeech)

Citation

@misc{liu2021diffsinger,
  title={DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism}, 
  author={Jinglin Liu and Chengxi Li and Yi Ren and Feiyang Chen and Zhou Zhao},
  year={2021},
  eprint={2105.02446},
  archivePrefix={arXiv},}

Acknowledgements

Our codes are based on the following repos:

Also thanks Keon Lee for fast implementation of our work.

GitHub

View Github

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Environments

DiffSpeech (TTS version)

1. Data Preparation

2. Training Example

3. Inference Example

DiffSinger (SVS version)

0. Data Acquirement

1. Data Preparation

2. Training Example

3. Inference Example

Tensorboard

Mel Visualization

Audio Demos

Citation

Acknowledgements

GitHub

John

Template for creating PyPI project

PyCASCLib: CASC interface for Warcraft III

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Environments

DiffSpeech (TTS version)

1. Data Preparation

2. Training Example

3. Inference Example

DiffSinger (SVS version)

0. Data Acquirement

1. Data Preparation

2. Training Example

3. Inference Example

Tensorboard

Mel Visualization

Audio Demos

Citation

Acknowledgements

GitHub

Template for creating PyPI project

PyCASCLib: CASC interface for Warcraft III

You might also like...