What is deepaudio-speaker?
Deepaudio-speaker is a framework for training neural network based speaker embedders. It supports online audio augmentation thanks to torch-audiomentation. It inlcudes or will include popular neural network architectures and losses used for speaker embedder.
To make it easy to use various functions such as mixed-precision, multi-node training, and TPU training etc, I introduced PyTorch-Lighting and Hydra in this framework (just like what pyannote-audio and openspeech do).
Deepaudio-tts is coming soon.
Installation
conda create -n deepaudio python=3.8.5
conda activate deepaudio
conda install numpy cffi
conda install libsndfile=1.0.28 -c conda-forge
git clone https://github.com/deepaudio/deepaudio-speaker.git
cd deepaudio-speaker
pip install -e .
Get Started
Supported Datasets
Voxceleb2
- Download VoxCeleb dataset and follow this script to obtain this kind of directory structure:
/path/to/voxceleb/voxceleb1/dev/wav/id10001/1zcIwhmdeo4/00001.wav
/path/to/voxceleb/voxceleb1/test/wav/id10270/5r0dWxy17C8/00001.wav
/path/to/voxceleb/voxceleb2/dev/aac/id00012/21Uxsk56VDQ/00001.m4a
/path/to/voxceleb/voxceleb2/test/aac/id00017/01dfn2spqyE/00001.m4a
Training examples
- Example1: Train the
ecapa-tdnn
model withfbank
features on GPU.
$ deepaudio-speaker-train \
dataset=voxceleb2 \
dataset.dataset_path=/your/path/to/voxceleb2/dev/wav/ \
model=ecapa \
model.channels=1024 \
feature=fbank \
lr_scheduler=warmup_reduce_lr_on_plateau \
trainer=gpu \
criterion=aamsoftmax
- Example2: Extract speaker embedding with trained model.