ESPnet: end-to-end speech processing toolkit

ESPnet

ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet uses chainer and pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Key Features

Kaldi style complete recipe

Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
Support numbers of TTS recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
Support numbers of MT recipes (IWSLT'16, the above ST recipes etc.)
Support speech separation and recognition recipe (WSJ-2mix)
Support voice conversion recipe (VCC2020 baseline) (new!)

ASR: Automatic Speech Recognition

State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
Hybrid CTC/attention based end-to-end ASR
- Fast/accurate training with CTC/attention multitask training
- CTC/attention joint decoding to boost monotonic alignment decoding
- Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
Attention: Dot product, location-aware attention, variants of multihead
Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
Batch GPU decoding
Transducer based end-to-end ASR
- Available: RNN-based encoder/decoder or custom encoder/decoder w/ supports for Transformer, Conformer, TDNN (encoder) and causal conv1d (decoder) blocks.
- Also support: mixed RNN/Custom encoder-decoder, VGG2L (RNN/Cutom encoder) and various decoding algorithms.
Please refer to the tutorial page for complete documentation.
CTC segmentation
Non-autoregressive model based on Mask-CTC
ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
Wav2Vec2.0 pretrained model as Encoder, imported from FairSeq.

Demonstration

Real-time ASR demo with ESPnet2

TTS: Text-to-speech

Tacotron2
Transformer-TTS
FastSpeech
FastSpeech2 (in ESPnet2)
Conformer-based FastSpeech & FastSpeech2 (in ESPnet2)
Multi-speaker model with pretrained speaker embedding
Multi-speaker model with GST (in ESPnet2)
Phoneme-based training (En, Jp, and Zn)
Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)

Demonstration

Real-time TTS demo with ESPnet2
Real-time TTS demo with ESPnet1

To train the neural vocoder, please check the following repositories:

NOTE:

We are moving on ESPnet2-based development for TTS.

If you are beginner, we recommend using ESPnet2-TTS.

SE: Speech enhancement (and separation)

Single-speaker speech enhancement
Multi-speaker speech separation
Unified encoder-separator-decoder structure for time-domain and frequency-domian models
- Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution
- Separators: BLSTM, Transformer, Conformer, DPRNN, Neural Beamformers, etc.
Flexible ASR integration: working as an individual task or as the ASR frontend
Easy to import pretrained models from Asteroid
- Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration

Interactive SE demo with ESPnet2

ST: Speech Translation & MT: Machine Translation

State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
Transformer based end-to-end ST (new!)
Transformer based end-to-end MT (new!)

VC: Voice conversion

Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

DNN Framework

Flexible network architecture thanks to chainer and pytorch
Flexible front-end processing thanks to kaldiio and HDF5 support
Tensorboard based monitoring

ESPnet2

See ESPnet2.

Indepedent from Kaldi/Chainer, unlike ESPnet1
On the fly feature extraction and text processing when training
Supporting DistributedDataParallel and DaraParallel both
Supporting multiple nodes training and integrated with Slurm or MPI
Supporting Sharded Training provided by fairscale
A template recipe which can be applied for all corpora
Possible to train any size of corpus without cpu memory error
ESPnet Model Zoo
Integrated with wandb

Installation

If you intend to do full experiments including DNN training, then see Installation.

If you just need the Python module only:

pip install espnet
# To install latest
# pip install git+https://github.com/espnet/espnet

You need to install some packages.

pip install torch
pip install chainer==6.0.0 cupy==6.0.0    # [Option] If you'll use ESPnet1
pip install torchaudio                    # [Option] If you'll use enhancement task
pip install torch_optimizer               # [Option] If you'll use additional optimizers in ESPnet2

There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.

(ESPNet2) Once installed, run wandb login and set --use_wandb true to enable tracking runs using W&B.

GitHub

https://github.com/espnet/espnet

ESPnet: end-to-end speech processing toolkit

ESPnet

Key Features

Kaldi style complete recipe

ASR: Automatic Speech Recognition

TTS: Text-to-speech

SE: Speech enhancement (and separation)

ST: Speech Translation & MT: Machine Translation

VC: Voice conversion

DNN Framework

ESPnet2

Installation

GitHub

John

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

A toolkit for validating, forging, scanning and tampering JWTs

ESPnet

Key Features

Kaldi style complete recipe

ASR: Automatic Speech Recognition

TTS: Text-to-speech

SE: Speech enhancement (and separation)

ST: Speech Translation & MT: Machine Translation

VC: Voice conversion

DNN Framework

ESPnet2

Installation

GitHub

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

A toolkit for validating, forging, scanning and tampering JWTs

You might also like...