ESPnet

ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet uses chainer and pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Key Features

Kaldi style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
  • Support numbers of TTS recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'16, the above ST recipes etc.)
  • Support speech separation and recognition recipe (WSJ-2mix)
  • Support voice conversion recipe (VCC2020 baseline) (new!)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
  • Attention: Dot product, location-aware attention, variants of multihead
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Transducer based end-to-end ASR
    • Available: RNN-based encoder/decoder or custom encoder/decoder w/ supports for Transformer, Conformer, TDNN (encoder) and causal conv1d (decoder) blocks.
    • Also support: mixed RNN/Custom encoder-decoder, VGG2L (RNN/Cutom encoder) and various decoding algorithms.

    Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pretrained model as Encoder, imported from FairSeq.

Demonstration

  • Real-time ASR demo with ESPnet2 Open In Colab

TTS: Text-to-speech

  • Tacotron2
  • Transformer-TTS
  • FastSpeech
  • FastSpeech2 (in ESPnet2)
  • Conformer-based FastSpeech & FastSpeech2 (in ESPnet2)
  • Multi-speaker model with pretrained speaker embedding
  • Multi-speaker model with GST (in ESPnet2)
  • Phoneme-based training (En, Jp, and Zn)
  • Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)

Demonstration

  • Real-time TTS demo with ESPnet2 Open In Colab
  • Real-time TTS demo with ESPnet1 Open In Colab

To train the neural vocoder, please check the following repositories:

NOTE:

  • We are moving on ESPnet2-based development for TTS.
  • If you are beginner, we recommend using ESPnet2-TTS.

SE: Speech enhancement (and separation)

  • Single-speaker speech enhancement
  • Multi-speaker speech separation
  • Unified encoder-separator-decoder structure for time-domain and frequency-domian models
    • Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution
    • Separators: BLSTM, Transformer, Conformer, DPRNN, Neural Beamformers, etc.
  • Flexible ASR integration: working as an individual task or as the ASR frontend
  • Easy to import pretrained models from Asteroid
    • Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration

  • Interactive SE demo with ESPnet2 Open In Colab

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer based end-to-end ST (new!)
  • Transformer based end-to-end MT (new!)

VC: Voice conversion

  • Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
  • End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

DNN Framework

  • Flexible network architecture thanks to chainer and pytorch
  • Flexible front-end processing thanks to kaldiio and HDF5 support
  • Tensorboard based monitoring

ESPnet2

See ESPnet2.

  • Indepedent from Kaldi/Chainer, unlike ESPnet1
  • On the fly feature extraction and text processing when training
  • Supporting DistributedDataParallel and DaraParallel both
  • Supporting multiple nodes training and integrated with Slurm or MPI
  • Supporting Sharded Training provided by fairscale
  • A template recipe which can be applied for all corpora
  • Possible to train any size of corpus without cpu memory error
  • ESPnet Model Zoo
  • Integrated with wandb

Installation

  • If you intend to do full experiments including DNN training, then see Installation.

  • If you just need the Python module only:

    pip install espnet
    # To install latest
    # pip install git+https://github.com/espnet/espnet
    

    You need to install some packages.

    pip install torch
    pip install chainer==6.0.0 cupy==6.0.0    # [Option] If you'll use ESPnet1
    pip install torchaudio                    # [Option] If you'll use enhancement task
    pip install torch_optimizer               # [Option] If you'll use additional optimizers in ESPnet2
    

    There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.

  • (ESPNet2) Once installed, run wandb login and set --use_wandb true to enable tracking runs using W&B.

GitHub

https://github.com/espnet/espnet