ESPnet
ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet uses chainer and pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.
Key Features
Kaldi style complete recipe
- Support numbers of
ASR
recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.) - Support numbers of
TTS
recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.) - Support numbers of
ST
recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.) - Support numbers of
MT
recipes (IWSLT'16, the above ST recipes etc.) - Support speech separation and recognition recipe (WSJ-2mix)
- Support voice conversion recipe (VCC2020 baseline) (new!)
ASR: Automatic Speech Recognition
- State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
- Hybrid CTC/attention based end-to-end ASR
- Fast/accurate training with CTC/attention multitask training
- CTC/attention joint decoding to boost monotonic alignment decoding
- Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
- Attention: Dot product, location-aware attention, variants of multihead
- Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
- Batch GPU decoding
- Transducer based end-to-end ASR
- Available: RNN-based encoder/decoder or custom encoder/decoder w/ supports for Transformer, Conformer, TDNN (encoder) and causal conv1d (decoder) blocks.
- Also support: mixed RNN/Custom encoder-decoder, VGG2L (RNN/Cutom encoder) and various decoding algorithms.
Please refer to the tutorial page for complete documentation.
- CTC segmentation
- Non-autoregressive model based on Mask-CTC
- ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
- Wav2Vec2.0 pretrained model as Encoder, imported from FairSeq.
Demonstration
TTS: Text-to-speech
- Tacotron2
- Transformer-TTS
- FastSpeech
- FastSpeech2 (in ESPnet2)
- Conformer-based FastSpeech & FastSpeech2 (in ESPnet2)
- Multi-speaker model with pretrained speaker embedding
- Multi-speaker model with GST (in ESPnet2)
- Phoneme-based training (En, Jp, and Zn)
- Integration with neural vocoders (WaveNet, ParallelWaveGAN, and MelGAN)
Demonstration
To train the neural vocoder, please check the following repositories:
NOTE:
- We are moving on ESPnet2-based development for TTS.
- If you are beginner, we recommend using ESPnet2-TTS.
SE: Speech enhancement (and separation)
- Single-speaker speech enhancement
- Multi-speaker speech separation
- Unified encoder-separator-decoder structure for time-domain and frequency-domian models
- Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution
- Separators: BLSTM, Transformer, Conformer, DPRNN, Neural Beamformers, etc.
- Flexible ASR integration: working as an individual task or as the ASR frontend
- Easy to import pretrained models from Asteroid
- Both the pre-trained models from Asteroid and the specific configuration are supported.
Demonstration
ST: Speech Translation & MT: Machine Translation
- State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
- Transformer based end-to-end ST (new!)
- Transformer based end-to-end MT (new!)
VC: Voice conversion
- Transformer and Tacotron2 based parallel VC using melspectrogram (new!)
- End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)
DNN Framework
- Flexible network architecture thanks to chainer and pytorch
- Flexible front-end processing thanks to kaldiio and HDF5 support
- Tensorboard based monitoring
ESPnet2
See ESPnet2.
- Indepedent from Kaldi/Chainer, unlike ESPnet1
- On the fly feature extraction and text processing when training
- Supporting DistributedDataParallel and DaraParallel both
- Supporting multiple nodes training and integrated with Slurm or MPI
- Supporting Sharded Training provided by fairscale
- A template recipe which can be applied for all corpora
- Possible to train any size of corpus without cpu memory error
- ESPnet Model Zoo
- Integrated with wandb
Installation
-
If you intend to do full experiments including DNN training, then see Installation.
-
If you just need the Python module only:
pip install espnet # To install latest # pip install git+https://github.com/espnet/espnet
You need to install some packages.
pip install torch pip install chainer==6.0.0 cupy==6.0.0 # [Option] If you'll use ESPnet1 pip install torchaudio # [Option] If you'll use enhancement task pip install torch_optimizer # [Option] If you'll use additional optimizers in ESPnet2
There are some required packages depending on each task other than above. If you meet ImportError, please intall them at that time.
-
(ESPNet2) Once installed, run
wandb login
and set--use_wandb true
to enable tracking runs using W&B.