Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.
We provide state-of-the-art training recipes for the following speech datasets:
Requirements and Installation
- PyTorch version >= 1.5.0
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
- To install Espresso from source and develop locally:
git clone https://github.com/freewym/espresso cd espresso pip install --editable . # on MacOS: # CFLAGS="-stdlib=libc++" pip install --editable ./ pip install kaldi_io sentencepiece soundfile cd espresso/tools; make KALDI=<path/to/a/compiled/kaldi/directory>
add your Python path to
PATH variable in
examples/asr_<dataset>/path.sh, the current default is
kaldi_io is required for reading kaldi scp files. sentencepiece is required for subword pieces training/encoding.
soundfile is required for reading raw waveform files.
Kaldi is required for data preparation, feature extraction, scoring for some datasets (e.g., Switchboard), and decoding for all hybrid systems.
PYTHON_DIR variable in
~/anaconda3/bin), and then
cd espresso/tools; make openfst pychain
- For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex cd apex pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \ --global-option="--deprecated_fused_adam" --global-option="--xentropy" \ --global-option="--fast_multihead_attn" ./