? Transformers Wav2Vec2 + PyCTCDecode
Included is a file to create an ngram with KenLM as well as a simple evaluation script to
compare the results of using Wav2Vec2 with PyCTCDecode + KenLM vs. without using any language model.
Note: The scripts are written to be used on GPU. If you want to use a CPU instead,
simply remove all
.to("cuda") occurances in
In a first step, one should install KenLM. For Ubuntu, it should be enough to follow the installation steps
described here. The installed
should be move into this repo for
./create_ngram.py to function correctly. Alternatively, one can also
lmplz binary file to a
lmplz bash command to directly run
lmplz instead of
Next, some Python dependencies should be installed. Assuming PyTorch is installed, it should be sufficient to run
pip install -r requirements.txt.
In a first step on should create a ngram. E.g. for
polish the command would be:
./create_ngram.py --language polish --path_to_ngram polish.arpa
After the language model is created, one should open the file. one should add a
The file should have a structure which looks more or less as follows:
\data\ ngram 1=86586 ngram 2=546387 ngram 3=796581 ngram 4=843999 ngram 5=850874 \1-grams: -5.7532206 <unk> 0 0 <s> -0.06677356 -3.4645514 drugi -0.2088903 ...
Now it is very important also add a
</s> token to the n-gram
so that it can be correctly loaded. You can simple copy the line:
0 <s> -0.06677356
</s>. When doing this you should also inclease
ngram by 1.
The new ngram should look as follows:
\data\ ngram 1=86587 ngram 2=546387 ngram 3=796581 ngram 4=843999 ngram 5=850874 \1-grams: -5.7532206 <unk> 0 0 <s> -0.06677356 0 </s> -0.06677356 -3.4645514 drugi -0.2088903 ...
Now the ngram can be correctly used with
Having created the ngram, one can run:
./eval.py --language polish --path_to_ngram polish.arpa
To compare Wav2Vec2 + LM vs. Wav2Vec2 + No LM on polish.
Without tuning any hyperparameters, the following results were obtained:
Comparison of Wav2Vec2 without Language model vs. Wav2Vec2 with `pyctcdecode` + KenLM 5gram. Fine-tuned Wav2Vec2 models were used and evaluated on MLS datasets. Take a closer look at `./eval.py` for comparison ==================================================portuguese================================================== polish - No LM - | WER: 0.3069742867206763 | CER: 0.06054530156286364 | Time: 58.04590034484863 polish - With LM - | WER: 0.2291299753434308 | CER: 0.06211174564528545 | Time: 191.65409898757935 ==================================================spanish================================================== portuguese - No LM - | WER: 0.18208286674132138 | CER: 0.05016682956422096 | Time: 114.61633825302124 portuguese - With LM - | WER: 0.1487761958086706 | CER: 0.04489231909945738 | Time: 429.78511357307434 ==================================================polish================================================== spanish - No LM - | WER: 0.2581272104769545 | CER: 0.0703088156033147 | Time: 147.8634352684021 spanish - With LM - | WER: 0.14927852292116295 | CER: 0.052034208044195916 | Time: 563.0732748508453
It can be seen that the word error rate (WER) is significantly improved when using PyCTCDecode + KenLM.
However, the character error rate (CER) does not improve as much or not at all.
This is expected since using a language model will make sure that words that are predicted are words that exist in the language’s vocabulary.
Wav2Vec2 without a LM produces many words that are more or less correct but contain a couple of spelling errors, thus not contributing to a good WER.
Those words are likely to be “corrected” by Wav2Vec2 + LM leading to an improved WER. However a Wav2Vec2 already has a good character error rate as its
vocabulary is composed of characters meaning that a “word-based” language model doesn’t really help in this case.
Overall WER is probably the more important metric though, so it might make a lot of sense to add a LM to Wav2Vec2.
In terms of speed, adding a LM significantly reduces speed. However, the script is not at all optimized for speed
so using multi-processing and batched inference would significantly speed up both Wav2Vec2 without LM and with LM.