Textless NLP is an active area of research that aims to extend NLP techniques to work directly on spoken language. By using self-supervisedly
learnt discrete speech representations, the area promises to unlock interesting NLP applications on languages without written form or on facets of spoken
language that are unaccessable for text-based approaches, e.g. prosody. To learn more, please check some of the papers.
textlesslib is a library aimed to facilitate research in Textless NLP. The goal of the library is to speed up the research cycle and
lower the learning curve for those who want to start. We provide highly configurable, off-the-shelf available tools to encode speech
as sequences of discrete values and tools to decode such streams back into the audio domain.
Table of Contents
- Usage examples
- Provided models
git clone [email protected]:facebookresearch/textlesslib.git cd textlesslib pip install -e . pip install git+git://github.com:pytorch/fairseq.git@dd106d9534b22e7db859a6b87ffd7780c38341f8
We include a set of examples in the examples folder:
- Discrete speech resynthesis (& compression)
- Probing for speaker information in the representations
- Generative Spoken Language Modeling (aka Speech Continuation)
We believe those examples can serve both as illustrations for the provided components and provide
a starting point for tinkering in interesting directions.
Below is an example on loading an audio example and encoding it as a sequence of HuBERT-based discrete tokens (aka pseudo-units).
Downloading of the required checkpoints is handled by textlesslib itself (by default they are stored in
import torchaudio from textless.data.speech_encoder import SpeechEncoder dense_model_name = "hubert-base-ls960" quantizer_name, vocab_size = "kmeans", 100 input_file = "input.wav" # now let's load an audio example waveform, sample_rate = torchaudio.load(input_file) # We can build a speech encoder module using names of pre-trained # dense and quantizer models. The call below will download # appropriate checkpoints as needed behind the scenes. We can # also construct an encoder by directly passing model instances encoder = SpeechEncoder.by_name( dense_model_name=dense_model_name, quantizer_model_name=quantizer_name, vocab_size=vocab_size, deduplicate=True, ).cuda() # now convert it in a stream of deduplicated units (as in GSLM) encoded = encoder(waveform.cuda()) # encoded is a dict with keys ('dense', 'units', 'durations'). # It can also contain 'f0' if SpeechEncoder was initialized # with need_f0=True flag. units = encoded["units"] # tensor([71, 12, 57, ...], ...)
Now it can be casted back into the audio domain:
# as with encoder, we can setup vocoder by passing checkpoints # directly or by specifying the expected format by the names # of dense and quantizer models (these models themselves # won't be loaded) vocoder = TacotronVocoder.by_name( dense_model_name, quantizer_name, vocab_size, ).cuda() # now we turn those units back into the audio. audio = vocoder(units) # save the audio torchaudio.save(output_file, audio.cpu().float().unsqueeze(0), vocoder.output_sample_rate)
Below is an example on using
textless view on the LibriSpeech dataset:
encoder = SpeechEncoder.by_name( dense_model_name=dense_model_name, quantizer_model_name=quantizer_name, vocab_size=vocab_size, deduplicate=True, ).cuda() quantized_dataset = QuantizedLibriSpeech( root=existing_root, speech_encoder=encoder, url=url) datum = quantized_dataset sample_rate, utterance, speaker_id, chapter_id, utterance_id = datum['rest'] # datum['units'] = tensor([71, 12, 63, ...])
In the probing example we illustrate how such a dataset
can be used with a standard Pytorch dataloader in a scalable manner.
We also provide a multi-GPU/multi-node preprocessing tool
for the cases where on-the-fly processing of audio should be avoided.
We provide implementations and pre-trained checkpoints for the following models:
- Dense representations: HuBERT-base (trained on LibriSpeech 960h) and CPC (trained on 6Kh subset of LibriLight);
- Quantizers: k-means quantizers with vocabulary sizes of 50, 100, 200 for both the dense models (trained on LibriSpeech 960h);
- Decoders: Tacotron2 models for all (dense model x quantizer) combinations (trained on LJSpeech).
Finally, the pitch extraction is done via YAAPT.
We use pytest (
pip install pytest pytest-xdist ). Our unit tests are located in the
cd tests && pytest -n 8
textlesslib is licensed under MIT, the text of the license can be found here.
Internally, it uses
- WaveGlow – licensed under BSD-3-Clause license;
- tacotron implementation – licensed under MIT license;
- tacotron2 implementation – licensed under BSD-3-Clause license;
- STFT implementation – licensed under BSD-3-Clause license.