BYOL for Audio
This is a demo implementation of BYOL for Audio (BYOL-A), a self-supervised learning method for general-purpose audio representation, includes:
- Training code that can train models with arbitrary audio files.
- Evaluation code that can evaluate trained models with downstream tasks.
- Pretrained weights.
If you find BYOL-A useful in your research, please use the following BibTeX entry for citation.
@misc{niizumi2021byol-a,
title={BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation},
author={Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
booktitle = {2021 International Joint Conference on Neural Networks, {IJCNN} 2021},
year={2021},
eprint={2103.06695},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
Getting Started
-
Download external source files, and apply a patch. Our implementation uses the following.
- BYOL implementation: https://github.com/lucidrains/byol-pytorch/blob/master/byol_pytorch/byol_pytorch.py
- MLPClassifier for PyTorch: https://github.com/daisukelab/general-learning/blob/master/MLP/torch_mlp_clf.py
curl -O https://raw.githubusercontent.com/lucidrains/byol-pytorch/2aa84ee18fafecaf35637da4657f92619e83876d/byol_pytorch/byol_pytorch.py patch < byol_a/byol_pytorch.diff mv byol_pytorch.py byol_a curl -O https://raw.githubusercontent.com/daisukelab/general-learning/7b31d31637d73e1a74aec3930793bd5175b64126/MLP/torch_mlp_clf.py mv torch_mlp_clf.py utils
-
Install PyTorch 1.7.1, torchaudio, and other dependencies listed on requirements.txt.
Evaluating BYOL-A Representations
Downstream Task Evaluation
The following steps will perform a downstream task evaluation by linear-probe fashion.
This is an example with SPCV2; Speech commands dataset v2.
-
Preprocess metadata (.csv file) and audio files, processed files will be stored under a folder
work
.# usage: python -m utils.preprocess_ds <downstream task> <path to its dataset> python -m utils.preprocess_ds spcv2 /path/to/speech_commands_v0.02
-
Run evaluation. This will convert all .wav audio to representation embeddings first, train a lineaer layer network, then calculate accuracy as a result.
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth spcv2
You can also run an evaluation multiple times and take an average result. Following will evaluate on UrbanSound8K with a unit audio duration of 4.0 seconds, for 10 times.
# usage: python evaluate.py <your weight> <downstream task> <unit duration sec.> <# of iteration>
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth us8k 4.0 10
Evaluating Representations In Your Tasks
This is an example to calculate a feature vector for an audio sample.
from byol_a.common import *
from byol_a.augmentations import PrecomputedNorm
from byol_a.models import AudioNTT2020
device = torch.device('cuda')
cfg = load_yaml_config('config.yaml')
print(cfg)
# Mean and standard deviation of the log-mel spectrogram of input audio samples, pre-computed.
# See calc_norm_stats in evaluate.py for your reference.
stats = [-5.4919195, 5.0389895]
# Preprocessor and normalizer.
to_melspec = torchaudio.transforms.MelSpectrogram(
sample_rate=cfg.sample_rate,
n_fft=cfg.n_fft,
win_length=cfg.win_length,
hop_length=cfg.hop_length,
n_mels=cfg.n_mels,
f_min=cfg.f_min,
f_max=cfg.f_max,
)
normalizer = PrecomputedNorm(stats)
# Load pretrained weights.
model = AudioNTT2020(d=cfg.feature_d)
model.load_weight('pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth', device)
# Load your audio file.
wav, sr = torchaudio.load('work/16k/spcv2/one/00176480_nohash_0.wav') # a sample from SPCV2 for now
assert sr == cfg.sample_rate, "Let's convert the audio sampling rate in advance, or do it here online."
# Convert to a log-mel spectrogram, then normalize.
lms = normalizer((to_melspec(wav) + torch.finfo(torch.float).eps).log())
# Now, convert the audio to the representation.
features = model(lms.unsqueeze(0))
Training From Scratch
You can also train models. Followings are an example of training on FSD50K.
-
Convert all samples to 16kHz. This will convert all FSD50K files to a folder
work/16k/fsd50k
while preserving folder structure.python -m utils.convert_wav /path/to/fsd50k work/16k/fsd50k
-
Start training, this example trains with all development set audio samples from FSD50K.
python train.py work/16k/fsd50k/FSD50K.dev_audio
Refer to Table VI on our paper for the performance of a model trained on FSD50K.
Pretrained Weights
We include 3 pretrained weights of our encoder network.
Method | Dim. | Filename | NSynth | US8K | VoxCeleb1 | VoxForge | SPCV2/12 | SPCV2 | Average |
---|---|---|---|---|---|---|---|---|---|
BYOL-A | 512-d | AudioNTT2020-BYOLA-64x96d512.pth | 69.1% | 78.2% | 33.4% | 83.5% | 86.5% | 88.9% | 73.3% |
BYOL-A | 1024-d | AudioNTT2020-BYOLA-64x96d1024.pth | 72.7% | 78.2% | 38.0% | 88.5% | 90.1% | 91.4% | 76.5% |
BYOL-A | 2048-d | AudioNTT2020-BYOLA-64x96d2048.pth | 74.1% | 79.1% | 40.1% | 90.2% | 91.0% | 92.2% | 77.8% |
License
This implementation is for your evaluation of BYOL-A paper, see LICENSE for the detail.
Acknowledgements
BYOL-A is built on top of byol-pytorch, a BYOL implementation by Phil Wang (@lucidrains). We thank Phil for open-source sophisticated code.
@misc{wang2020byol-pytorch,
author = {Phil Wang},
title = {Bootstrap Your Own Latent (BYOL), in Pytorch},
howpublished = {\url{https://github.com/lucidrains/byol-pytorch}},
year = {2020}
}