/ Machine Learning

Voice Activity Detection based on Deep Learning & TensorFlow

Voice Activity Detection based on Deep Learning & TensorFlow

Voice Activity Detection project

Voice Activity Detection based on Deep Learning & TensorFlow.

1. Installation

This project was designed for:

  • Ubuntu 18.04
  • Python 3.6
  • TensorFlow 1.12.0

Please install requirements & project:

$ cd /path/to/project/
$ git clone https://github.com/filippogiruzzi/voice_activity_detection.git
$ cd voice_activity_detection/
$ pip3 install -r requirements.txt
$ pip3 install -e . --user --upgrade

2. Introduction

2.1 Goal

The purpose of this project is to design and implement
a real-time Voice Activity Detection algorithm based on Deep Learning.

The designed solution is based on MFCC feature extraction and
a 1D-Resnet model that classifies whether a audio signal is
speech or noise.

2.2 Results

Model Train acc. Val acc. Test acc.
1D-Resnet 99 % 98 % 97 %

Raw and post-processed inference results on a test audio signal are shown below.



3. Project structure

The project voice_activity_detection/ has the following structure:

  • vad/data_processing/: raw data labeling, processing,
    recording & visualization
  • vad/training/: data, input pipeline, model
    & training / evaluation / prediction
  • vad/inference/: exporting trained model & inference

4. Dataset

Please download the LibriSpeech ASR corpus dataset from https://openslr.org/12/,
and extract all files to : /path/to/LibriSpeech/.

The dataset contains approximately 1000 hours of 16kHz read English speech
from audiobooks, and is well suited for Voice Activity Detection.

I automatically annotated the test-clean set of the dataset with a
pretrained VAD model.

Please feel free to use the labels/ folder and the pre-trained VAD model (only for inference) from this
link .

5. Project usage

$ cd /path/to/project/voice_activity_detection/vad/

5.1 Dataset automatic labeling

Skip this subsection if you already have the labels/ folder, that contains annotations
from a different pre-trained model.

$ python3 data_processing/librispeech_label_data.py --data_dir /path/to/LibriSpeech/test-clean/
                                                    --exported_model /path/to/pretrained/model/
                                                    --out_dir /path/to/LibriSpeech/labels/

This will record the annotations into /path/to/LibriSpeech/labels/ as
.json files.

5.2 Record raw data to .tfrecord format

$ python3 data_processing/data_to_tfrecords.py --data_dir /path/to/LibriSpeech/

This will record the splitted data to .tfrecord format in /path/to/LibriSpeech/tfrecords/

5.3 Train a CNN to classify Speech & Noise signals

$ python3 training/train.py --data-dir /path/to/LibriSpeech/tfrecords/

5.4 Export trained model & run inference on Test set

$ python3 inference/export_model.py --model-dir /path/to/trained/model/dir/
                                    --ckpt /path/to/trained/model/dir/
$ python3 inference/inference.py --data_dir /path/to/LibriSpeech/
                                 --exported_model /path/to/exported/model/

The trained model will be recorded in /path/to/LibriSpeech/tfrecords/models/resnet1d/.
The exported model will be recorded inside this directory.

6. Todo

  • [ ] Compare Deep Learning model to a simple baseline
  • [ ] Train on full dataset
  • [ ] Improve data balancing
  • [ ] Add time series data augmentation
  • [ ] Study ROC curve & classification threshold
  • [ ] Add online inference
  • [ ] Evaluate quantitatively post-processing methods on the Test set
  • [ ] Add model description & training graphs
  • [ ] Add Google Colab demo

7. Resources

  • Voice Activity Detection for Voice User Interface,
  • Deep learning for time series classifcation: a review,
    Fawaz et al., 2018, Arxiv
  • Time Series Classification from Scratch
    with Deep Neural Networks: A Strong Baseline
    , Wang et al., 2016,