Voice Activity Detection based on Deep Learning & TensorFlow

Voice Activity Detection project

Voice Activity Detection based on Deep Learning & TensorFlow.

1. Installation

This project was designed for:

Ubuntu 18.04
Python 3.6
TensorFlow 1.12.0

Please install requirements & project:

$ cd /path/to/project/
$ git clone https://github.com/filippogiruzzi/voice_activity_detection.git
$ cd voice_activity_detection/
$ pip3 install -r requirements.txt
$ pip3 install -e . --user --upgrade

2. Introduction

2.1 Goal

The purpose of this project is to design and implement
a real-time Voice Activity Detection algorithm based on Deep Learning.

The designed solution is based on MFCC feature extraction and
a 1D-Resnet model that classifies whether a audio signal is
speech or noise.

2.2 Results

Model	Train acc.	Val acc.	Test acc.
1D-Resnet	99 %	98 %	97 %

Raw and post-processed inference results on a test audio signal are shown below.

inference_raw

inference_smooth

3. Project structure

The project voice_activity_detection/ has the following structure:

vad/data_processing/: raw data labeling, processing,
recording & visualization
vad/training/: data, input pipeline, model
& training / evaluation / prediction
vad/inference/: exporting trained model & inference

4. Dataset

Please download the LibriSpeech ASR corpus dataset from https://openslr.org/12/,
and extract all files to : /path/to/LibriSpeech/.

The dataset contains approximately 1000 hours of 16kHz read English speech
from audiobooks, and is well suited for Voice Activity Detection.

I automatically annotated the test-clean set of the dataset with a
pretrained VAD model.

Please feel free to use the labels/ folder and the pre-trained VAD model (only for inference) from this
link .

5. Project usage

$ cd /path/to/project/voice_activity_detection/vad/

5.1 Dataset automatic labeling

Skip this subsection if you already have the labels/ folder, that contains annotations
from a different pre-trained model.

$ python3 data_processing/librispeech_label_data.py --data_dir /path/to/LibriSpeech/test-clean/
                                                    --exported_model /path/to/pretrained/model/
                                                    --out_dir /path/to/LibriSpeech/labels/

This will record the annotations into /path/to/LibriSpeech/labels/ as
.json files.

5.2 Record raw data to .tfrecord format

$ python3 data_processing/data_to_tfrecords.py --data_dir /path/to/LibriSpeech/

This will record the splitted data to .tfrecord format in /path/to/LibriSpeech/tfrecords/

5.3 Train a CNN to classify Speech & Noise signals

$ python3 training/train.py --data-dir /path/to/LibriSpeech/tfrecords/

5.4 Export trained model & run inference on Test set

$ python3 inference/export_model.py --model-dir /path/to/trained/model/dir/
                                    --ckpt /path/to/trained/model/dir/
$ python3 inference/inference.py --data_dir /path/to/LibriSpeech/
                                 --exported_model /path/to/exported/model/
                                 --smoothing

The trained model will be recorded in /path/to/LibriSpeech/tfrecords/models/resnet1d/.
The exported model will be recorded inside this directory.

6. Todo

[ ] Compare Deep Learning model to a simple baseline
[ ] Train on full dataset
[ ] Improve data balancing
[ ] Add time series data augmentation
[ ] Study ROC curve & classification threshold
[ ] Add online inference
[ ] Evaluate quantitatively post-processing methods on the Test set
[ ] Add model description & training graphs
[ ] Add Google Colab demo

7. Resources

Voice Activity Detection for Voice User Interface,
Medium
Deep learning for time series classifcation: a review,
Fawaz et al., 2018, Arxiv
Time Series Classification from Scratch
with Deep Neural Networks: A Strong Baseline, Wang et al., 2016,
Arxiv

Voice Activity Detection based on Deep Learning & TensorFlow

Voice Activity Detection project

1. Installation

2. Introduction

2.1 Goal

2.2 Results

3. Project structure

4. Dataset

5. Project usage

5.1 Dataset automatic labeling

5.2 Record raw data to .tfrecord format

5.3 Train a CNN to classify Speech & Noise signals

5.4 Export trained model & run inference on Test set

6. Todo

7. Resources

GitHub

John

A Very Fast Deep Video Denoising algorithm

Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline

Voice Activity Detection project

1. Installation

2. Introduction

2.1 Goal

2.2 Results

3. Project structure

4. Dataset

5. Project usage

5.1 Dataset automatic labeling

5.2 Record raw data to .tfrecord format

5.3 Train a CNN to classify Speech & Noise signals

5.4 Export trained model & run inference on Test set

6. Todo

7. Resources

GitHub

A Very Fast Deep Video Denoising algorithm

Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline

You might also like...