A practical ML pipeline for data labeling with experiment tracking using DVC

Jan 11, 2022 5 min read

Auto Label Pipeline

Goals:

Demonstrate reproducible ML
Use DVC to build a pipeline and track experiments
Automatically clean noisy data labels using Cleanlab cross validation
Determine which FastText subword embedding performs better for semi-supervised cluster classification
Determine optimal hyperparameters through experiment tracking
Prepare casually labeled data for human evaluation

Demo: View Experiments recorded in git branches:

The Data

For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

Data Input Format

Tab separated CSV files, with the fields:

text_data – the item that is to be labeled (single word or short group of words)
class_type – the class label
context – any text that surrounds the text_data field in situ, or defines the text_data item in other words.
count – the number of occurrences of this label; how common it appears in the existing data.

Data Output format

(same parameters as the data input plus)
date_updated – when the label was updated
previous_class_type – the previous class_type label
mislabeled_rank – records how low the confidence was prior to a re-label

The Pipeline

Fetch
Prepare
Train
Relabel

For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.

Using/Extending the pipeline

Drop your own CSV files into the data/raw directory
Run the pipeline
Tune settings, embeddings, etc, until no longer amused
Verify your results manually and by submitting data/final/data.csv for human evaluation, using random sampling and drawing heavily from the mislabeled_rank entries.

Project Structure

<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="├── LICENSE
├── README.md
├── data # <– Directory with all types of data
│ ├── final # <– Directory with final data
│ │ ├── class.metrics.csv # <– Directory with raw and intermediate data
│ │ └── data.csv # <– Pipeline output (not stored in git)
│ ├── interim # <– Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared # <– Directory with prepared data
│ │ └── data.all.csv
│ └── raw # <– Directory with raw data; populated by pipeline's fetch stage
│ ├── README.md
│ ├── cc.en.300.bin # <– Fasttext binary model file, creative commons
│ ├── crawl-300d-2M-subword.bin # <– Fasttext binary model file, common crawl
│ ├── crawl-300d-2M-subword.vec
│ ├── employers.wikidata.csv # <– Our initial data, 1 set of class labels
│ ├── lid.176.ftz
│ └── occupations.wikidata.csv # <– Our initial data, 1 set of class labels
├── dvc.lock # <– DVC internal state tracking file
├── dvc.yaml # <– DVC project configuration file
├── dvc_plots # <– Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json # <– Metrics from the pipeline's train stage
├── mypy.ini
├── params.yaml # <– Parameter configuration file for the pipeline
├── reports # <– Directory with metrics output
│ ├── prepare.metrics.json
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src #

├── LICENSE
├── README.md
├── data                    # <-- Directory with all types of data
│ ├── final                 # <-- Directory with final data
│ │ ├── class.metrics.csv   # <-- Directory with raw and intermediate data
│ │ └── data.csv            # <-- Pipeline output (not stored in git)
│ ├── interim               # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared              # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw                   # <-- Directory with raw data; populated by pipeline's fetch stage
│     ├── README.md
│     ├── cc.en.300.bin               # <-- Fasttext binary model file, creative commons 
│     ├── crawl-300d-2M-subword.bin   # <-- Fasttext binary model file, common crawl
│     ├── crawl-300d-2M-subword.vec
│     ├── employers.wikidata.csv      # <-- Our initial data, 1 set of class labels 
│     ├── lid.176.ftz
│     └── occupations.wikidata.csv    # <-- Our initial data, 1 set of class labels
├── dvc.lock                # <-- DVC internal state tracking file
├── dvc.yaml                # <-- DVC project configuration file
├── dvc_plots               # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json    # <-- Metrics from the pipeline's train stage  
├── mypy.ini
├── params.yaml             # <-- Parameter configuration file for the pipeline
├── reports                 # <-- Directory with metrics output
│ ├── prepare.metrics.json  
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src                     # <-- Directory containing the pipeline's code
    ├── README.md
    ├── fetch.py
    ├── prepare.py
    ├── relabel.py
    ├── train.py
    └── utils.py

Setup

Create environment

conda create --name auto-label-pipeline python=3.9

conda activate auto-label-pipeline

Install requirements

pip install -r requirements.txt

If you’re going to modify the source, also install the requirements-dev.txt file

Reproduce the pipeline results locally

dvc repro

View Metrics

dvc metrics show

Working with Experiments

To see your local experiments:

dvc exp show

Experiments that have been turned into a branches can be referenced directly in commands:

dvc exp diff svc_linear_ex svc_rbf_ex

e.g. to compare experiments:

dvc exp diff [experiment branch name] [experiment branch 2 name]

e.g.:

dvc exp diff svc_linear_ex svc_rbf_ex

dvc exp diff svc_poly_ex svc_rbf_ex

To create an experiment by changing a parameter:

dvc exp run --set-param train.split=0.9 --name my_split_ex

(When promoting an experiment to a branch, DVC does not switch into the branch.)

To save and share your experiment in a branch:

dvc exp branch my_split_ex my_split_ex_branch

View plots

Initial Confusion matrix:

dvc plots show model/class.metrics.csv -x actual -y predicted --template confusion

Confusion matrix after relabeling:

dvc plots show data/final/class.metrics.csv -x actual -y predicted --template confusion

Conclusions

For relabeling and cleaning, it’s important to have more than two labels, and to specifying an UNK label for: unknown; labels spanning multiple groups; or low confidence support.
Standardizing the input data formats allow users to flexibly use many different data sources.
Language detection is an important part of data cleaning, however problematic because:
- Modern languages sometimes “borrow” words from other languages (but not just any words!)
- Language detection models perform inference poorly with limited data, especially just a single word.
- Normalization utilities, such as unidecode aren’t helpful; (the wrong word in more readable letters is still the wrong word).
Experimentation parameters often have co-dependencies that would make a simple combinatorial grid search inefficient.

GitHub

View Github

A practical ML pipeline for data labeling with experiment tracking using DVC

Auto Label Pipeline

Demo: View Experiments recorded in git branches:

The Data

Data Input Format

Data Output format

The Pipeline

Using/Extending the pipeline

Project Structure