Auto Label Pipeline
A practical ML pipeline for data labeling with experiment tracking using DVC
Goals:
- Demonstrate reproducible ML
- Use DVC to build a pipeline and track experiments
- Automatically clean noisy data labels using Cleanlab cross validation
- Determine which FastText subword embedding performs better for semi-supervised cluster classification
- Determine optimal hyperparameters through experiment tracking
- Prepare casually labeled data for human evaluation
Demo: View Experiments recorded in git branches:
The Data
For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.
Data Input Format
Tab separated CSV files, with the fields:
text_data
– the item that is to be labeled (single word or short group of words)class_type
– the class labelcontext
– any text that surrounds thetext_data
field in situ, or defines thetext_data
item in other words.count
– the number of occurrences of this label; how common it appears in the existing data.
Data Output format
- (same parameters as the data input plus)
date_updated
– when the label was updatedprevious_class_type
– the previousclass_type
labelmislabeled_rank
– records how low the confidence was prior to a re-label
The Pipeline
- Fetch
- Prepare
- Train
- Relabel
For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.
Using/Extending the pipeline
- Drop your own CSV files into the
data/raw
directory - Run the pipeline
- Tune settings, embeddings, etc, until no longer amused
- Verify your results manually and by submitting
data/final/data.csv
for human evaluation, using random sampling and drawing heavily from themislabeled_rank
entries.
Project Structure
<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="├── LICENSE
├── README.md
├── data # <– Directory with all types of data
│ ├── final # <– Directory with final data
│ │ ├── class.metrics.csv # <– Directory with raw and intermediate data
│ │ └── data.csv # <– Pipeline output (not stored in git)
│ ├── interim # <– Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared # <– Directory with prepared data
│ │ └── data.all.csv
│ └── raw # <– Directory with raw data; populated by pipeline's fetch stage
│ ├── README.md
│ ├── cc.en.300.bin # <– Fasttext binary model file, creative commons
│ ├── crawl-300d-2M-subword.bin # <– Fasttext binary model file, common crawl
│ ├── crawl-300d-2M-subword.vec
│ ├── employers.wikidata.csv # <– Our initial data, 1 set of class labels
│ ├── lid.176.ftz
│ └── occupations.wikidata.csv # <– Our initial data, 1 set of class labels
├── dvc.lock # <– DVC internal state tracking file
├── dvc.yaml # <– DVC project configuration file
├── dvc_plots # <– Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json # <– Metrics from the pipeline's train stage
├── mypy.ini
├── params.yaml # <– Parameter configuration file for the pipeline
├── reports # <– Directory with metrics output
│ ├── prepare.metrics.json
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src #
├── LICENSE
├── README.md
├── data # <-- Directory with all types of data
│ ├── final # <-- Directory with final data
│ │ ├── class.metrics.csv # <-- Directory with raw and intermediate data
│ │ └── data.csv # <-- Pipeline output (not stored in git)
│ ├── interim # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw # <-- Directory with raw data; populated by pipeline's fetch stage
│ ├── README.md
│ ├── cc.en.300.bin # <-- Fasttext binary model file, creative commons
│ ├── crawl-300d-2M-subword.bin # <-- Fasttext binary model file, common crawl
│ ├── crawl-300d-2M-subword.vec
│ ├── employers.wikidata.csv # <-- Our initial data, 1 set of class labels
│ ├── lid.176.ftz
│ └── occupations.wikidata.csv # <-- Our initial data, 1 set of class labels
├── dvc.lock # <-- DVC internal state tracking file
├── dvc.yaml # <-- DVC project configuration file
├── dvc_plots # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json # <-- Metrics from the pipeline's train stage
├── mypy.ini
├── params.yaml # <-- Parameter configuration file for the pipeline
├── reports # <-- Directory with metrics output
│ ├── prepare.metrics.json
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src # <-- Directory containing the pipeline's code
├── README.md
├── fetch.py
├── prepare.py
├── relabel.py
├── train.py
└── utils.py