Auto Label Pipeline

A practical ML pipeline for data labeling with experiment tracking using DVC

Goals:

  • Demonstrate reproducible ML
  • Use DVC to build a pipeline and track experiments
  • Automatically clean noisy data labels using Cleanlab cross validation
  • Determine which FastText subword embedding performs better for semi-supervised cluster classification
  • Determine optimal hyperparameters through experiment tracking
  • Prepare casually labeled data for human evaluation

Demo: View Experiments recorded in git branches:

asciicast

The Data

For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

Data Input Format

Tab separated CSV files, with the fields:

  • text_data – the item that is to be labeled (single word or short group of words)
  • class_type – the class label
  • context – any text that surrounds the text_data field in situ, or defines the text_data item in other words.
  • count – the number of occurrences of this label; how common it appears in the existing data.

Data Output format

  • (same parameters as the data input plus)
  • date_updated – when the label was updated
  • previous_class_type – the previous class_type label
  • mislabeled_rank – records how low the confidence was prior to a re-label

The Pipeline

  • Fetch
  • Prepare
  • Train
  • Relabel

For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.

Using/Extending the pipeline

  1. Drop your own CSV files into the data/raw directory
  2. Run the pipeline
  3. Tune settings, embeddings, etc, until no longer amused
  4. Verify your results manually and by submitting data/final/data.csv for human evaluation, using random sampling and drawing heavily from the mislabeled_rank entries.

Project Structure

<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content="├── LICENSE
├── README.md
├── data # <– Directory with all types of data
│ ├── final # <– Directory with final data
│ │ ├── class.metrics.csv # <– Directory with raw and intermediate data
│ │ └── data.csv # <– Pipeline output (not stored in git)
│ ├── interim # <– Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared # <– Directory with prepared data
│ │ └── data.all.csv
│ └── raw # <– Directory with raw data; populated by pipeline's fetch stage
│ ├── README.md
│ ├── cc.en.300.bin # <– Fasttext binary model file, creative commons
│ ├── crawl-300d-2M-subword.bin # <– Fasttext binary model file, common crawl
│ ├── crawl-300d-2M-subword.vec
│ ├── employers.wikidata.csv # <– Our initial data, 1 set of class labels
│ ├── lid.176.ftz
│ └── occupations.wikidata.csv # <– Our initial data, 1 set of class labels
├── dvc.lock # <– DVC internal state tracking file
├── dvc.yaml # <– DVC project configuration file
├── dvc_plots # <– Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json # <– Metrics from the pipeline's train stage
├── mypy.ini
├── params.yaml # <– Parameter configuration file for the pipeline
├── reports # <– Directory with metrics output
│ ├── prepare.metrics.json
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src #

├── LICENSE
├── README.md
├── data                    # <-- Directory with all types of data
│ ├── final                 # <-- Directory with final data
│ │ ├── class.metrics.csv   # <-- Directory with raw and intermediate data
│ │ └── data.csv            # <-- Pipeline output (not stored in git)
│ ├── interim               # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared              # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw                   # <-- Directory with raw data; populated by pipeline's fetch stage
│     ├── README.md
│     ├── cc.en.300.bin               # <-- Fasttext binary model file, creative commons 
│     ├── crawl-300d-2M-subword.bin   # <-- Fasttext binary model file, common crawl
│     ├── crawl-300d-2M-subword.vec
│     ├── employers.wikidata.csv      # <-- Our initial data, 1 set of class labels 
│     ├── lid.176.ftz
│     └── occupations.wikidata.csv    # <-- Our initial data, 1 set of class labels
├── dvc.lock                # <-- DVC internal state tracking file
├── dvc.yaml                # <-- DVC project configuration file
├── dvc_plots               # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json    # <-- Metrics from the pipeline's train stage  
├── mypy.ini
├── params.yaml             # <-- Parameter configuration file for the pipeline
├── reports                 # <-- Directory with metrics output
│ ├── prepare.metrics.json  
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src                     # <-- Directory containing the pipeline's code
    ├── README.md
    ├── fetch.py
    ├── prepare.py
    ├── relabel.py
    ├── train.py
    └── utils.py