Git Based MLOps

This project shows how to achieve MLOps using tools such as DVC, DVC Studio, DVCLive – all products built by iterative.ai, Google Drive, Jarvislabs.ai, and HuggingFace Hub.

Instructions

Prior work

Click “Use this template” button to create your own repository
Wait for few seconds, then Initial Setup PR will be automatically created
Merge the PR, and you are good to go

After your repos is setup

Run pip install -r requirements.txt (requirements.txt)
Run dvc init to enable DVC
Add your data under data directory
Run dvc add [ADDED FILE OR DIRECTORY] to track your data with DVC
Run dvc remote add -d gdrive_storage gdrive://[ID of specific folder in gdrive] to add Google Drive as the remote data storage
Run dvc push, then URL to auth is provided. Copy and paste it to the browser, and autheticate
Copy the content of .dvc/tmp/gdrive-user-credentials.json and put it as in GitHub Secret with the name of GDRIVE_CREDENTIALS
Run git add . && git commit -m "initial commit" && git push origin main to keep the initial setup
Write your own pipeline under pipeline directory. Codes for basic image classification in TensorFlow are provided initially.
Run the following dvc stage add for training stage

$ dvc stage add -n train \
                -p train.train_size,train.batch_size,train.epoch,train.lr \ # no space between items
                -d pipeline/modeling.py -d pipeline/train.py -d data \
                --plots-no-cache dvclive/scalars/loss.tsv \
                --plots-no-cache dvclive/scalars/sparse_categorical_accuracy.tsv \
                --plots-no-cache dvclive/scalars/val_loss.tsv \
                --plots-no-cache dvclive/scalars/val_sparse_categorical_accuracy.tsv \
                -o outputs/model \
                python pipeline/train.py outputs/model

Run the following dvc stage add for evaluate stage

$ dvc stage add -n evaluate \
                -p evaluate.test,evaluate.batch_size \
                -d pipeline/evaluate.py -d data/test -d outputs/model \
                -M outputs/metrics.json \
                python pipeline/evaluate.py outputs/model

Update params.yaml as you need.
Run git add . && git commit -m "add initial pipeline setup" && git push origin main
Run dvc repro to run the pipeline initially
Run dvc add outputs/model.tar.gz to add compressed version of model
Run dvc push outputs/model.tar.gz
Run echo "/pipeline/__pycache__" >> .gitignore to ignore unnecessary directory
Run git add . && git commit -m "add initial pipeline run" && git push origin main
Add access token and user email of JarvisLabs.ai to GitHub Secret as JARVISLABS_ACCESS_TOKEN and JARVISLABS_USER_EMAIL
Add GitHub access token to GitHub Secret as GH_ACCESS_TOKEN
Create a PR and write #train as in comment (you have to be the onwer of the repo)

TODO

Write solid steps to reproduce this repo for other tasks
Deploy experimental model to HF Space
Deploy current model to GKE with auto TFServing deployment project
Add more cloud providers
Add more scripts

Brief description of each tools

DVC(Data Version Control): Manages data in somewhere else(i.e. cloud storage) while keeping the version and remote information in metadata file in Git repository.
DVCLive: Provides callbacks for ML framework(i.e. TensorFlow, Keras) to record metrics during training in tsv format.
DVC Studio: Visuallize the metrics from files in Git repository. What to visuallize is recorded in dvc.yaml.
Google Drive: Is used as a remote data repository. However, you can use others such as AWS S3, Google Cloud Storage, or your own file server.
Jarvislabs.ai: Is used to provision cloud GPU VM instances to conduct each experiments.