Git Based MLOps
This project shows how to achieve MLOps using tools such as DVC, DVC Studio, DVCLive – all products built by iterative.ai, Google Drive, Jarvislabs.ai, and HuggingFace Hub.
Instructions
Prior work
- Click “Use this template” button to create your own repository
- Wait for few seconds, then
Initial Setup
PR will be automatically created - Merge the PR, and you are good to go
After your repos is setup
- Run
pip install -r requirements.txt
(requirements.txt) - Run
dvc init
to enable DVC - Add your data under
data
directory - Run
dvc add [ADDED FILE OR DIRECTORY]
to track your data with DVC - Run
dvc remote add -d gdrive_storage gdrive://[ID of specific folder in gdrive]
to add Google Drive as the remote data storage - Run
dvc push
, then URL to auth is provided. Copy and paste it to the browser, and autheticate - Copy the content of
.dvc/tmp/gdrive-user-credentials.json
and put it as in GitHub Secret with the name ofGDRIVE_CREDENTIALS
- Run
git add . && git commit -m "initial commit" && git push origin main
to keep the initial setup - Write your own pipeline under
pipeline
directory. Codes for basic image classification in TensorFlow are provided initially. - Run the following
dvc stage add
for training stage
$ dvc stage add -n train \
-p train.train_size,train.batch_size,train.epoch,train.lr \ # no space between items
-d pipeline/modeling.py -d pipeline/train.py -d data \
--plots-no-cache dvclive/scalars/loss.tsv \
--plots-no-cache dvclive/scalars/sparse_categorical_accuracy.tsv \
--plots-no-cache dvclive/scalars/val_loss.tsv \
--plots-no-cache dvclive/scalars/val_sparse_categorical_accuracy.tsv \
-o outputs/model \
python pipeline/train.py outputs/model
- Run the following
dvc stage add
for evaluate stage
$ dvc stage add -n evaluate \
-p evaluate.test,evaluate.batch_size \
-d pipeline/evaluate.py -d data/test -d outputs/model \
-M outputs/metrics.json \
python pipeline/evaluate.py outputs/model
- Update
params.yaml
as you need. - Run
git add . && git commit -m "add initial pipeline setup" && git push origin main
- Run
dvc repro
to run the pipeline initially - Run
dvc add outputs/model.tar.gz
to add compressed version of model - Run
dvc push outputs/model.tar.gz
- Run
echo "/pipeline/__pycache__" >> .gitignore
to ignore unnecessary directory - Run
git add . && git commit -m "add initial pipeline run" && git push origin main
- Add access token and user email of JarvisLabs.ai to GitHub Secret as
JARVISLABS_ACCESS_TOKEN
andJARVISLABS_USER_EMAIL
- Add GitHub access token to GitHub Secret as
GH_ACCESS_TOKEN
- Create a PR and write
#train
as in comment (you have to be the onwer of the repo)
TODO
- Write solid steps to reproduce this repo for other tasks
- Deploy experimental model to HF Space
- Deploy current model to GKE with auto TFServing deployment project
- Add more cloud providers
- Add more scripts
Brief description of each tools
- DVC(Data Version Control): Manages data in somewhere else(i.e. cloud storage) while keeping the version and remote information in metadata file in Git repository.
- DVCLive: Provides callbacks for ML framework(i.e. TensorFlow, Keras) to record metrics during training in tsv format.
- DVC Studio: Visuallize the metrics from files in Git repository. What to visuallize is recorded in
dvc.yaml
. - Google Drive: Is used as a remote data repository. However, you can use others such as AWS S3, Google Cloud Storage, or your own file server.
- Jarvislabs.ai: Is used to provision cloud GPU VM instances to conduct each experiments.