Git Based MLOps
This project shows how to achieve MLOps using tools such as DVC, DVC Studio, DVCLive – all products built by iterative.ai, Google Drive, Jarvislabs.ai, and HuggingFace Hub.
- Click “Use this template” button to create your own repository
- Wait for few seconds, then
Initial Setup PR will be automatically created
- Merge the PR, and you are good to go
After your repos is setup
pip install -r requirements.txt (requirements.txt)
dvc init to enable DVC
- Add your data under
dvc add [ADDED FILE OR DIRECTORY] to track your data with DVC
dvc remote add -d gdrive_storage gdrive://[ID of specific folder in gdrive] to add Google Drive as the remote data storage
dvc push, then URL to auth is provided. Copy and paste it to the browser, and autheticate
- Copy the content of
.dvc/tmp/gdrive-user-credentials.json and put it as in GitHub Secret with the name of
git add . && git commit -m "initial commit" && git push origin main to keep the initial setup
- Write your own pipeline under
pipeline directory. Codes for basic image classification in TensorFlow are provided initially.
- Run the following
dvc stage add for training stage
$ dvc stage add -n train \
-p train.train_size,train.batch_size,train.epoch,train.lr \ # no space between items
-d pipeline/modeling.py -d pipeline/train.py -d data \
--plots-no-cache dvclive/scalars/loss.tsv \
--plots-no-cache dvclive/scalars/sparse_categorical_accuracy.tsv \
--plots-no-cache dvclive/scalars/val_loss.tsv \
--plots-no-cache dvclive/scalars/val_sparse_categorical_accuracy.tsv \
-o outputs/model \
python pipeline/train.py outputs/model
- Run the following
dvc stage add for evaluate stage
$ dvc stage add -n evaluate \
-p evaluate.test,evaluate.batch_size \
-d pipeline/evaluate.py -d data/test -d outputs/model \
-M outputs/metrics.json \
python pipeline/evaluate.py outputs/model
params.yaml as you need.
git add . && git commit -m "add initial pipeline setup" && git push origin main
dvc repro to run the pipeline initially
dvc add outputs/model.tar.gz to add compressed version of model
dvc push outputs/model.tar.gz
echo "/pipeline/__pycache__" >> .gitignore to ignore unnecessary directory
git add . && git commit -m "add initial pipeline run" && git push origin main
- Add access token and user email of JarvisLabs.ai to GitHub Secret as
- Add GitHub access token to GitHub Secret as
- Create a PR and write
#train as in comment (you have to be the onwer of the repo)
Brief description of each tools
- DVC(Data Version Control): Manages data in somewhere else(i.e. cloud storage) while keeping the version and remote information in metadata file in Git repository.
- DVCLive: Provides callbacks for ML framework(i.e. TensorFlow, Keras) to record metrics during training in tsv format.
- DVC Studio: Visuallize the metrics from files in Git repository. What to visuallize is recorded in
- Google Drive: Is used as a remote data repository. However, you can use others such as AWS S3, Google Cloud Storage, or your own file server.
- Jarvislabs.ai: Is used to provision cloud GPU VM instances to conduct each experiments.