Tracking ML Experiments With Data Version Control

Amit Last Updated : 24 Jun, 2021
9 min read

This article was published as a part of the Data Science Blogathon

” DVC brings agility, reproducibility, and collaboration into your existing data science workflow “

Introduction:

In a typical data science project, the scientists and machine learning experts deal with a large volume of data of various kinds. There are multiple models built with varying configurations, features, with multiple iterations of parameters tuning to get the best performing model. In such a scenario, all the changes we make to data and in the model building process need to be tracked and measured for us to understand what has worked and what has not. It also is necessary to have the option to go back to a specific version and investigate past results. One such tool which lets us track all of this is Data Version Control (DVC) which helps in governing the data, the underlying model and run reproducible results.

What is Data Version Control (DVC)?

Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that we are familiar with (Git, CI/CD, etc.). DVC is meant to be run alongside Git. The git and DVC commands will often be used in tandem, one after the other. While Git is used to storing and version code, DVC does the same for data and model files.

The .dvc file is lightweight and meant to be stored with code in GitHub. When you download a Git repository, you also get the .dvc files. You can then use those files to get the data associated with that repository. Large data and model files go in your DVC remote storage, and small .dvc files that point to your data go in GitHub.
The entire process of loading the data to measuring model metrics to tracking the model experiments is called Machine learning operations aka MLOps. If you need an example of MLOps implementation with Github then please read my previous blog – Bring DevOps To Data Science With MLOps
Data Version Control

In this blog, we will explore and learn the following:

  • Install Data Version Control and learn about its project setup
  • Build ML model and create experiment pipelines that can be reproduced with DVC
  • Compare the metrics with previous iterations/results.
  • Generate ML metrics plots
  • Use Github to add, commit, and push commands

Any Prerequisites?

To understand MLOps & DVC, you will need very basic knowledge of model building in python and a GitHub account. I will be using Visual Studio Code as editor; you can use any editor of your choice as long as you are comfortable with it.

Getting Started:

We will be using Kaggle’s South Africa Heart Disease dataset. Here is the data dictionary for reference. Our objective is to predict chd i.e., coronary heart disease (yes=1 or no=0). A simple binary classification model.
  • sbp: systolic blood pressure
  • tobacco: cumulative tobacco (kg)
  • ldl: low-density lipoprotein cholesterol
  • famhist: family history of heart disease (Present=1, Absent=0)
  • typea: type-A behavior
  • alcohol: current alcohol consumption
  • age: age at onset
  • chd: coronary heart disease (yes=1 or no=0)

Here is a quick snap of basic stats from our data:

data described | Data Version Control

Project Setup: 

To start with, our project folder structure is as below. As we process the files in the later sections, each of these folders will have respective files in them.

+—data
| +—processed –> This will have 2 files train and test after split
| —raw –> This will have the raw data file
+—data_given –> The data from the source either from the local machine or external source
+—report –> A place holder for JSON files.
+—saved_models –> The model pickle file will be saved
—src –> Four python files: get data, load data, splitting data, and Model building

 

Configurations:

The configuration plays a key role in setting up a workflow and also to keep the code clean and improve readability. We will use the YAML file format to define our configurations in stages. To know more about configs, please read my previous blog Reproducible ML Reports Using YAML Configs

Create a file by name param.yaml and have the below configuration defined:

base:
  project: SAHeart-project
  random_state: 999
  target_col: chd
.........
split_data:
  train_path: data/processed/train_SAHeart.csv
  test_path: data/processed/test_SAHeart.csv
  test_size: 0.25
model_dir: saved_models

Please refer to the complete code on GitHub

Fetching Data & Pre-processing:

Let’s create a python file by the name get_data.py where we will load the data and carry out basic data processing. The processed data will be saved in the data_given folder which we had created in an earlier section

...........
def read_params(config_path):
    with open(config_path) as yaml_file:
        config = yaml.safe_load(yaml_file)
    return config
def get_data(config_path):
    config = read_params(config_path)
    data_path = config["data_source"]["s3_source"]
    df = pd.read_csv(data_path, sep=",", encoding='utf-8')
    df = pd.get_dummies(df, columns = ['famhist'], drop_first=True)
    df.drop("sbp",axis=1, inplace=True)
    return df
..........

Please refer to the complete code on GitHub

Loading Data:

We will create a python file load_data.py and load the data from our previous stage and save it in ./raw folder which we had created earlier

.......
def load_and_save(config_path):
    config = read_params(config_path)
    df = get_data(config_path)
    df = df.dropna()
    new_cols = [col.replace(" ", "_") for col in df.columns]
    raw_data_path = config["load_data"]["raw_dataset_csv"]
    df.to_csv(raw_data_path, sep=",", index=False, header=new_cols)
.......

Please refer to the complete code on GitHub

Splitting Data:

Our next stage would be to split the data into train and test which is an essential step before model building. There will be two files created train_SAHeart.csv and test_SAHeart.csv. These two files will be saved under data/processed folder

.............
    train, test = train_test_split(
        df, 
        test_size=split_ratio, 
        random_state=random_state
        )
    train.to_csv(train_data_path, sep=",", index=False, encoding="utf-8")
    test.to_csv(test_data_path, sep=",", index=False, encoding="utf-8")
.............

Please refer to the complete code on GitHub

Model Building:

This is the last of four stages – the model building where we will build a binary classification model using Logistic Regression. Lets create a python file by name train_and_evaluate.py with the below code. In the end, the model in the pickle format is saved under the/saved_models folder.

...........
def train_and_evaluate(config_path):
    config = read_params(config_path)
    test_data_path = config["split_data"]["test_path"]
    train_data_path = config["split_data"]["train_path"]
    random_state = config["base"]["random_state"]
    model_dir = config["model_dir"]
    target = [config["base"]["target_col"]]
    train = pd.read_csv(train_data_path, sep=",")
    test = pd.read_csv(test_data_path, sep=",")
    train_y = train[target]
    test_y = test[target]
    train_x = train.drop(target, axis=1)
    test_x = test.drop(target, axis=1)
    # Build logistic regression model
    model = LogisticRegression(solver='saga', random_state=0).fit(train_x, train_y)
    # Report training set score
    train_score = model.score(train_x, train_y) * 100
    print(train_score)
    # Report test set score
    test_score = model.score(test_x, test_y) * 100
    print(test_score)
.............

Please refer to the complete code on GitHub

At this stage, our project structure is as below. The files highlighted are the ones that we generated from our previous four stages.

+---data
|   +---processed
|   |       test_SAHeart.csv
|   |       train_SAHeart.csv
|   ---raw
|           SAHeart.csv
+---data_given
|       SAHeart.csv
+---report
+---saved_models
|       model.joblib
---src
    |   get_data.py
    |   load_data.py
    |   split_data.py
    |   train_and_evaluate.py

DVC Setup:

If you don’t have DVC Install then install as below. For more information refer install DVC

pip install dvc

Getting started with DVC:

Assuming you have already installed DVC, initiate DVC as below. This is very similar to initiating git.

dvc init

Mapping remote storage: We need to tell DVC where our remote storage is on local machine.

dvc add . /data_given/SAHeart.csv

Note: DVC supports many cloud-based storage systems, such as AWS S3 buckets, Google Cloud Storage, and Microsoft Azure Blob Storage. You can find out more in the official DVC documentation for the dvc remote add command. 

We have successfully installed DVC and mapped our dataset to bed tracked & versioned by DVC.

Create metrics file: Let’s create JSON file under-report folder by name scores.json where we will write some of the key results we need to track. Remember, in our last section, we had created a report folder but didn’t have any file under it.

DVC pipeline:

DVC pipelines are effectively version-controlled steps in a typical machine learning workflow (e.g. data loading, pre-processing, data splitting, model building, metrics measurement, etc.). we will define the dependencies and outputs. Under the hood, DVC is building up a Directed Acyclic Graph (DAG) which is a fancy way of saying a series of steps that can only be executed in one direction.

dvc dag
Data Version Control dag

Let’s create a YAML file by the name dvc.yaml where we define all the stages of the pipeline. In the below sample code we are defining the command that needs to execute with all the dependent files and also the output which in this case is SAHeart.csv file to be placed under /data/raw folder

stages:
  load_data:
    cmd: python src/load_data.py --config=params.yaml
    deps:
    - src/get_data.py
    - src/load_data.py
    - data_given/SAHeart.csv
    outs:
    - data/raw/SAHeart.csv

We will define the remaining stages on similar lines as below:

.......
  split_data:
    cmd: python src/split_data.py --config=params.yaml
    deps:
    - src/split_data.py
    - data/raw/SAHeart.csv
    outs:
    - data/processed/train_SAHeart.csv
    - data/processed/test_SAHeart.csv 
  train_and_evaluate:
    cmd: [python src/train_and_evaluate.py --config=params.yaml, dvc plots show]
    deps:
    - data/processed/train_SAHeart.csv
    - data/processed/test_SAHeart.csv 
    - src/train_and_evaluate.py
    metrics:
    - report/scores.json:
        cache: false
.......

Please refer to the complete code on GitHub

At this stage, we have our model built, DVC installed and setup, defined pipeline with dvc.yaml file. Lets run the pipeline to see if all the stages are executed sequentially as defined.

dvc repro
## Here is the output
'data_givenSAHeart.csv.dvc' didn't change, skipping
Running stage 'load_data':
> python src/load_data.py --config=params.yaml
Updating lock file 'dvc.lock'
Running stage 'split_data':
> python src/split_data.py --config=params.yaml
Updating lock file 'dvc.lock'
Running stage 'train_and_evaluate':
> python src/train_and_evaluate.py --config=params.yaml
.....
.....
To track the changes with git, run:
        git add dvc.lock
Use `dvc push` to send your updates to remote storage.

Ahh !!. All our stages are executed successfully. I have trimmed the output to keep it clean and precise. The dvc also automatically creates a file by name dvc.lock which tracks all the changes. Lets take a look at the content of this file.

schema: '2.0'
stages:
  load_data:
    cmd: python src/load_data.py --config=params.yaml
    deps:
    - path: data_given/SAHeart.csv
      md5: d75f04589cb6fb8dce5772ed762a5621
      size: 22761
    - path: src/get_data.py
      md5: cd5029054879f4cdebb4f105ef891912
      size: 774
    - path: src/load_data.py
      md5: fd8ed8a4e1d3a894523c33f274a42ba1db
      size: 693
    outs:
    - path: data/raw/SAHeart.csv
      md5: 9bcb98547eea864e13d075f2f0e17cea9
      size: 19013

If you observe, the structure of the file is similar to dvc.yaml but the content is different. It has created an identifier to track every single change that goes in the pipeline.

Model Metrics:

As our pipeline has completed its task successfully, lets look at some of the model metrics that are of our interest. Remember, we had created report/scores.json file in the earlier section?. This file would have the metrics data and it was populated with data during the model building stage. Let’s take a look.

dvc metrics show
## Here is the output
Path                Logistic Accuracy    roc_auc    test_score    train_score
reportscores.json  0.62069              0.65093    62.06897      71.96532

Note: Our focus is not on model metrics or on improving it further but on building a reproducible pipeline with DVC to help us track our experiment.

Let’s push the code to github before proceeding further.

git add . && git commit -m "experiment with dvc" && git push origin main

Metric Plots:

As we have built a binary classification model, it will be good to see the confusion matrix. DVC provides plot templates and in this case, we will use the confusion matrix template. We will also set the x (Actual value from test data i.e. target (chd)) and y axis (predicted value from our model).

dvc plots show cm.csv --template confusion -x chd -y Predicted
## Here is the output. The confusion matrix is generated in HTML and saved in current directory
file://C:/Desktop/Versioning-ML-Models-datasets-With-DVC/plots.html
confusion matrix 1 Data Version Control

 

Let’s now experiment by changing some of the parameters and re-running dvc pipeline.

Here are the changes:
param.json change the test size from 0.25 to 0.3
train_and_evaluate.py change solver from ‘saga’ to ‘lbfgs’

Let’s run the DVC pipeline and also check metrics.

dvc repro
dvc metrics show
## Here is the output
Path Logistic Accuracy roc_auc test_score train_score
reportscores.json 0.65517 0.72764 65.51724 74.27746

Well, we have some improvement compared to the last iteration. Let’s compare it with DVC.

dvc metrics diff
## Here is the comparision of both the iterations
Path                Metric             Old       New       Change
reportscores.json  Logistic Accuracy  0.62069   0.65517   0.03448
reportscores.json  roc_auc            0.65093   0.72764   0.07671
reportscores.json  test_score         62.06897  65.51724  3.44828
reportscores.json  train_score        71.96532  74.27746  2.31214

Let’s check confusion matrix as well for this iteration:

dvc plots show cm.csv --template confusion -x chd -y Predicted

 

confusion matrix

We just wrote one line of code to reproduce the entire pipeline, that’s the simplicity DVC brings to ML experiments. 

Closing Note:

You have learned how to use Data Version Control to create experiments!. For every experiment you run, you can version the data you use and the model you train. The best part is you can place your high-volume dataset on remote(cloud) and use DVC to point to the data source saving disk space on your local and on Github. And moreover, the experiments are reproducible, and anyone can repeat what you’ve done.

I hope this article was useful. Good luck with all your experiments !!!!

You can connect with me Linkedin

You can find the code for reference – Github

References

https://dvc.org/

https://unsplash.com/

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

I am a Data Science enthusiast with experience in building predictive models, data processing, and data mining algorithms to solve challenging business problems. Involved in open source community and passionate about building data apps.

Responses From Readers

Clear

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details