Getting Started with Data Version Control (DVC)

Dheeraj Bhat Last Updated : 13 Jun, 2023

4 min read

Introduction

If you are reading this blog, you might have been familiar with what Git is and how it has been an integral part of software development. Similarly, Data Version Control (DVC) is an open-source, Git-based version management for Machine Learning development that instills best practices across the teams. A system called data version control manages and tracks changes to data and machine learning models in a collaborative and reproducible manner. It draws inspiration from version control systems used in software development, such as Git, but tailors specifically to data science projects.

Learning Objectives

In this article you will develop basic understanding of:

What is Git?
What is Data Version Control?
Understand the basics of Data Version Control

This article was published as a part of the Dat a Science Blogathon.

Introduction
Advantages of Data Version Control (DVC)
- ML Project Version Control
Getting Started
Gdrive Remote Configuration
DVC Pipelines
Conclusion
Frequently Asked Questions

Advantages of Data Version Control (DVC)

ML Project Version Control

DVC lets you connect with storage providers like AWS S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, HDFS, etc., to store ML models and datasets.

ML Experiment Management

It helps in easy navigation for automatic metric tracking.

Deployment and Collaboration

DVC introduces pipelines that help in the easy bundling of ML models, data, and code into production, remote machines, or a colleague’s computer.

Source: dvc.orgNaNSource: dvc.org</figcaption>
</figure>
<h2>Learning Objectives</h2>
<p>With this article, you will learn the following:</p>
<ul>
<li>Understanding the basics of DVC</li>
<li>How DVC can help in variety of problems?</li>
<li>Installing and using DVC in a git repository</li>
<li>Configuring DVC for GDrive remote storage</li>
<li>How to use DVC Pipelines for reproducing workflows?</li>
</ul>
<h2>Use cases of DVC</h2>
<figure class=

Source: dvc.orgNaNSource: dvc.org</figcaption>
</figure>
<p>The use cases of DVC are as follows:</p>
<ul>
<li><b>Versioning Data and Models:</b> We can track versions of data and ML models using git commits. A metafile with .dvc extension is created for the data/models that need to be tracked by dvc which contains the metadata information like md5 hash, size, number of files, and the path.</li>
<li><b>CI/CD for Machine Learning: </b>DVC helps in managing data/models and reproducible pipelines</li>
<li>Fast and Secure Data Caching Hub: DVC’s built-in data caching speeds up data transfers and lets us set up a shared DVC cache that prevents repetitive transfers by linking working files and directories</li>
<li><b>Experiment Tracking:</b> Running DVC Experiments in your workspace captures relevant changes automatically (input data, source code, hyperparameters, artifacts, etc.). This helps to iterate quickly on experiments, creating checkpoints, and comparing results.</li>
<li><b>Model Registry:</b> DVC enables us to catalog ML models and versions. This helps to organize model versions from different sources, sharing metadata, and deploying specific models on dev, test, and production environments.</li>
<li><b>Data Registry:</b> DVC enables cross-project reusability of data artifacts i.e. different projects can depend on different repositories.</li>
</ul>
<h2>Installation</h2>
<p>You can install dvc from <a href=

PyPi repository using the following command line:

pip install dvc

Depending on the type of remote storage that will be used, we have to install optional dependencies: [s3], [gdrive], [gs], [azure], [ssh], [hdfs], [webdav], [oss]. Use [all] to include them all. In this blog, we will be using google drive as remote storage, so pip install dvc[gdrive] for installing gdrive dependencies.

Learn More: Tracking ML Experiments With Data Version Control

Getting Started

In this blog, we will see how to use dvc for tracking data and ml models with gdrive as remote storage. Imagine the Git repository which contains the following structure:

Folder StructureNaNFolder Structure</figcaption>
</figure>
<p>The data and models folder will be very huge when it's compared with the source code of the repository. This is where DVC comes into the picture which helps to track data and models folder. Go to the root of the Git repository (a repository that includes data, ml models folders) and initialize dvc using the command:</p>
<pre><code>dvc init</code></pre>
<p>To start tracking data and models directory, run the following command:</p>
<pre><code>dvc add data
dvc add models</code></pre>
<p>Now, this creates a special file with a .dvc extension (data.dvc and models.dvc). This .dvc file contains metadata information like md5 hash, size, number of files, and the path. These .dvc files are versioned with source code with Git. The dvc add command will also add data and models folder to the .gitignore file. Then, we need to commit the changes to git using the following command:</p>
<pre><code>git add -A
git commit -m

Gdrive Remote Configuration

Now, we need to configure gdrive remote storage. Go to your google drive and create a folder called dvc_storage in it. Open the folder dvc_storage. Get the folder-id of the dvc_storage folder from the URL:

https://drive.google.com/drive/folders/folder-id

# example: https://drive.google.com/drive/folders/0AIac4JZqHhKmUk9PDA

Now, use the following command to use the dvc_storage folder created in the google drive as remote storage:

dvc remote add myremote gdrive://folder-id

# example: dvc remote add myremote gdrive://0AIac4JZqHhKmUk9PDA

Now, we need to commit the changes to git repository by using the command:

git add -A
git commit -m "configure dvc remote storage"

To push the data to remote storage, we use the following command:

dvc push

Then, we push the changes to git using the command:

git push

To pull data from dvc, we can use the following command:

dvc pull

DVC Pipelines

We can make use of DVC pipelines to reproduce the workflows in our repository. The main advantage of this is that we can go back to a particular point in time and run the pipeline to reproduce the same result that we had achieved during the previous time. There are different stages in the DVC pipeline like prepare, train, and evaluate, with each of them performing different tasks. The DVC pipeline is nothing but a DAG (Directed Acyclic Graph). In this DAG graph, there are nodes and edges, with nodes representing the stages and edges representing the direct dependencies. The pipeline is defined in a YAML file (dvc.yaml). A simple dvc.yaml file is as follows:

stages:
  prepare:
    cmd: source src/cleanup.sh
    deps:
      - src/cleanup.sh
      - data/raw
    outs:
      - data/clean.csv
  train:
    cmd: python src/model.py data/model.csv
    deps:
      - src/model.py
      - data/clean.csv
    outs:
      - data/predict.dat
  evaluate:
    cmd: python src/evaluate.py data/predict.dat
    deps:
      - src/evaluate.py
      - data/predict.dat

Use the prepare stage to run the data cleaning and pre-processing steps. Use the train stage to train the machine learning model using the data from the prepare stage. The evaluate stage uses the trained model and predictions to provide different plots and metrics.

Conclusion

This blog helps you with the basics of Data Version Control and set up dvc using google drive as remote storage. For advanced uses (like CI/CD etc.), we need to set up DVC remote configuration using the Google Cloud project (click here). There are also other storage types supported like AWS S3, Microsoft Azure Blob Storage, self-hosted SSH servers, HDFS, HTTP, etc. DVC has most of the commands analogous to git (like dvc fetch, dvc checkout, and dvc status, etc, and a lot more). It also has Visual Studio Extension which makes things easier for developers using VS Code. Check out their GitHub repository to learn more about DVC and everything it offers.

Key Takeaways:

Understanding the basics of DVC
Become acquainted with the use cases of DVC
Installation and use of DVC in a git repository
GDrive Remote configuration in DVC

References

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is the DVC command?

A. The DVC command is a command-line tool that provides various functionalities for interacting with DVC projects. It includes commands for initializing a DVC project, tracking data files, managing data pipelines, running experiments, and collaborating with other team members. It serves as the primary interface for interacting with DVC’s features.

Q2. How does DVC work?

A. DVC (Data Version Control) provides a layer of version control specifically for data and machine learning models. It tracks changes to data files, dependencies, and experiments while storing them separately from the codebase, allowing for reproducibility and efficient collaboration.

Q3. What is DVC used for?

A. DVC is used for managing and versioning large datasets, machine learning models, and experiments. It helps streamline the data pipeline, enables reproducibility, and facilitates collaboration among data scientists and machine learning engineers.

Q4. Why use DVC instead of Git?

A. DVC complements Git by focusing on versioning and managing data and machine learning models, while Git primarily handles source code. DVC’s dedicated functionality for data and models includes handling large files efficiently, storing data separately, and enabling reproducibility, which are essential for machine learning projects.

Dheeraj Bhat

Dheeraj is a Data Science and ML Enthusiast. Dheeraj likes writing about Machine Learning and Data Science in general, and loves to study about new concepts. Feel free to connect with me on linkedin :)

Beginner Data Science Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Getting Started with Data Version Control (DVC)

Introduction

Learning Objectives

Table of contents

Advantages of Data Version Control (DVC)

ML Project Version Control

ML Experiment Management

Deployment and Collaboration

Getting Started

Gdrive Remote Configuration

DVC Pipelines

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)