The DataHour Synopsis: Writing Reusable and Reproducible Pipeline

ankita184 Last Updated : 04 Sep, 2022

11 min read

Overview

Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour.”

DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 25th June 2022, we were joined by Mr.Andrey Lukyanenko for a DataHour session on “Writing Reusable and Reproducible Pipelines for Training Neural Networks”

Andrey Lukyanenko is a Senior Data Scientist at Careem, which provides IT solutions and consulting. He has over ten years of experience in Analytics and Data Science. He aspires to create Deep Learning applications that positively impact people, bringing value to the business while also improving the lives of users/clients.

He is the Grandmaster of Kaggle Notebooks (ranked first in the kernel ranking), Discussions, and Kaggle Competitions.

Are you excited to dive deeper into the world of Data Engineering? We got you covered. Let’s start with this session’s major highlights: Writing Reusable and Reproducible Pipelines for Training Neural Networks.

Introduction

Data science is concerned with reproducibility. It is critical to have a dependable code if we want to ensure that our experiments are not unduly influenced by randomness. At the same time, we should be able to change the code as we iterate over ideas easily.

What is a Training Pipeline?

A training pipeline is a code for training neural networks and producing checkpoints with model weights, logs, etc. For example, images for interpreting the model. Everyone needs to write some training pipeline to train neural networks. But here, we specifically learn about reusable pipelines, which can be used for different data sets and tasks without changing many things.

Styles of Writing Training Code

It’s one of the most important things to consider when writing code for business, implementing projects, etc. When it comes to business, it is critical to have consistent results and a well-functioning production system. And in this case, machine learning could be only a small part even though it’s crucial. Usually, the code could have some good quality; maybe it could be tested, it should be well written, or maybe it’s optimized for something.

For example, optimize models for speed if you need to have a high load. If you want to be able to deploy the model on small devices, you optimize the size model. If you need interpretability, you use simple models, and so on.

As a result, the emphasis is on whether or not written code is stable metrics and software engineering. You use a different coding style when you need to make some prospects at the types or simply make something work.

For example, suppose you have a new task and only a little experience with it. You have a limited amount of time and must test numerous approaches. So you go to Stack Overflow, Kaggle, and other similar sites, take various pieces of code and hope they work. When they appear functional, you can decide whether to try something else or rewrite this code to do something better.

So, in this case, we simply iterate and don’t write much code, and Kaggle is somewhere in the middle. Because it is about iteration, and you frequently modify the code. To begin, you add some features or drop some parts support quickly. At the same time, you must be able to track and reproduce the code because if you want to optimize the metrics, you must be certain that your changes to the code improve the metrics and are not the result of randomness or a large number of changes in the code. As you can see, there are many different approaches to writing code. And the pipelines for writing training aren’t yet ready for all cases.

For example, we don’t need to write a large pipeline while trying new things and discarding them quickly. However, when writing training pipelines, you will have less stable code. You will be able to run many models and compare them easily. You can reduce your old functions and transfer them to different projects.

Another important distinction is whether you work alone or as a team. If you’re working alone, you can do whatever you want; write whatever code you want, however well or poorly; the important thing is that it works for you. You can use it, reuse it, and so on. However, if you work in a team, you must compromise and make the code understandable by writing documentation, tests, or comments. It appears small to you because not only do you read this code but so do others.

Training Pipeline

How do we get started with writing code and training neural networks? Taking a popular framework is the default approach. TensorFlow, PyTorch, and so on. They are the most well-known, but there are others. Keras, for example, appears to be overtaking Tensorflow. But, so far, PyTorch is the most popular in TensorFlow. So you open the notebook, then the script, and finally write the code from scratch before training the model. The advantages are that you understand exactly how everything works. You can distribute it to anyone, and they will understand the codes if they are familiar with the framework. However, writing this code takes a significant amount of time. Everything, such as training loops, features, and so on, must be written here. It’s a lot of code, especially if you need to handle complex cases like distributed training, advanced machine learning, etc. You must also write it yourself. Experienced people can do it faster, but many potential issues remain. You could try tests, but they have their own bugs that take longer.

Another approach is to use a high-level framework, such as Keras. It is the most widely used framework for tensorflow, Lightning PyTorch, Catalyst, and other high-level frameworks. And the nice thing is that you can pick the framework you want, which is usually quite different because they focus on different things. Some of them attempt to concentrate on cutting-edge approaches. Some of them are concerned with high-quality software engineering. Some have a strict API, while others are simple to use. It’s also convenient to be able to select them. The main advantage is that you have a lot of related code and a friendly community of people who will answer your questions and assist you in using the framework.

However, there is a disadvantage in that switching between them is much more difficult. If you could write the code in one framework, it would take a long time to change it to another. The logical progression of the first approach is to write your frame when writing trainers, classes, methods, and abstractions. It’s great to have because it will help you understand how everything works, but the main issue is that it will be difficult to share your code because no one will understand it.

So you have good eyesight to write or write code for yourself. And some kind of hybrid approach – she’s writing your wrapper on top of standard frameworks. For example, Keras or Fast.ai multiply by the other fractions. The reason for this is that, while frameworks are fantastic, they are frequently insufficient for certain use cases. On the one hand, you must make some changes. It’s great that you can use all of the framework’s features. At the same time, you can include whatever you want.
The main issue is that when other people try to use your code, they must be familiar with both the high-level framework and your code.

So, it’s more difficult for others, but I see that more and more people use this approach because it helps you. After all, you don’t need to write everything from scratch; you can add whatever you want.

Reasons for Writing Pipelines

Writing everything from scratch takes time and can have errors.
If you have to write everything from scratch, it will take a long time, and there may be many errors due to many issues. Many popular high-level frameworks have dozens, if not hundreds, of features, and if you think it’s possible to write your framework without any bugs or errors, think again. It takes a lot of self-assurance to say that you can write your framework better than hundreds of people who contribute to popular high-level frameworks.

You have repeatable pieces of code anyway.
Then, suppose you participate in multiple projects or develop multiple projects. In that case, you will have some repeatable pieces of code regardless of whether you calculate the metrics or use which optimizers. Perhaps you have optimized some code or have optimization for a specific metric. You have some code that you copied from your first project to the second, third, and so on. Converting them into classes of functions may be the next logical step in this case.

Standardization among the team
Suppose you have a team and work on specific projects using the same pipeline. In that case, it is much easier to share code because if different people use different frameworks, comparing their solutions to metrics is much more difficult. And having the same pipeline is far superior.So, where do you begin when creating a training pipeline? People usually open Twitter, Notebook, Google, Collab, Pockel, or whatever and type in a call code. It’s magnificent at first because it works, but you should realize that it’s not the best approach because it makes changing things in the code more difficult.

A better understanding of how things work
Another intriguing aspect is that it greatly aids in understanding how these things work. For example, most people can change the top layer of the model’s framework. They can’t change the inputs and don’t understand how the layers work, which is necessary when writing your pipeline. It will be extremely beneficial to your job or future projects.It’s difficult to virtualize your code, and it’s even more difficult to commit it, so after some time, one could try to split the code into multiple parts, such as a separate script for the data set, a separate script for the model, a separate script for optimization. Then you could add obstructions configurations and so on. After a while, your project grows larger and more modern, and you imagine that you need to work on a different project.

For example, if you train a model for image classification and then need to train a model for image segmentation if your pipeline is quite good, you won’t need to change many things. Of course, you’ll need to change your model, which will order, but most other things, such as the training loop, shouldn’t have many changes. The operation is quite good if you can switch to a different task without many changes. By the way, an interesting approach to determining whether your program is okay is to ask a friend to try and run your pipeline; if this person can run your complaint well, for example, within a couple of minutes, then your program is fine.

Speakers’ Pipeline – The approach he follows

His strategy evolved. He first wrote his code in Tensorflow and then in PyTorch. He has experimented with various frameworks, and his current approach is based on PyTorch lightning and hydra. PyTorch lightning is a high-level PyTorch framework. It abstracts a lot of code, allows you to write less code, and provides a lot of flexibility and nice features. Hydra manages configuration files by combining multiple configuration files and making changing heater parameters easier. He used the same pipeline in several projects.

For example, he trained it on multiple GPUs and developed multiple nodes for time series, tableau data, named entity recognition, and image specification. He didn’t have to change many things. So he is currently working on his approach and wishes to share it.

Speakers’ Pipeline: Core Ideas

His pipeline contains several key concepts. To begin with, he has replaceable models, making it simple to change the model, the data loader, and some optimizers. For example, as he previously stated, his configuration files are managed by hydra, and he will go into more detail later. The most important thing is that the command line can change values and configuration files. Any value in any configuration file can be easily changed, and of course, as with any pipeline, it has some log-in and is replicated.

To know more about his pipeline strategy, follow the session properly and make one on your own to embed the learnings more efficiently.

It’s based on two frameworks; if any of these frameworks change API, he has to change his pipeline and spend some time here. If he wants to add some feature, he has to wait for the next version of the library, and he can’t do anything before that.

Pipeline Needs

Training Loops
The first step in every pipeline is a training loop, in which we iterate over the data, gather losses, compute the matrix, and possibly have some events. For instance, in this example, we have some batches, calculate losses, block them, and then lock the losses and the metrics. Yes, sometimes having multiple functions is preferable. Numerous methods for training, such as distinct methods for events on the training period or the batch, are acceptable if they can be abstracted. Still, there are occasions when this is impossible, and you must use multiple ways to carry out the training.

Reproducibility
Because some layers aren’t deterministic, and even if you set up a torchback codeine and deterministic true sentence, you might still have some varied values in training. However, in most circumstances, such a function should be sufficient to fix multiple decimal values when you conduct multiple tests.

Experiment Tracking
It’s crucial to conduct some experiments. Tensorboard is the standard method and option, and if you haven’t tried it yet, I encourage you to do so because it’s quite good and has many features and might be sufficient for you. However, many companies are currently developing their solutions, such as wasting biases, which allows you to have a web interface, some nice fish, and frequently a community of people who can assist you.

Changing Hyperparameters

It is not very flexible, and it is difficult to find the parameters here, so I recommend using configuration files. It isn’t essential to have many configuration files, like in my code, as many libraries prefer to have a single huge configuration file. One of the most popular approaches is using ARC Parse. When in your mind training script, you have many lines with your parameters and change them here.

Basic Functionality

The training pipeline should have some more features in the functionality.

Easy to modify for similar problems – It should be simple to adapt. For instance, if you trained a model for binary image classification and now have a different set, you shouldn’t need to make many changes to the code. Likewise, switching from binary to multi-class classification, you shouldn’t need to make many changes. If you only have to modify the design and possibly the losses, the pipeline may not be optimum. However, if you also have to change a lot of other things.

Make predictions – Although it may seem strange, we have seen many pipelines only intended for training, wherein the model was trained on the data. Still, no methods for making predictions were provided. As a result, our plan should have some capacity for making predictions, and it should be possible to do so both with and without a pipeline.

Make predictions without a pipeline – Why should we be able to make predictions without a pipeline? Because for instance, using tensorflow or PyTorch makes it simpler to make predictions when using the model in production. Additionally, it should be possible to convert the model to other formats to make it simpler to use them in the future.

Changing isn’t very complicated – Additionally, it should be simple to change the model. I’ve given some examples of how to change the code, but there are other examples. For instance, some frameworks have many layers of obstruction, making it difficult to understand what exactly needs to be changed. I won’t criticize this approach, but I’d rather things be simpler, more modular, and easier.

Using Functionality

Configs, configs, everything – Of course, there are some niches to fill, but as we’ve already mentioned, setups are the best.
Templates of everything – Some high-level frameworks already offer templates for image segmentation, image classification, text classification, etc. This is incredibly convenient because it reduces the thought required when switching from one task to another.
Training on folds and hyperparameter optimization – Unfortunately, there aren’t many high-level frameworks available because they often concentrate on building powerful single models. If you want to undertake parameter optimization, you sometimes have to work via loops, which may be challenging, so it would be nice to have it in the pipeline.
Training with Stages – It would also be nice to train errors, by which I mean, for instance, changing the size of an image after several epochs or perhaps changing some of those optimizers by training. It’s interesting that some high-level frameworks already have this feature, and it’s beneficial when trying to push your metric to the limit.
Using pipeline for a variety of tasks without rewriting all the codes
Shareable code and documentation – It’s important to write so you can also remember what you wrote in the future.
Various cool tricks -There are many possible methods, and it would be wonderful to have an in-pipeline, so we don’t have to write them repeatedly. Gradient accumulation increases dropout and so on.

The speaker provided links to resources that might be used while writing the pipeline. The speaker also uses these resources. These are:

https://developers.google.com/machine-learning/crash-course/production-ml-systems
https://medium.com/@CodementorIO/good-developers-vs-bad-developers-fe9d2d6b582b
https://towardsdatascience.com/the-pytorch-training-loop-3c645c56665a
https://neptune.ai/blog/best-ml-experiment-tracking-tools
https://github.com/Erlemar/pytorch_tempest

Conclusion

This article has covered the roadmap for creating reusable and reproducible pipelines for neural network training, along with a great example using speaker data.

You can connect with the speaker on:

ods.ai @artgor
https://twitter.com/AndLukyane

ankita184

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

The DataHour Synopsis: Writing Reusable and Reproducible Pipeline

Overview

Introduction

What is a Training Pipeline?

Styles of Writing Training Code

Training Pipeline

Reasons for Writing Pipelines

Speakers’ Pipeline – The approach he follows

Speakers’ Pipeline: Core Ideas

Pipeline Needs

Changing Hyperparameters

Basic Functionality

Using Functionality

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr