Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour.”
DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 25th June 2022, we were joined by Mr.Andrey Lukyanenko for a DataHour session on “Writing Reusable and Reproducible Pipelines for Training Neural Networks”
Andrey Lukyanenko is a Senior Data Scientist at Careem, which provides IT solutions and consulting. He has over ten years of experience in Analytics and Data Science. He aspires to create Deep Learning applications that positively impact people, bringing value to the business while also improving the lives of users/clients.
He is the Grandmaster of Kaggle Notebooks (ranked first in the kernel ranking), Discussions, and Kaggle Competitions.
Are you excited to dive deeper into the world of Data Engineering? We got you covered. Let’s start with this session’s major highlights: Writing Reusable and Reproducible Pipelines for Training Neural Networks.
Data science is concerned with reproducibility. It is critical to have a dependable code if we want to ensure that our experiments are not unduly influenced by randomness. At the same time, we should be able to change the code as we iterate over ideas easily.
A training pipeline is a code for training neural networks and producing checkpoints with model weights, logs, etc. For example, images for interpreting the model. Everyone needs to write some training pipeline to train neural networks. But here, we specifically learn about reusable pipelines, which can be used for different data sets and tasks without changing many things.
It’s one of the most important things to consider when writing code for business, implementing projects, etc. When it comes to business, it is critical to have consistent results and a well-functioning production system. And in this case, machine learning could be only a small part even though it’s crucial. Usually, the code could have some good quality; maybe it could be tested, it should be well written, or maybe it’s optimized for something.
For example, optimize models for speed if you need to have a high load. If you want to be able to deploy the model on small devices, you optimize the size model. If you need interpretability, you use simple models, and so on.
As a result, the emphasis is on whether or not written code is stable metrics and software engineering. You use a different coding style when you need to make some prospects at the types or simply make something work.
For example, suppose you have a new task and only a little experience with it. You have a limited amount of time and must test numerous approaches. So you go to Stack Overflow, Kaggle, and other similar sites, take various pieces of code and hope they work. When they appear functional, you can decide whether to try something else or rewrite this code to do something better.
So, in this case, we simply iterate and don’t write much code, and Kaggle is somewhere in the middle. Because it is about iteration, and you frequently modify the code. To begin, you add some features or drop some parts support quickly. At the same time, you must be able to track and reproduce the code because if you want to optimize the metrics, you must be certain that your changes to the code improve the metrics and are not the result of randomness or a large number of changes in the code. As you can see, there are many different approaches to writing code. And the pipelines for writing training aren’t yet ready for all cases.
For example, we don’t need to write a large pipeline while trying new things and discarding them quickly. However, when writing training pipelines, you will have less stable code. You will be able to run many models and compare them easily. You can reduce your old functions and transfer them to different projects.
Another important distinction is whether you work alone or as a team. If you’re working alone, you can do whatever you want; write whatever code you want, however well or poorly; the important thing is that it works for you. You can use it, reuse it, and so on. However, if you work in a team, you must compromise and make the code understandable by writing documentation, tests, or comments. It appears small to you because not only do you read this code but so do others.
How do we get started with writing code and training neural networks? Taking a popular framework is the default approach. TensorFlow, PyTorch, and so on. They are the most well-known, but there are others. Keras, for example, appears to be overtaking Tensorflow. But, so far, PyTorch is the most popular in TensorFlow. So you open the notebook, then the script, and finally write the code from scratch before training the model. The advantages are that you understand exactly how everything works. You can distribute it to anyone, and they will understand the codes if they are familiar with the framework. However, writing this code takes a significant amount of time. Everything, such as training loops, features, and so on, must be written here. It’s a lot of code, especially if you need to handle complex cases like distributed training, advanced machine learning, etc. You must also write it yourself. Experienced people can do it faster, but many potential issues remain. You could try tests, but they have their own bugs that take longer.
Another approach is to use a high-level framework, such as Keras. It is the most widely used framework for tensorflow, Lightning PyTorch, Catalyst, and other high-level frameworks. And the nice thing is that you can pick the framework you want, which is usually quite different because they focus on different things. Some of them attempt to concentrate on cutting-edge approaches. Some of them are concerned with high-quality software engineering. Some have a strict API, while others are simple to use. It’s also convenient to be able to select them. The main advantage is that you have a lot of related code and a friendly community of people who will answer your questions and assist you in using the framework.
However, there is a disadvantage in that switching between them is much more difficult. If you could write the code in one framework, it would take a long time to change it to another. The logical progression of the first approach is to write your frame when writing trainers, classes, methods, and abstractions. It’s great to have because it will help you understand how everything works, but the main issue is that it will be difficult to share your code because no one will understand it.
So you have good eyesight to write or write code for yourself. And some kind of hybrid approach – she’s writing your wrapper on top of standard frameworks. For example, Keras or Fast.ai multiply by the other fractions. The reason for this is that, while frameworks are fantastic, they are frequently insufficient for certain use cases. On the one hand, you must make some changes. It’s great that you can use all of the framework’s features. At the same time, you can include whatever you want.
The main issue is that when other people try to use your code, they must be familiar with both the high-level framework and your code.
So, it’s more difficult for others, but I see that more and more people use this approach because it helps you. After all, you don’t need to write everything from scratch; you can add whatever you want.
For example, if you train a model for image classification and then need to train a model for image segmentation if your pipeline is quite good, you won’t need to change many things. Of course, you’ll need to change your model, which will order, but most other things, such as the training loop, shouldn’t have many changes. The operation is quite good if you can switch to a different task without many changes. By the way, an interesting approach to determining whether your program is okay is to ask a friend to try and run your pipeline; if this person can run your complaint well, for example, within a couple of minutes, then your program is fine.
His strategy evolved. He first wrote his code in Tensorflow and then in PyTorch. He has experimented with various frameworks, and his current approach is based on PyTorch lightning and hydra. PyTorch lightning is a high-level PyTorch framework. It abstracts a lot of code, allows you to write less code, and provides a lot of flexibility and nice features. Hydra manages configuration files by combining multiple configuration files and making changing heater parameters easier. He used the same pipeline in several projects.
For example, he trained it on multiple GPUs and developed multiple nodes for time series, tableau data, named entity recognition, and image specification. He didn’t have to change many things. So he is currently working on his approach and wishes to share it.
His pipeline contains several key concepts. To begin with, he has replaceable models, making it simple to change the model, the data loader, and some optimizers. For example, as he previously stated, his configuration files are managed by hydra, and he will go into more detail later. The most important thing is that the command line can change values and configuration files. Any value in any configuration file can be easily changed, and of course, as with any pipeline, it has some log-in and is replicated.
To know more about his pipeline strategy, follow the session properly and make one on your own to embed the learnings more efficiently.
It’s based on two frameworks; if any of these frameworks change API, he has to change his pipeline and spend some time here. If he wants to add some feature, he has to wait for the next version of the library, and he can’t do anything before that.
It is not very flexible, and it is difficult to find the parameters here, so I recommend using configuration files. It isn’t essential to have many configuration files, like in my code, as many libraries prefer to have a single huge configuration file. One of the most popular approaches is using ARC Parse. When in your mind training script, you have many lines with your parameters and change them here.
The training pipeline should have some more features in the functionality.
The speaker provided links to resources that might be used while writing the pipeline. The speaker also uses these resources. These are:
https://developers.google.com/machine-learning/crash-course/production-ml-systems
https://medium.com/@CodementorIO/good-developers-vs-bad-developers-fe9d2d6b582b
https://towardsdatascience.com/the-pytorch-training-loop-3c645c56665a
https://neptune.ai/blog/best-ml-experiment-tracking-tools
https://github.com/Erlemar/pytorch_tempest
This article has covered the roadmap for creating reusable and reproducible pipelines for neural network training, along with a great example using speaker data.
You can connect with the speaker on: