“I must admit it (Kaggle Competitions) made a huge impact on my career. It was the key reason why I managed to switch to the Data Science area.” – Dmitry Gordeev
Remember when you said ‘no’ to data science competitions? Perhaps you found them too difficult to crack or you felt they weren’t worth the effort.
Well, our popular Kaggle Grandmaster Series is certainly bursting that bubble! We have received an overwhelmingly positive response to the first three interviews and we are delighted to bring the fourth edition today!
Please put your hands together for Kaggle Rank #9 and Grandmaster Dmitry Gordeev!
Dmitry is a Kaggle Competitions Grandmaster and one of the top community members that many beginners look up to. He has 10 gold medals and 4 silver medals to his name, an achievement that sets him apart. He is also a Kaggle Expert in the discussions category.
Dmitry graduated from Lomonosov Moscow State University (MSU) in 2010 as a specialist in pattern recognition. Before joining H2O.ai, he was deeply involved in the Risk Management industry. He brings all this experience to the table in this Kaggle Grandmaster Series interview!
So without any further ado, let begin!
Dmitry Gordeev (DG): I spent several years working as a specialist in the banking retail credit risk area, focused on statistical model development and validation. That was true to a large extent data analytics work, but also included basic machine learning and time series models application.
Luckily, my background covered general areas of machine learning, so when I decided to move to Data Science, it helped not to start from scratch. But there was quite a large gap with regards to the tools I had to bridge. Kaggle was probably the main source of knowledge in that period, allowing students to learn best practices, new approaches, and try new creative (and not so creative) ideas. An amazing community full of brilliant and supportive people help to get into difficult topics quickly.
“Another big gap I had is related to tools of proper code management, collaboration, and model deployment. But I had an opportunity to develop a series of small data related internal projects in a small team end-to-end. That was a great experience, forcing me to work with the tools I haven’t been exposed to before.”
DG: The industry is quite heavily regulated in Europe and generally is focusing on explainable decision making. Therefore, it is common to apply more robust and well-known approaches over complex black-box models.
However, AI has always been a topic of interest in this area, as it can provide new ways of extracting information from large data samples a bank typically collects and the ability to produce more accurate predictive models to apply for business.
DG: I think the low hanging fruit with regards to machine learning in Risk Management is the ability to bring new types of data into consideration, like texts, graphs, and images. It is exactly the type of data that was difficult to analyze with standard methods and hence was not scrutinized enough.
But these are the areas where machine learning shines, especially considering recent developments in language models and transfer knowledge in general.
Another aspect is the developing domain of explainable AI, which can be a game-changer for such industries as Risk Management. The ability to use more diverse data, make better forecasts, and be capable to explain them can make a dramatic impact.
DG: Sure!
DG: It was a challenge to start the very first competition because I was insecure about my knowledge and skills. But the desire to get better on the leaderboard always motivated me to continue, constantly learn, try, and not to give up.
“I quickly realized how addictive and time-consuming competitions can be, so arguably the main challenge is to find a good balance between spending efforts on trying all the ideas out and having enough rest and time off.”
Also, don’t give up if something doesn’t work, most of the ideas will fail and it is fine. Everyone goes through it; nobody knows the best solution upfront. You just need to be patient enough to keep looking for the approach which works. And then proceed further, searching for the next big idea that beats the current.
DG: Looking back, I must admit it made a huge impact on my career, it was the key reason why I managed to switch to the Data Science area.
It is common that your expertise is being judged by your past employment. So, risk managers are expected to be good at risk management, but not in machine learning.
Participation in competitions, though was extremely time-consuming and barely left any spare time for other activities, helped me to change my career path.
DG: There is a single criterion and it is simple – does it look like I will enjoy working on it? It might be an interesting topic or challenging data. Most of my past competitions were driven by the desire to try something new out, like language models, or time-series like data from earthquakes.
I joined the NFL Big Data Bowl competition because it was one of a few sports-related competitions with quite novel data behind. This way I kept my motivation high to either produce a better model or learn something new for myself, both in machine learning and the domain of the contest. And high motivation brings new ideas and a desire to invest more and more time implementing them.
DG: I had absolutely no knowledge about Indic languages before, but now I feel proud that I can recognize some of the graphemes when I see them.
“That’s probably the beauty of machine learning as a discipline – it can be applied across multiple domains, while often very little domain knowledge is required to produce valuable results. It is more typically to classify problems by the type of underlying data rather than by the domain.”
For instance, the Bengali AI Handwritten Grapheme Classification challenge attracted many brilliant computer vision specialists, many of whom have never worked with text images before. But the common approaches which allow AI to distinguish a dog from a cat, identify a pedestrian on a road, or even generate a realistic image of a human face, can be used to classify complex Bengali graphemes.
DG: Absolutely, xgboost and lightgbm are still the first choice for traditional structured data in tabular format and frequently for time series forecasting. It is important in the industry, where traditionally the data is collected in a structured manner.
“Gradient boosting methods typically produce more accurate models, while requiring less computational resources and much less time for training. Neural networks can serve as complementary models, improving the overall ensemble, but only when carefully tuned for the dataset.”
Neural networks are opening up new areas for AI, such as natural language, computer vision, signal classification, deep reinforcement learning, and many more to come. The machine learning competitions changed focus from tabular data to these new areas, therefore we see such a boom of deep learning in competitive fields. It is exciting, but traditional methods are still as important as they were before.
DG: I think there is no single correct way to do things and everyone develops their own approach. We explore and visualize data to answer the questions we have, and what matters is how quickly I can get to the answers. Therefore, I would suggest using tools you are comfortable with and know well enough to apply them fast. In the end, data science is often about trials and errors, therefore it is crucial to learn to fail fast.
In the university, I used low-level programming languages and MATLAB. So naturally, I started learning R for data science, but quite quickly decided to switch to Python. Nowadays the Python ecosystem has probably everything a data scientist might wish for. The core packages like numpy, pandas, scipy, scikit-learn are sufficient to efficiently answer data-related questions, while PyTorch and lightgbm cover almost all the needs for powerful and flexible model fitting. I believe knowing these core blocks well will already allow you to build exceptional things.
One of our favorite interviews so far! Dmitry’s analytical approach to answering things is just out of the world. Make sure you capture the lessons here and hold them till the end.
This is the third interview in the series of Kaggle Interviews. You can read the first 2 interviews here-
What did you learn from this interview? Are there other data science leaders you would want us to interview? Let me know in the comments section below!