Kaggle Grandmaster Series – Competitions Grandmaster and Rank #9 Dmitry Gordeev’s Phenomenal Journey!

Analytics Vidhya Last Updated : 27 Nov, 2020

8 min read

Welcome back to the Kaggle Grandmaster Series!

“I must admit it (Kaggle Competitions) made a huge impact on my career. It was the key reason why I managed to switch to the Data Science area.” – Dmitry Gordeev

Remember when you said ‘no’ to data science competitions? Perhaps you found them too difficult to crack or you felt they weren’t worth the effort.

Well, our popular Kaggle Grandmaster Series is certainly bursting that bubble! We have received an overwhelmingly positive response to the first three interviews and we are delighted to bring the fourth edition today!

Please put your hands together for Kaggle Rank #9 and Grandmaster Dmitry Gordeev!

Dmitry is a Kaggle Competitions Grandmaster and one of the top community members that many beginners look up to. He has 10 gold medals and 4 silver medals to his name, an achievement that sets him apart. He is also a Kaggle Expert in the discussions category.

Dmitry graduated from Lomonosov Moscow State University (MSU) in 2010 as a specialist in pattern recognition. Before joining H2O.ai, he was deeply involved in the Risk Management industry. He brings all this experience to the table in this Kaggle Grandmaster Series interview!

In this interview, we cover a range of topics, including:

Dmitry Gordeev’s Experience in Data Science
Dmitry’s Kaggle Journey from Scratch to becoming a Kaggle Grandmaster
Dmitry’s advice to beginners in Data Science

So without any further ado, let begin!

Dmitry Gordeev’s Experience

data science interview Dmitry Gordeev - experience

Analytics Vidhya (AV): You had quite a few years of experience as a data analyst before transitioning into Data Science. Was this gap too large in terms of tools and processes, and how did you bridge it?

Dmitry Gordeev (DG): I spent several years working as a specialist in the banking retail credit risk area, focused on statistical model development and validation. That was true to a large extent data analytics work, but also included basic machine learning and time series models application.

Luckily, my background covered general areas of machine learning, so when I decided to move to Data Science, it helped not to start from scratch. But there was quite a large gap with regards to the tools I had to bridge. Kaggle was probably the main source of knowledge in that period, allowing students to learn best practices, new approaches, and try new creative (and not so creative) ideas. An amazing community full of brilliant and supportive people help to get into difficult topics quickly.

“Another big gap I had is related to tools of proper code management, collaboration, and model deployment. But I had an opportunity to develop a series of small data related internal projects in a small team end-to-end. That was a great experience, forcing me to work with the tools I haven’t been exposed to before.”

AV: We noticed that you have considerable experience in the domain of risk management, specifically in retail. Can you tell our community how you’ve used data science in this industry?

DG: The industry is quite heavily regulated in Europe and generally is focusing on explainable decision making. Therefore, it is common to apply more robust and well-known approaches over complex black-box models.

However, AI has always been a topic of interest in this area, as it can provide new ways of extracting information from large data samples a bank typically collects and the ability to produce more accurate predictive models to apply for business.

AV: How do you see the future shaping up in Risk Management with respect to machine learning?

DG: I think the low hanging fruit with regards to machine learning in Risk Management is the ability to bring new types of data into consideration, like texts, graphs, and images. It is exactly the type of data that was difficult to analyze with standard methods and hence was not scrutinized enough.

But these are the areas where machine learning shines, especially considering recent developments in language models and transfer knowledge in general.

Another aspect is the developing domain of explainable AI, which can be a game-changer for such industries as Risk Management. The ability to use more diverse data, make better forecasts, and be capable to explain them can make a dramatic impact.

HS: A lot of aspiring data scientists would love to know what your daily tasks as a Senior Data Scientist at H2O.ai entail. Can you take us through a typical day in your work life?

DG: Sure!

One of the core areas of H2O expertise is AutoML, where we provide both open source and commercial products. A part of my typical day is dedicated to supporting our customers to get the best out of the H2O tool for their use cases. These are the companies representing various industries, such as healthcare, retail, production, and many more
Another part of my daily job is dedicated to the development of new AI services and products. For instance, this year we invested our efforts into implementing and sharing the code of several predictive models for the COVID-19 spread forecast. But more importantly, we stressed the necessity to properly backtest and validate such models, as key decisions can be based on the produced forecasts. A more general topic of model validation and model robustness is the focus of my current activities at the moment
Last, but not least, initiatives related to AI applications for good always capture my attention. A good example of that was a recent Kaggle competition dedicated to predicting the stability of mRNA molecules, which can help the development of mRNA vaccines

Dmitry’s Kaggle Journey from Scratch

AV: You’re a Kaggle Competitions Grandmaster with a current rank of 9. What are the challenges you faced when you started out and when you started climbing the leaderboard?

DG: It was a challenge to start the very first competition because I was insecure about my knowledge and skills. But the desire to get better on the leaderboard always motivated me to continue, constantly learn, try, and not to give up.

“I quickly realized how addictive and time-consuming competitions can be, so arguably the main challenge is to find a good balance between spending efforts on trying all the ideas out and having enough rest and time off.”

Also, don’t give up if something doesn’t work, most of the ideas will fail and it is fine. Everyone goes through it; nobody knows the best solution upfront. You just need to be patient enough to keep looking for the approach which works. And then proceed further, searching for the next big idea that beats the current.

AV: How has participation in Hackathons helped you in your career?

DG: Looking back, I must admit it made a huge impact on my career, it was the key reason why I managed to switch to the Data Science area.

It is common that your expertise is being judged by your past employment. So, risk managers are expected to be good at risk management, but not in machine learning.

Participation in competitions, though was extremely time-consuming and barely left any spare time for other activities, helped me to change my career path.

AV: We noticed that the competitions you have achieved high ranks are pretty diverse ranging from fraud detection to earthquake prediction, etc. Do you have any specific criteria for choosing a competition to participate in, and if so, could you list them?

DG: There is a single criterion and it is simple – does it look like I will enjoy working on it? It might be an interesting topic or challenging data. Most of my past competitions were driven by the desire to try something new out, like language models, or time-series like data from earthquakes.

I joined the NFL Big Data Bowl competition because it was one of a few sports-related competitions with quite novel data behind. This way I kept my motivation high to either produce a better model or learn something new for myself, both in machine learning and the domain of the contest. And high motivation brings new ideas and a desire to invest more and more time implementing them.

AV: One of your competitions that grabbed our attention was Bengali AI Handwritten Grapheme Classification. You also scored the second rank in that. Do you have any knowledge about Indic Languages? If not then how were you able to score such a good rank in that competition?

DG: I had absolutely no knowledge about Indic languages before, but now I feel proud that I can recognize some of the graphemes when I see them.

“That’s probably the beauty of machine learning as a discipline – it can be applied across multiple domains, while often very little domain knowledge is required to produce valuable results. It is more typically to classify problems by the type of underlying data rather than by the domain.”

For instance, the Bengali AI Handwritten Grapheme Classification challenge attracted many brilliant computer vision specialists, many of whom have never worked with text images before. But the common approaches which allow AI to distinguish a dog from a cat, identify a pedestrian on a road, or even generate a realistic image of a human face, can be used to classify complex Bengali graphemes.

Dmitry’s Advice for Beginners in Data Science

data types MySql

AV: With the recent boom in deep learning and neural networks, do you still see traditional techniques like ensemble modeling holding their own – both in competitions and in the industry?

DG: Absolutely, xgboost and lightgbm are still the first choice for traditional structured data in tabular format and frequently for time series forecasting. It is important in the industry, where traditionally the data is collected in a structured manner.

“Gradient boosting methods typically produce more accurate models, while requiring less computational resources and much less time for training. Neural networks can serve as complementary models, improving the overall ensemble, but only when carefully tuned for the dataset.”

Neural networks are opening up new areas for AI, such as natural language, computer vision, signal classification, deep reinforcement learning, and many more to come. The machine learning competitions changed focus from tabular data to these new areas, therefore we see such a boom of deep learning in competitive fields. It is exciting, but traditional methods are still as important as they were before.

AV: What are your go-to tools for analytics and data science tasks like visualization, statistical tasks, etc, and how do they differ from the tools that you used as a beginner?

DG: I think there is no single correct way to do things and everyone develops their own approach. We explore and visualize data to answer the questions we have, and what matters is how quickly I can get to the answers. Therefore, I would suggest using tools you are comfortable with and know well enough to apply them fast. In the end, data science is often about trials and errors, therefore it is crucial to learn to fail fast.

In the university, I used low-level programming languages and MATLAB. So naturally, I started learning R for data science, but quite quickly decided to switch to Python. Nowadays the Python ecosystem has probably everything a data scientist might wish for. The core packages like numpy, pandas, scipy, scikit-learn are sufficient to efficiently answer data-related questions, while PyTorch and lightgbm cover almost all the needs for powerful and flexible model fitting. I believe knowing these core blocks well will already allow you to build exceptional things.

End Notes

One of our favorite interviews so far! Dmitry’s analytical approach to answering things is just out of the world. Make sure you capture the lessons here and hold them till the end.

This is the third interview in the series of Kaggle Interviews. You can read the first 2 interviews here-

What did you learn from this interview? Are there other data science leaders you would want us to interview? Let me know in the comments section below!

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Kaggle Grandmaster Series – Competitions Grandmaster and Rank #9 Dmitry Gordeev’s Phenomenal Journey!

Welcome back to the Kaggle Grandmaster Series!

In this interview, we cover a range of topics, including:

Dmitry Gordeev’s Experience

Analytics Vidhya (AV): You had quite a few years of experience as a data analyst before transitioning into Data Science. Was this gap too large in terms of tools and processes, and how did you bridge it?

AV: We noticed that you have considerable experience in the domain of risk management, specifically in retail. Can you tell our community how you’ve used data science in this industry?

AV: How do you see the future shaping up in Risk Management with respect to machine learning?

HS: A lot of aspiring data scientists would love to know what your daily tasks as a Senior Data Scientist at H2O.ai entail. Can you take us through a typical day in your work life?

Dmitry’s Kaggle Journey from Scratch

AV: You’re a Kaggle Competitions Grandmaster with a current rank of 9. What are the challenges you faced when you started out and when you started climbing the leaderboard?

AV: How has participation in Hackathons helped you in your career?

AV: We noticed that the competitions you have achieved high ranks are pretty diverse ranging from fraud detection to earthquake prediction, etc. Do you have any specific criteria for choosing a competition to participate in, and if so, could you list them?

AV: One of your competitions that grabbed our attention was Bengali AI Handwritten Grapheme Classification. You also scored the second rank in that. Do you have any knowledge about Indic Languages? If not then how were you able to score such a good rank in that competition?

Dmitry’s Advice for Beginners in Data Science

AV: With the recent boom in deep learning and neural networks, do you still see traditional techniques like ensemble modeling holding their own – both in competitions and in the industry?

AV: What are your go-to tools for analytics and data science tasks like visualization, statistical tasks, etc, and how do they differ from the tools that you used as a beginner?

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I