Kaggle Grandmaster Series – Exclusive Interview with Kaggle Notebooks Grandmaster Gabriel Preda (#Rank 10)

Analytics Vidhya Last Updated : 05 Feb, 2021

8 min read

“When, a few years ago, I started to study Data Science systematically, I could use all this previous experience”- Gabriel Preda

The above statement is a testament to the fact that data science is a multi-disciplinary field and your past experience will only make it easy to interpret the qualitative aspects of data.

To explain this, in the 18th edition of Kaggle Grandmaster joining us is Kaggle Grandmaster, Gabriel Preda.

Gabriel is a Kaggle Notebooks Grandmaster and ranks 10th with 27 Gold Medals. He is also an expert in Kaggle Competitions and Kaggle Datasets category. And he holds the Master title in Kaggle Discussions.

Gabriel has a Ph.D. in Computational Electromagnetics from the University POLITEHNICA of Bucharest. He is currently a Lead Data Scientist at Endava.

You can go through the previous Kaggle Grandmaster Series Interviews here.

In this interview, we cover a range of topics, including:

Gabriel’s Education and Work
Gabriel’s Kaggle Journey
Gabriel’s Advice to Beginners in Data Science
Gabriel’s Inspiration

So without any further ado. Let’s begin.

Gabriel’s Education and Work

kaggle grandmaster series education and work

Analytics Vidhya (AV): You hold a Ph.D. in Computational Electromagnetics and have significant experience in various fields such as Research, Software Development, Project management, and Data Science as well. If we talk about Data Science specifically then how did you start your career in the field of Data Science?

Gabriel Preda(GP): More than 20 years ago I was using Neural Networks to solve ill-posed inverse problems in Nondestructive Testing and Evaluation. Our task was to reconstruct defects geometry in structural steel parts from electromagnetic signals picked-up by a coil probe. Since the problem was quite extremely ill-posed, exploring the possible solution space with classical methods, based on conjugate gradient was prone to errors (easy to fall in local minima) so a more robust approach, using pre-trained NN (with simulated signals as well as measured) was a better solution.

We were experimenting with several other approaches, including Genetic Algorithms or other evolutive approaches; we were using PCA & data splitting-data fusion approach as a pre-processing step. Later, after I became a software engineer, I worked with various techniques for image processing, object detection, pattern matching.

Developing, later on, with my own company, software solutions for medical imaging, I also learned more about image filters, image registration, image segmentation, all kind of techniques used for image manipulation. Also, I build software for PDE results visualization (computer graphics software). So, I was using quite a lot of tools & techniques that are now commonly used in DS & ML. Also, common algorithmics that is now part of various models I was using while developing solvers for large-scale PDE solution.

I also was always interested in data exploration. When, a few years ago, I started to study Data Science systematically, I could use all this previous experience. In the first few years, I was mainly studying, learning R, Python, how to explore the data, how to build features and models, also publishing content online, and competing in Kaggle competitions. Then I helped create a Data Science community in my company, by making presentations, delivering training, facilitating technical communications about Data Science subjects across the company. After that, I started to work as a Data Scientist in my current company.

AV: You were mainly into the research and teaching work before becoming a full-time Software Developer at Integrisoft Solutions. So how did you transition from academia to the industry?

GP: I started the transition to industry 4 years before, working, in Tokyo, for Science Solutions International Laboratory – a company that developed a high-performance computing solver for PDE (very much similar field with what I did in my academic research, just that in the industry) – this was a product that we mainly sell to R&D departments for large Japanese manufacturing companies – they were using it to validate design done with other, less fast or precise CAD systems.

We were doing a lot of consulting – in electromagnetics, image processing for industries like aerospace, energy, transportation, construction of electrical machinery and equipment, steel industry. Some of our activity consisted in taking inventions (that requires very advanced skills to implement) & develop prototypes / MVPs – to make them ready for VC investment rounds.

So I started the move to industry gradually, doing first very similar work to what I did in academia, converting from a researcher developing high-performance software to a software engineer developing very specialized scientific software as well as very “common” type of software, like animation graphics software, used in CAD systems for post-processing the results of PDE simulations.

Gabriel’s Kaggle Journey from scratch

AV: How did you get to know about Kaggle and what was your first impression as a beginner?

GP: I got to Kaggle while learning Data Science. I was learning about R & Python languages, about tools and techniques, algorithms using blog posts, papers, online articles (from R-bloggers, Data Science Central, KDNuggets, Analytics Vidya but also arXiv), GitHub projects, absorbing virtually any information I found.

On Kaggle I found an entire community looking basically for the same things I was looking for; and I found a lot of the resources that were distributed all around the internet, in the same place. It felt finally at home, in a way.

AV: You are Kaggle Kernel Grandmaster and currently ranked 10, this is really impressive. You must have faced a lot of challenges during this journey till now, can you recall a few of them and also how did you overcome them?

GP: In the early stages, I was using a lot of Notebooks to analyze datasets. I am happy I did not do like many are doing with Kaggle, just download a dataset and start wrangling with the data on their local computer. I started to spend a lot of time on the Kaggle platform, and I also started to compete.

At a certain point, I felt that I do not advance (my knowledge) anymore (at the same pace as previously) so I felt the need to return to a more formal way of learning, so I spent, in parallel with Kaggle, quite a lot of time doing courses for Data Science and Machine Learning on Coursera. I also started to study with more attention Kernels and read discussion topics from high-ranked Kagglers.

This also helped me a lot to improve. My Kernels started to be more visible. Seeing that my Kernels are forked and used by others was my biggest reward. Upvotes and medals arrived then naturally (and I also reached, at a certain moment, rank #3 in Kernels).

AV: Which is your favorite kernel to date and what do you think is unique about it from the rest of your kernels?

GP: Well, the Kernel (from my Kernels) I like most is a funny one: Beer or Coffee in London – Tough Choice? No more! It uses 2 datasets (with Starbucks and English Pubs) to establish which Starbucks is closer to your local pub, to sober-up with a coffee after some beers. It is an R Kernel, using Voronoi Polygons, and polygons clipping to create maps with multiple layers superposed to display areas covered by coffee shops and pubs in the London area.

This is not by far the most popular one. My most popular Kernels are few Kernels that explore in minute detail data from some high-profile competitions.

AV: Please tell us about your checklist while creating a notebook. What are the mandatory steps that one should follow and should always keep in mind while creating any notebook?

GP: If it is an Exploratory Data Analysis (EDA) Notebook related to a competition, I try to cover clearly all the steps from ingesting the data, doing preliminary exploration, data profiling, check data quality issues, then do complete data exploration, using best choices for visualizing the features, to try to capture most useful aspects of the data, in preparation of a model.

I also try to find some hidden patterns or anomalies in the data, which can provide an original, or different angle to approach the problem. If I also include a baseline model, I will have sections about features selection, feature engineering, and training the model as well as inference for the test set and submission.

AV: How has your experience in other aspects of Kaggle – Competitions, Datasets, and Discussions contributed towards your ascent to a Grandmaster?

GP: I was not very active in Datasets or Discussions until recently. Now I am in the top 50 in discussions (also a Master) and top 20 in Datasets. Recently, I started to spend quite a significant amount of time to collect, curate and publish interesting data. My work on Competitions is highly related, in the last 2 years, with my work on Kernels. Most of my high-profile Kernels are related to competitions and I also invest around 50% of my time in Kaggle preparing private kernels for competing. My performance in competitions is not comparable with the one in the other 3 (Kernels, Datasets, Discussion) but I think that I started to make more progress recently. And, of course, I aim to become a master in Competitions.

Gabriel Advice to the Beginners for Data Science

data science interview work

AV: How has Kaggle helped you in your professional career so far? The idea behind this question is to help beginners understand what they can expect from hackathons and how that translates to the real world.

GP: Kaggle is the best place to accelerate your learning curve in Data Science and Machine Learning. Before Kaggle, I was learning at a “normal” speed. Kaggle, through competitions, made me learn so that I can advance on the leaderboard, in a very short time, a lot of useful techniques. I learn with Notebooks also how to better communicate Data Science findings.

And, reading the contributions in Comments of the best Kaggle GMs, it was like (not like, it was the actual thing) being able to speak to the best of the best in Data Science. There are things that you need while working as a DS and you will not find on Kaggle (the part about selecting, profiling, curating the data, mostly – well, you do this for Datasets or when you prepare an In-Class competition; or about productizing models; most of the data engineering part and so) but important parts of algorithmics, and data exploration & feature selection and feature engineering and ML pipeline, and how to organize and structure and perform experiments – all these you can learn at lightning speed and from the best people on Kaggle.

AV: Based on your own experiences, what would you like to suggest to the people who want to have an educational background in non-CS STEM fields and want to learn Data Science?

GP: I assume you are talking about Humanities, Law, Economics, History, Linguistics – all domains that are increasingly now made use of Data Science. Sociology and demography were traditionally using DS techniques – some were developed within them.

My suggestion is to take an iterative approach – start the small, experiment, add more topics on your learning list only after you experimented with the simplest algorithms (as simple as possible, but not simpler). Do not jump to Auto-ML solutions or Deep Learning or so. It is very important that you can understand what your models are doing. So that you can correctly interpret your findings as well as being credible when exposing/interpreting to others your results. Find the simplest approach available for your class of problem, test it, if it is working, build on it. And then, if you need more advanced tools, go to the next step. Only then.

Gabriel’s Inspiration

kaggle grandmaster series inspiration

AV: Can you name five Data Science experts whose work you always look forward to?

GB: I will identify a few of the Data Science experts from Kaggle. I admire and I follow the work of Gilberto Titericz (Giba) – especially his comments during and after competitions, Sudalai Rajkumar (SRK) – comments & Notebooks, Chris Deotte – with so many valuable contributions, Bojan Tunguz (a very friendly and helpful Kaggler), Abishek Thakur (very active in the community outside Kaggle as well).

End Notes

Having multi-disciplinary background is rather a plus point in data science. We hope you do not stop your journey due to this thought of having multiple backgrounds.

This is the 18th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-

What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Kaggle Grandmaster Series – Exclusive Interview with Kaggle Notebooks Grandmaster Gabriel Preda (#Rank 10)

In this interview, we cover a range of topics, including:

Gabriel’s Education and Work

AV: You were mainly into the research and teaching work before becoming a full-time Software Developer at Integrisoft Solutions. So how did you transition from academia to the industry?

Gabriel’s Kaggle Journey from scratch

AV: How did you get to know about Kaggle and what was your first impression as a beginner?

AV: You are Kaggle Kernel Grandmaster and currently ranked 10, this is really impressive. You must have faced a lot of challenges during this journey till now, can you recall a few of them and also how did you overcome them?

AV: Which is your favorite kernel to date and what do you think is unique about it from the rest of your kernels?

AV: Please tell us about your checklist while creating a notebook. What are the mandatory steps that one should follow and should always keep in mind while creating any notebook?

AV: How has your experience in other aspects of Kaggle – Competitions, Datasets, and Discussions contributed towards your ascent to a Grandmaster?

Gabriel Advice to the Beginners for Data Science

AV: How has Kaggle helped you in your professional career so far? The idea behind this question is to help beginners understand what they can expect from hackathons and how that translates to the real world.

AV: Based on your own experiences, what would you like to suggest to the people who want to have an educational background in non-CS STEM fields and want to learn Data Science?

Gabriel’s Inspiration

AV: Can you name five Data Science experts whose work you always look forward to?

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM