“I think one of the nice things about the data science field is that it is so multi-disciplinary and that anyone who aspires to become a data scientist can do so.” – Gilles Vandewiele
Golden words!
As a beginner in data science, this quote gives me a lot of hope provided that I, like many other data science aspirants, don’t come from a scientific or technical background. And when it comes to people like us, looking up to someone’s journey to learn from is really important.
To ease the process, we are excited to bring to you an exclusive interview with Gilles Vandewiele. He is a 2X Kaggle Master in both the Competitions and Discussions categories.
He has already won 3 Gold Medal Competitions this year. He actively participates in Kaggle discussions where he helps others based on his experiences and learnings. He’s the perfect community presence to learn from!
Also, Giles is a Ph.D. student in Machine Learning at the Internet and Data Science Lab (IDLab) research group in the Department of Information Technology (INTEC) at Ghent University. There he is conducting research in the domain of white-box machine learning for critical domains and (semantic) knowledge models.
This is a highly insightful interview for beginners in Data Science. So take it all in and enjoy your journey!
Gilles Vandewiele(GV): My transition was rather smooth as I started a Ph.D. in Machine Learning at IDLab (Ghent University) directly after finishing my master’s degree in CS engineering there.
I think one of the nice things about the DS field is that it is so multi-disciplinary and that anyone who aspires to become a data scientist can do so. Of course, some degrees, such as CS and mathematics, do make this transition easier but it is definitely not a requirement to become a data scientist.
GV: We typically make a distinction between white and black box ML models. White box models are techniques that are inherently interpretable such as decision trees, linear regression, and Bayesian networks. On the other hand, we have black box models that are very difficult to explain, such as neural networks. While there are techniques out there that can highlight why certain techniques make a specific prediction, such as SHAP, these techniques are only able to give local, instance-based explanations and it is impossible to fully grasp the internals of the model.
In critical domains, where decisions have significant consequences (e.g. law, health, and finance), it is of key importance that ML techniques support the expert in making decisions instead of making the decisions for them. This importance is being increasingly recognized, as we are seeing a surge in the domain of eXplainable AI (xAI).
GV: I got to know Kaggle in my final master year, 5 years ago, as part of a project of a Machine Learning course in which we had to recognize traffic signs. I am a very competitive person and remember that I spent a lot of time on that project as I wanted to end up high on the leaderboard. While the result was not that great (only finished 20th out of 31 teams), I did learn a lot.
I then Kaggled on and off over the next two years, mostly joining playground competitions to hone my skills. It is only around 2019 that I started Kaggling on a frequent basis by continuously participating in competitions, one at a time. I achieved a Kaggle expert roughly 10 months ago and a Kaggle master status, 5 months ago. 2020 was a good year for me, as I was able to win 3 Competition Gold Medals.
GV: Projects on Kaggle and in the real world definitely have some differences at first sight, but have more similarities than one would think at closer inspection. In real-world projects, a lot of time and work needs to be invested in the earlier and later steps of a typical data science pipeline (such as data collection, data cleaning, model visualization, …). While a data scientist should have some experience in each of the steps of such a pipeline, we cannot expect everyone to be an expert in all of those steps. Therefore, I think Kaggle is the ideal place to hone your skills in the modeling and analysis part of the pipeline. Even more so than most real-world projects. The main reason why Kaggle is a better learning environment than the real world is that your boundaries are pushed further by other competitors: you want to end up high in competition and thus create a solution that is better than the other solutions (which are often 1000s of them); in the real world, you create a solution that fulfills the need of the clients and then you are done.
GV: Actively reading and participating in discussions helped me to better understand many different subjects: you learn new things by reading other people’s posts and you better understand the things you know once you have to explain them to others.
GV: It is definitely not easy to create a good write-up, and definitely something I can improve myself further on. Nevertheless, it is a very important skill as a data scientist to explain your solutions to people with all kinds of backgrounds. I typically start out with a schematic drawing of my solution, which helps to structure my post and also gives me an overview of the components that need to be discussed. I then spend more attention on the components that I struggled with understanding myself and try to explain it in a way that helped me to understand the subject. It can also help to mentally go back in time to before the competition (when you did not know anything about the data & problem) and see whether it would have been possible to understand the post at that time. You could also ask a friend to check your post and see whether they can understand it.
GV: I never focused on the discussion aspect solely. All of my discussions are made in the context of competitions in which I participated myself.
But I do spend a reasonable amount of my time on discussions as it also helps me learn a lot from it. I think some of the most valuable learning experiences on Kaggle are made in a team, as you learn from others. Similarly, discussing ideas on the forum helps to understand the problem and the data at hand.
GV: I wish I could say that I have a nice structured approach and workflow for all of my competitions, but I am a very chaotic person. I make a lot of copies of the same notebook with small changes and my competition directory quickly becomes a huge mess. If there is one piece of advice that I can give, is that it is of key importance to iterate very fast during these competitions. You need to set up a pipeline quickly and make some simplifications to it that can increase its efficiency while not sacrificing too much of its performance. After that, a lot of different ideas need to be implemented in a trial-and-error fashion. While implementing these ideas, it is important to keep an overview of what ideas did and did not work. In the end, all of the working ideas can then be integrated into the pipeline.
GV: This will perhaps sound cliche, but my main piece of advice would be to “not hold back”. When you start Kaggling, you should not care about your results but rather about how much you learn. I sometimes hear from others that they do not want to participate on Kaggle because they are afraid of ending up badly on the leaderboard. I think that’s a big mistake.
One other piece of advice that I would like to give is to “never take shortcuts” or “game” the system. We sometimes see malpractices in the notebook, discussion, and dataset tiers where people spam others on LinkedIn for upvotes, or just plagiarize other people’s work. This will never pay off in the long term.
GV: In order to build your own profile, personal branding is important. Definitely share your achievements from hackathons on different social media. Also, blog posts (e.g. with write-ups of your solutions) help to reach people that do not participate in hackathons. Finally, a website or live CV is also a good thing. I would suggest making that as early as possible so that you can extend it over time.
I thoroughly enjoyed interacting with Gilles Vandewiele via this interview. He has a clear structure to his thoughts and his enthusiasm to share his experience is something all beginners will benefit from.
This is the third interview in the series of Kaggle Interviews. You can read the first 2 interviews here-
What did you learn from this interview? Are there other data science leaders you would want us to interview? Let me know in the comments section below!