“Bad examples can often be just as educational as good ones”- Martin Henze
In data science, every mistake, bad experience, and example is unique to every dataset and contains a lesson. Don’t agree with us?
The Kaggle Grandmaster series is certainly back to challenge your disagreement with its 5th edition.
To talk more about learning through bad examples we are thrilled to bring you this interview with Martin Henze, who is known on Kaggle and beyond as ‘Heads or Tails’.
Martin is the first Kaggle Notebooks Grandmaster with 20 Gold Medals to his name and currently ranks 12th. His granular level documentation is well lauded within the community. Also, he is a Discussions Master with 45 Gold Medals.
He has a Ph.D. in Astrophysics from Technical University Munich and currently works as a Data Scientist at Edison Software. He brings all his experience from diverse fields in this Kaggle Grandmaster Series Interview.
So without any further ado, let begin!
Martin Haze(MH): From the very beginning, my work in astrophysics was data focussed. The vast majority of my research during my academic career was based on observational data obtained via various ground- and space-based observatories. Through my desire to analyze this data, and to understand the physics of the astronomical objects in question, I was motivated to learn programming basics, Python, R, statistics, and eventually some basic machine learning methods like logistic regression or decision trees.
I always derived a lot of insights from data visualizations. I am a very visual person. The DataViz capabilities of the R language, together with its rich statistical toolset, were my gateway to frameworks beyond simple bash scripts or astronomy-specific tools (of which there is quite a number). Practically at the same time, I picked up Python to replace Bash as a “glue language” and due to its larger collection of astrophysical libraries.
My first exposure to the wider world of Data Science was through the Kaggle community. In 2017, I joined Kaggle with the goal to learn more about state-of-the-art Machine Learning and Data Science techniques. The intention was to see which of the tools could be useful for my astrophysical projects. However, very quickly I became interested in the wide variety of challenges that Kaggle provided; which in turn opened my eyes to the myriad ways in which I could apply my data skills to the problems in the real world. I was intrigued. And the more I learned, the more I realized that it was time for a change.
MH: Kaggle was really instrumental in learning Data Science and Machine Learning techniques. The community is truly remarkable in the way that it unites expertise with a welcoming atmosphere. Working on a specific problem for a few months with like-minded people is a fantastic way to experience how others are approaching the project and to learn from them. At that time, Kaggle Notebooks (aka Kernels) were starting to become popular, and I learned a lot from other people’s code and their write-ups.
In parallel, I read up on the different techniques that were new to me, like boosted trees, to understand the underlying principles. I don’t recall that there was a single, main source of knowledge; although I still think that the scikit-learn documentation is a pretty thorough (and underrated) way to get started. My maths background, from my physics degree, might have helped; but I don’t think it’s a strong requirement.
MH: I think that astrophysics provides a lot of potential for the application of state-of-the-art ML techniques. Astronomers always had a lot of data; starting 100 years ago with the first large telescopes and with targeted data collection using photographic plates. Currently, we are in a golden age of astronomical surveys, where large areas of the sky are being monitored regularly by professional astronomers and citizen scientists alike. The resulting data sets are rich, diverse, and very large. Astrophysics is gradually adopting Deep Learning tools. I’m certain that there are many future synergies between both fields.
(MH): A Kernels Grandmaster title is awarded for 15 gold notebooks; which I achieved with my first 15 notebooks within about a year after joining Kaggle. My notebooks usually focus on extensive exploratory data analysis (EDA) for competition data. I’m always aiming to provide a comprehensive overview of all the relevant aspects of the data as quickly as possible, to provide other competitors with a head start into the competition. I don’t think there’s much of a secret to it – my goal is to be thorough and explain my insights.
(MH): In my view, the most important property of high-level public notebooks is having detailed and well-narrated documentation. Regardless of the notebook topic, you need to be able to explain your work and insights to the reader; ideally in a clear and engaging style. The level of detail in the documentation depends on the topic of the notebook and the knowledge of your audience. Remember that one major purpose of a notebook is to communicate your thinking and approach.
For EDA notebooks, I recommend starting with the fundamental building blocks of the dataset and work towards gradually more complex features and interactions. The challenge here is to work methodically, and don’t get sidetracked by new ideas. Those new ideas will inevitably occur to you when digging deeper into any reasonably interesting dataset. EDA is always about answering certain questions that you have about the dataset; which is why the specifics of the EDA depend on those questions and on the data itself.
To make sure that a modeling notebook is not only performing strongly but is also accessible to a reader, it is vital to structure and document your code well. Beyond best software engineering practices, this means to explain your thinking for why you chose specific pre-processing, model architecture building, or post-processing steps.
In any case, remember that clear communication is important – not just for other people to understand your work but also for yourself to recall why you were doing what you were doing when looking at the notebook again a few months later. This also addresses the very core of the notebook’s format: reproducibility.
(MH): It differs in the sense that different types of data call for a DL approach (i.e. images, text) instead of more traditional ML techniques (i.e. tabular data, time series). Tabular data is often the easiest to explore because its features are reasonably well defined and can be studied in isolation as well as in their interactions. Similar to time series data, where we have an established set of visual techniques that deal with e.g. decomposition or autocorrelations.
In the DL realm, text data is probably closest to the tabular paradigm: basic NLP features like word frequencies or sentiment scores can be extracted and visualized much like categorical tabular columns. Image data are more complex in terms of their feature space, but I strongly recommend to look at samples of your images before starting the modeling. While this might give you data augmentation ideas, it primarily serves to unveil sources of bias (e.g. an image classifier learning about the background of the image instead of the intended foreground objects.)
(MH): Let’s discuss two different, common scenarios. The first is a binary classification problem with very imbalanced target classes, as it is commonly found in fraud detection or similar contexts. Basic visualizations will instantly reveal this imbalance. If instead you jump straight into a basic model and choose accuracy as your metric, then you might likely end up with a, say, 95% accurate model which simply predicts the majority class in every case. This is certainly not what you’d want.
The second scenario assumes that you have been given separate train and test samples (which mirrors the setup of most Kaggle competitions). While you don’t want to touch the test set for building or tuning your model, it is important to make sure that your training data is indeed representative of this test set. In a business context, this translates to confirming that you build your model on data like the ones it will encounter in production. Otherwise, there is a real danger of encoding a significant bias in your final model, which will thus not generalize well to future data. Visual comparisons of the train vs test features will reveal significant bias.
(MH): For most projects, I’m getting a lot of mileage out of bar plots, scatterplots, and line charts. Those are the swiss army knives in your DataViz tool belt that are most important to know and to understand. They come with a few rules – e.g. bar plots should always start from zero on the frequency axis – but are generally intuitive: bars measure counts or percentages for categorical variables, scatter points show how two continuous features relate to one another, and lines are great to see changes over time.
For visualizing multiple feature interactions I recommend multi-facet plots (especially for categoricals with relatively few levels) and heatmaps. Heatmaps can produce very insightful visuals to uncover patterns hidden in feature interactions. In this context, correlation plots and confusion matrices can be considered a type of heatmap. For specific categories of data, you’d want to be familiar with the appropriate plots. For instance, geospatial data often looks best on maps.
Bad examples can often be just as educational as good ones, so here is a recommendation of what *not* to do: Pie charts have a well-deserved reputation for being bad because slight differences between pie slices are very hard for human brains to interpret. Barplots are always better in this situation. There is a very limited set of cases where pie charts can be useful: e.g. only 3 or 4 different slices; or a focus on 1 specific slice and its growth. Personally, I would always avoid pie charts.
More generally, less is more when it comes to DataViz. Every visual dimension (x, y, z, color, size, facet, time) should correspond to one and only one feature. Bells and whistles like interactivity or animation can sometimes help but are often a distraction. Always remember that the purpose of a good visualization is to communicate one (or a small set of) insights in a clear and accessible way.
(MH): I’m a huge fan of R’s ggplot2 and related libraries. In my view, ggplot2 is the gold standard for DataViz tools. This is mainly due to the way in which it implements the grammar of graphics as an intuitive set of building blocks. In ggplot2, the frequent iterations in the plot building process are quick and seamless. The reusability of visuals is high, which means that your past work can serve as an adaptable starting point for new projects.
In general, I advocate for the use of tools that use code to build visuals – as opposed to drag and drop tools like a tableau. The main reason is reproducibility: adapting your existing ggplot2 code to new or related data is made just as simple as interpreting and explaining your insights based on the visualization choices you made. I’m convinced that any time investment you make to learn a tool like ggplot2 will pay off tenfold in terms of productivity in the future.
(MH): The challenge here is to restrict me to five people only. There are so many smart and generous people out there who share their knowledge with the community; and I have been fortunate to learn a great deal from most of them. So, I’m going to cheat a bit and give you the names of 5 experts on Kaggle, and 5 beyond it.
On Kaggle, one of my first inspirations was Sudalai Rajkumar, or SRK as he is affectionately known. He has a gift for accessible and powerful code. One of the pillars of the Kaggle community is the inimitable Bojan Tunguz who continues to share so much valuable advice. Gilberto Titericz, also known as Giba, is a true ML expert with a deep understanding of how to (quickly) build high-performance models. Rohan Rao, known on Kaggle as Vopani is an inspiration and a role model for so many of us – not just as a data scientist but also as a human being. One of Kaggle’s recent rising stars is Chris Deotte, who always shares creative and thorough insights into any new challenge.
An important expert to bridge the worlds of Kaggle and beyond is Abhishek Thakur, who’s Youtube channel and hands-on NLP tutorials teach ML best practices to a new generation. Another great teacher is the fastai founder Jeremy Howard – everything he touches seems to turn to gold. When it comes to making DL architectures accessible it’s hard to overestimate the visuals of Jay Alammar. Hadley Wickham is the mastermind behind the R tidyverse – building the tools that allow us to do data science. In a similar way, I admire the thoughtful and user-focused philosophy of the Keras creator François Chollet.
The interview was an eye-opener highlighting the importance of Notebooks in the community. Plus, combined with his panoply of thoughts, there is a lot we can learn from here.
This is the fifth interview in the series of Kaggle Interviews. You can read some of the past interviews here-
What did you learn from this interview? Are there other data science leaders you would want us to interview? Let me know in the comments section below!