“Start with the “knowledge” type of hackathons. There you do not compete for money (or other rewards). You can receive more help and there is no stress if you do not do very well”- Marios Michailidis
When money becomes aspirational instead of learning, that is a red flag in your life. We have seen many people not practicing enough on Kaggle because they were not able to win the cash price and eventually drop data science as a career option.
So to motivate you to break this habit, we are pleased to be joined by Marios Michailidis in this edition of the Kaggle Grandmaster Series.
Marios is a 2x Kaggle Grandmaster, holding the titles in Competitions and Discussions Category. He ranks 5th and 73rd and has 39 and 69 gold medals to his name respectively. Also, he is an Expert in the Kaggle Notebooks category.
Marios has a Ph.D. in Financial Computing from University College London. He is currently working as a Competitive Data Scientist at H2O.ai.
You can go through the previous Kaggle Grandmaster Series Interviews here.
So let’s begin without any further ado.
Marios Michailidi (MM): That’s right. Not only books but many of the things that I have learned also came straight from the free internet from websites like Wikipedia, StackOverflow, the usual suspects. Back then, the data science field was not as refined as it is now – even the term “data science” did not exist. 10 years ago, there was no specific module or university degree which could make you a data scientist.
Having said that, I think what I did back then and the way I learned data science may not be the most optimum given the choices you have today. Nowadays, there are nice courses at universities, for example, both my previous universities at UCL and Southampton have good MScs for Data Science. There are also many good ones (for multiple seniorities or specializations) in online platforms like Coursera too. I have looked at the curriculums from many of these online courses and they look pretty good.
If you follow the reviews, you cannot go wrong I think. Then you have scientific blogs dedicated to data science or organizations like yourselves and kaggle that provide multiple means for people to learn the craft. The reason I mention these is that the path to becoming a data scientist now is a bit clearer and my answer on how I learned it is potentially outdated if someone intends to follow it.
In any case, I started via learning programming. I started with C++ (but don’t remember the title of the book), but I do recall that when I reached the chapter that was explaining the “pointers”, I totally lost it and I thought that programming was not for me. I gave it one more chance with Java that was very hot back then and was easier to learn! The book I used was called “head first java”. It took me something like 3 weeks to just create a Jtable and populate it with data from a CSV file, but after that, the learning increased exponentially. I learned computer-aided statistics, testing hypotheses, and basic regression from “Discovering Statistics Using SPSS” written by Andy Field.
I dived deeper into machine learning concepts via reading the book that came along with the weka software which I used a lot as a reference to both learning the concepts and how to code machine learning modules. I had read countless other books, articles, blogs, etc in that period, but these 3 stand out the most, and my recommendation for today’s data scientist-to-be is to try and acquire knowledge from the same three pillars, which in my opinion are:
MM: It took me 2-3 months to start feeling more comfortable with it and about 6 months to start creating some basic machine learning applications. After understanding the basics, I tried to implement multiple machine learning techniques and make them faster than the software that I was using. I could implement multiple machine learning techniques (like logistic regression, decision trees, simple neural networks, etc) from scratch after one year of constant trial and error. I was putting many hours into it – maybe 6-8 on top of my job as there was no significant overlap with it. Soon after I realized that there are other libraries (e.g sklearn, H2O,) that can do it faster, give better results, and are easier to use and I gave up! However, that learning was/is the foundation I relied/rely upon to further develop my skills.
I think with both programming and data science, you can never really be complacent. Especially the latter is changing rapidly. For all, we know there might be a different programming language tomorrow that is ideal to perform data science tasks or a new library/technique may come out that totally changes the dynamics of what is considered state of the art today. As a data scientist, one of the most important skills you must have is the ability to learn or adapt to what is new.
MM: In principle, the main difference is that it is automated! As far as the actual pipeline goes, there are different levels of automation that apply to various aspects of the machine learning/data science process and the automated machine learning toolkits need to account for these from the start of the experiment to production. Within the organization, I work for (called H2O.ai), we have developed various tools that fall into this space and automate the following aspects:
MM: I do not think it affects the role of existing data scientists as much as people may think. The main reason automated tools became more popular is that the supply of data science (in human resources) is not enough to meet the current needs. These automated tools help make data scientists more productive. Data scientists can still run things programmatically. What changes is that they can handle more experiments and cover more space is in less amount of time. Many of the mundane/repetitive tasks (like rerunning a deep learning model with a higher learning rate to see if results are better) are handled automatically as well as reporting, documentation/presenting the insights, model explainability can also be handled by the tools. The tools can also prevent errors that may arise out of negligence (like leakage) and errors in the data. The data scientist can focus more on other things that are more likely to yield uplift, like:
Just to name a few.
MM: It makes life easier for all these roles. For the data analyst, it becomes easier to run experiments using a GUI than coding everything from scratch. Managers can appreciate the details and insight presented in documentation and reports produced by the tools (and commonly the production-ready code that comes with it) and see more work being done in less amount of time from their team.
In my opinion, these do not change the fact that a more experienced data scientist or data practitioner will be able to get more done and be more efficient when using these tools than somebody who just entered the field. The role of the data scientist gets strengthened with these tools, not the other way around.
MM: We have been automating things for years now and demand for programmers has only been increasing (and expected to increase more). This is because the field is changing so quickly and the state-of-the-art, as well as the expectations, are different every year. Automation helps us achieve higher highs, however, there is still the extra mile we need to make to reach the top.
In the meantime, it seems like the ceiling will keep going up. I do not see the demand ceasing. For all we know, there could be a different programming language next year that is the best one to do machine learning.
MM: They could sign up to H2O.ai’s learning center (more info here). These courses are specifically designed to teaching AutoML and there are variants for all levels (beginners or pros). I have been involved with these tutorials and I can recommend them confidently. There are other sources out there, courses and books too. Learning some form of ML can greatly help too (before diving specifically into AutoML).
MM:
MM: When I had my best years on kaggle (never thought I would say that!) the title of the kaggle grandmaster did not exist, so I did not specifically try to attain that the same way that recent members may need to. What I did try was to achieve the #1 spot (in competitions) and that was tough. I had to put a lot of hours into it on top of my day job (like 60+ per week) and I ended up being exhausted by the end of it, but I feel glad that I was able to do it. Another challenge was maintaining a top 10 position for 6 straight years or so because data science back then was different than what it is today.
The biggest challenge is to keep learning and motivating yourself. Maybe is not so much of a challenge if you like it, but there have been cases where I had to dive into areas I was not very familiar with and tried to cover the gaps as quickly as I could. As data science becomes more refined, different areas have developed (like computer vision, reinforcement learning, NLP, etc) that require a lot of expertise.
There came a point where I could not be as good in all as I would have liked to, but I never became complacent. I still see myself as a student in the data science journey and I feel you need this kind of mentality if you want to be successful on kaggle or your working environment. Another challenge is to get the right technology/hardware.
I feel I had a good set-up for the pre-deep learning era (where I had multiple 256 GB RAM machines with 40 cores), but it became quickly outdated. Kaggle does help by providing resources for GPUs/TPUs through kernels. Colab may be another option (especially if you are in the US). Making optimum use of currently owned and freely available resources is important to do well in competitions.
In general, becoming a Grandmaster is a nice goal to have, primarily because of the journey it will take to get there, the stuff you will learn along the way, the people you will meet, the challenges you will face, so do not obsess so much with obtaining the title, because the fact you are on that track does pay dividends on your development as a data scientist.
MM: You need some automation and manage the time right. Managing expectations is also important (to maintain your sanity). I do not join every competition with the goal of winning. This almost never is my goal. Not anymore. I mostly join to learn and have fun. In that sense, I do not try to “complete a competition before the deadline” but rather do as best as I reasonably can give the time left and the time I am expected to invest.
Ideally, you want to prepare an iterative process that for example can run overnight so that you can get the results the next day. Managing the time your machine will be running things for you is of the essence to cover as much ground as possible within the time constraints of a hackathon. I do most of the work between 7 until 12 during the night.
Stuff runs overnight. I submit them in the morning or evening, depending on when they finish. I see the results and I strategize what to do next until the time comes that can code it and the same loop happens again.
MM: These models have been so much used and studied that I do not think there are any hidden gems here! A few things to say for them are:
MM: Just to clarify, I was a data scientist before I started competing in hackathons. However, I have mentored people and generally seen people that landed their first job without having any other experience (other than hackathons). My advice is:
I have found kaggle to be a very good place to keep up with new developments. Most of the time, the competitors, or the researchers themselves will choose this platform to publish some of their work, therefore you can try them right away as they usually come with code. For instance, Xgboost became known because of kaggle.
I must admit, I am a bit tired of when a paper comes out that claims they have beaten all the benchmarks with a new technique, but then you try it on a new dataset and underperforms. Maybe I prefer to miss on a few months from something that can be potentially good and wait to see it tested on the platform before investing my own time.
Other than that, following some of the top conferences in the field is probably the best source for keeping up with new things. For instance, I like KDD, Deep Learning Summit (London), recsys, Big data London, and Strata to name a few. With regards to implementing things on my own, I now do less of that. I prefer to pick among the available choices out there and improve/adjust if needed.
I follow the usual suspects: G. Hinton, Y. LeCun, Andrew NG, F. Chollet. I also follow the work from many of the top kagglers, many of whom happen to be my colleagues at H2O.ai.You can follow them on Twitter or on other social media.
Jeremy Howard is also a data scientist I really like to follow. He always posts good material and has a gift for explaining things if you happen to listen to any of his lectures online.
Also, advancements in Machine learning interpretability are very interesting to me.
Well, this is one of the longest interviews we had. It is for sure a Goldmine for people trying to get things in line with their data science journey.
This is the 15th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-
What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!