In the 20th edition of the Kaggle Grandmaster Series, we are honored to be joined by Quadruple Kaggle Grandmaster- Rohan Rao.
Rohan ranks 100th in Kaggle Competitions, 6th in Datasets, 12th in Notebooks, and 12th in Kaggle Discussion category with 8,8,15 and 56 gold medals to his name respectively.
Rohan currently works as a Data Scientist at H2O.ai. He has a Masters Degree in Applied Statistics from IIT Bombay. He is also a 17-time National Sudoku/Puzzle Champion
You can go through the previous Kaggle Grandmaster Series Interviews here.
So without any further ado. Let’s begin.
Rohan Rao(RR): Mathematics & Statistics form the roots of many data science workflows and machine learning algorithms. Having a solid understanding of the fundamentals helped me learn and grow faster, professionally as well as competitively.
The Master’s degree and experience gave me more exposure to the field and also opened up a lot of opportunities in different industries.
RR: I work in between our products and customers, constantly engaging and helping users of our platform while building and enhancing our products in parallel. I enjoy this kind of role because it gives me an opportunity to contribute to different aspects of the business.
It’s important to know and understand the exact role of a Data Scientist because it can vary widely across different companies and then prepare and apply for the ones that best suit your skillset and interests.
RR: Thank You! It has been one of the most exhilarating experiences of my life and I’m glad I could accomplish this feat.
Competitions are the only category that I find having a meaningful ranking & points system while the other three are more for learning and networking. It’s also the primary reason why I enjoy spending time on Kaggle. What helped me most here was to try out and work on as many different types of competitions as time permitted in my early days and those experiences helped me accrue some successful results in subsequent competitions.
Datasets were the hardest category to reach the KGM tier. What helped me was to constantly work on dataset ideas that would be useful for projects or competitions and publish extremely clean datasets that users can use conveniently.
Notebook is a great category to showcase novel visualizations or models or pipelines. I don’t like writing EDA notebooks much so most of my efforts have gone into interesting model pipelines or utility scripts or tutorials.
Discussions are fun and easy for me because I enjoy writing ☺
RR: Competitions (in 2016), followed by Notebooks, Datasets, and Discussions in order (all in 2020).
Datasets were the hardest and most likely due to lower activity and visibility compared to the other three categories on the platform. So it involved more effort and investment of time.
RR: Start with the basics and tweak as you go.
Spend the majority of the time exploring the data. Read every discussion and scan every notebook. Be open to approaches. Often the craziest of ideas can help you rank well.
Team up to learn more and ensemble better, it is often under-rated.
RR: ASHRAE competition is my most memorable competition as we had a great team and it was my first prize-winning competition on Kaggle.
Large Datasets Tutorial notebook is my most popular notebook and in fact, I had deleted that notebook because I did not like it. Fortunately, I had shared the notebook with a couple of friends who gave me extremely positive feedback and convinced me to republish it.
Santander Product Recommendation was the competition that gave me my 5th gold medal to become a Competitions Grandmaster. My team-mate, Sudalai, and I were struggling to make our ensemble work until we chose different weights for different rows, some of them not even summing to 1. Non-sensical idea but worked.
RR:
RR:
RR: Most of the knowledge competitions on Kaggle are specifically meant for beginners to get started and they are the best place to start and learn stuff.
RR: I use a combination of both. I try to make the maximum use of freely available resources before moving to my personal servers (which are sometimes costly).
RR: python-datatable is a good alternative to pandas especially while working on large datasets with limited resources.
Recently I quite enjoyed using PyTorch in R, it’s a great wrapper for DL-practitioners who want to stick to R.
RR: Kaggle is the best place to keep oneself acquainted with all the latest happenings in ML/DL and ArXiv for reading research papers. The more you spend time working on these hands-on, the better you get at it and the easier it becomes to apply them to hackathons and real-world industry problems.
It was a pleasure for us to interview such a multi-talented person. His thoughts and words are enough to get anyone to begin and stay focused on their data science journey.
This is the 20th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-
What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!