Have you ever participated in a Kaggle competition? Have you ever wondered what it takes to win one or to become a Kaggle Grandmaster? H2O.ai’s Senior Data Scientist, Nikhil Kumar Mishra, recently achieved the Kaggle Grandmaster title with his 5th Gold in competitions. He spoke to Analytics Vidhya following the win to share with us his journey, struggles, milestones, and what it’s like to be a Kaggle Grandmaster.
And here’s the interview.
Nikhil Mishra (NM): Thank You. I think it’s been a dream for me since the time I started with data science, which is when I started participating in competitions. So yeah, it’s finally a dream come true and I think it’s the same feeling for most competitive data scientists out there when they become a Grandmaster – it’s just pure happiness and excitement.
NM: I think my journey is similar to many data scientists back at that time. We started with Andrew Ng’s famous Machine Learning course, which everyone said ‘If you know this, you probably know more than what half the engineers know’ or so, which was motivational for us. Around the same time, I discovered that data science competitions were a good way to earn money – although I never made any money in the first 3 or 4 years.
There were hackathons going on in college at that time. And although I was not too good at those hackathons, I was interested in data science. So I started participating in data science competitions on platforms like Analytics Vidya and Kaggle obviously. That’s where I came across people like Rohan Rao, SRK, Sahil Verma, and Mohsin – who were all No.1 on Analytics Vidhya at that time. I saw them doing well in almost every competition and felt if they could do it, then maybe even I could. So, that just kept me going.
I’m not going to lie, initially, it was the money that got me into competing. But even when you lose you learn something from it. And when you win, you invest it back in – buy more GPUs, or more cloud computing time, or a better system. It’s a cycle of investing and making money out of it.
The other motivation is the opportunity to try out the latest technology in the field and learn about data science as it evolves. Kaggle competitions let you do that and they also teach you things that you may later use in your work as well. So, I guess, that’s what keeps me going.
NM: I probably don’t remember my first competition so much, but I do remember one competition vividly, which I seriously took part in Kaggle for a month and a half. It was a Microsoft Malware Prediction competition in which we were placed 25th. What makes it memorable is that it was the first time I collaborated with so many people, and that too from different countries.
One of my teammates was from Vietnam, another was from England, and the third was from the US. Also, they were all very senior to me. Seeing this aspect of competitions, where you get to collaborate with people all over the world, and learn from them – was also very motivating for me.
NM: My first win, I think 4000 or 5000 rupees, which felt okay. But seeing yourself on the top of the leaderboard for the first time, that too after so many days, so many attempts – that was something. I think there were 3 or 4 times before that when I came in the top 2 or top 3, or even No. 1 on the public leaderboard. But then I kept falling on the private leaderboard. So finally when I came on top of the private leaderboard, it was a surreal feeling. It was like, “Okay, even I can do this!”
NM: Firstly, as I mentioned, Kaggle competitions are very much about collaboration. I think when you collaborate with people from different parts of the world or different walks of life, you get to learn a lot. You get to see through other people’s minds – how they think, how they try to solve problems. And when you put that into your own strategies, I think it makes you 4x or 5x of what you already are.
The second thing about competitions which I really like is that you have to try a lot of things in a very short period of time. That really helps you evolve as a data scientist. You see, in most of the projects we do ourselves, we have a lot of time to work, but we don’t have some leaderboard to race against. So we usually take it slowly. We try a few experiments and see if they work or reset till we are satisfied with the results. But for competitions, you have so many different things to try in a very short period of time. So the learnings you get in a competition are much more and much better than when you do just things by yourself at work.
The third thing that I think these competitions really help with, is your career. At least for me, my entire journey, all the jobs I got, were all because of the references that I did well in competitions. They were because people knew me from competitions and they saw that I was good at competitions. It helped me build a good network of helpful data scientists and friends. That’s a great takeaway for beginners and aspiring data scientists.
NM: As I mentioned earlier, In Kaggle competitions you constantly have to evolve in a very short period of time because you’re racing against a lot of people and even the smallest differences matter. But in the real world, you don’t know the limits, and probably you might get satisfied after reaching some certain accuracy in your model. And then you say okay, ‘this is enough.’ But for a competition, you’ll have to constantly try out a lot of things; you’ll have to constantly push yourself to be better. And after you compete on a few platforms, you will feel that the projects in the real world become much more simpler to you because you know what to try and what will work, because you have tried it before.
Another thing is, in Kaggle, it’s always about the state-of-the-art solutions. Even if the problems are simple, the solutions are cutting-edge or beating edge. You have the best and latest technologies at your fingertips to try out and see if they work. That is one really big advantage of Kaggle, which you don’t get otherwise.
You’ll even get to reinvent, say, some architectures if you talk about deep learning, or try some really fancy method and share it after the competition. So when any problem of a similar domain comes to you at work, it becomes very easy.
NM: When I initially started it was mostly about structured data problems, and I think the competition was relatively easier compared to what it is now. Not taking anything away from the people who have done it before, they too have worked really hard. But I think it’s much tougher now to secure a good position as compared to, say, six or seven years back. There are a lot more people actively participating on Kaggle now, which makes it more challenging. Also, the kind of resources that were available to us back then is much different than what we have now.
NM: I think In solo competitions, right from the beginning, you have to try things on your own. You’ll have to map out how you want things to go. For instance, if it’s a three-month competition on Kaggle, you’ll have to decide how to progress, what kind of experiments you want to try, and how you would put them together at the end, when you only have one or two weeks left. In solo competitions, all of this solely depends on you.
When you work with teams, if you get stuck somewhere or can’t find something, there’s always a teammate who’ll find it or guide you. Also, it gives you a lot of exposure to how other people think and how the same problem can be approached differently. Each person in the team will have their own way of coding and their way of thinking. The learning is more in this case. The competition also becomes comparatively easier because you split the work and effort, and it’s more exciting to see how all our different ideas come together at the end.
NM: When I began participating in Data Science competitions, most of the problems on Kaggle or even on Analytics Vidhya were on structured data. So I developed a knack for solving those. So, not talking about preference, but I’m definitely much better at solving structured data problems. But I’ve got 2 or 3 gold medals in classic sequence problems, which aren’t completely structured. So I guess I handle unstructured datasets pretty well too. I definitely want to evolve more in them though.
NM: I think in my initial days, say, from 2018 to 2021, you could easily manage most competitions on a local workstation, or maybe with a really high-end laptop. But now, most of the competitions require a lot of resources.
See, the number of resources that you’ll need at the beginning of the competition is a lot different than towards the end of the competition. Towards the end, you want to try a lot of ideas together and run some big experiments. And for that you will need bigger resources, like what a cloud setup can provide. But that calls for a big investment, which I feel will eventually pay off when you win competitions.
NM: So, if you split a three-month competition – the time we spend every month is equal. But speaking of the effort we put in as data scientists, I think it’s the most during the end of the competition. In the last one or two weeks, our effort is double, or triple, or even 10 times more as compared to the rest of it.
At the beginning of the competition, we are all chill, just thinking about which experiments to run. And then we test them out slowly and observe the results. In the middle, we try out different ideas, change some parameters, and figure out what works. But by the end, we have hundreds of ideas to try and only 10 days left! Then it’s mostly just sleepless nights and coffees.
NM: It’s a lot of fun to engage in Kaggle discussion forums and even on LinkedIn or Twitter. We share some of our ideas and updates on where we are on the leaderboard. We sometimes even challenge each other on social media.
Apart from that, I think the learnings shared by the Kaggle community are completely different from what you find on any other platform. The wealth of knowledge you get from these discussions and the solutions at the end of competitions is very valuable. On Kaggle, you can find the latest paper on state-of-the-art technology or a really fancy technique you may want to try. You will also find the outcomes of experiments tried out by different people and the different approaches they take. All of that adds to who you are as a data scientist. And the best part I think is that it’s completely open for anybody to access.
Then again, when you compete, you find teammates from around the world who share their knowledge with you. That also helps you with your networking and future jobs, which I think is a big bonus for aspiring and upcoming data scientists.
NM: Most beginners keep wondering how to start on Kaggle, and I tell them that the most important part is to start. It’s not about how you start, what’s important is that you start. Once you start, you’ll eventually find your way.
The other concern I often hear from beginners is that they get low ranks although they compete a lot. Hear me out – that’s how it is for most people.
Even if you check my profile, you’ll see that my first few competitions were really bad. But that’s how you start, and from there you will evolve. Now, how to get better and improve this? Read solutions from past competitions and try to implement them on your own. Keep doing this and you’ll notice that your ranks improve. It definitely requires that effort from your end.
That’s what I did. I would go crazy experimenting and trying out past solutions. This helped me understand how others think and how they go about solving problems. All of that added to my experience and gradually helped me move up the leaderboard.
NM: The first thing is if you are starting in a Kaggle competition, start early. Most competitions are 3 months long and starting early gives you ample time to experiment, run tests, and do really well on a project.
The second thing is to plan out your time really well. Kaggle competitions are all about doing good experiments and doing a lot of experiments. If you want to do that, you need to plan out what kind of experiments you want to try and figure out how to make your iteration faster. You could do this by sampling the data, through better allocation of the resources, etc.
The third thing I think you should do is a lot of reading. This could be the latest research papers, or solutions of previous problems, or just skimming the internet to see what’s new. And as you read, see how you can use these new models and techniques in your projects. Keep asking yourself, Can I use that model? Can I train it on my data? What kind of results would I get? and so on.
That being said, one cannot stay updated on everything, all the time. You can gain surface-level knowledge of the latest large language models and technologies from reading, and also from the discussion boards on Kaggle. From that, you need to pick what topics to focus on and explore them further, depending on your project or work. But even that surface-level knowledge will help you stay ahead in the competition.
NM: Thankfully for me, my company really motivates everyone to participate in competitions. So much, that it has its own team of Grandmasters! So my work and colleagues really motivate me and appreciate me when I do well in competitions.
My usual day during competitions would mostly be in front of two screens – one for work and the other running experiments for the competition. But during the last part of the competition, it’s just sleep-competitions-eat-repeat! During that time, the rest and fun part of life goes on hold. That’s the only accommodation I have to make.
NM: I think by now I would have participated in over a hundred competitions. Now that I’m at H2O, I’m more actively participating – so, about 20-25 competitions per year. Obviously, on Kaggle you cannot participate in more than 5-6 competitions due to the length. But there are platforms with smaller competitions lasting a week or two, or even over weekends.
NM: It’s really motivating when you work with people who are much more talented than you and even some who were your Idols when you began your journey. Back in 2019, there was a conference near my college, where Rohan Rao was one of the speakers, and Sanyam Bhutani was an organizer. At that time, they didn’t even know me and I just attended as a college student. And now I’m participating with Rohan on a regular basis.
It’s a different feeling when you get to work alongside such people. And they are constantly pushing the limits at work while doing really well in competitions. When you have such a great circle to work with, it definitely pushes you.
NM: For me, like I said, in my initial years of competing, Rohan, SRK, Sahil, Mohsin – all of these people were the ones who really inspired me. I’ve learned a lot from whatever they have posted – be it articles or notebooks, or solutions to problems.
During my college time, there was Josh Starmer, whose short videos helped me learn things quickly and prepare for college exams and interviews. Nowadays there are a lot of good YouTubers like 3Blue1Brown who post interesting and informational content. There’s Andrej Karpathy teaching about LLMs and the world is moving towards open sourcing the knowledge hidden behind LLMs. So there’s knowledge and inspiration everywhere!
Don’t miss out the opportunity to learn to build a ChatGPT-style language model from Josh Starmer at the DataHack Summit 2024!
NM: Apart from reading discussion forums, as I mentioned earlier, I like to read research papers, which is now easier than ever, thanks to tools like ChatGPT. That keeps me updated with the latest developments in machine learning.
I haven’t really read many books, but I’m sure those are great sources of knowledge too. I prefer articles posted on Twitter or Reddit because you get them as soon as something new comes out.
For courses, I’d definitely recommend Andrej Karpathy’s CS231 and Andrew Ng’s courses on machine learning and AI. Even Gilbert Strang’s videos on Linear Algebra, I think are quite helpful.
And for competitive data science specifically, I suggest you read the solutions to previous problems and get the latest updates from research papers.
NM: I don’t think I prepared myself for this question. Well, I’m generally interested in multimodal LLMs. Apart from that, I read about Agentic AI. I try to learn how we can use different agents to automate our tasks. Then, if I start with a Kaggle competition, I get interested in knowing more about the LLMs or generative AI related to that problem.
NM: I was talking to Nischay about this the other day. He’s a friend and I compete a lot with him. So, I was telling him now that I’ve come in the top 100, at the 63rd rank, his being 5th in the world pushes me to participate more and get better. So I am definitely looking forward to more competitions and pushing myself to be in the top 10 or top 20 by next year.
I haven’t really set goals for the far future, but I’d definitely like to keep participating in competitions and build some really good AI products. I also hope to make some good open source contributions in the future.
With 6 gold, 9 silver, and a bronze medal under his belt, Nikhil Kumar Mishra finally earned his Kaggle Competitions Grandmaster title! In this interview, he told us how Kaggle as a platform helps data scientists showcase their skills, learn from others, and tackle real-world problems. He also shared with us some great tips and course recommendations for people who are just starting out their Kaggle or data science journeys.
However, approaching Kaggle competitions can be overwhelming, especially for beginners with limited domain knowledge. To help you out, we are bringing you Kaggle Grandmaster Nischay Dhankhar for a GenAI Hack Session on “Mastering Kaggle Competitions: Strategies and Techniques for Success,” Don’t miss out on this great opportunity at the DataHack Summit 2024!