Introduction
So, what are you doing this weekend ? We have an amazing opportunity you wouldn’t want to miss (if you are crazy about machine learning). Brace yourself up, the action is about to get started.
Lord of the Machines is round the corner. To help you make last minute strategic plans, we thought of sharing these winning tips with you all. These tips will provide you a unique perspective to help you build better ML models.
Competition conducted on Datahack are fast paced (live for only 48 – 72 hours) so that you start to think, act and respond faster to solve business problems. We want you to be fast and efficient. So, if you are determined to walk with our pace, very soon you’ll reach another milestone in your life. Stay with us!
Follow the below tips shared by four datahack champions.
Tips by 4 DataHack Winners
Nalin Pasricha, DataHack Rank 1, Mumbai
Nalin is an investment banker turned data scientist who currently works as an independent consultant.
He has participated in 17 hackathons at DataHack. He won Data Hackathon 3.x and emerged as the 1st Runner Up in Black Friday DataHack. Check out his complete profile here.
Here’s what Nalin has to say:
- Our mind works subconsciously at night on our problems in a very powerful manner. So I try to start work on the problem as early as possible so that my mind has at least one night to work subconsciously on the problem.
- Read inspirational books or watch inspiring videos during a competition. I think it really helps your mind to go beyond its usual limits. I remember I was reading ‘The Wright Brothers’ by David McCullough during one hackathon. It’s the story of two brothers who were only bicycle manufacturers, they had not even attended college, they had no funding, and still they managed to make the world’s first aeroplane, beating top scientists, universities etc. I did really well in the hackathon mainly because my mindset was changed due to this book.
- Try to use a package or language that is new to you. It’ll make you think differently and spur your creativity. I normally use R, but when I try to use Python instead I think I come up with unusual solutions.
Sudalai Rajkumar (SRK), DataHack Rank 2, Chennai
SRK is a Senior Data Scientist at Tiger Analytics. He is currently positioned at Rank 23 on Kaggle and has been bestowed with the Grandmaster Title on Kaggle. He is an inspiration for most of the aspiring data scientists in our community.
He has participated in 17 hackathons on DataHack. He’s a two time winner of Mini DataHack and 2nd runner up for the Black Friday Datahack. Check out his complete profile here.
Here’s what SRK has to say:
- Feature engineering – The first and most important thing. We need to concentrate a lot on this since this makes a huge difference in the scores.
- Solid Validation Strategy – Without this, competitions are more or less like gambling and so it is essential to have a proper local validation strategy. Public LB can be misleading at times.
- Ensembling / Stacking – This is an important last step which helps us cover that extra mile at the end.
Rohan Rao, DataHack Rank 5, Bengaluru
Rohan is the Lead Data Scientist at AdWyze. He is currently positioned at Rank 70 on Kaggle and holds the prestigious Kaggle Master title. He has represented and brought laurels to India in World Sudoku championships.
He has participated in 11 hackathons on DataHack. He’s the winner of The Seer’s Accuracy DataHack and stood as 1st runner up in the Last Man Standing. Check out his complete profile here.
Here’s what Rohan has to say:
- Understand The Problem: Without understanding the problem statement, the data, the evaluation metric, most of your work is fruitless. Spend time in reading as much as possible about them. Only once you are very clear about the objective, you can proceed with exploration.I spend a good amount of time reading through and re-reading through all the available information. It usually helps me in figuring out an approach / direction before writing a single line of code.
- Summarize / Visualize Data: Data Science competitions are driven by data. It’s all about the data. Sometimes you can have a great problem statement but noisy data. Sometimes you can have really clean data but a tricky evaluation metric. Sometimes you might have a good model, but with skewed outliers. While there are huge advancements being made to automate a lot of this, there is still a lot of value in exploring data yourself. Cleaning data, handling outliers, transforming data, engineering features, etc. are all winners. I’ve found these to be major factors in Machine Learning projects.Feature engineering is the most useful output of data exploration. I believe that if you find the right and useful features, you can build a single powerful model better than any ensemble.Remember the Garbage In Garbage Out philosophy, if you input noisy/unclean data into a model, no matter how powerful the model is, it will result in noisy output.
- Validation Framework: A lot of people jump into building models by dumping data into the algorithms. While it is useful to get a sense of basic benchmarks, you need to take a step back and build a robust validation framework. Without validation, you are just shooting in the dark. You will be at the mercy of overfitting, leakage and other possible evaluation issues.By replicating the evaluation mechanism, you can make faster and better improvements by measuring your validation results along with making sure your model is robust enough to perform well on various subsets of the train/test data.
Shantanu Dutta, DataHack Rank 6, Kolkata
Shan is a Senior Associate at ACME. He is a self learned data scientist and specializes in BFSI and marketing. So, all this way, if you ever doubted that self learning can’t make you a data scientist, you were wrong.
Shan has participated in 37 hackathons on DataHack. He won Date Your Data and Re-date Your Data competition. Check out his complete profile here.
Here’s what Shan has to say.
- Understand the Data: Do not worry about needing huge amounts of compute power, it is possible to do well in these competitions with moderate setups.Understand the data and generate a hypothesis. This part is important.
- Preprocessing & Feature Engineering: Spend a considerable amount of the time in pre-processing and feature engineering. Have participated in many competitions, and it’s never the case that any dataset is perfectly clean , there’s always some sort of inherent noise in the dataset that’ll be creating hiccups in models. It may be missing values, outliers etc. Be able to visualize the data at each level of extraction will avoid many frustrations at the end.
- Algorithm Selection: Select the algorithm most suited for data. Have confidence on your handcrafted cross validation results.
Now, you have the winning potion. It’s time to test your winning habit. Use these tips in our upcoming competition Lord of the Machines and shine out as a champion.
This competition is going to be intense and mind-boggling, you will have to fight and survive to reach the end. Be a Winner, and challenge all your limits this time.
Register Now
To know more about the competition Visit Here
You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.
Kunal Jain is the Founder and CEO of Analytics Vidhya, one of the world's leading communities of Al professionals. With over 17 years of experience in the field, Kunal has been instrumental in shaping the global Al landscape. His expertise spans diverse markets, from developed economies like the UK to emerging ones like India, where he has successfully led and delivered complex data-driven solutions. As a recognized thought leader, Kunal has empowered countless individuals to realize their Al ambitions through his visionary approach to Al education and community building. Before founding Analytics Vidhya, Kunal earned both his undergraduate and postgraduate degrees from IIT Bombay and held key roles at Capital One and Aviva Life Insurance across multiple geographies. His passion lies at the intersection of analytics, Al, and fostering a thriving community of data science professionals.