This article was published as a part of the Data Science Blogathon.
The first step towards problem-solving in data science projects isn’t about building machine learning models. Yes, you read that right!
That distinction belongs to hypothesis generation – the step where combine our problem solving skills with our business intuition. It’s a truly crucial step in ensuring a successful data science project.
Let’s be honest – all of us think of a hypothesis almost everyday. Let us consider the example of a famous sport in India – cricket. It is that time of the year when IPL fever is high and we are all absorbed in predicting the winner.
If you have been guessing which team would win based on various factors like the size of the stadium and batsmen present in the team with six hitting capabilities or batsmen with high T20 averages, then kudos to you all. You have all been making an educated guess and generating hypotheses based on your domain knowledge of the sport.
Similarly, the first step towards solving any business problem using machine learning is hypothesis generation. Understanding the problem statement with good domain knowledge is important and formulating a hypothesis will further expose you to newer ideas of problem-solving.
So in this article, let’s dive into what hypothesis generation is and figure out why it is important for every data scientist.
Hypothesis generation is an educated “guess” of various factors that are impacting the business problem that needs to be solved using machine learning. In framing a hypothesis, the data scientist must not know the outcome of the hypothesis that has been generated based on any evidence.
“A hypothesis may be simply defined as a guess. A scientific hypothesis is an intelligent guess.” – Isaac Asimov
Hypothesis generation is a crucial step in any data science project. If you skip this or skim through this, the likelihood of the project failing increases exponentially.
This is a very common mistake data science beginners make.
Hypothesis generation is a process beginning with an educated guess whereas hypothesis testing is a process to conclude that the educated guess is true/false or the relationship between the variables is statistically significant or not.
This latter part could be used for further research using statistical proof. A hypothesis is accepted or rejected based on the significance level and test score of the test used for testing the hypothesis.
To understand more about hypothesis testing in detail, you can read about it here or you can also learn it through this course.
Here are 5 key reasons why hypothesis generation is so important in data science:
The million-dollar question – when in the world should you perform hypothesis generation?
Let us now look at the “NEW YORK CITY TAXI TRIP DURATION PREDICTION” problem statement and generate a few hypotheses that would affect our taxi trip duration to understand hypothesis generation.
Here’s the problem statement:
To predict the duration of a trip so that the company can assign the cabs that are free for the next trip. This will help in reducing the wait time for customers and will also help in earning customer trust.
Let’s begin!
Let us try to come up with a formula that would have a relation with trip duration and would help us in generating various hypotheses for the problem:
TIME=DISTANCE/SPEED
Distance and speed play an important role in predicting the trip duration.
We can notice that the trip duration is directly proportional to the distance traveled and inversely proportional to the speed of the taxi. Using this we can come up with a hypothesis based on distance and speed.
Cars are of various types, sizes, brands, and these features of the car could be vital for commute not only on the basis of the safety of the passengers but also for the trip duration. Let us now generate a few hypotheses based on the features of the car.
Trip types can be different based on trip vendors – it could be an outstation trip, single or pool rides. Let us now define a hypothesis based on the type of trip used.
A driver is an important person when it comes to commute time. Various factors about the driver can help in understanding the reason behind trip duration and here are a few hypotheses this.
Passengers can influence the trip duration knowingly or unknowingly. We usually come across passengers requesting drivers to increase the speed as they are getting late and there could be other factors to hypothesize which we can look at.
The day and time of the week are important as New York is a busy city and could be highly congested during office hours or weekdays. Let us now generate a few hypotheses on the date and time-based features.
Pickup Day:
Time:
Roads are of different types and the condition of the road or obstructions in the road are factors that can’t be ignored. Let’s form some hypotheses based on these factors.
Weather can change at any time and could possibly impact the commute if the weather turns bad. Hence, this is an important feature to consider in our hypothesis.
I hope you were able to get some value from this post. If there is anything that I missed or something was inaccurate or if you have any feedback, please let me know in the comments. I would greatly appreciate it.
Very helpful post. This really helped me understand different points on which I can build my hypothesis.
IF A<B THEN GOTO B ELSE READ A.
Nice, this makes Soo much sense!