This article was published as a part of the Data Science Blogathon.
Let us walk through the Exploratory Data Analysis on NYC Taxi Trip Duration Dataset.
Exploratory Data Analysis is investigating data and drawing out insights from it to study its main characteristics. EDA can be done using statistical and visualization techniques.
We simply can’t make sense of such huge datasets if we don’t explore the data.
Exploring and analyzing the data is important to see how features are contributing to the target variable, identifying anomalies and outliers to treat them lest they affect our model, to study the nature of the features, and be able to perform data cleaning so that our model building process is as efficient as possible.
If we don’t perform exploratory data analysis, we won’t be able to find inconsistent or incomplete data that may pose trends incorrectly to our model.
From a business point of view, business stakeholders often have certain assumptions about data. Exploratory Data Analysis helps us look deeper and see if our intuition matches with the data. It helps us see if we are asking the right questions.
This step also serves as the basis for answering our business questions.
import pandas as pd #data processing import numpy as np #linear algebra
#data visualisation import seaborn as sns sns.set() import matplotlib.pyplot as plt %matplotlib inline
import datetime as dt
import warnings; warnings.simplefilter('ignore')
Let us now import the dataset. (You can download the dataset from here.)
Now, we have our dataset which was of the type ‘csv’ in a pandas dataframe which we have named ‘data’.
We see the shape of the dataset is (729322, 11) which essentially means that there are 729322 rows and 11 columns in the dataset.
Now let’s see what are those 11 columns.
Let us now look at the datatypes of all these columns.
Now, let us look at how does the data in these columns look like.
Let us see if there are any null values in our dataset.
There are no null values in this dataset which saves us a step of imputing.
Let us check for unique values of all columns.
Let us finally check for a statistical summary of our dataset.
Note that this function can provide statistics for numerical features only.
Some insights from the above summary:
Let us create some new features from the existing variables so that we can gain more insights from the data.
Remember pickup_datetime and dropoff_datetime were both of type object.
If we want to make use of this data, we can convert it to datetime object which contains numerous functions with which we can create new features that we will see soon.
We can convert it to datetime using the following code.
data['pickup_datetime']=pd.to_datetime(data['pickup_datetime']) data['dropoff_datetime']=pd.to_datetime(data['dropoff_datetime'])
Now if you will run the dtypes function again, you will be able to see the type as datetime64[ns].
Now, let us extract and create new features from this datetime features we just created.
data['pickup_day']=data['pickup_datetime'].dt.day_name() data['dropoff_day']=data['dropoff_datetime'].dt.day_name()
data['pickup_day_no']=data['pickup_datetime'].dt.weekday data['dropoff_day_no']=data['dropoff_datetime'].dt.weekday
data['pickup_hour']=data['pickup_datetime'].dt.hour data['dropoff_hour']=data['dropoff_datetime'].dt.hour
data['pickup_month']=data['pickup_datetime'].dt.month data['dropoff_month']=data['dropoff_datetime'].dt.month
We have created the following features:
Next, I have defined a function that lets us determine what time of the day the ride was taken. I have created 4 time zones ‘Morning’ (from 6:00 am to 11:59 pm), ‘Afternoon’ (from 12 noon to 3:59 pm), ‘Evening’ (from 4:00 pm to 9:59 pm), and ‘Late Night’ (from 10:00 pm to 5:59 am)
def time_of_day(x): if x in range(6,12): return 'Morning' elif x in range(12,16): return 'Afternoon' elif x in range(16,22): return 'Evening' else: return 'Late night'
Now let us apply this function and create new columns in the dataset.
data[‘pickup_timeofday’]=data[‘pickup_hour’].apply(time_of_day) data[‘dropoff_timeofday’]=data[‘dropoff_hour’].apply(time_of_day)
We also saw during dataset exploration that we have coordinates in the form of longitude and latitude for pickup and dropoff. But, we can’t really gather any insights or draw conclusions from that.
So, the most obvious feature that we can extract from this is distance. Let us do that.
Importing the library which lets us calculate distance from geographical coordinates.
from geopy.distance import great_circle
Defining a function to take coordinates as inputs and return us distance.
def cal_distance(pickup_lat,pickup_long,dropoff_lat,dropoff_long): start_coordinates=(pickup_lat,pickup_long) stop_coordinates=(dropoff_lat,dropoff_long) return great_circle(start_coordinates,stop_coordinates).km
Finally, applying the function to our dataset and creating the feature ‘distance’.
data[‘distance’] = data.apply(lambda x: cal_distance(x[‘pickup_latitude’],x[‘pickup_longitude’],x[‘dropoff_latitude’],x[‘dropoff_longitude’] ), axis=1)
Now let us re-run and see what the head looks like now with these new features.
Thus, we successfully created some new features which we will analyze in univariate and bivariate analysis.
The univariate analysis involves studying patterns of all variables individually.
Let us start by analyzing the target variable.
The histogram is really skewed as we can see.
Let us also look at the boxplot.
We can clearly see an outlier.
We can see that there is an entry which is significantly different from others.
As there is a single row only, let us drop this row.
data.drop(data[data['trip_duration'] == 1939736].index, inplace = True)
We see that there is not much difference between the trips taken by both vendors.
We see the highest amount of trips are with 1 passenger.
Let us remove the rows which have 0 or 7 or 9 passenger count.
data=data[data['passenger_count']!=0] data=data[data['passenger_count']<=6]
Now, let’s see our value counts again.
Now, that seems like a fair distribution.
We see there are less than 1% of trips that were stored before forwarding.
We see there are 2893 trips with 0 km distance.
The reasons for 0 km distance can be:
We will analyze these trips further in bivariate analysis.
ax1.set_title('Pickup Days') ax=sns.countplot(x="pickup_day",data=data,ax=ax1)
ax2.set_title('Dropoff Days') ax=sns.countplot(x="dropoff_day",data=data,ax=ax2)
We see Fridays are the busiest days followed by Saturdays. That is probably because it’s weekend.
ax9.set_title('Pickup Days') ax=sns.countplot(x="pickup_hour",data=data,ax=ax9)
ax10.set_title('Dropoff Days') ax=sns.countplot(x="dropoff_hour",data=data,ax=ax10)
We see the busiest hours are 6:00 pm to 7:00 pm and that makes sense as this is the time when people return from their offices.
ax3.set_title('Pickup Time of Day') ax=sns.countplot(x="pickup_timeofday",data=data,ax=ax3)
ax4.set_title('Dropoff Time of Day') ax=sns.countplot(x="dropoff_timeofday",data=data,ax=ax4)
As we saw above, evenings are the busiest.
ax11.set_title('Pickup Month') ax=sns.countplot(x="pickup_month",data=data,ax=ax11)
ax12.set_title('Dropoff Month') ax=sns.countplot(x="dropoff_month",data=data,ax=ax12)
There is not much difference in the number of trips across months.
Now, we will analyze all these variables further in bivariate analysis.
Bivariate Analysis involves finding relationships, patterns, and correlations between two variables.
Vendor id 2 takes longer trips as compared to vendor 1.
Trip duration is generally longer for trips whose flag was not stored.
There is no visible relation between trip duration and passenger count.
We see the trip duration is the maximum around 3 pm which may be because of traffic on the roads.
Trip duration is the lowest around 6 am as streets may not be busy.
As we saw above, trip duration is the maximum in the afternoon and lowest between late night and morning.
Trip duration is the longest on Thursdays closely followed by Fridays.
From February, we can see trip duration rising every month.
The distribution for both vendors is very similar.
We see for longer distances the trip is not stored.
We see some of the longer distances are covered by either 1 or 2 or 4 passenger rides.
Distances are the longest around 5 am.
As seen above also, distances being the longest during late night or it maybe called as early morning too.
This can probably point to outstation trips where people start early for the day.
As we also saw during trip duration per month, similarly trip distance is the lowest in February and the maximum in June.
This shows that vendor 2 generally carries 2 passengers while vendor 1 carries 1 passenger rides.
We can see there are trips which trip duration as short as 0 seconds and yet covering a large distance. And, trips with 0 km distance and long trip durations.
Let us see few rows whose distances are 0.
We can see even though distance is recorded as 0 but trip duration is definitely more.
A. The NYC TLC dataset stands out as a prominent public dataset, renowned for being among the select few that are not only sizable (exceeding 100GBs) but also characterized by a relatively orderly structure and cleanliness.
A. Several factors contribute to the perceived expense of NYC taxis. High operating costs, including fuel, maintenance, and insurance, are a key factor. Additionally, the dense urban traffic can lead to longer trip durations, raising fares. Regulatory fees, such as those imposed by the Taxi and Limousine Commission, also add to costs. Moreover, demand often outstrips supply, especially during peak hours, allowing taxis to charge premium prices. These factors, combined with the city’s high cost of living, contribute to the perception of NYC taxis as expensive transportation options.
So, we see how Exploratory Data Analysis helps us identify underlying patterns in the data, let us draw out conclusions and this even serves as the basis of feature engineering before we start building our model.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Nice Visualization and Analytics.
By doing Data Visualization step, doesn't this result in Data Leakage & therefore Overfitting of the data I.e. Biased models? Shouldn't this be carried out only on Train dataset after Train- Test Split or k-fold split?
Not very sure what to write for I am not from ur field. But analysis is beyond par.