Transport and logistics, food and shopping, payments, daily needs, business, news and entertainment, Gojek, an Indonesian firm does it all through a mobile app contributing more than $7 billion to the economy. It has 900K registered merchants, more than 190M app downloads, and more than 2M drivers delivering more than 180K orders within a span of 120 minutes. It’s a behemoth! Using business analytics, we will be solving a case study. Below is the last of the 20+ services it offers:
Data generated through these services is enormous and GO team has engineering solutions to tackle with day to day data engineering issues. Central Analytics and Science Team(CAST) enables multiple products within the Gojek ecosystem to efficiently use the abundance of data involved in the working of the app. The team has analysts, data scientists, data engineers, business analysts, and decision scientists working on developing in-house deep analytics solutions and other ML systems.
The analysts’ role is concentrated on solving day-to-day business problems, having good business knowledge, creating impact, deriving insights, RCA’s(root cause analysis), and keeping top management informed on micro as well as macro metrics, and product decisions to address business problems.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
GOJEK directors have asked BI analysts to look at the data to understand what happened during Q1 2016 and what they should do to maximize the revenue for Q2 2016.
Our GO-FOOD service in Surabaya performed very well last month – they had 20% more completed orders last month than the month before. The manager of GO-FOOD in Surabaya needs to see what is happening in order to constantly maintain this success for the next month onwards.
Before beginning to solve, start researching blogs and whitepapers that are present on the company website(links are added below). Company archives provide useful resources that act as guides and help understand what the company stands for or what the company is expecting out of this role. Questions one and three can be considered open-ended problems. Question two is a simple exercise on regression, not necessarily focusing on the best model, but the focus is on the processes involved in building a model.
Import data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
#import csv
sales_df =pd.read_csv('https://raw.githubusercontent.com/chrisdmell/Project_DataScience/working_branch/09_gojek/sales_data_all.csv')
print("Shape of the df")
display(sales_df.shape)
print("HEAD")
display(sales_df.head())
print("NULL CHECK")
display(sales_df.isnull().any().sum())
print("NULL CHECK")
display(sales_df.isnull().sum())
print("df INFO")
display(sales_df.info())
print("DESCRIBE")
display(sales_df.describe())
Create pandas datetime from object format. Pandas datetimes is an easy format to work with and manipulate dates. Derive the month column from datetime. Filter out month 4(April) as well. Rename months as Jan, Feb, March.
## convert to date time
# convert order_status to strinf
##
time_to_pandas_time = ["date"]
for cols in time_to_pandas_time:
sales_df[cols] = pd.to_datetime(sales_df[cols])
sales_df.dtypes
sales_df['Month'] = sales_df['date'].dt.month
sales_df.head()
sales_df['Month'].drop_duplicates()
sales_df[sales_df['Month'] !=4]
Q1_2016_df = sales_df[sales_df['Month'] !=4]
Q1_2016_df['Month'] = np.where(Q1_2016_df['Month'] == 1,"Jan",np.where(Q1_2016_df['Month'] == 2,"Feb",np.where(Q1_2016_df['Month'] == 3,"Mar","Apr")))
print(Q1_2016_df.head(1))
display(Q1_2016_df.order_status.unique())
display(Q1_2016_df.service.unique())
#import csv
At the group level, overall revenue has grown by 14%. This is a positive outcome. Let’s break this down by various services and identify services that are performing well.
revenue_total.sort_values(["Jan"], ascending=[False],inplace=True)
revenue_total.head()
revenue_total['cummul1'] = revenue_total["Jan"].cumsum()
revenue_total['cummul2'] = revenue_total["Feb"].cumsum()
revenue_total['cummul3'] = revenue_total["Mar"].cumsum()
top_95_revenue = revenue_total[revenue_total["cummul3"]<=95 ]
display(top_95_revenue)
ninety_five_perc_gmv = list(top_95_revenue.service.unique())
print(ninety_five_perc_gmv)
top_95_revenue_plot = top_95_revenue[["Jan", "Feb", "Mar"]]
top_95_revenue_plot.index = top_95_revenue.service
top_95_revenue_plot.T.plot.line(figsize=(5,3))
## share of revenue is changed but has the overall revenue changed for these top 4 services#import csv
## NET - completed rides
Q1_2016_df_pivot_cbv_4 = Q1_2016_df[Q1_2016_df["order_status"] == "Completed"]
Q1_2016_df_pivot_cbv_4 = Q1_2016_df_pivot_cbv_4[Q1_2016_df_pivot_cbv_4.service.isin(ninety_five_perc_gmv)]
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv_4.pivot_table(index='service', columns=['Month' ], values='total_cbv', aggfunc= 'sum')
# display(Q1_2016_df_pivot_cbv.head())
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv[["Jan", "Feb", "Mar"]]
for cols in Q1_2016_df_pivot_cbv.columns:
Q1_2016_df_pivot_cbv[cols]=(Q1_2016_df_pivot_cbv[cols]/1000000000)
display(Q1_2016_df_pivot_cbv)
display(Q1_2016_df_pivot_cbv.T.plot())
## We see that go shop as reduced its revenue but others the revenue is constant.
Q1_2016_df_pivot_cbv_4 = Q1_2016_df_pivot_cbv
Q1_2016_df_pivot_cbv_4.reset_index(inplace = True)
Q1_2016_df_pivot_cbv_4["Feb_jan_growth"] = (Q1_2016_df_pivot_cbv_4.Feb / Q1_2016_df_pivot_cbv_4.Jan -1)*100
Q1_2016_df_pivot_cbv_4["Mar_Feb_growth"] = (Q1_2016_df_pivot_cbv_4.Mar / Q1_2016_df_pivot_cbv_4.Feb -1)*100
display(Q1_2016_df_pivot_cbv_4)#import csv
Q1_2016_df_pivot_cbv = Q1_2016_df[Q1_2016_df["order_status"] != "Completed"]
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv.pivot_table(index='service', columns=['Month' ], values='total_cbv', aggfunc= 'sum')
Q1_2016_df_pivot_cbv = Q1_2016_df_pivot_cbv[["Jan", "Feb", "Mar"]]
revenue_total = pd.DataFrame()
for cols in Q1_2016_df_pivot_cbv.columns:
revenue_total[cols]=(Q1_2016_df_pivot_cbv[cols]/Q1_2016_df_pivot_cbv[cols].sum())*100
revenue_total.reset_index(inplace = True)
display(revenue_total.head())
overall_cbv = Q1_2016_df_pivot_cbv.sum()
print(overall_cbv)
overall_cbv.plot()
plt.show()
overall_cbv = Q1_2016_df_pivot_cbv.sum()
overall_cbv_df = pd.DataFrame(data = overall_cbv).T
display(overall_cbv_df)
overall_cbv_df["Feb_jan_growth"] = (overall_cbv_df.Feb / overall_cbv_df.Jan -1)*100
overall_cbv_df["Mar_Feb_growth"] = (overall_cbv_df.Mar / overall_cbv_df.Feb -1)*100
display(overall_cbv_df)
revenue_total.sort_values(["Jan"], ascending=[False],inplace=True)
revenue_total.head()
revenue_total['cummul1'] = revenue_total["Jan"].cumsum()
revenue_total['cummul2'] = revenue_total["Feb"].cumsum()
revenue_total['cummul3'] = revenue_total["Mar"].cumsum()
top_95_revenue = revenue_total[revenue_total["cummul3"]<=95 ]
display(top_95_revenue)
ninety_five_perc_gmv = list(top_95_revenue.service.unique())
print(ninety_five_perc_gmv)
Q1_2016_df_can_com = Q1_2016_df[Q1_2016_df.order_status.isin(["Cancelled", "Completed"])]
Q1_2016_df_can_com = Q1_2016_df_can_com[Q1_2016_df_can_com.service.isin(ninety_five_perc_gmv)]
Q1_2016_df_pivot = Q1_2016_df_can_com.pivot_table(index='service', columns=['order_status','Month' ], values='num_orders', aggfunc= 'sum')
Q1_2016_df_pivot.fillna(0, inplace = True)
multi_tuples =[
('Cancelled', 'Jan'),
('Cancelled', 'Feb'),
('Cancelled', 'Mar'),
('Completed', 'Jan'),
('Completed', 'Feb'),
('Completed', 'Mar')]
multi_cols = pd.MultiIndex.from_tuples(multi_tuples, names=['Experiment', 'Lead Time'])
Q1_2016_df_pivot = pd.DataFrame(Q1_2016_df_pivot, columns=multi_cols)
display(Q1_2016_df_pivot.columns)
display(Q1_2016_df_pivot.head(3))
Q1_2016_df_pivot.columns = ['_'.join(col) for col in Q1_2016_df_pivot.columns.values]
display(Q1_2016_df_pivot)
#import csv
Q1_2016_df_pivot["jan_total"] = Q1_2016_df_pivot.Cancelled_Jan + Q1_2016_df_pivot.Completed_Jan
Q1_2016_df_pivot["feb_total"] = Q1_2016_df_pivot.Cancelled_Feb + Q1_2016_df_pivot.Completed_Feb
Q1_2016_df_pivot["mar_total"] = Q1_2016_df_pivot.Cancelled_Mar + Q1_2016_df_pivot.Completed_Mar
Q1_2016_df_pivot[ "Cancelled_Jan_ratio" ] =Q1_2016_df_pivot.Cancelled_Jan/Q1_2016_df_pivot.jan_total
Q1_2016_df_pivot[ "Cancelled_Feb_ratio" ]=Q1_2016_df_pivot.Cancelled_Feb/Q1_2016_df_pivot.feb_total
Q1_2016_df_pivot[ "Cancelled_Mar_ratio" ]=Q1_2016_df_pivot.Cancelled_Mar/Q1_2016_df_pivot.mar_total
Q1_2016_df_pivot[ "Completed_Jan_ratio" ]=Q1_2016_df_pivot.Completed_Jan/Q1_2016_df_pivot.jan_total
Q1_2016_df_pivot[ "Completed_Feb_ratio" ]=Q1_2016_df_pivot.Completed_Feb/Q1_2016_df_pivot.feb_total
Q1_2016_df_pivot[ "Completed_Mar_ratio" ] =Q1_2016_df_pivot.Completed_Mar/Q1_2016_df_pivot.mar_total
Q1_2016_df_pivot_1 = Q1_2016_df_pivot[["Cancelled_Jan_ratio"
,"Cancelled_Feb_ratio"
,"Cancelled_Mar_ratio"
,"Completed_Jan_ratio"
,"Completed_Feb_ratio"
,"Completed_Mar_ratio"]]
Q1_2016_df_pivot_1
## column wise cancellation check if increased
perc_of_cols_orders = pd.DataFrame()
for cols in Q1_2016_df_pivot.columns:
perc_of_cols_orders[cols]=(Q1_2016_df_pivot[cols]/Q1_2016_df_pivot[cols].sum())*100
perc_of_cols_orders
perc_of_cols_cbv.T.plot(kind='bar', stacked=True)
perc_of_cols_orders.T.plot(kind='bar', stacked=True)
The Business team has a budget of 40 Billon for Q2 and it has set growth targets for each service. For each service, the cost of incremental 100 rides and the maximum growth target in Q2 is given below. For Go-Box, to get 100 more bookings, it costs 40M, and the maximum growth target in Q2 is 7%.
Import budget data and use sales data from the above analysis.
budget_df =pd.read_csv('https://raw.githubusercontent.com/chrisdmell/Project_DataScience/working_branch/09_gojek/optimization_budge.csv')
print("Shape of the df")
display(budget_df.shape)
print("HEAD")
display(budget_df.head())
print("NULL CHECK")
display(budget_df.isnull().any().sum())
print("NULL CHECK")
display(budget_df.isnull().sum())
print("df INFO")
display(budget_df.info())
print("DESCRIBE")
display(budget_df.describe())
## convert to date time
# convert order_status to string
##
time_to_pandas_time = ["date"]
for cols in time_to_pandas_time:
sales_df[cols] = pd.to_datetime(sales_df[cols])
sales_df.dtypes
sales_df['Month'] = sales_df['date'].dt.month
sales_df.head()
sales_df['Month'].drop_duplicates()
sales_df_q1 = sales_df[sales_df['Month'] !=4]
## Assumptions
sales_df_q1 = sales_df_q1[sales_df_q1["order_status"] == "Completed"]
# Q1_2016_df_pivot = Q1_2016_df.pivot_table(index='service', columns=['order_status','Month' ], values='num_orders', aggfunc= 'sum')
sales_df_q1_pivot = sales_df_q1.pivot_table(index='service', columns=['order_status'], values='total_cbv', aggfunc= 'sum')
sales_df_q1_pivot_orders = sales_df_q1.pivot_table(index='service', columns=['order_status'], values='num_orders', aggfunc= 'sum')
sales_df_q1_pivot.reset_index(inplace = True)
sales_df_q1_pivot.columns = ["Service","Q1_revenue_completed"]
sales_df_q1_pivot
sales_df_q1_pivot_orders.reset_index(inplace = True)
sales_df_q1_pivot_orders.columns = ["Service","Q1_order_completed"]
optimization_Df = pd.merge(
sales_df_q1_pivot,
budget_df,
how="left",
on="Service",
)
optimization_Df = pd.merge(
optimization_Df,
sales_df_q1_pivot_orders,
how="left",
on="Service",
)
optimization_Df.columns = ["Service", "Q1_revenue_completed", "Cost_per_100_inc_booking", "max_q2_growth_rate","Q1_order_completed"]
optimization_Df.head(5)
#import csv
Is it possible to achieve the maximum growth rate for all the services with an available budget of 40B?
## If all service max growth is to be achived what is the budget needed? and whats the deficiet?
optimization_Df["max_q2_growth_rate_upd"] = optimization_Df['max_q2_growth_rate'].str.extract('(\d+)').astype(int) ## extract int from string
optimization_Df["max_growth_q2_cbv"] = (optimization_Df.Q1_order_completed *(1+ optimization_Df.max_q2_growth_rate_upd/100)) ## Q2 max orders based on Q1 orders
optimization_Df["abs_inc_orders"] = optimization_Df.max_growth_q2_cbv-optimization_Df.Q1_order_completed ## Total increase in orders
optimization_Df["cost_of_max_inc_q2_order"] = optimization_Df.abs_inc_orders * optimization_Df.Cost_per_100_inc_booking /100 ## Total Cost to get maximum growth for each serivce
display(optimization_Df)
display(budget_df[budget_df["Service"] == "Budget:"].reset_index())
budget_max = budget_df[budget_df["Service"] == "Budget:"].reset_index()
budget_max = budget_max.iloc[:,2:3].values[0][0]
print("Budget difference by")
display(budget_max-optimization_Df.cost_of_max_inc_q2_order.sum() )
## Therefore max of the everything cannot be achieved#import csv
The answer is No. 247B(247,244,617,204) more budget is required to achieve growth targets for all services.
Is it possible to achieve at least 10% of the maximum growth rate for all the services with an available budget of 40B?
## Then what is the budget needed and what will the extra budget at hand??
optimization_Df["min_10_max_growth_q2_cbv"] = (optimization_Df.Q1_order_completed *(1+ optimization_Df.max_q2_growth_rate_upd/1000)) ## atleast 10% of max if achieved, this is orders
optimization_Df["min_10_abs_inc_orders"] = optimization_Df.min_10_max_growth_q2_cbv-optimization_Df.Q1_order_completed ## what is the increase in orders needed to achieve 10% orders growth
optimization_Df["min_10_cost_of_max_inc_q2_order"] = optimization_Df.min_10_abs_inc_orders * optimization_Df.Cost_per_100_inc_booking /100 ## Cost associatedfor 10% increase in orders
display(budget_max-optimization_Df.min_10_cost_of_max_inc_q2_order.sum() ) ## Total budget remaining
display((budget_max-optimization_Df.min_10_cost_of_max_inc_q2_order.sum())/budget_max) ## Budget utilization percentage
optimization_Df["perc_min_10_max_growth_q2_cbv"] =( ( optimization_Df.max_q2_growth_rate_upd/1000)) ## atleast 10% of max if achieved, 7 to percent divide by 100, 10% of this number. divide by 10, so 1000
optimization_Df["perc_max_growth_q2_cbv"] =( ( optimization_Df.max_q2_growth_rate_upd/100)) ## Max growth to be achieved
optimization_Df["q1_aov"] = optimization_Df.Q1_revenue_completed/optimization_Df.Q1_order_completed ## Q1 average order value
optimization_Df["order_profitability"] = 0.1 ## this is assumption that 10% will be profit
optimization_Df["a_orders_Q2"] = (optimization_Df.Q1_order_completed *(1+ optimization_Df.perc_min_10_max_growth_q2_cbv)) ## based on 10% growth, total new orders for qc
optimization_Df["a_abs_inc_orders"] = optimization_Df.a_orders_Q2-optimization_Df.Q1_order_completed
optimization_Df["a_Q2_costs"] = optimization_Df.Cost_per_100_inc_booking* optimization_Df.a_abs_inc_orders/100
##There is scope for improvement here, so This can be adjusted based on revenue or ranking from Q1
display(budget_max - optimization_Df.a_Q2_costs.sum())
optimization_Df#import csv
The answer is Yes. With only 28% of the available 40B budget, this can be achieved. Underutilization of the available budget is never an option, and no business leader would use only 28% of the available budget.
So, the maximum growth across all services cannot be achieved, and achieving 10% of the maximum growth rate will lead to an underutilized budget. Hence the need here is to optimize spending such that:
Assumptions used here:
Pre-optimization data pipeline:
## Data prep for pulp optimization
perc_all_df = pd.DataFrame(data = list(range(1,optimization_Df.max_q2_growth_rate_upd.max()+1)), columns = ["growth_perc"])
## create a list of all percentage growth, from 1 to max to growth expected, this is to create simulation for optimization
display(perc_all_df.head(1))
optimization_Df_2 = optimization_Df.merge(perc_all_df, how = "cross") ## cross join with opti DF
## Filter and keeping all percentgaes upto maximum for each service
## Minimum percentage kept is 1
optimization_Df_2["filter_flag"] = np.where(optimization_Df_2.max_q2_growth_rate_upd >= (optimization_Df_2.growth_perc),1,0)
optimization_Df_2["abs_profit"] = (optimization_Df_2.q1_aov)*(optimization_Df_2.order_profitability)
optimization_Df_3 = optimization_Df_2[optimization_Df_2["filter_flag"] == 1]
display(optimization_Df_3.head(1))
display(optimization_Df_3.columns)
## Filter columns needed
optimization_Df_4 = optimization_Df_3[[
'Service', ## services offered
'Cost_per_100_inc_booking', ## cost of additional 100 orders
'Q1_order_completed', ## to calculate q2 growth based on q1 orders
'perc_min_10_max_growth_q2_cbv', ## minimum growth percent need
'perc_max_growth_q2_cbv', ## max growth percent allowed
'abs_profit', ## profit per order
'growth_perc' ## to simulative growth percet across
]]
display(optimization_Df_4.head(2))
optimization_Df_4["orders_Q2"] = (optimization_Df_4.Q1_order_completed *(1+ optimization_Df_4.growth_perc/100)) ## based on growth, total new orders for qc
optimization_Df_4["abs_inc_orders"] = optimization_Df_4.orders_Q2-optimization_Df_4.Q1_order_completed
optimization_Df_4["profit_Q2_cbv"] = optimization_Df_4.orders_Q2 * optimization_Df_4.abs_profit
optimization_Df_4["growth_perc"] = optimization_Df_4.growth_perc/100
optimization_Df_4["Q2_costs"] = optimization_Df_4.Cost_per_100_inc_booking* optimization_Df_4.abs_inc_orders/100
display(optimization_Df_4.head())
optimization_Df_5 = optimization_Df_4[[
'Service', ## services offered
'Q2_costs', ## cost total for the growth expected
'perc_min_10_max_growth_q2_cbv', ## minimum growth percent need
'perc_max_growth_q2_cbv', ## max growth percent allowed
'profit_Q2_cbv', ## total profit at the assumed order_profitability rate
'growth_perc' ## to simulative growth percet across
]]
optimization_Df_5
display(optimization_Df_5.head(10))
display(optimization_Df_5.shape)
## Best optimization for our case case. This is good.
prob = LpProblem("growth_maximize", LpMaximize) ## Initialize optimization problem - Maximization problem
optimization_Df_5.reset_index(inplace = True, drop = True)
markdowns = list(optimization_Df_5['growth_perc'].unique()) ## List of all growth percentages
cost_v = list(optimization_Df_5['Q2_costs']) ## List of all incremental cost to achieve the growth % needed
perc_min_10_max_growth_q2_cbv = list(optimization_Df_5['perc_min_10_max_growth_q2_cbv'])
growth_perc = list(optimization_Df_5['growth_perc'])
## lp variables
low = LpVariable.dicts("l_", perc_min_10_max_growth_q2_cbv, lowBound = 0, cat = "Continuous")
growth = LpVariable.dicts("g_", growth_perc, lowBound = 0, cat = "Continuous")
delta = LpVariable.dicts ("d", markdowns, 0, 1, LpBinary)
x = LpVariable.dicts ("x", range(0, len(optimization_Df_5)), 0, 1, LpBinary)
## objective function - Maximise profit, column name - profit_Q2_cbv
## Assign value for each of the rows -
## For all rows in the table each row will be assidned x_0, x_1, x_2 etc etc
## This is later used to filter the optimal growth percent
prob += lpSum(x[i] * optimization_Df_5.loc[i, 'profit_Q2_cbv'] for i in range(0, len(optimization_Df_5)))
## one unique growth percentahe for each service
## Constraint one
for i in optimization_Df_5['Service'].unique():
prob += lpSum([x[idx] for idx in optimization_Df_5[(optimization_Df_5['Service'] == i) ].index]) == 1
## Do not cross total budget
## Constraint two
prob += (lpSum(x[i] * optimization_Df_5.loc[i, 'Q2_costs'] for i in range(0, len(optimization_Df_5))) - budget_max) <= 0
## constraint to say minimum should be achived
for i in range(0, len(optimization_Df_5)):
prob += lpSum(x[i] * optimization_Df_5.loc[i, 'growth_perc'] ) >= lpSum(x[i] * optimization_Df_5.loc[i, 'perc_min_10_max_growth_q2_cbv'] )
prob.writeLP('markdown_problem') ## Write Problem name
prob.solve() ## Solve Problem
display(LpStatus[prob.status]) ## Problem status - Optimal, if problem solved successfully
display(value(prob.objective)) ## Objective, in this case what is the maximized profit with availble budget - 98731060158.842 @ 10% profit per order #import csv
print(prob)
print(growth)
var_name = []
var_values = []
for variable in prob.variables():
if 'x' in variable.name:
var_name.append(variable.name)
var_values.append(variable.varValue)
results = pd.DataFrame()
results['variable_name'] = var_name
results['variable_values'] = var_values
results['variable_name_1'] = results['variable_name'].apply(lambda x: x.split('_')[0])
results['variable_name_2'] = results['variable_name'].apply(lambda x: x.split('_')[1])
results['variable_name_2'] = results['variable_name_2'].astype(int)
results.sort_values(by='variable_name_2', inplace=True)
results.drop(columns=['variable_name_1', 'variable_name_2'], inplace=True)
results.reset_index(inplace=True)
results.drop(columns='index', axis=1, inplace=True)
# results.head()
optimization_Df_5['variable_name'] = results['variable_name'].copy()
optimization_Df_5['variable_values'] = results['variable_values'].copy()
optimization_Df_5['variable_values'] = optimization_Df_5['variable_values'].astype(int)# optimization_Df_6.head()
#import csv## with no budget contraint
optimization_Df_10 = optimization_Df_5[optimization_Df_5['variable_values'] == 1].reset_index()
optimization_Df_10["flag"] = np.where(optimization_Df_10.growth_perc >= optimization_Df_10.perc_min_10_max_growth_q2_cbv,1,0)
display(optimization_Df_10)
display(budget_max - optimization_Df_10.Q2_costs.sum())
display( optimization_Df_10.Q2_costs.sum())
sales_df =pd.read_csv('https://raw.githubusercontent.com/chrisdmell/Project_DataScience/working_branch/09_gojek/model_analytics__data.csv')
time_to_pandas_time = ["date"]
for cols in time_to_pandas_time:
sales_df[cols] = pd.to_datetime(sales_df[cols])
sales_df['Month'] = sales_df['date'].dt.month
Q1_2016_df = sales_df[sales_df['Month'] !=900]
Q1_2016_df['Month'] = np.where(Q1_2016_df['Month'] == 1,"Jan",np.where(Q1_2016_df['Month'] == 2,"Feb",np.where(Q1_2016_df['Month'] == 3,"Mar","Apr")))
Q1_2016_df['test_control'] = np.where(Q1_2016_df['date'] <= "2016-03-30","train", "test")
display(Q1_2016_df.head(5))
display(Q1_2016_df.order_status.unique())
display(Q1_2016_df.service.unique())
display(Q1_2016_df.date.max())
#import csv
display(Q1_2016_df.head())
display(Q1_2016_df.date.max())
Q1_2016_df_2 = Q1_2016_df[Q1_2016_df["date"] <= "2016-04-01"]
display(Q1_2016_df_2.date.max())
Q1_2016_df_2 = Q1_2016_df_2[Q1_2016_df["order_status"] == "Cancelled"]
Q1_2016_df_date_unique = Q1_2016_df_2[["date"]].drop_duplicates()
Q1_2016_df_date_service = Q1_2016_df_2[["service"]].drop_duplicates()
Q1_2016_df_CJ = Q1_2016_df_date_unique.merge(Q1_2016_df_date_service, how = "cross") ## cross join with opti DF
display(Q1_2016_df_date_unique.head())
display(Q1_2016_df_date_unique.shape)
display(Q1_2016_df_date_unique.max())
display(Q1_2016_df_date_unique.min())
display(Q1_2016_df_2.shape)
Q1_2016_df_3 = Q1_2016_df_CJ.merge(Q1_2016_df_2, on=['date','service'], how='left', suffixes=('_x', '_y'))
display(Q1_2016_df_3.head())
display(Q1_2016_df_3.shape)
display(Q1_2016_df_CJ.shape)
Q1_2016_df_3["total_cbv"].fillna(0, inplace = True)
print("Null check ",Q1_2016_df_3.isnull().values.any())
nan_rows = Q1_2016_df_3[Q1_2016_df_3['total_cbv'].isnull()]
nan_rows
display(Q1_2016_df_3[Q1_2016_df_3.isnull().any(axis=1)])
Q1_2016_df_3["dayofweek"] = Q1_2016_df_3["date"].dt.dayofweek
Q1_2016_df_3["dayofmonth"] = Q1_2016_df_3["date"].dt.day
Q1_2016_df_3["Is_Weekend"] = Q1_2016_df_3["date"].dt.day_name().isin(['Saturday', 'Sunday'])
Q1_2016_df_3.head()
Q1_2016_df_4 = Q1_2016_df_3[Q1_2016_df_3["service"] != "GO-TIX"]
Q1_2016_df_5 = pd.get_dummies(Q1_2016_df_4, columns=["Month","dayofweek"])
display(Q1_2016_df_5.head())
import numpy as np
import pandas as pd
# from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_predict
Q1_2016_df_5.columns
all_columns = ['date', 'service', 'num_orders', 'order_status', 'total_cbv',
'test_control', 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']
model_variables = [ 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']
target_Variable = ["total_cbv"]
all_columns = ['service',
'test_control', 'dayofmonth', 'Is_Weekend', 'Month_Apr', 'Month_Feb',
'Month_Jan', 'Month_Mar', 'dayofweek_0', 'dayofweek_1', 'dayofweek_2',
'dayofweek_3', 'dayofweek_4', 'dayofweek_5', 'dayofweek_6']
model_1 = Q1_2016_df_5[Q1_2016_df_5["service"] =="GO-FOOD"]
test = model_1[model_1["test_control"]!="train"]
train = model_1[model_1["test_control"]=="train"]
X = train[model_variables]
y = train[target_Variable]
train_predict = model_1[model_1["test_control"]=="train"]
x_ = X[model_variables]
sc = StandardScaler()
X_train = sc.fit_transform(X)
X_test = sc.transform(x_)
#define custom function which returns single output as metric score
def NMAPE(y_true, y_pred):
return 1 - np.mean(np.abs((y_true - y_pred) / y_true)) * 100
#make scorer from custome function
nmape_scorer = make_scorer(NMAPE)
# prepare the cross-validation procedure
cv = KFold(n_splits=3, random_state=1, shuffle=True)
# create model
model = LinearRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring=nmape_scorer, cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
y_pred = cross_val_predict(model, X, y, cv=cv)
def go_model(Q1_2016_df_5, go_service,model_variables,target_Variable):
"""
Q1_2016_df_5
go_service
model_variables
target_Variable
"""
model_1 = Q1_2016_df_5[Q1_2016_df_5["service"] ==go_service]
test = model_1[model_1["test_control"]!="train"]
train = model_1[model_1["test_control"]=="train"]
X = train[model_variables]
y = train[target_Variable]
train_predict = model_1[model_1["test_control"]=="train"]
x_ = X[model_variables]
X_train = sc.fit_transform(X)
X_test = sc.transform(x_)
# prepare the cross-validation procedure
cv = KFold(n_splits=3, random_state=1, shuffle=True)
# create model
model = LinearRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring=nmape_scorer, cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
y_pred = cross_val_predict(model, X, y, cv=cv)
return y_pred,mean(scores), std(scores)
a,b,c = go_model(Q1_2016_df_5, "GO-FOOD",model_variables,target_Variable)
b
Question 3 is an open-ended question and readers are encouraged to solve it on their own. Some of the hypotheses are:
Case studies, when done right, following the steps given above, will have a positive impact on the business. Recruiters aren’t looking for answers but an approach to those answers, the structure followed, the reasoning used, and business and practical knowledge using business analytics. This article provides an easy-to-follow framework for data analysts using a real business case study as an example.
Key Takeaways:
Good luck! Here’s my Linkedin profile if you want to connect with me or want to help improve the article. Feel free to ping me on Topmate/Mentro; you can drop me a message with your query. I’ll be happy to be connected. Check out my other articles on data science and analytics here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.