This article was published as a part of the Data Science Blogathon.
Data scientists/engineers/Analysts are in huge demand and various large and small firms are hiring left right and centre for various roles. Entry to mid-level roles involves quite a lot of individual contributor roles as well, where individuals own a project or a solution end to end. Business case study assignments can be used as a proxy to understand candidates’ ability to work as individual contributors. Knowing the nuances of effectively solving a case study assignment can surely help in landing multiple job offers.
Data scientists hiring involves case studies and it is an effective way to judge a candidate’s eligibility for the role. It is mostly considered an elimination round, and about 80% of the candidates are filtered out. Usually, in business analytics case studies, coding knowledge, problem-solving ability, presentation skills and overall role fitness are evaluated. Some case studies might focus just on building models, while others might focus on insights and final outcomes and recommendations, and many eligible candidates fail to clear this round due to a lack to clear understanding as to what needs to be presented.
This article sheds some light on how to go about solving business analytics or data scientists using these case studies.
About the Data – The client is a leading fashion retailer in Australia. Company A runs a display advertising campaign for this brand, where it shows ads to users leading them to make a purchase on the brand’s website. The given dataset is the Sales data for all users who made a purchase online in the first half of October ’17. Link to download data.
Broadly below are the steps to be followed:
import pandas as pd import matplotlib.pyplot as plt import datetime import time import re import numpy as np file_name = "Fashion - Assignment Data .csv" fashin_df = pd.read_csv(file_name, encoding = "cp1252", parse_dates=True, dtype={'Product_Name': str, 'Product_ID': str, "Number_of_Products": str}) display(fashin_df.head()) display(fashin_df.columns)
Data cleaning takes the most effort during work but is the least assessed skill in interviews. As there is no standard process, this step can be used to highlight programming as well as data processing skills. Below are the certain operations that need to be done, before segmentation or before using the apriori algorithm.
fashin_df[fashin_df["Product_Name"] == "undefined"] fashin_df.timestamp = pd.to_datetime(fashin_df.timestamp) fashin_df['dates'] = fashin_df['timestamp'].dt.date fashin_df_002["User_Birthday_new"] = pd.to_datetime(fashin_df["User_Birthday"], errors='coerce') fashin_df['Time'] = [datetime.datetime.time(d) for d in fashin_df['timestamp']]
types_dict = {'Product_Name': str, 'Product_ID': str, "Number_of_Products": str} for col, col_type in types_dict.items(): fashin_df[col] = fashin_df[col].astype(col_type)
fashin_df["Order_ID"] = fashin_df['timestamp'].astype(str).str.cat(fashin_df[['user ID', 'ip_address']].astype(str), sep=' - ')
fashin_df["Revenue"] = pd.to_numeric(fashin_df["Revenue"], errors='coerce') fashin_df["Revenue"] = fashin_df["Revenue"].replace(np.nan, 0)
Multiple products from the same order are represented as a list, separated by commas, this is not an ideal input for the apriori algorithm. Product pairs from the same transaction are one way to represent data for the apriori algorithm. This has been done below.
## Get a quantity columns ## write a function for the same df_c = pd.concat([fashin_df["Order_ID"], fashin_df["Number_of_Products"].str.split(",", expand = True)], axis=1) df_f = df_c.melt(id_vars=["Order_ID"], var_name="Product_Split", value_name="Number_of_Products") df_f.head() df_c = pd.concat([fashin_df["Order_ID"], fashin_df["Product_Name"].str.split(",", expand = True)], axis=1) df_e = df_c.melt(id_vars=["Order_ID"], var_name="Product_Split", value_name="Product_Name") df_e.head() df_c = pd.concat([fashin_df["Order_ID"], fashin_df["Product_ID"].str.split(",", expand = True)], axis=1) df_d = df_c.melt(id_vars=["Order_ID"], var_name="Product_Split", value_name="Product_ID") df_d.head() a1 = pd.merge(df_f,df_e,on=['Order_ID',"Product_Split"],how='left') a2 = pd.merge(a1, df_d,on=['Order_ID',"Product_Split"],how='left') a2["Number_of_Products"] = pd.to_numeric(a2["Number_of_Products"], errors='coerce') a2["Number_of_Products"] = a2["Number_of_Products"].replace(np.nan, 0) display(sum(a2["Number_of_Products"] )) display(a2.head()) display(a2.shape)
Quantity_to_join = a2.groupby(["Order_ID"]).agg({"Number_of_Products":sum}).reset_index() Quantity_to_join.rename({"Number_of_Products" : "Quantity"}, axis='columns', inplace = True) fashin_df_002 = pd.merge(fashin_df,Quantity_to_join,on=['Order_ID'],how='left')
a1 = pd.merge(df_f,df_e,on=['Order_ID',"Product_Split"],how='left') a2 = pd.merge(a1, df_d,on=['Order_ID',"Product_Split"],how='left') display(a2.head()) display(a2.shape) a3 = a2[a2.Number_of_Products.notnull()] a3.to_csv("Product_Details_002.csv")
Saved to CSV and apriori algorithm is run to get product affinity. The final output is at order_id, product level. An order with a single product isn’t very useful to the apriori algorithm and can be ignored as well.
fashin_df_002["Coupon_Tag"] = np.where(fashin_df_002.Order_Coupon_Code.isnull() == True, "No Coupon","Coupon")
fashin_df_002["Hour"] = fashin_df_002['timestamp'].apply(lambda x: x.hour) fashin_df_002["Hour_Bracket"] = np.where((fashin_df_002.Hour >= 0) &(fashin_df_002.Hour <= 7), "Mid - Night - Morning", np.where((fashin_df_002.Hour > 7) &(fashin_df_002.Hour <= 13),"First Half", np.where((fashin_df_002.Hour > 13) &(fashin_df_002.Hour <= 17),"Second Half","Night")))
Top-down approach: Top-down approach uses already known business knowledge to build segments. For example – Segments of customers between the ages of 18 to 25 have more than 5 transactions with an average quantity of 2. The approach is neat, defines the segment clearly, has a clear objective and is easy to implement.
Bottom-up approach: Bottom-up approach identifies users with similar attributes and groups them as a segment. Once the segments are created, there cannot be direct actionability. These segments need to evaluate based on their metrics and identify and name those segments.
Due to the time constraints in getting these assignments out as quickly as possible, the Top-down approach has been chosen.
fashin_df_002["Segment_Tag"] = np.where(fashin_df_002.Revenue == 0, "Outlier", np.where(fashin_df_002.Multi_Txns_cust.isnull() == False, "Multi_Txns", np.where(fashin_df_002.Coupon_Tag == "Coupon", "Coupon_Cust", np.where(fashin_df_002.Hour_Bracket == "First Half", "First Half", np.where(fashin_df_002.Hour_Bracket== "Mid - Night - Morning", "Mid - Night - Morning", np.where((fashin_df_002.Hour_Bracket == "Night") or (fashin_df_002.Hour_Bracket=="Second Half"), "Second Half", "Outlier") )))))
Which day of the week has the highest revenue, quantity and orders?
fashin_df_002.groupby("dayofweek").agg({"Quantity":sum, "Revenue":sum, "Order_ID" : pd.Series.nunique}).reset_index()
Is coupon redemption distributed equally during the day?
–> 0-7 Midnight-Morning, 8-13 First Half, 14-17 Second Half, >17 Night
all_single_txns_full_df = fashin_df_002[fashin_df_002.Multi_Txns_cust.isnull() == True] all_single_txns_full_df.groupby(["Coupon_Tag", "Hour_Bracket"]).agg({"user ID":pd.Series.nunique})
A few other EDA questions that can be answered for the data scientists roles are :
Use Product_Details_002.csv as the input data. Converting the below formulas into SQL code to get product affinity pairs.
SQL Apriori code link.
PPTs should contain the below sub-sections:
Until HR calls for interviews for the data scientists roles, below are a few things that might help.
While case studies for data scientists can be a daunting task, it provides a clear picture of the role offered. It also shows that the firm is ready to ready to put in a lot of effort in selecting the best candidates as well. In summary :
Good luck! Here’s my Linkedin profile if you want to connect with me or want to help improve the article. Feel free to ping me on Topmate/Mentro; you can drop me a message with your query. I’ll be happy to be connected. Check out my other articles on data science and analytics here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.