In the last few months, we have started conducting data science hackathons. These hackathons are contests with a well defined data problem, which has be be solved in short time frame. They typically last any where between 2 – 7 days.
If month long competitions on Kaggle are like marathons, then these hackathons are shorter format of the game – 100 mts Sprint. They are high energy events where data scientists bring in lot of energy, the leaderboard changes almost every hour and speed to solve data science problem matters lot more than Kaggle competitions.
One of the best tip, I can provide to data scientists participating in these hackathons (or even in longer competitions) is to quickly build the first solution and submit. The first few submissions should be real quick. I have created modules on Python and R which can takes in tabular data and the name of target variable and BOOM! I have my first model in less than 10 minutes (Assuming your data has more than 100,000 observations). For smaller data sets, this can be even faster. The reason of submitting this super-fast solution is to create a benchmark for yourself on which you need to improve. I will talk about my methodology in this article.
To understand the strategic areas, let’s first break down the process of predictive analysis into its essential components. Broadly, it can be divided into 4 parts. Every component demands x amount of time to execute. Let’s evaluate these aspects n(with time taken):
Note: The percentages are based on a sample of 40 competition, I have participated in past (rounded off).
Now we know where do we need to cut down time. Let’s go step by step into the process (with time estimate):
1.Descriptive Analysis : When I started my career into analytics, we used to primarily build models based on Logistic Regression and Decision Trees. Most of the algorithm we used involved greedy algorithms, which can subset the number of features I need to focus on.
With advanced machine learning tools coming in race, time taken to perform this task can be significantly reduced. For your initial analysis, you probably need not do any kind of feature engineering. Hence, the time you might need to do descriptive analysis is restricted to know missing values and big features which are directly visible. In my methodology, you will need 2 minutes to complete this step (I assume a data with 100,000 observations).
2.Data Treatment : Since, this is considered to be the most time consuming step, we need to find smart techniques to fill in this phase. Here are two simple tricks which you can implement :
With such simple methods of data treatment, you can reduce the time to treat data to 3-4 minutes.
3. Data Modelling : I have found GBM to be extremely effective for 100,000 observation cases. In case of bigger data, you can consider running a Random Forest. This will take maximum amount of time (~4-5 minutes)
4. Estimation of Performance : I find k-fold with k=7 highly effective to take my initial bet. This finally takes 1-2 minutes to execute and document.
The reason to build this model is not to win the competition, but to establish a benchmark for our self. Let me take a deeper dive into my algorithm. I have also included a few snippets of my code in this article.
I will not include my entire function to give you space to innovate. Here is a skeleton of my algorithm(in R):
Step 1 : Append both train and test data set together
Step 2 : Read data-set to your memory
setwd("C:\\Users\\Tavish\\Desktop\\Kagg\\AV") complete <- read.csv("complete_data.csv", stringsAsFactors = TRUE)
Step 3: View the column names/summary of the dataset
colnames(complete ) [1] "ID" "Gender" "City" "Monthly_Income" "Disbursed" "train"
Step 4: Identify the a) Numeric variable b) ID variables c) Factor Variables d) Target variables
Step 5 : Create flags for missing values
missing_val_var <- function(data,variable,new_var_name) { data$new_var_name <- ifelse(is.na(variable),1,0)) return(data$new_var_name)}
Step 6 : Impute Numeric Missing values
numeric_impute <- function(data,variable) { mean1 <- mean(data$variable) data$variable <- ifelse(is.na(data$variable),mean1,data$variable) return(new_var_name) }
Similarly impute categorical variable so that all missing value is coded as a single value say “Null”
Step 7 : Pass the imputed variable into the modelling process
#Challenge: Try to Integrate a K-fold methodology in this step
create_model <- function(trainData,target) { set.seed(120) myglm <- glm(target ~ . , data=trainData, family = "binomial") return(myglm) }
Step 8 : Make predictions
score <- predict(myglm, newdata = testData, type = "response") score_train <- predict(myglm, newdata = complete, type = "response")
Step 9 : Check performance
auc(complete$Disbursed,score_train)
And Submit!
Hopefully, this article would have given enough motivation to make your own 10-min scoring code. Most of the masters on Kaggle and the best scientists on our hackathons have these codes ready and fire their first submission before making a detailed analysis. Once they have some estimate of benchmark, they start improvising further. Share your complete codes in the comment box below.
Did you find this article helpful? Please share your opinions / thoughts in the comments section below.
Hi Pulkit, This is a great article and timely as far as I am concerned. Off late, I have been trying to get some guidance on how to beat the computational power issue when building models on huge datasets. Using google as mentioned in your article is exactly the concept I was wanting to get some guidance on. And not just for Deep Learning models, this will be handy for other typical ML model exercises like RF, SVM and even text mining where after creating the DTM, data size explodes. I have faced difficulties in ensuring the model training completion because my laptop memory can be just as much. However I have been a R practitioner and not quite gone into Python so much as yet. Would it possible to give the exact same codes in R. If yes, it will be very helpful.
Hi Pranov, Glad to hear that you found it helpful! Regarding the codes in R, I don't have much knowledge about R but I will look for the codes in R and will share resources with you.
Hi Pranov, same here. I also use R pretty often. I am not sure but I found that Keras has also support for R, but I never tried.
Hi, I have tried with the above mentioned code. I am getting an error for downloading the test data set. While downloading training data there was no error and model got trained well. However, while dowloading test data it is giving me an error "data not found". It may because of wrong file ID. Please mention how to find a correct file ID to download the testing data set?
Hi Charanteja, You have to upload the test file on your drive and from there you will get the ID for that file. Paste that ID in the code and it should work.
from google.colab import files file = files.upload() #upload the test zip !unzip test_ScVgIM0.zip
Hi Pulkit, Thanks for the great article, it is very helpful. When I am trying to run this line: from google.colab import auth, I get this error: No module named 'google.colab'
Hi Nouman, You should run these codes in google colab instead of using your own system. The codes are designed to run on colab which provides free GPU to run your model. So, use google colab for training your model.