Recently, one of my friends and I were solving a practice problem. After 8 hours of hard work & coding, my friend Shubham got a score of 1153 (position 219). Here is his position on leaderboard:
On the other hand, I was able to achieve this by writing only 8 lines of code:
How did I get there?
What if I tell you there exists a library called MLBox , which does most of the heavy lifting in machine learning for you in minimal lines of code? From missing value imputation to feature engineering using state-of-the-art Entity Embeddings for categorical features, MLBox has it all.
In these 8 lines of code using MLBox, I have also performed hyperparameter optimisation and tested around 50 models with blazing speed – isn’t that awesome? You will be able to use this library by end of this article.
According to the developer of MLBox,
“MLBox is a powerful Automated Machine Learning Python library. It provides the following features:
MLBox focuses on the below three points in particular in comparison to the other libraries:
We will be studying about these below in some detail to have an idea about what they do.
MLBox is currently available for Linux only. MLBox was primarily developed using Python 2 and last night it was extended to Python 3. We will be installing the latest 3.0 dev version of MLBox. Follow the below steps to install MLBox into your Linux System.
conda create -n Python3 python=3 anaconda #Here Python3 is the name of the #environment that we just created.
source activate Python3
curl -OL https://github.com/AxeldeRomblay/mlbox/tarball/3.0-dev
sudo tar -xzvf 3.0-dev
cd AxeldeRomblay-MLBox-2befaee
cd python-package
cd dist
pip install *.whl
pip install lightgbm
pip install xgboost
pip install hyperopt
pip install tensorflow
pip install keras
pip install matplotlib
pip install ipyparallel
pip install pandas
python
import
mlbox
If the mlbox library is loaded without any error, you have successfully installed the mlbox library. Next, we will go ahead awnd install some additional libraries that MLBox uses under the hood.Note – This library is currently under very active development and therefore there may be the cases that something that works now may break the next day. For example, this library worked pretty well till 2 days ago for Python 2.7 and didn’t work so good for Python 3.6. But at the time of writing, I am experiencing some issues with the 2.7 version and the 3 version is working fine for now. Also please feel free to open issues on the github repository and asking for help in the comments below.
The entire pipeline of MLBox looks like below-
The entire pipeline of MLBox has been divided into 3 sections/sub-packages.
We will be studying about these 3 sub-packages in detail below.
All the functionalities inside this sub-package can be used via the command-
from mlbox.preprocessing import *
This sub-package provides functionalities related to two major functions.
This package supports reading a wide variety of file formats like csv, Excel, hdf5, JSON etc. but in this article, we will be primarily seeing the most common “.csv” file format. Follow the below steps to read a csv file.
Step1: Create an object of the Reader class with the separator as a parameter. “,” is the separator in the case of a csv file.
s=","
r=Reader(s) #initialising the object of Reader Class
Step2: Make a list of the train and test file paths and also identify the target variable name.
path=["path of the train csv file","path of the test csv file "]
target_name="name of the target variable in the train file"
Step3: Performing the cleaning operation and creating a cleaned train and test file.
The cleaning steps performed in the above step are-
data=r.train_test_split(path,target_name)
-deleting unnamed columns
-removing duplicates
-extracting month, year and day of the week from a Date column
The drifting Variables are explained in the later section. To remove the drifting variables, follow the below steps.
Step1: Create an object of class Drift_thresholder
dft=Drift_thresholder()
Step2: Use the fit_transform method of the created object to remove the drift variables.
data=dft.fit_transform(data)
All the functionalities inside this sub-package can be used via the command-
from mlbox.optimisation import *
This is the section where this library scores the maximum points. This hyper-parameter optimisation method in this library uses the hyperopt library which is very fast and you can almost optimise anything in this library from choosing the right missing value imputation method to the depth of an XGBOOST model. This library creates a high-dimensional space of the parameters to be optimised and chooses the best combination of the parameters that lowers the validation score.
Below is the table of the four broad optimisations that are done in the MLBox library with terms to the right of hyphen that can be optimised for different values.
Missing Values Encoder(ne) – numerical_strategy (when the column to be imputed is a continuous column eg- mean, median etc), categorical_strategy(when the column to be imputed is a categorical column e.g.- NaN values etc)
Categorical Values Encoder(ce)– strategy (method of encoding categorical variables e.g.- label_encoding, dummification, random_projection, entity_embedding)
Feature Selector(fs)– strategy (different methods for feature selection e.g. l1, variance, rf_feature_importance), threshold (the percentage of features to be discarded)
Estimator(est)–strategy (different algorithms that can be used as estimators eg- LightGBM, xgboost etc.), **params(parameters specific to the algorithm being used eg- max_depth, n_estimators etc.)
Let us take an example and create a hyperparameter space to be optimised. Let us state all the parameters that I want to optimise:
Algorithm to be used- LightGBM
LightGBM max_depth-[3,5,7,9]
LightGBM n_estimators-[250,500,700,1000]
Feature selection-[variance, l1, random forest feature importance]
Missing values imputation – numerical(mean,median),categorical(NAN values)
categorical values encoder- label encoding, entity embedding and random projection
Let us now create our hyper-parameter space. Before that, remember, hyper-parameter is a dictionary of key and value pairs where value is also a dictionary given by the syntax
{“search”:strategy,”space”:list}, where strategy can be either “choice” or “uniform” and list is the list of values.
space={'ne__numerical_strategy':{"search":"choice","space":['mean','median']},
'ne__categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce__strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs__strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs__threshold':{"search":"uniform","space":[0.01, 0.3]},
'est__max_depth':{"search":"choice","space":[3,5,7,9]},
'est__n_estimators':{"search":"choice","space":[250,500,700,1000]}}
Now we will see the steps to choose the best combination from the above space using the following steps:
Step1: Create an object of class Optimiser which has the parameters as ‘scoring’ and ‘n_folds’. Scoring is the metric against which we want to optimise our hyper-parameter space and n_folds is the number of folds of cross-validation
Scoring values for Classification- "accuracy"
, "roc_auc"
, "f1"
, "log_loss"
, "precision"
, "recall"
Scoring values for Regression-
"mean_absolute_error", "mean_squarred_error", "median_absolute_error", "r2"
opt=Optimiser(scoring="accuracy",n_folds=5)
Step2: Use the optimise function of the object created above which takes the hyper-parameter space, dictionary created by the train_test_split and number of iterations as the parameters. This function returns the best hyper-paramters from the hyper-parameter space.
best=opt.optimise(space,data,40)
All the functions in this sub-package can be installed using the command below.
from mlbox.prediction import *
This sub-package predicts on the test dataset using the best hyper-parameters calculated using the optimisation sub-package. To predict on the test dataset, go through the following steps.
Step1: Create an object of class Predictor
pred=Predictor()
Step2: Use the fit_predict method of the object created above which takes a set of hyperparameters and dictionary created through train_test_split as the parameter.
pred.fit_predict(best,data)
The above method saves the feature importance, drift variables coefficients and the final predictions into a separate folder named ‘save’.
We are now going to build a Machine Learning Classifier in just 7 lines of code with hyperparameter optimisation. We are going to solve the Big Marts sales problem. Download the train and test file and keep them in a single folder. Using the MLBox library, we are going to submit our first prediction without even having to look at the data. You can find the code below to make the prediction for the problem.
# coding: utf-8
# importing the required libraries
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *
# reading and cleaning the train and test files
df=Reader(sep=",").train_test_split(['/home/nss/Downloads/mlbox_blog/train.csv',
'/home/nss/Downloads/mlbox_blog/test.csv'],'Item_Outlet_Sales')
# removing the drift variables
df=Drift_thresholder().fit_transform(df)
# setting the hyperparameter space
space={'ne__numerical_strategy':{"search":"choice","space":['mean','median']},
'ne__categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce__strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs__strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs__threshold':{"search":"uniform","space":[0.01, 0.3]},
'est__max_depth':{"search":"choice","space":[3,5,7,9]},
'est__n_estimators':{"search":"choice","space":[250,500,700,1000]}}
# calculating the best hyper-parameter
best=Optimiser(scoring="mean_squared_error",n_folds=5).optimise(space,df,40)
# predicting on the test dataset
Predictor().fit_predict(best,df)
The above code ranked 108(top 1%) on the Public Leaderboard without having to even open the train and test file. I think this is pretty awesome.
Below is the image of feature importance as calculated by LightGBM.
Drift is not a common topic but a very important one and it deserves an article of its own. But I will try to explain the functionality of Drift_Thresholder in brief.
In general, we assume that train and test dataset are created through the same generative algorithm or process but this assumption is quite strong and we do not see this behaviour in the real world. In the real world, the data generator or the process may change. For example, in a sales prediction model, the customer behaviour changes over time and hence the data generated will be different than the data that was used to create the model. This is called drift.
Another point to note is that in a dataset, both the independent features and the dependent feature may drift. When the independent features changes, it is called the covariate shift and when the relationship between the independent and dependent features change, it is called the concept shift. MLBox deals with the covariate shift.
The general algorithm for detection of drift is as follows-
Entity Embeddings owe their existence to the word2vec embeddings in the sense that they function the same way as word vectors do. For example, we know that in word vector representation, we can do things like below.
In the similar sense, categorical variables could be encoded to create new informative features. Their effect was evident to the world in Kaggle’s Rossmann Sales Problem where a team used Entity Embeddings along with Neural Network and came third without performing any significant feature engineering. The entire code and the research paper on Entity Embeddings that resulted from the competition could be found here. The Entity Embeddings were able to capture the relationship between the German states as shown below.
I don’t want to bog you down with the explanation of Entity Embeddings here. It deserves its own article. In MLBox, you can use Entity Embedding as a black box for encoding categorical variables.
This library has its own sets of pros and cons.
The pros are –
The cons are-
So, I suggest you weigh the pros and cons before making this your mainstream library for Machine Learning.
I was really excited to try this library as soon as I read about its release on Github. I spent the next couple of days studying the library and simplifying it for you to use it on the go. I must say that I am really impressed with the library and am going to explore even more. With just 8 lines of code, I was able to break into top 1% and without having to spend time explicitly on handling data and hyperparameter optimisation, I could dedicate more time to feature engineering and check them on the fly. Please feel free to comment for any help or ideas below.
very interesting article. congrats do you know any similar attempt to automate ML but in R? not talking about Caret stuff but more similar to this python library, trying to automate all the pipeline from cleaning to validation thanks
Same interest here! This would be awesome!!!!! Would save us so much time..
Is there anything similar in R?
Not as of now.
Fantastic article! Really is on the bleeding edge of ml. I'll have to check it out.