This article was published as a part of the Data Science Blogathon.
In this article, we will first discuss some of the common methods of Ensemble and their disadvantages. Then we will discuss how those disadvantages can be taken care of by another way of Ensemble known as Stacking and Blending and how to build it in python. Finally, we will wrap everything and create an easy-to-use function.
Ensemble methods are a machine learning technique that combines different models to make an optimum model. Different machine learning models can extract patterns in different ways, by using all of them a better model can be made. One of the common methods is –
Though this method is very much simple to use and often gives a better result the problem is the weaker models get equal priority as the stronger models which might result in a decrease in a score sometimes.
In this method, we can assign higher weightage to the stronger models but it is difficult to assume the weights which is also not that much desirable.
Stacking and Blending: To get rid of the above-mentioned problems stacking and blending can be used for Ensemble Modelling. The steps are –
import pandas as pd import numpy as np x_train = pd.read_csv("C:\Users\chakr\Desktop\Clean_data\X_train_reg.csv") y_train = pd.read_csv("C:\Users\chakr\Desktop\Clean_data\y_train_reg.csv") x_train.head()
from sklearn.model_selection import train_test_split x_train1, x_train2, y_train1, y_train2 = train_test_split( x_train, y_train, test_size=0.25, random_state=42)
Step 1: Divide the Datasets into N parts ( here we use 20 Parts)
def get_dataset(x_train,y_train,N=5) : merge = pd.concat([x_train,y_train],axis=1) merge = merge.sample(frac=1, random_state=1).reset_index(drop=True) y_train = merge.iloc[:,(merge.shape[1]-1):(merge.shape[1])] x_train = merge.iloc[:,0:(merge.shape[1]-1)] z = int(len(x_train)/N) start = [0] stop = [] for i in range(1,N): start.append(z*i) stop.append(z*i) stop.append(len(x_train)) c = list() train_data = list() test_data = list() y_data = list() for i in range(0,N): c=list(range(start[i],stop[i])) train_data.append(x_train.iloc[[k for k in range(0,len(x_train)) if k not in c],:]) y_data.append(y_train.iloc[[k for k in range(0,len(y_train)) if k not in c],:]) test_data.append(x_train.iloc[c,:]) return(train_data,y_data,test_data,y_train) datasets = get_dataset(x_train1,y_train1,20) train_data = datasets[0] y_data = datasets[1] test_data = datasets[2] final_y = datasets[3]
Now we have the following datasets.
from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from catboost import CatBoostRegressor, Pool models = [LinearRegression(), DecisionTreeRegressor(), KNeighborsRegressor(), CatBoostRegressor(logging_level ='Silent')] code = ['lin_reg','dtree_reg','Knn_reg','cat_reg']
Step 3: Prediction function for all the models together
def stack(x_train, y_train , x_test , models,code): def flatten_list(_2d_list): flat_list = [] for element in _2d_list: if type(element) is list: for item in element: flat_list.append(item) else: flat_list.append(element) return flat_list result = list() for i in list(range(len(models))): reg = models[i] reg.fit(x_train,y_train) test_pred = flatten_list(reg.predict(x_test).tolist()) result.append(test_pred) result_df = pd.DataFrame() for i in list(range(len(code))): result_df[code[i]] = result[i] return result_df
Step 4: Predict for each the chunks to get the final Data Frame
final_df = pd.DataFrame(columns = code) for i in range(0,len(train_data)): current_df = stack(train_data[i],y_data[i],test_data[i],models,code) final_df = pd.concat([final_df,current_df]) final_test = stack(x_train1,y_train1,x_train2,models,code) final_df.head()
Step 5: Build the second Layer Model
reg2 = CatBoostRegressor(logging_level ='Silent') reg2.fit(final_df,final_y) test_pred = reg2.predict(final_test) mean_squared_error(test_pred,y_train2)**0.5
In the above section, we saw how the stacking and blending are working to help us build an ensemble model. In this section, we will wrap everything up to build a useful function that can return prediction directly.
def stackblend_reg(x_train,y_train,x_test,models,code,N=20,final_layer=LinearRegression()): def get_dataset(x_train,y_train,N=5) : merge = pd.concat([x_train,y_train],axis=1) merge = merge.sample(frac=1, random_state=1).reset_index(drop=True) y_train = merge.iloc[:,(merge.shape[1]-1):(merge.shape[1])] x_train = merge.iloc[:,0:(merge.shape[1]-1)] z = int(len(x_train)/N) start = [0] stop = [] for i in range(1,N): start.append(z*i) stop.append(z*i) stop.append(len(x_train)) c = list() train_data = list() test_data = list() y_data = list() for i in range(0,N): c=list(range(start[i],stop[i])) train_data.append(x_train.iloc[[k for k in range(0,len(x_train)) if k not in c],:]) y_data.append(y_train.iloc[[k for k in range(0,len(y_train)) if k not in c],:]) test_data.append(x_train.iloc[c,:]) return(train_data,y_data,test_data,y_train) datasets = get_dataset(x_train,y_train,N) train_data = datasets[0] y_data = datasets[1] test_data = datasets[2] final_y = datasets[3] def stack(x_train, y_train , x_test , models=models,code=code): def flatten_list(_2d_list): flat_list = [] for element in _2d_list: if type(element) is list: for item in element: flat_list.append(item) else: flat_list.append(element) return flat_list result = list() for i in list(range(len(models))): reg = models[i] reg.fit(x_train,y_train) test_pred = flatten_list(reg.predict(x_test).tolist()) result.append(test_pred) result_df = pd.DataFrame() for i in list(range(len(code))): result_df[code[i]] = result[i] return result_df final_df = pd.DataFrame(columns = code) for i in range(0,len(train_data)): current_df = stack(train_data[i],y_data[i],test_data[i],models,code) final_df = pd.concat([final_df,current_df]) final_test = stack(x_train,y_train,x_test,models,code) reg2 = final_layer reg2.fit(final_df,final_y) test_pred = reg2.predict(final_test) return test_pred
stack_pred = stackblend_reg(x_train1,y_train1,x_train2, models = [LinearRegression(), DecisionTreeRegressor(), KNeighborsRegressor(), CatBoostRegressor(logging_level ='Silent')], code = ['lin_reg','dtree_reg','Knn_reg','cat_reg'],N=20, final_layer=CatBoostRegressor(logging_level ='Silent')) mean_squared_error(stack_pred,y_train2)**0.5
Now we have the function to get prediction directly, we can with different types of the final layer model to see what works the best. A similar function can be made for the classification problems too. These functions should be highly time-saving and easy to use during solving Supervised Learning problems.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.