Stacking is an ensemble learning technique that uses predictions for multiple nodes(for example kNN, decision trees, or SVM) to build a new model. This final model is used for making predictions on the test dataset.
***Video***
Note: If you are more interested in learning concepts in an Audio-Visual format, We have this entire article explained in the video below. If not, you may continue reading.
So, what we do in stacking is we take the training data and run it through multiple models, M1 to Mn. And all of these models are typically known as base learners or base models. And we generate predictions from these models.
So Pred 1 to Pred n are the predictions, and this input is sent to Model, instead of Max Voting or Averaging it. And the model takes these as inputs and gives us the final prediction. And depending on whether it was a regression problem or a classification problem, I can choose what is the right model to do this. So the concept of stacking is very interesting and it opens up a lot of possibilities.
But doing stacking in this manner opens up a huge danger of Overfitting the model because I’m using my entire training data to create the model as well as creating predictions on it.
So the question is, can I become smarter and can I use the training data and the test data in a different manner so that I reduce the danger of overfitting. And that is what we’ll be discussing in this particular article. So what I’m going to cover is one of the most popular ways in which stacking is used.
So let’s say we have these train and test dataset-
And in order to reduce overfitting, I take my train data and divide it into 10 parts. So this is just done randomly. So I take the entire train data set and I convert it into 10 smaller data sets-
And now in order to reduce overfitting, what I do is I train my model on 9 out of these 10 parts and do my predictions on the 10th part. So in this particular case, I do my training on part 2 to part 10. And let’s say I’m using decision tree as my modeling techniques so I train my model and I do my predictions cases, which were there in part 1-
So part 1 is basically prediction. So the green color represents the prediction, which I’ve done on the points, which were there in dataset 1, I do the same exercise for each of these parts. So for part 2, I again train my model using part 1 of data and part 3 to part 10 of data. And I do my predictions on part two-
So in this manner, I do my predictions for all of these 10 parts. So in summary, each of these predictions is coming from a model which had not seen the same train data points. And for creating a test data set, I use the entire train data. So I, again, train the model, which is done on the entire train data set, and I make predictions on test-
So if you think about it, we created 10 models to get the predictions on train data and the 11th model to get predictions on test data. And all of these are decision tree models. So this gives me one set of predictions or the equivalent of what predictions were coming from model M1.
I do the same thing with a second modeling technique. Let’s say KNN. So again, the same concept that I do predictions part by part on part 1 to part 10. And again, for getting predictions on the test dataset, I run the 11th KNN model.
I do the same thing with the third part, which could be linear or logistic regression, depending on what kind of problem I am handling.
So these are my new base learners in a way. So now I have predictions from three different types of modeling techniques, but I’ve avoided the danger of overfitting.
Now you might ask, why am I using 10? And what is so sacrosanct about this number 10? So there is nothing sacrosanct about the number 10. It’s based on the fact that if I use anything less than two or three, it doesn’t give me as much benefit. And if I take anything more than let’s say 15 or 20, then my number of computations is increasing. So just a trade-off between reducing the over-fitting and not increasing my complexity a lot. You can as well, go ahead with 7 or 8, there is nothing specific that you have to do with 10.
So feel free to choose your own number. It could be seven, it could be eight, but typically I see people using anywhere between five to maybe 11, 12, depending on the situation. And you’ll see this again and again in ensembling that there are guidelines, but at the end of the day, you need to make decisions based on how many resources you have, how much complexity is there, and what are your production guidelines and what can you afford in production?
So I’ve taken 10 as an example, but you can use any other number as well. So coming back to stacking. So we had these predictions from three different types of models. So this becomes my new train dataset-
and the predictions which I had on my test become my new test dataset. And I now create a model on these train and test data sets to come up with my final predictions-
So we use these new train to come up with the train model and make predictions on my test to get my final test predictions.
So this is the most popular variant of stacking, which is used in industry. Let us look at a few more variations, which can be used-
So currently, if you think about it, we have used only the new predictions as features on our final model. What I can also do is I can include the original features along with the new feature. So instead of just using this red box for train and test-
I can use the complete feature to train my model, the features which were there originally, and the predictions, which came outside. So I’m opening up my train dataset to include more features-
And I do the same thing for tests. And this gives me a new set of predictions-
So that’s one way in which stacking is also implemented.
The second way to implement stacking is by doing multiple predictions on the test data set and aggregating them. So again, if you remember what we did was we created these 10 predictions for each of these train files, and we used one of the entire models to create that test data set predictions. Now, what I could also do is do this for 10, each of these 10 models, which were created, and then aggregate them instead of just doing it on the entire model.
So again, the same models which I was using to do the predictions for 1, 2, and each of these datasets, I use the same model to create my predictions for the test. And then I averaged them to come up with my final test, which I’ll use for the final model.
Again, as I said, these are all different models and different ways of implementing stacking and ensembling. You are completely free to become creative and find new ways to reduce overfitting. So the overall objectives are to make sure that:
As long as you do anything to achieve these three objectives, it would be a valid strategy, right?
So the third variant of stacking is where instead of keeping one single model on all the predictions, I ended up creating layers of models. So for example, in this particular case-
I took predictions from M1 and M2 and fed it to another model, M4. Similarly, it took predictions from Model 2 and Model 3 and fed them to Model 5. And the final model was actually a model on Model 4 and Model 5. So I ended up creating two levels of models on my base models. And again, it’s a valid way of stacking. And depending on the situation, you might choose these.
So these were the variants of stacking, as I said, as long as you make sure that the three requirements of ensembling are taken care of which we’re- making sure that you do not overfit your models, making sure that you keep models as simple as possible, and you increase your accuracy. Remember that you can get as creative with stacking or any other ensembling modeling as possible by just keeping these three points in mind.
I’ve covered a few variants for stacking. So feel free to use them. And with those three constraints or with those thoughts, any variation which you can come up with would be a valid variation.
If you are looking to kick start your Data Science Journey and want every topic under one roof, your search stops here. Check out Analytics Vidhya’s Certified AI & ML BlackBelt Plus Program
If you have any queries let me know in the comment section!
Hello Guys, Thank you very much for this hard work I have one question I trained my models on the data then made it as input for ensemble mode then make them train on the test set as new data, Is this right?? thanks in advace, i didn;t get how to ensemble work to improve the results and combin the single models .