Stacking is an ensemble machine learning technique that allows combining different prediction models to make a single model that make the final prediction out of the provided dataset.
Sometimes these combined models can be same depending up to the type of dataset we are dealing with, Stacking in Machine Learning also known as Stacked Generalisation which is the method to determine and combine the scope of the different model in order to predict the output for the same dataset.
The entire model is divided into folds where each model deals with some portion of the complete problem, not with the complete set of problem. So, all the subparts of the problem can be dealt individually with different models combined, it can also be treated as the layers of prediction where each model solve certain part of the problem and the output from every layer work as the input for the final layer or any intermediate layer these final and intermediate layer is also based on some prediction model technique based on whether we are predicting for classification or regression.
The main benefit of using stacking is to improve predictions. In broad situation it can be said that Ensemble methods are meta algorithms that combine many machine learning algorithms in one prediction model in order to decrease variance (bagging), bias (boosting) or improve predictions (stacking).
Mainly Ensemble method can be divided in two categories:
- Sequential Ensemble: Here the base learners are generated sequentially, base learners are dependent on each other.
- Parallel Ensemble: Here the base learners are generated parallel, base learners are independent of each other.
- Homogeneous Ensemble: Ensemble uses only one learning algorithm to make the complete learning model and in order to obtain the base learners.
- Heterogeneous Ensemble: Ensemble uses different learning algorithms to make the prediction model and to obtain the base learners.
Stacked Generalisation: Now, we know what is stacking in machine learning so our next question should be which models are useful in the particular problem? how to choose which model to choose? and final why do we Need stacking?
Before coming to these questions, we need to understand the architecture of the stacking. Stacking mainly involves two or more base models, these models referred to as level-0 model and level-1 is meta-model which take the output of the level-0 as an input in order to obtain the final output for regression or classification depending upon the dataset, we are dealing with.
- Level-0: This is also called the base model that fit the or train on a certain portion of a dataset whose output is treated as the input to another layer.
- Level-1: This layer is meta-model which takes input from the base layer and combines the output of the base layer in order to produce the output of meta-model.
The output that, we obtain from the base model and the input to the meta-model may be a real value in the case of regression or probability value, value of any probability density function, class or label in case of classification models.
Now the choice of prediction model of the base layer can be different. In stacking model can be different and be fitted on the same dataset also, a single model can be used to learn how to best combine the prediction from each model and to finally produce the output which can be our desired real value or may be a label attached to our data.
The details of the figure are as follow.
- the dataset has m rows and n columns which is m data points with n features per datapoints.
- There are M different models with different functions which work on X training dataset in the mode of K-folds.
- Each model provides prediction which is then provided to the second level training and this data has a dimension of m x M, that is m rows which is the number of data points and M columns that is a number of features which is the output of M base models.
- Level-2 model will be trained on the data produced by the base-model dataset in order to produce the final result. This level-2 model is also referred to as a meta-model.
Stacking in Classification
Classification is the technique to obtain or predict the label or class attached to the dataset on which our training is being done, there are many classification algorithms that can be used for obtaining the desired result such as logistic Regression, Support Vector Classification, Decision Tree, Random Forest, k-nearest Neighbours, Naive Bayes all these algorithms play a major role in the classification of data and may produce a different score.
Stacking can be a generalised way to produce a finite model that can combine all models where each model can deal with the portion of dataset and finally the output from each model or prediction classifier can be feed as an input to our meta-model in order to obtain the final result.
This technique of stacking is useful in many ways. Firstly, it improves the score or efficiency of the entire prediction model. Secondly, it combines all classification algorithms which is helpful in maintaining low variance.
Stacking in Regression
Regression is basically a representation of a set of independent quantity on a unit dependent quantity, that dependent quantity in machine learning is our output value produced from the set of input that is independent values or records.
The main algorithms that can be used for Regression are Linear Regression, Support Vector Machine, Decision Tree, k-nearest Neighbours all these algorithms can be used for the base model of stacking technique, the output from base models can be feed to the meta-model in order to obtain the final real value term that satisfies the relationship between the dependent and independent data of provided dataset.
Using a Stacking score of individual Regression algorithms can be identified on the basis of that it will be easy of select which algorithms need to be added in our combined layer-0 which will deal with raw portions of datasets and their input will be feed to meta-model for final prediction of real value.
Steps in which stacking works
In order to work with stacking we need to follow its architecture in proper way.
- First, the complete data is divided into test and train using train_test_split from model_selection.
- Now the training data is divided using K-folds, these K-folds are mostly used for validation purpose and this can be said that they work as k-fold cross-validation.
- Now the base model is fit on a certain portion of the dataset or it can be said that model is going to be fit on one of the folds of the dataset and then the prediction is made for the validation part od that portion of the dataset.
- The base model is now used to fit on a complete dataset in order to calculate the performance to test set.
- Above step from 2-3 is repeated for other base models to check for the efficiency of a complete model.
- Predictions are obtained from each base model and these predictions are used as the features for the meta-model for final prediction.
- The metamodel is used to obtain the final prediction on the test data of train_test_split.
To learn more about it, click here.
By Vikas Upadhayay