August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
An ensemble technique called bootstrap aggregation (bagging) addresses overfitting for classification or regression issues. The goal of bagging is to enhance the performance and accuracy of machine learning models. To achieve this, random subsets of the original dataset are taken with replacement, and each subset is fitted with either a classifier (for classification) or a regressor (for regression). To increase prediction accuracy, each subset’s forecasts are combined via majority vote for classification or average for regression.
We must first assess the performance of the base classifier on the dataset before we can understand how bagging can enhance model performance. Before continuing, revisit the lesson on decision trees if you need help understanding what they are. Bagging is a development of this idea.
In sklearn’s wine dataset, we’ll be trying to identify various wine classifications.
Importing the essential modules
The data must then be loaded and stored in the variables X (input features) and Y. (target). To preserve the feature names when loading the data, the parameter as frame is set to True.
We need to separate X and y into train and test sets to assess our model on unobserved data appropriately. You may get more details about data splitting in the Train/Test lesson.
Now that our data is ready, we can instantiate a basic classifier and fit the training set.
Now that the test set is unknown, we may forecast the wine class and assess the model’s performance.
Output:
With the current parameters, the base classifier works admirably on the dataset, obtaining 82% accuracy on the test dataset (other outcomes may be seen if the random state option is not set).
We can compare the performance of the Bagging Classifier and a single Decision Tree Classifier now that we know the baseline accuracy for the test dataset.
The number of base classifiers our model will aggregate is the parameter n estimators, which must be set to do bagging.
Although there aren’t many estimators for this sample dataset, considerably more comprehensive ranges are frequently investigated. However, for now, we will utilize a particular range of values for the number of estimators. Hyperparameter tweaking is often done via a grid search.
First, we import the required model.
Let’s establish a range of numbers to indicate the number of estimators we intend to employ in each ensemble.
We need a method to iterate over the range of values and record the results from each ensemble to compare how the Bagging Classifier performs with various values of n estimators. To do this, a for loop will be built, with the models and scores being stored in separate lists for future visualizations.
Note: Since the DecisionTreeClassifier is the default value for the base classifier in the BaggingClassifier, we don’t need to set it when we initialize the bagging model.
Let’s visualize the improvement with the models and scores stored.
Output:
We can observe a rise in model performance from 82.2% to 95.5% by iterating through various settings for the number of estimators. The accuracy starts to decline after 14 estimators; once more, the values you see will change if you specify a different random state. Cross-validation is recommended as best practice to provide reliable results because of this.
In this instance, we observe a 13.3% improvement in wine type identification precision.
Now, it’s clear how bootstrap aggregation helps improve model performance and stability. If you want to read some of my other blogs, you can read them below:
I advise you to give it a shot. You are welcome to ask me any questions in the comment section if you have any as well.