Reflections of a Data Scientist: (R) Machine Learning - Bagging, Boosting and Bootstrap Aggregation

Now that you have a fundamental understanding of tree based modeling, we can begin to discuss the concept of "Bootstrap Aggregation". Both of the previously mentioned concepts will come to serve as compositional aspects of a separate model known as "The Random Forest". This methodology will be discussed in the subsequent article.

All three of these concepts classify as "Machine Learning", specifically, supervised machine learning.

"Bagging", is a word play synonym, which serves as a short abbreviation for "Bootstrap Aggregation". Bootstrap aggregation is a term which is utilized to describe a methodology in which multiple randomized observations are drawn from a sample data set. "Boosting" refers to the algorithm which analyzes numerous sample sets which were composed as a result of the previous process. Ultimately, from these sets, numerous decision trees are created. Into which, test data is eventually passed. Each observation within the test data set is analyzed as it passes through the numerous nodes of each individual tree. The results of the predictive output being the consensus of the results reached from a majority of the individual internal models.

How Bagging is Utilized

As previously discussed, "Bagging" is a data sampling methodology. For demonstrative purposes, let's consider its application as it is applied to a randomized version of the "iris" data frame. Here is a portion of the data frame as it currently exists within the "R" platform.

From this data frame, we could utilize the "bagging" methodology to create numerous subsets which contain aspects of the observations contained therein. This methodology will sample from the data frame a pre-determined number of times until it has created a single data sub-set. Once this task has been completed, the process will be completed until a pre-determined number of subsets have been created. Observations from the initial data frame can be sampled multiple times in order to build each individual subset. Therefore, each data frame may contain multiple instances of the same observation.

A graphical representation of this process is illustrated below:

In the case of our illustrated example, three new data samples were created. Each new sample contains a similar number of observations, however, observations from the original data frame are not exclusive in each set. Also, as demonstrated in the above graphic, data observations can repeat within the same sample.

Boosting Described

Once new data samples have been created, the "boosting" process, which is the portion of the algorithm which is initiated following the "bagging" methodology's application, begins to create individualized decision trees for each newly created set. Once each decision tree has been created, the model’s creation process is complete.

The Decision Making Process

With the model created, the process of predicting dependent variable values can be initiated.

Remember that each decision tree was created from data observations from which each corresponding set was comprised of.

The above graphical representation illustrates observation - 8 ,being passed through the model. The model, being comprised of three separate decision trees, which were synthesized from three separate data subsets, produces three different internal outcomes. The average of these outcomes is what is eventually returned to the user as the ultimate product of the model.

A Real Application Demonstration (Classification)

Again, we will utilize the "iris" data set which comes embedded within the R data platform.

A short note on the standard notation utilized for this model type:

D = The training data set.

n = The number of observations within the training data set.

n^1 = "n prime". The number of observations within each data subset.

m = The number of subsets.

In this example we will allow the bagging package command to perform its default function without specifying any additional options. If n^1 = n, then each subset which is created from the training data set is expected to contain at least (1 - 1/e) (≈63.2%) of the unique observations contained within the training data set. If this does not occur, the bagging() function will automatically enable an option which ensures this occurrence.

# Create a training data set from the data frame: "iris" #

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the data frame rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# With the package "ipred" downloaded and enabled #

# Create the model #

mod <- bagging(Species ~., data= raniris[1:100,], type = "class")

# View model classification results with training data #

prediction <- predict(mod, raniris[1:100,], type="class")

table(raniris[1:100,]$Species, predicted = prediction )

# View model classification results with test data #

prediction <- predict(mod, raniris[101:150,], type="class")

table(raniris[101:150,]$Species, predicted = prediction )

Console Output (1):

predicted

setosa versicolor virginica

setosa 31 0 0

versicolor 0 35 0

virginica 0 0 34

Console Output (2):

predicted

setosa versicolor virginica

setosa 19 0 0

versicolor 0 13 2

virginica 0 2 14

A Real Application Demonstration (ANOVA)

In this second example demonstration, all of the notational aspects of the model and the restrictions of the function still apply. However, in this case, the dependent variable is continuous, not categorical. To test the predictability of the model, the Root Mean Standard Error and the Mean Absolute Error values are calculated. For more information as it pertains to the calculation and interpretation of these measurements of predictability, please consult the prior article.

# Create a training data set from the data frame: "iris" #

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the dataframe rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# With the package "ipred" downloaded and enabled #

# Create the model #

anmod <- bagging(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova")

# Compute the Root Mean Standard Error (RMSE) of model training data #

prediction <- predict(anmod, raniris[1:100,], type="class")

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, prediction )

# Compute the Root Mean Standard Error (RMSE) of model test data #

prediction <- predict(anmod, raniris[101:150,], type="class")

# With the package "metrics" downloaded and enabled #

rmse(raniris[101:150,]$Sepal.Length, prediction )

Console Output (1) - Training Data:

[1] 0.3032058

Console Output (2) - Test Data:

[1] 0.3427076

# Create a function to calculate Mean Absolute Error #

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Compute the Mean Absolute Error (MAE) of model training data #

anprediction <- predict(anmodel , raniris[1:100,])

MAE(raniris[1:100,]$Sepal.Length, anprediction)

# Compute the Mean Absolute Error (MAE) of model test data #

anprediction <- predict(anmodel , raniris[101:150,])

MAE(raniris[101:150,]$Sepal.Length, anprediction)

Console Output (1) - Training Data:

[1] 0.2289299

Console Output (2) - Test Data:

[1] 0.2706003

Conclusions

The method from which the Bagging() function was derived, was initially postulated by Leo Breiman, the same individual who created the tree model methodology. You will likely never be inclined to use this methodology as a standalone method of analysis. As was previously mentioned within this article, the justification for this topic’s discussion pertains solely its applicability as an aspect of the random forest model. Therefore, from a pragmatic standpoint, if tree models are the model type which you wish to utilize when performing data analysis, you would either be inclined to select the basic tree model for its simplicity, or the random forest model for its enhanced ability.

That's all for today.

I'll see you next week,

-RD

Reflections of a Data Scientist

Saturday, October 8, 2022

(R) Machine Learning - Bagging, Boosting and Bootstrap Aggregation – Pt. II

No comments:

Post a Comment