**"Bootstrap Aggregation"**. Both of the previously mentioned concepts will come to serve as compositional aspects of a separate model known as

**"The Random Forest"**. This methodology will be discussed in the subsequent article.

All three of these concepts classify as

**"Machine Learning"**, specifically, supervised machine learning.

**"Bagging",**is a word play synonym, which serves as a short abbreviation for

**"Boot**strap

**Agg**regation

**"**. Bootstrap aggregation is a term which is utilized to describe a methodology in which multiple randomized observations are drawn from a sample data set.

**"Boosting"**refers to the algorithm which analyzes numerous sample sets which were composed as a result of the previous process. Ultimately, from these sets, numerous decision trees are created. Into which, test data is eventually passed. Each observation within the test data set is analyzed as it passes through the numerous nodes of each individual tree. The results of the predictive output being the consensus of the results reached from a majority of the individual internal models.

__How Bagging is Utilized__

As previously discussed,

**"Bagging"**is a data sampling methodology. For demonstrative purposes, let's consider its application as it is applied to a randomized version of the

**"iris"**data frame. Here is a portion of the data frame as it currently exists within the "R" platform.

**"bagging"**methodology to create numerous subsets which contain aspects of the observations contained therein. This methodology will sample from the data frame a pre-determined number of times until it has created a single data sub-set. Once this task has been completed, the process will be completed until a pre-determined number of subsets have been created. Observations from the initial data frame can be sampled multiple times in order to build each individual subset. Therefore, each data frame may contain multiple instances of the same observation.

A graphical representation of this process is illustrated below:

__Boosting Described__Once new data samples have been created, the

**"boosting"**process, which is the portion of the algorithm which is initiated following the

**"bagging"**methodology's application, begins to create individualized decision trees for each newly created set. Once each decision tree has been created, the model’s creation process is complete.

__The Decision Making Process__With the model created, the process of predicting dependent variable values can be initiated.

Remember that each decision tree was created from data observations from which each corresponding set was comprised of.

__A Real Application Demonstration (Classification)__

Again, we will utilize the

**"iris"**data set which comes embedded within the R data platform.

A short note on the standard notation utilized for this model type:

**D = The training data set.**

n = The number of observations within the training data set.

n^1 = "n prime". The number of observations within each data subset.

m = The number of subsets.

n = The number of observations within the training data set.

n^1 = "n prime". The number of observations within each data subset.

m = The number of subsets.

In this example we will allow the bagging package command to perform its default function without specifying any additional options. If n^1 = n, then each subset which is created from the training data set is expected to contain at least (1 - 1/e) (≈63.2%) of the unique observations contained within the training data set. If this does not occur, the

**bagging()**function will automatically enable an option which ensures this occurrence.

**# Create a training data set from the data frame: "iris" #**

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the data frame rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# With the package "ipred" downloaded and enabled #

# Create the model #

mod <- bagging(Species ~., data= raniris[1:100,], type = "class")

# View model classification results with training data #

prediction <- predict(mod, raniris[1:100,], type="class")

table(raniris[1:100,]$Species, predicted = prediction )

# View model classification results with test data #

prediction <- predict(mod, raniris[101:150,], type="class")

table(raniris[101:150,]$Species, predicted = prediction )

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the data frame rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# With the package "ipred" downloaded and enabled #

# Create the model #

mod <- bagging(Species ~., data= raniris[1:100,], type = "class")

# View model classification results with training data #

prediction <- predict(mod, raniris[1:100,], type="class")

table(raniris[1:100,]$Species, predicted = prediction )

# View model classification results with test data #

prediction <- predict(mod, raniris[101:150,], type="class")

table(raniris[101:150,]$Species, predicted = prediction )

__Console Output (1):__

*predicted*

setosa versicolor virginica

setosa 31 0 0

versicolor 0 35 0

virginica 0 0 34

setosa versicolor virginica

setosa 31 0 0

versicolor 0 35 0

virginica 0 0 34

__Console Output (2):__

*predicted*

setosa versicolor virginica

setosa 19 0 0

versicolor 0 13 2

virginica 0 2 14

setosa versicolor virginica

setosa 19 0 0

versicolor 0 13 2

virginica 0 2 14

__A Real Application Demonstration (ANOVA)__In this second example demonstration, all of the notational aspects of the model and the restrictions of the function still apply. However, in this case, the dependent variable is continuous, not categorical. To test the predictability of the model, the Root Mean Standard Error and the Mean Absolute Error values are calculated. For more information as it pertains to the calculation and interpretation of these measurements of predictability, please consult the prior article.

**# Create a training data set from the data frame: "iris" #**

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the dataframe rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# With the package "ipred" downloaded and enabled #

# Create the model #

anmod <- bagging(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova")

# Compute the Root Mean Standard Error (RMSE) of model training data #

prediction <- predict(anmod, raniris[1:100,], type="class")

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, prediction )

# Compute the Root Mean Standard Error (RMSE) of model test data #

prediction <- predict(anmod, raniris[101:150,], type="class")

# With the package "metrics" downloaded and enabled #

rmse(raniris[101:150,]$Sepal.Length, prediction )

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the dataframe rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# With the package "ipred" downloaded and enabled #

# Create the model #

anmod <- bagging(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova")

# Compute the Root Mean Standard Error (RMSE) of model training data #

prediction <- predict(anmod, raniris[1:100,], type="class")

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, prediction )

# Compute the Root Mean Standard Error (RMSE) of model test data #

prediction <- predict(anmod, raniris[101:150,], type="class")

# With the package "metrics" downloaded and enabled #

rmse(raniris[101:150,]$Sepal.Length, prediction )

__Console Output (1) - Training Data:__

*[1] 0.3032058*

__Console Output (2) - Test Data:__

*[1] 0.3427076*

**# Create a function to calculate Mean Absolute Error #**

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Compute the Mean Absolute Error (MAE) of model training data #

anprediction <- predict(anmodel , raniris[1:100,])

MAE(raniris[1:100,]$Sepal.Length, anprediction)

# Compute the Mean Absolute Error (MAE) of model test data #

anprediction <- predict(anmodel , raniris[101:150,])

MAE(raniris[101:150,]$Sepal.Length, anprediction)

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Compute the Mean Absolute Error (MAE) of model training data #

anprediction <- predict(anmodel , raniris[1:100,])

MAE(raniris[1:100,]$Sepal.Length, anprediction)

# Compute the Mean Absolute Error (MAE) of model test data #

anprediction <- predict(anmodel , raniris[101:150,])

MAE(raniris[101:150,]$Sepal.Length, anprediction)

__Console Output (1) - Training Data:__

*[1] 0.2289299*

__Console Output (2) - Test Data:__

*[1] 0.2706003*

__Conclusions__The method from which the

**Bagging()**function was derived, was initially postulated by Leo Breiman, the same individual who created the tree model methodology. You will likely never be inclined to use this methodology as a standalone method of analysis. As was previously mentioned within this article, the justification for this topic’s discussion pertains solely its applicability as an aspect of the random forest model. Therefore, from a pragmatic standpoint, if tree models are the model type which you wish to utilize when performing data analysis, you would either be inclined to select the basic tree model for its simplicity, or the random forest model for its enhanced ability.

That's all for today.

I'll see you next week,

-RD

## No comments:

## Post a Comment

Note: Only a member of this blog may post a comment.