Reflections of a Data Scientist: (R) Machine Learning - Gradient Boosted Algorithms

Of all the models which have been discussed thus far, the most complicated, and the most effective of the models which utilize the tree methodology, are the models which belong to a primary sub-group known as, “gradient boosted algorithms”.

Gradient boosted models are similar to the random forest model, the primary difference between the two is that the gradient boosted models synthesize their individual trees differently. Whereas random forests seek to minimize errors through a randomization process, gradient boosted models address each incorrect model within each tree as it is created. Meaning, that each tree is re-assessed after its creation occurs, and the subsequent tree is optimized based on acknowledgement of the prior tree’s errors.

Model Creation Options

As the gradient boosted algorithm possesses components of all of the previously discussed model methodologies, the complexities of the algorithm’s internal mechanism are evident by design. In essence, the evolved capacity of the model, possessing various foundational elements which were initially designated as aspects of prior methodologies, ultimately, through various stages of synthesis, produces a model with a greater number of options. These options can remain at their default assignments in which they were initially designated. As such, they will assume predetermined values in accordance to the surrounding circumstances. However, if you would like to customize the model’s synthesis, the following options are available for such:

distribution – This option refers to the distribution type which the model will assume when analyzing the data utilized within the model design process. The following distribution types are available within the “gbm” package: “gaussian”, “laplace”, “tdist”, “bernoulli”, “huberized”, “adaboost”, “poisson”, “coph”, “quantile” and “pairwise”. If this option is not explicitly indicated by the user, they system will automatically decide between “gaussian” and “bernoulli”, as to which distribution type best suits the model data.

n.minobsinnode – This option indicates the integer specifying the minimum number of observations in the terminal nodes of the trees.

n.trees – The number of trees which will be utilized to create the final model.

interaction.depth - Integer specifying the maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. Default is 1.

cv.folds – Specifies the number of “cross-validation” folds to perform. This option essentially provides additional model output in the form of additional testing results. Similar output is generated by default within the random forest model package.

shrinkage - A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually works, but a smaller learning rate typically requires more trees. Default is 0.1.

Optimizing a Model with the “CARET” Package

For the everyday analyst, being confronted with the task of appropriately assigning values to the aforementioned fields can be disconcerting. This task is also undertaken with the understanding that by incorrectly assigning a variable field, that an individual can vastly compromise the validity of a model’s results. Thankfully, the “CARET” package exists to assist us with our model optimization needs.

“CARET” is an auxiliary package with numerous uses, primarily among them, is a function which can be utilized to assess model optimization prior to synthesis. It the case of our example, we will be utilizing the following packages to demonstrate this capability:

# With the “CARET” package downloaded and enabled #

# With the “e1071” package downloaded and enabled #

With the above packages downloaded and enabled, we can run the following “CARET” function to generate console output pertaining to the various model types which “CARET” can be utilized to optimize:

# List different models which train() function can optimize #

names(getModelInfo())

The console output is too voluminous to present in its entirety within this article. However, a few notable options which warrant mentioning as they pertain to previously discussed methodologies are:

rf – Which refers to the random forest model.

treebag – Which refers to the bootstrap aggregation model.

glm – Which refers to the general linear model.

(and)

gbm – Which refers to the gradient boosted model.

Let’s start by regenerating the random sets which comprise of random observations from our favorite “iris” set.

# Create a training data set from the data frame: "iris" #

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the data frame rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. #

train(Species~.,data=raniris[1:100,], method = "gbm")

This produces a voluminous amount of console output, however, the primary portion of the output which we will focus upon is the bottom most section.

This output should resemble something similar to:

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

From this information, we discover the optimal parameters in which to establish a gradient boosted model.

In this particular case:

n.trees = 50

interaction.depth = 2

shrinkage = 0.1

n.minobsinnode = 10

A Real Application Demonstration (Classification)

With the optimal parameters discerned, we may continue with the model building process. The model created for this example is of the classification type. Typically for a classification model type, the “multinomial” option should be specified.

# Create Model #

model <- gbm(Species ~., data = raniris[1:100,], distribution = 'multinomial', n.trees = 50, interaction.depth = 2, shrinkage = 0.1, n.minobsinnode = 10)

# Test Model #

modelprediction <- predict(model, n.trees = 50, newdata = raniris[101:150,] , type = 'response')

# View Results #

modelprediction0 <- apply(modelprediction, 1, which.max)

# View Results in a readable format #

modelprediction0 <- colnames(modelprediction)[modelprediction0]

# Create Confusion Matrix #

table(raniris[101:150,]$Species, predicted = modelprediction0)

Console Output:

predicted
setosa versicolor virginica
setosa 19 0 0
versicolor 0 13 2
virginica 0 2 14

A Real Application Demonstration (Continuous Dependent Variable)

As was the case with the previous example, we will again be utilizing the train() function within the “CARET” package to determine model optimization. As it pertains to continuous dependent variables, the “gaussian” option should be specified if the data is normally distributed, and the “tdist” option should be specified if the data is non-parametric.

# Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. #

model <- train(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", method = "gbm")

model

Console Output:

Stochastic Gradient Boosting

100 samples
3 predictor

No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:

interaction.depth n.trees RMSE Rsquared MAE
1 50 0.4256570 0.7506086 0.3316030
1 100 0.4083072 0.7623251 0.3258838
1 150 0.4067113 0.7607363 0.3270202
2 50 0.4241599 0.7471639 0.3347628
2 100 0.4184793 0.7466858 0.3335772
2 150 0.4212821 0.7427328 0.3369379
3 50 0.4248178 0.7433384 0.3345428
3 100 0.4260524 0.7391382 0.3385778
3 150 0.4278416 0.7345970 0.3398392

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.

# Optimal Model Parameters #

# n.trees = 150 #

# interaction.depth = 1 #

# shrinkage = 0.1 #

# n.minobsinnode = 10 #

# Create Model #

tmodel <- gbm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", n.trees = 150, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10)

# Test Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response')

# Compute the Root Mean Standard Error (RMSE) of model testing data #

# With the package "metrics" downloaded and enabled #

rmse(raniris[101:150,]$Sepal.Length, tmodelprediction)

# Compute the Root Mean Standard Error (RMSE) of model training data #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response')

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, tmodelprediction)

Console Output:

[1] 0.4060854

[1] 0.3144518

# Mean Absolute Error #

# Create MAE function #

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA #

# Utilize MAE function on model testing data #

# Regenerate Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response')

# Generate Output #

MAE(raniris[101:150,]$Sepal.Length, tmodelprediction)

# Utilize MAE function on model training data #

# Regenerate Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response')

# Generate Output #

MAE(raniris[1:100,]$Sepal.Length, tmodelprediction)

Console Output:

[1] 0.3320722

[1] 0.2563723

Graphing and Interpreting Output

The following method creates an output which quantifies the importance of each variable within the model. The type of analysis which determines the variable importance depends on the model type specified within the initial function. In the case of each model, the code samples below produce the subsequent outputs:

# Multinomial Model #

summary(model)

Console Output:

var rel.inf
Petal.Length Petal.Length 59.0666833
Petal.Width Petal.Width 38.6911265
Sepal.Width Sepal.Width 2.1148704
Sepal.Length Sepal.Length 0.1273199

#######################################

# T-Distribution Model #

summary (tmodel)

Console Output:

var rel.inf
Petal.Length Petal.Length 74.11473
Sepal.Width Sepal.Width 14.18743
Petal.Width Petal.Width 11.69784

That's all for now.

I'll see you next time, Data Heads!

-RD

Reflections of a Data Scientist

Friday, October 21, 2022

(R) Machine Learning - Gradient Boosted Algorithms – Pt. IV

No comments:

Post a Comment