**“gradient boosted algorithms”**.

Gradient boosted models are similar to the random forest model, the primary difference between the two is that the gradient boosted models synthesize their individual trees differently. Whereas random forests seek to minimize errors through a randomization process, gradient boosted models address each incorrect model within each tree as it is created. Meaning, that each tree is re-assessed after its creation occurs, and the subsequent tree is optimized based on acknowledgement of the prior tree’s errors.

__Model Creation Options__As the gradient boosted algorithm possesses components of all of the previously discussed model methodologies, the complexities of the algorithm’s internal mechanism are evident by design. In essence, the evolved capacity of the model, possessing various foundational elements which were initially designated as aspects of prior methodologies, ultimately, through various stages of synthesis, produces a model with a greater number of options. These options can remain at their default assignments in which they were initially designated. As such, they will assume predetermined values in accordance to the surrounding circumstances. However, if you would like to customize the model’s synthesis, the following options are available for such:

**– This option refers to the distribution type which the model will assume when analyzing the data utilized within the model design process. The following distribution types are available within the**

__distribution__**“gbm”**package:

**“gaussian”**,

**“laplace”**,

**“tdist”**,

**“bernoulli”**,

**“huberized”**,

**“adaboost”**,

**“poisson”,**

**“coph”**,

**“quantile”**and

**“pairwise”**. If this option is not explicitly indicated by the user, they system will automatically decide between

**“gaussian”**and

**“bernoulli”**, as to which distribution type best suits the model data.

**– This option indicates the integer specifying the minimum number of observations in the terminal nodes of the trees.**

__n.minobsinnode__**– The number of trees which will be utilized to create the final model.**

__n.trees__**- Integer specifying the maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. Default is 1.**

__interaction.depth__**– Specifies the number of “cross-validation” folds to perform. This option essentially provides additional model output in the form of additional testing results. Similar output is generated by default within the random forest model package.**

__cv.folds__**- A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually works, but a smaller learning rate typically requires more trees. Default is 0.1.**

__shrinkage__

__Optimizing a Model with the “CARET” Package__For the everyday analyst, being confronted with the task of appropriately assigning values to the aforementioned fields can be disconcerting. This task is also undertaken with the understanding that by incorrectly assigning a variable field, that an individual can vastly compromise the validity of a model’s results. Thankfully, the

**“CARET”**package exists to assist us with our model optimization needs.

**“CARET”**is an auxiliary package with numerous uses, primarily among them, is a function which can be utilized to assess model optimization prior to synthesis. It the case of our example, we will be utilizing the following packages to demonstrate this capability:

**# With the “CARET” package downloaded and enabled #**

# With the “e1071” package downloaded and enabled #

# With the “e1071” package downloaded and enabled #

With the above packages downloaded and enabled, we can run the following

**“CARET”**function to generate console output pertaining to the various model types which

**“CARET”**can be utilized to optimize:

**# List different models which train() function can optimize #**

names(getModelInfo())

names(getModelInfo())

The console output is too voluminous to present in its entirety within this article. However, a few notable options which warrant mentioning as they pertain to previously discussed methodologies are:

rf – Which refers to the random forest model.

treebag – Which refers to the bootstrap aggregation model.

glm – Which refers to the general linear model.

(and)

gbm – Which refers to the gradient boosted model.

Let’s start by regenerating the random sets which comprise of random observations from our favorite

**“iris”**set.

**# Create a training data set from the data frame: "iris" #**

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the data frame rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. #

train(Species~.,data=raniris[1:100,], method = "gbm")

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the data frame rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

# Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. #

train(Species~.,data=raniris[1:100,], method = "gbm")

This produces a voluminous amount of console output, however, the primary portion of the output which we will focus upon is the bottom most section.

This output should resemble something similar to:

*Tuning parameter 'shrinkage' was held constant at a value of 0.1*

Tuning parameter 'n.minobsinnode' was held constant at a value of 10

Accuracy was used to select the optimal model using the largest value.

The final values used for the model were n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

Tuning parameter 'n.minobsinnode' was held constant at a value of 10

Accuracy was used to select the optimal model using the largest value.

The final values used for the model were n.trees = 50, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

From this information, we discover the optimal parameters in which to establish a gradient boosted model.

In this particular case:

n.trees = 50

interaction.depth = 2

shrinkage = 0.1

n.minobsinnode = 10

**A Real Application Demonstration (Classification)**

With the optimal parameters discerned, we may continue with the model building process. The model created for this example is of the classification type. Typically for a classification model type, the “multinomial” option should be specified.

**# Create Model #**

model <- gbm(Species ~., data = raniris[1:100,], distribution = 'multinomial', n.trees = 50, interaction.depth = 2, shrinkage = 0.1, n.minobsinnode = 10)

# Test Model #

modelprediction <- predict(model, n.trees = 50, newdata = raniris[101:150,] , type = 'response')

# View Results #

modelprediction0 <- apply(modelprediction, 1, which.max)

# View Results in a readable format #

modelprediction0 <- colnames(modelprediction)[modelprediction0]

# Create Confusion Matrix #

table(raniris[101:150,]$Species, predicted = modelprediction0)

model <- gbm(Species ~., data = raniris[1:100,], distribution = 'multinomial', n.trees = 50, interaction.depth = 2, shrinkage = 0.1, n.minobsinnode = 10)

# Test Model #

modelprediction <- predict(model, n.trees = 50, newdata = raniris[101:150,] , type = 'response')

# View Results #

modelprediction0 <- apply(modelprediction, 1, which.max)

# View Results in a readable format #

modelprediction0 <- colnames(modelprediction)[modelprediction0]

# Create Confusion Matrix #

table(raniris[101:150,]$Species, predicted = modelprediction0)

__Console Output:__

predicted

setosa versicolor virginica

setosa 19 0 0

versicolor 0 13 2

virginica 0 2 14

**A Real Application Demonstration (Continuous Dependent Variable**

**)**

As was the case with the previous example, we will again be utilizing the

**train()**function within the

**“CARET”**package to determine model optimization. As it pertains to continuous dependent variables, the

**“gaussian”**option should be specified if the data is normally distributed, and the

**“tdist”**option should be specified if the data is non-parametric.

**# Optimize model parameters for a gradient boosted model through the utilization of the train() function. The train() function is a native command contained within the “CARET” package. #**

model <- train(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", method = "gbm")

model

model <- train(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", method = "gbm")

model

__Console Output:__

Stochastic Gradient Boosting

100 samples

3 predictor

No pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...

Resampling results across tuning parameters:

*interaction.depth n.trees RMSE Rsquared MAE*

1 50 0.4256570 0.7506086 0.3316030

1 100 0.4083072 0.7623251 0.3258838

1 150 0.4067113 0.7607363 0.3270202

2 50 0.4241599 0.7471639 0.3347628

2 100 0.4184793 0.7466858 0.3335772

2 150 0.4212821 0.7427328 0.3369379

3 50 0.4248178 0.7433384 0.3345428

3 100 0.4260524 0.7391382 0.3385778

3 150 0.4278416 0.7345970 0.3398392

1 50 0.4256570 0.7506086 0.3316030

1 100 0.4083072 0.7623251 0.3258838

1 150 0.4067113 0.7607363 0.3270202

2 50 0.4241599 0.7471639 0.3347628

2 100 0.4184793 0.7466858 0.3335772

2 150 0.4212821 0.7427328 0.3369379

3 50 0.4248178 0.7433384 0.3345428

3 100 0.4260524 0.7391382 0.3385778

3 150 0.4278416 0.7345970 0.3398392

*Tuning parameter 'shrinkage' was held constant at a value of 0.1*

Tuning parameter 'n.minobsinnode' was held constant at a value of 10

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were n.trees = 150, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.

Tuning parameter 'n.minobsinnode' was held constant at a value of 10

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were n.trees = 150, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.

**# Optimal Model Parameters #**

# n.trees = 150 #

# interaction.depth = 1 #

# shrinkage = 0.1 #

# n.minobsinnode = 10 #

# Create Model #

tmodel <- gbm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", n.trees = 150, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10)

# Test Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response')

# Compute the Root Mean Standard Error (RMSE) of model testing data #

# With the package "metrics" downloaded and enabled #

rmse(raniris[101:150,]$Sepal.Length, tmodelprediction)

# Compute the Root Mean Standard Error (RMSE) of model training data #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response')

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, tmodelprediction)

# n.trees = 150 #

# interaction.depth = 1 #

# shrinkage = 0.1 #

# n.minobsinnode = 10 #

# Create Model #

tmodel <- gbm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], distribution="tdist", n.trees = 150, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10)

# Test Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response')

# Compute the Root Mean Standard Error (RMSE) of model testing data #

# With the package "metrics" downloaded and enabled #

rmse(raniris[101:150,]$Sepal.Length, tmodelprediction)

# Compute the Root Mean Standard Error (RMSE) of model training data #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response')

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, tmodelprediction)

__Console Output:__

*[1] 0.4060854*

[1] 0.3144518

[1] 0.3144518

**# Mean Absolute Error #**

# Create MAE function #

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA #

# Utilize MAE function on model testing data #

# Regenerate Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response')

# Generate Output #

MAE(raniris[101:150,]$Sepal.Length, tmodelprediction)

# Utilize MAE function on model training data #

# Regenerate Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response')

# Generate Output #

MAE(raniris[1:100,]$Sepal.Length, tmodelprediction)

# Create MAE function #

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA #

# Utilize MAE function on model testing data #

# Regenerate Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[101:150,] , type = 'response')

# Generate Output #

MAE(raniris[101:150,]$Sepal.Length, tmodelprediction)

# Utilize MAE function on model training data #

# Regenerate Model #

tmodelprediction <- predict(tmodel, n.trees = 150, newdata = raniris[1:100,] , type = 'response')

# Generate Output #

MAE(raniris[1:100,]$Sepal.Length, tmodelprediction)

__Console Output:__

*[1] 0.3320722*

[1] 0.2563723

[1] 0.2563723

__Graphing and Interpreting Output__The following method creates an output which quantifies the importance of each variable within the model. The type of analysis which determines the variable importance depends on the model type specified within the initial function. In the case of each model, the code samples below produce the subsequent outputs:

**# Multinomial Model #**

summary(model)

summary(model)

__Console Output:__

*var rel.inf*

Petal.Length Petal.Length 59.0666833

Petal.Width Petal.Width 38.6911265

Sepal.Width Sepal.Width 2.1148704

Sepal.Length Sepal.Length 0.1273199

Petal.Length Petal.Length 59.0666833

Petal.Width Petal.Width 38.6911265

Sepal.Width Sepal.Width 2.1148704

Sepal.Length Sepal.Length 0.1273199

**#######################################**

# T-Distribution Model #

summary (tmodel)

# T-Distribution Model #

summary (tmodel)

__Console Output:__

var rel.inf

Petal.Length Petal.Length 74.11473

Sepal.Width Sepal.Width 14.18743

Petal.Width Petal.Width 11.69784

var rel.inf

Petal.Length Petal.Length 74.11473

Sepal.Width Sepal.Width 14.18743

Petal.Width Petal.Width 11.69784

That's all for now.

I'll see you next time, Data Heads!

-RD

## No comments:

## Post a Comment

Note: Only a member of this blog may post a comment.