Reflections of a Data Scientist: (R) Machine Learning - Trees

This article will serve as the first of many articles which will be discussing the topic of Machine Learning. Throughout the series of subsequent articles published on this site, we will discuss Machine Learning as a topic, and the theories and algorithms which ultimately serve as the subject’s foundation.

While I do not personally consider the equations embedded within the “rpart” package to be machine learning from a literal perspective, those who act as authorities on such matters define it as otherwise. By the definition postulated by the greater community, Tree-Based models represent an aspect of machine learning known as "supervised learning". What this essentially implies, is that the computer software implements a statistical solution to an evidence based question posed by the user. After which, the user has the opportunity to review the solution and the rational, and make model edits where necessary.

The functionality which is implemented within tree-based models, is often drawn from an abstract or white paper written by mathematicians. Therefore, in many cases, the algorithms which ultimately animate the decision making process, are often too difficult, or too cumbersome, for a human being to apply by hand. This does not mean that such undertakings are impossible, however, given the time commitment dependent on the size of the data frame which will ultimately be analyzed, the more pragmatic approach would be to leave the process entirely to the machines which are designed to perform such functions.

Introducing Tree-Based Models with "rpart"

Like the K-Means Cluster, "rpart" is reliant on an underlying algorithm which, due to its complexity, produces results which are difficult to verify. Meaning, that unlike a process such as categorical regression, there is much that occurs outside of the observation of the user from a mathematical standpoint. Due to the nature of the analysis, no equation is output for the user to check, only the model itself. Without this proof of concept, the user can only assume that the analysis was appropriately performed, and the model produced was the optimal variation necessary for future application.

For the examples included within this article, we will be using the R data set "iris".

Perparing for Analysis

Before we begin, you will need to download two separate auxiliary packages from the CRAN repository, those being:

"rpart"

and

"rpart.plot"

Once you have completed this task, we will move forward by reviewing the data set prior to analysis.

This can be achieved by initiating the following functions:

summary(iris)

head(iris)

Since the data frame is initially sorted and organized by "Species", prior to performing the analysis, we must take steps to randomize the data contained within the data frame.

Justification for Randomization

Presenting a machine with data which is performing analysis through the utilization of an algorithm, is somewhat analogous to teaching a young child. To better illustrate this concept, I will present a demonstrative scenario.

Let's imagine, that for some particular reason, you were attempting to instruct a very young child on the topic of dogs, and to accomplish such, you presented the child with a series of pictures which consisted of only golden Labradors. As you might imagine, the child would walk away from the exercise with the notion that dogs, as an object, always consisted of the features associated with the Labradors of the golden variety. Instead of believing that a dog is a generalized descriptor which encompasses numerous minute and discretely defined features, the child will believe that all dogs are golden Labradors, and that golden Labradors, are the only type of dog.

Machines learn* in a similar manner. Each algorithm provides a distinct and unique applicable methodology as it pertains to the overall outcome of the analysis, however, the typical algorithmic standard possesses a bias, in a similar manner to the way in which humans also possess such, based solely on the data as it is initially presented. This is why randomization of data, which instead presents a diverse and robust summary of the data source, is so integral to the process.

This method of randomization was inspired by the YouTube user: Jalayer Academy. A link to the video which describes this randomization technique can be found below.

* - or the algorithm that is associated with the application which creates the appearance of such.

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the dataframe rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]

Jalayer Academy: https://www.youtube.com/watch?v=XLNsl1Da5MA

Training Data and The "rpart" Algorithm

Before we apply the algorithm within the "rpart" package, there are two separate topics which I wish to discuss.

The "rpart" algorithm, as was previously mentioned, is one of many machine learning methodologies which can be utilized to analyze data. The differentiating factor which separates methodologies is typically based on the underlying algorithm which is applied to initial data frame. In the case of "rpart", the methodology utilized, was initially postulated by: Breiman, Friedman, Olshen and Stone.

Classiﬁcation and Regression Trees

On the topic of training data, let us again return to our previous child training example. When teaching a child, if utilizing the flash card method that was discussed prior, you may be inclined to set a few of the cards which you have designed aside. The reason for such, is that these cards could be utilized after the initial training, in order to test the child's comprehension of the subject matter.

Most machines are trained in a similar manner. A portion of the initial data frame is typically set aside in order to test the overall strength of the model after the model's synthesis is complete. After passing the additional data through the model, a rough conclusion can be drawn as it pertains to the overall effectiveness of the model's design.

Method of Application (categorical variable)

As is the case as it pertains to linear regression, we must designate the dependent variable that we wish to predict. If the variable is a categorical variable, we will specify the rpart() function to include a method option of "case". If the variable is a continuous variable, we will specify the “rpart” function to include a method option of "anova".

In this first case, we will attempt to create a model which can be utilized to, through the assessment of independent variables, properly predict the species variable.

The structure of the rpart() function is incredibly similar to the linear model function which is native within R.

model <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = raniris [1:100,], method="class")

Let's break this structure down:

Species - Is the model's dependent variable.

Sepal.Length + Sepal.Width + Petal.Length + Petal.Width - Are the model's independent variables.

data = irisr[1:100,] - This option is specifying the data which will be included within the analysis. As we discussed previously, for the purposes our model, only the first 100 row entries of the initial data frame will be included as the foundational aspects in which to structure the model.

method = "case" - This option indicates to the computer that the dependent variable is categorical and not continuous.

After running the above function, we are left with newly created variable: "model".

Conclusions

From this variable we can draw various conclusions.

Running the variable: "model" within the terminal should produce the following console output:

n= 100

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000)

2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) *

3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362)

6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) *

7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) *

Let's break this structure down:

Structure Summary

n = 100 - This is the initial number of observations passed into the model.

Logic of the nodal split – Example: Petal.Length>=2.45

Total Observations Included within node - Example: 69

Observations which were incorrectly designated - Example: 34

Nodal Designation – Example: versicolor

Percentage of categorical observations occupying each category – Example: (0.00000000 0.50724638 0.49275362)

The Structure Itself

root 100 65 versicolor (0.31000000 0.35000000 0.34000000) - Root is the initial number of observations which are fed through the tree model, hence the term root. The numbers which are found within the parenthesis are the percentage breakdowns of the observations by category.

Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) * - The first split which filters model data between two branches. The first branch, sorts data to the left leaf, in which, 31 of the observations are setosa (100%). The condition which determines the discrimination of data is the Petal.Length (<2.45) variable value of the observation. The (*) symbol is indicating that the node is a terminal node. This means that this node leads to a leaf.

Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362) - This branch indicates a split based on the right sided alternative to the prior condition. The initial number within the first set of numbers indicates the number of cases which remain prior to further sorting, and the subsequent number indicates the number of cases which are virginica (and not veriscolor) . The next set of numbers indicates the percentage of the remaining 69 cases which are versicol (50%), and the percentage of the remaining 69 cases which are virginica (49%) .

Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * - This branch indicates a left split. The (*) symbol is indicates that the node is a terminal node. Of the cases sorted through the node, 35 of the observations are veriscolor (95%) and 2 of the observations are virginica (5%).

Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) * - This branch indicates a right split alternative. The (*) symbol indicates that the node is a terminal node. Of the cases sorted through the node, 32 of the observations are veriscolor (100%), and 0 of the observations are virginica (0%).

Further information, for inference, can be generated by running the following code within the terminal:

summary(model)

This produces the following console output:

(I have created annotations beneath each relevant portion of output)

Call:

rpart(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length +

Petal.Width, data = raniris[1:100, ], method = "class")

n= 100

CP nsplit rel error xerror xstd

1 0.4846154 0 1.00000000 1.26153846 0.05910576

2 0.0100000 2 0.03076923 0.04615385 0.02624419

This portion of the output will be useful as we explore the process of "pruning" later in the article.

Variable importance

Petal.Width Petal.Length Sepal.Length Sepal.Width

35 31 20 14

Node number 1: 100 observations, complexity param=0.4846154

predicted class=versicolor expected loss=0.65 P(node) =1

class counts: 31 35 34

probabilities: 0.310 0.350 0.340

left son=2 (31 obs) right son=3 (69 obs)

Primary splits:

Petal.Length < 2.45 to the left, improve=32.08725, (0 missing)

Petal.Width < 0.8 to the left, improve=32.08725, (0 missing)

Sepal.Length < 5.55 to the left, improve=18.52595, (0 missing)

Sepal.Width < 3.05 to the right, improve=12.67416, (0 missing)

Surrogate splits:

Petal.Width < 0.8 to the left, agree=1.00, adj=1.000, (0 split)

Sepal.Length < 5.45 to the left, agree=0.89, adj=0.645, (0 split)

Sepal.Width < 3.35 to the right, agree=0.83, adj=0.452, (0 split)

The initial split from the root.

Node number 2: 31 observations

predicted class=setosa expected loss=0 P(node) =0.31

class counts: 31 0 0

probabilities: 1.000 0.000 0.000

Filtered results which exist within the "setosa" leaf.

Node number 3: 69 observations, complexity param=0.4846154

predicted class=versicolor expected loss=0.4927536 P(node) =0.69

class counts: 0 35 34

probabilities: 0.000 0.507 0.493

left son=6 (37 obs) right son=7 (32 obs)

The results of the aforementioned split prior to being filtered through the pedal width conditional.

Primary splits:

Petal.Width < 1.65 to the left, improve=30.708970, (0 missing)

Petal.Length < 4.75 to the left, improve=25.420120, (0 missing)

Sepal.Length < 6.35 to the left, improve= 7.401845, (0 missing)

Sepal.Width < 2.95 to the left, improve= 3.878961, (0 missing)

Surrogate splits:

Petal.Length < 4.75 to the left, agree=0.899, adj=0.781, (0 split)

Sepal.Length < 6.15 to the left, agree=0.754, adj=0.469, (0 split)

Sepal.Width < 2.95 to the left, agree=0.696, adj=0.344, (0 split)

Node number 6: 37 observations

predicted class=versicolor expected loss=0.05405405 P(node) =0.37

class counts: 0 35 2

probabilities: 0.000 0.946 0.054

Filtered results which exist within the "versicolor" leaf.

Node number 7: 32 observations

predicted class=virginica expected loss=0 P(node) =0.32

class counts: 0 0 32

probabilities: 0.000 0.000 1.000

Filtered results which exist within the "virginica" leaf.

Visualizing Output with a Well Needed Illustration

If you got lost somewhere along the way during the prior section, don't be ashamed, it is understandable. I am not in any way operating under the pretense that any of this is innate or easily scalable.

However, much of what I attempted to explain the preceding paragraphs can be best surmised through the utilization of the "rpart.plot" package.

# Model Illustration Code #

rpart.plot(model, type = 3, extra = 101)

Console Output:

What is being illustrated in the graphic are the decision branches, and the leaves which ultimately serve as the destinations for the final categorical filtering process.

The leaf "setosa" contains 31 observations which were correctly identified as "setosa" observations. The total number of observations equates for 31% of the total observational rows which were passed through the model.

The leaf "versicolor" contains 35 observations which were correctly identified as "versicolor", and 2 observations which were misidentified. The misidentified observations would instead belong within the “virginica” categorical leaf. The total number of observation contained within the "versicolor" leaf, both correct and incorrect, equal for a total of 37% of the observational rows which were passed through the model.

The leaf "virginica" contains 32 observations which were correctly identified as "virginica". The total number of observations equates for 32% of the total observational rows which were passed through the model.

Testing the Model

Now that our decision tree model has been built, let's test its predictive ability with the data which we left absent from our initial analyzation.

# Create "confusion matrix" to test model accuracy #

prediction <- predict(model, raniris[101:150,], type="class")

table(raniris[101:150,]$Species, predicted = prediction )

A variable named "prediction" is created through the utilization of the predict() function. Passed to this function as options are: the model variable, the remaining rows of the randomized "iris" data frame, and the model type.

Next, a table is created which illustrates the differentiation between what was predicted and what the observation occurrence actually equals. The option "predicted = " will always equal your prediction variable. The numbers within the brackets [101:150, ] specify the rows of the randomized data frame which will act as test observations for the model. “raniris” is the data frame from which these observations will be drawn, and “$Species” specifies the data frame variable which will be assessed.

The result of initiating the above lines of code produces the following console output:

predicted
setosa versicolor virginica
setosa 19 0 0
versicolor 0 13 2
virginica 0 2 14

This output table is known as a “confusion matrix”. Its purpose of existence is to sort the output provided into a readable format which illustrates the number of correctly predicted outcomes, and the number of incorrectly predicted outcomes within each category. In this particular case, all setosa observations were correctly predicted. 13 virsicolor observations were correctly predicted with 2 observations misattributed as virginica observations. 14 virginica observations were correctly attributed, with 2 observations misattributed as versicolor categorical entries.

Method of Application (continuous variable)

Now that we’ve successfully analyzed categorical data, we will progress within our study by also demonstrating rpart’s capacity as it pertains to the analysis of continuous data.

Again, we will be utilizing the “iris” data set. However, in this scenario, we will omit “species” from our model, and instead of attempting to identify the species of the iris in question, we will attempt to identify the septal length of an iris plant based on its other attributes. Therefore, in this example, our dependent variable will be “Sepal.Length”.

The main differentiation between the continuous data model and the categorical data model within the “rpart” package is the option which specifies the analytical methodology. Instead of specifying (method=”class”), we will instruct the package function to utilize (method=”anova”). Therefore, the function which will lead to creation of the model will resemble:

anmodel <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova")

Once the model is built, let’s take a look at the summary of its internal aspects:

summary(anmodel)

This produces the output:

Call:
rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = raniris[1:100, ], method = "anova")
n= 100

CP nsplit rel error xerror xstd
1 0.57720991 0 1.0000000 1.0240753 0.12908984
2 0.12187301 1 0.4227901 0.4792432 0.07380297
3 0.06212228 2 0.3009171 0.3499328 0.04643313
4 0.03392768 3 0.2387948 0.2920761 0.04577809
5 0.01783361 4 0.2048671 0.2920798 0.04349656
6 0.01614077 5 0.1870335 0.2838212 0.04639387
7 0.01092541 6 0.1708927 0.2792003 0.04602130
8 0.01000000 7 0.1599673 0.2849910 0.04586765

Variable importance
Petal.Length Petal.Width Sepal.Width
46 37 17

Node number 1: 100 observations, complexity param=0.5772099
mean=5.834, MSE=0.614244
left son=2 (49 obs) right son=3 (51 obs)
Primary splits:
Petal.Length < 4.25 to the left, improve=0.57720990, (0 missing)
Petal.Width < 1.15 to the left, improve=0.53758000, (0 missing)
Sepal.Width < 3.35 to the right, improve=0.02830809, (0 missing)
Surrogate splits:
Petal.Width < 1.35 to the left, agree=0.96, adj=0.918, (0 split)
Sepal.Width < 3.35 to the right, agree=0.65, adj=0.286, (0 split)

Node number 2: 49 observations, complexity param=0.06212228
mean=5.226531, MSE=0.1786839
left son=4 (34 obs) right son=5 (15 obs)
Primary splits:
Petal.Length < 3.45 to the left, improve=0.4358197, (0 missing)
Petal.Width < 0.35 to the left, improve=0.3640792, (0 missing)
Sepal.Width < 2.95 to the right, improve=0.1686580, (0 missing)
Surrogate splits:
Petal.Width < 0.8 to the left, agree=0.939, adj=0.8, (0 split)
Sepal.Width < 2.95 to the right, agree=0.878, adj=0.6, (0 split)

Node number 3: 51 observations, complexity param=0.121873
mean=6.417647, MSE=0.3375317
left son=6 (39 obs) right son=7 (12 obs)
Primary splits:
Petal.Length < 5.65 to the left, improve=0.4348743, (0 missing)
Sepal.Width < 3.05 to the left, improve=0.1970339, (0 missing)
Petal.Width < 1.95 to the left, improve=0.1805629, (0 missing)
Surrogate splits:
Sepal.Width < 3.15 to the left, agree=0.843, adj=0.333, (0 split)
Petal.Width < 2.15 to the left, agree=0.824, adj=0.250, (0 split)

Node number 4: 34 observations, complexity param=0.03392768
mean=5.041176, MSE=0.1288927
left son=8 (26 obs) right son=9 (8 obs)
Primary splits:
Sepal.Width < 3.65 to the left, improve=0.47554080, (0 missing)
Petal.Length < 1.35 to the left, improve=0.07911083, (0 missing)
Petal.Width < 0.25 to the left, improve=0.06421307, (0 missing)

Node number 5: 15 observations
mean=5.646667, MSE=0.03715556

Node number 6: 39 observations, complexity param=0.01783361
mean=6.205128, MSE=0.1799737
left son=12 (30 obs) right son=13 (9 obs)
Primary splits:
Sepal.Width < 3.05 to the left, improve=0.1560654, (0 missing)
Petal.Width < 2.05 to the left, improve=0.1506123, (0 missing)
Petal.Length < 4.55 to the left, improve=0.1334125, (0 missing)
Surrogate splits:
Petal.Width < 2.25 to the left, agree=0.846, adj=0.333, (0 split)

Node number 7: 12 observations
mean=7.108333, MSE=0.2257639

Node number 8: 26 observations
mean=4.903846, MSE=0.07344675

Node number 9: 8 observations
mean=5.4875, MSE=0.04859375

Node number 12: 30 observations, complexity param=0.01614077
mean=6.113333, MSE=0.1658222
left son=24 (23 obs) right son=25 (7 obs)
Primary splits:
Petal.Length < 5.15 to the left, improve=0.19929710, (0 missing)
Petal.Width < 1.45 to the right, improve=0.07411631, (0 missing)
Sepal.Width < 2.75 to the left, improve=0.06794425, (0 missing)
Surrogate splits:
Petal.Width < 2.05 to the left, agree=0.867, adj=0.429, (0 split)

Node number 13: 9 observations
mean=6.511111, MSE=0.1054321

Node number 24: 23 observations, complexity param=0.01092541
mean=6.013043, MSE=0.1620038
left son=48 (9 obs) right son=49 (14 obs)
Primary splits:
Petal.Width < 1.65 to the right, improve=0.18010500, (0 missing)
Petal.Length < 4.55 to the left, improve=0.12257150, (0 missing)
Sepal.Width < 2.75 to the left, improve=0.03274482, (0 missing)
Surrogate splits:
Petal.Length < 4.75 to the right, agree=0.783, adj=0.444, (0 split)

Node number 25: 7 observations
mean=6.442857, MSE=0.03673469

Node number 48: 9 observations
mean=5.8, MSE=0.1466667

Node number 49: 14 observations
mean=6.15, MSE=0.1239286

The largest distinguishing factor between outputs is that instead of categorical sorting, “rpart” has organized the data by mean value and sorted in this manner. “MSE” is an abbreviation for “Mean Squared Error”, which measures the level of differentiation of other values in regards to the mean. The larger this value is, the greater the spatial differential between the set’s data points.

As always, the phenomenon which is demonstrated within the raw output will look better in graphical form. To create an illustration of the model, utilize the code below:

# Note: R-Part Part will not round off the numerical figures within an ANOVA model’s output graphic #

# For this reason, I have explicitly disabled the “roundint” option #

rpart.plot(anmodel,extra = 101, type =3, roundint = FALSE)

This creates the following output:

In the leaves at the bottom of the graphic, the topmost value represents the mean value, the n value represents the number of observations which occupy that assigned filtered category, and the percentage value represents the percentage ratio of the number observations within the mean, divided by the number of observations within the entire set.

Testing the Model

Now that our decision tree model has been built, let's test its predictive ability with the data which was left absent from our initial analyzation.

When assessing non-categorical models for their predictive capacity, there are numerous methodologies which can be employed. In this article, we will be discussing two specifically.

Mean Absolute Error

The first method of predictive capacity that we will be discussing is known as The Mean Absolute Error. The Mean Absolute Error is essentially the mean of the absolute value of the total sum of differentiations, which are derived from subtracting the predictive observed values from the assigned observational values.

https://en.wikipedia.org/wiki/Mean_absolute_error

Within the R platform, deriving this value can be achieved through the utilization of the following code:

# Create predictive model #

anprediction <- predict(anmodel , raniris[101:150,])

# Create MAE function #

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA #

# Utilize MAE function #

MAE(raniris[101:150,]$Sepal.Length, anprediction)

Console Output:

[1] 0.2976927

The above output is indicating that there is, on average, a difference of 0.298 inches between the predicted value of sepal length and the actual value of sepal length.

Root Mean Squared Error

The Root Mean Squared Error is a value produced by a methodology utilized to measure the predictive capacity of models. Like the Mean Absolute Error, this formula is applied to the observational values as they appear within the initial data frame, and the predicted observational values which are generated by the predictive model.

However, the manner in which the output value is synthesized is less straight forward. The value itself is generated by solving for the square root of the average of squared differences between the predicted observational value and the original observational value. As a result, the interpretation of the final output value of the Root Mean Squared Error is more difficult to interpret than its Mean Absolute Error counterpart.

The Root Mean Squared Error is more sensitive to large differentiations between predictive value and observational error. With Mean Absolute Error, with enough observations, the eventual output value is smoothed out enough to provide the appearance of less distance between individual values than is actually the case. However, as was previously mentioned, Root Mean Squared Error maintains, through the method in which the eventual value is synthesized, a certain amount of distance variation regardless of the size of the set.

https://en.wikipedia.org/wiki/Root-mean-square_deviation

Within the R platform, deriving this value can be achieved through the utilization of the following code:

# Create predictive model #

anprediction <- predict(anmodel , raniris[101:150,])

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, anprediction)

# Compute the Root Mean Standard Error (RMSE) of model test data #

prediction <- predict(anmod, raniris[101:150,], type="class")

Console Output:

[1] 1.128444

Decision Tree Nomenclature

As much of the terminology within the field of “machine learning” is synonymously applied regardless of model type, it is important to understand the basic descriptive terms in order to familiarize oneself with the contextual aspects of the subject matter.

In generating the initial graphic with the code:

rpart.plot(model, type = 3, extra = 101)

We were presented with the illustration below:

The “rpart” package, as it pertains to the model output provided, identifies each aspect of the model in the following manner:

# Generate model output with the following code #

model

> model
n= 100

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000)
2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362)
6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) *
7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) *

If this identification was provided within a graphical representation of the model, the illustration would resemble the graphic below:

However, universally, the following graphic is a better representation of what each term is utilized to describe within the context of the field of study.

# Illustrate the model #

rpart.plot(model)

The first graphic provides a much more pragmatic representation of the model, a representation which is perfectly in accordance with the manner in which the rpart() function surmises the data. The latter graphic, illustrates the technique which is traditionally synonymous with the way in which a model of this type would be represented.

Therefore, if an individual were discussing this model with an outside researcher, he would refer to the model as possessing 3 leaves and 2 nodes. The tree being in possession of 1 root is essentially inherent. The term “branches” is the descriptor utilized to describe the black line which connect the various other aspects of the model. However, like the root of the tree, the branches themselves do not warrant mention. In summary, when referring to a tree model, it is a common practice to define it generally by the number of nodes and leaves it possesses.

Pruning with prune()

There will be instances in which you may wish to simplify a model by removing some of its extraneous nodes. The motivation for accomplishing such can be motivated by either a desire to simplify the model, or, as an attempt to optimize the model’s predictive capacity.

We will apply the pruning function to the second example model that we previously created.

First, we must find the CP value of the model that we wish to prune. This can be achieved through the utilization of the code:

printcp(anmodel)

This presents the following console output:

> printcp(anmodel)

Regression tree:
rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = raniris[1:100, ], method = "anova")

Variables actually used in tree construction:
[1] Petal.Length Petal.Width Sepal.Width

Root node error: 61.424/100 = 0.61424

n= 100

CP nsplit rel error xerror xstd
1 0.577210 0 1.00000 1.04319 0.133636
2 0.121873 1 0.42279 0.52552 0.081797
3 0.062122 2 0.30092 0.39343 0.051912
4 0.033928 3 0.23879 0.32049 0.050067
5 0.017834 4 0.20487 0.32167 0.050154
6 0.016141 5 0.18703 0.29403 0.047955
7 0.010925 6 0.17089 0.29242 0.048231
8 0.010000 7 0.15997 0.29256 0.048205

Each value in the list represents a node, with the initial value (1) representing the model’s root. They typical course of action for pruning an “rpart” tree is to first identify the node with the lowest cross-validation error (xerror). Once this value has been identified, we must make note of the value’s corresponding CP score (0.01925). It is this value which will be utilized within our pruning function to modify the model.

With the above information ascertained, we can move forward in the pruning process by initiating the following code within R console.

prunedmodel <- prune(anmodel, 0.010925)

In the case of our example, due to the small CP value, no modifications were made to the original model. However, this not always the case. I encourage you to experiment with this function as it pertains to your own rpart models, the best way to learn is through repetition.

Dealing with Missing Values

Typically when analyzing real world data sets, there will be instances in which different variable observation values are absent. You should not let this all too common occurrence hinder your model ambitions. Thankfully, within the rpart function, there exists a mechanism for dealing with missing values. However, this mechanism only applies to observations which consist of missing independent variables, values which will be designated as dependent variables which are missing entries should be removed prior to analysis.

After testing the functionality of the method with data sets which I had previously removed portions of data from, there appeared to be very little impact on the model creation or prediction capacity. The algorithms which animate the data functions also exist in such a manner in which incomplete data sets can be passed through the model to generate predictions.

I’m not exactly sure how the underlying functionality of the rpart package specifically estimates the values of the missing, or “surrogate” variable observations. From reading various articles and the manual associated with the rpart package, I can only assume, from what was described, that the values of the missing variables are derived from full observations which share variable similarities.

Conclusion

The basic tree model as it is discussed within the contents of this article, is often passed over in favor of the random forest model. However, as you will observe in future articles, the basic tree model is not without merit, as due its singular nature, it is the easier model to explain and conceptually visualize. Both of the latter concepts are extremely valuable as it relates to data presentation and research publication. In the next article well will be discussing “Bagging”. Until then, stay subscribed, Data Heads.

Reflections of a Data Scientist

Sunday, September 25, 2022

(R) Machine Learning - Trees - Pt. I

No comments:

Post a Comment