Sunday, September 25, 2022

(R) Machine Learning - Trees - Pt. I

This article will serve as the first of many articles which will be discussing the topic of Machine Learning. Throughout the series of subsequent articles published on this site, we will discuss Machine Learning as a topic, and the theories and algorithms which ultimately serve as the subject’s foundation.

While I do not personally consider the equations embedded within the “rpart” package to be machine learning from a literal perspective, those who act as authorities on such matters define it as otherwise. By the definition postulated by the greater community, Tree-Based models represent an aspect of machine learning known as "supervised learning". What this essentially implies, is that the computer software implements a statistical solution to an evidence based question posed by the user. After which, the user has the opportunity to review the solution and the rational, and make model edits where necessary.

The functionality which is implemented within tree-based models, is often drawn from an abstract or white paper written by mathematicians. Therefore, in many cases, the algorithms which ultimately animate the decision making process, are often too difficult, or too cumbersome, for a human being to apply by hand. This does not mean that such undertakings are impossible, however, given the time commitment dependent on the size of the data frame which will ultimately be analyzed, the more pragmatic approach would be to leave the process entirely to the machines which are designed to perform such functions. 

Introducing Tree-Based Models with "rpart"

Like the K-Means Cluster, "rpart" is reliant on an underlying algorithm which, due to its complexity, produces results which are difficult to verify. Meaning, that unlike a process such as categorical regression, there is much that occurs outside of the observation of the user from a mathematical standpoint. Due to the nature of the analysis, no equation is output for the user to check, only the model itself. Without this proof of concept, the user can only assume that the analysis was appropriately performed, and the model produced was the optimal variation necessary for future application.

For the examples included within this article, we will be using the R data set "iris"

Perparing for Analysis

Before we begin, you will need to download two separate auxiliary packages from the CRAN repository, those being:

"rpart"

and

"rpart.plot"

Once you have completed this task, we will move forward by reviewing the data set prior to analysis.

This can be achieved by initiating the following functions:

summary(iris)

head(iris)


Since the data frame is initially sorted and organized by "Species", prior to performing the analysis, we must take steps to randomize the data contained within the data frame.

Justification for Randomization

Presenting a machine with data which is performing analysis through the utilization of an algorithm, is somewhat analogous to teaching a young child. To better illustrate this concept, I will present a demonstrative scenario.

Let's imagine, that for some particular reason, you were attempting to instruct a very young child on the topic of dogs, and to accomplish such, you presented the child with a series of pictures which consisted of only golden Labradors. As you might imagine, the child would walk away from the exercise with the notion that dogs, as an object, always consisted of the features associated with the Labradors of the golden variety. Instead of believing that a dog is a generalized descriptor which encompasses numerous minute and discretely defined features, the child will believe that all dogs are golden Labradors, and that golden Labradors, are the only type of dog.

Machines learn* in a similar manner. Each algorithm provides a distinct and unique applicable methodology as it pertains to the overall outcome of the analysis, however, the typical algorithmic standard possesses a bias, in a similar manner to the way in which humans also possess such, based solely on the data as it is initially presented. This is why randomization of data, which instead presents a diverse and robust summary of the data source, is so integral to the process.

This method of randomization was inspired by the YouTube user: Jalayer Academy. A link to the video which describes this randomization technique can be found below.

* - or the algorithm that is associated with the application which creates the appearance of such.

# Set randomization seed #

set.seed(454)

# Create a series of random values from a uniform distribution. The number of values being generated will be equal to the number of row observations specified within the data frame. #

rannum <- runif(nrow(iris))

# Order the dataframe rows by the values in which the random set is ordered #

raniris <- iris[order(rannum), ]


Training Data and The "rpart" Algorithm

Before we apply the algorithm within the "rpart" package, there are two separate topics which I wish to discuss.

The "rpart" algorithm, as was previously mentioned, is one of many machine learning methodologies which can be utilized to analyze data. The differentiating factor which separates methodologies is typically based on the underlying algorithm which is applied to initial data frame. In the case of "rpart", the methodology utilized, was initially postulated by: Breiman, Friedman, Olshen and Stone.

Classification and Regression Trees

On the topic of training data, let us again return to our previous child training example. When teaching a child, if utilizing the flash card method that was discussed prior, you may be inclined to set a few of the cards which you have designed aside. The reason for such, is that these cards could be utilized after the initial training, in order to test the child's comprehension of the subject matter.

Most machines are trained in a similar manner. A portion of the initial data frame is typically set aside in order to test the overall strength of the model after the model's synthesis is complete. After passing the additional data through the model, a rough conclusion can be drawn as it pertains to the overall effectiveness of the model's design. 

Method of Application (categorical variable)

As is the case as it pertains to linear regression, we must designate the dependent variable that we wish to predict. If the variable is a categorical variable, we will specify the rpart() function to include a method option of "case". If the variable is a continuous variable, we will specify the “rpart” function to include a method option of "anova".

In this first case, we will attempt to create a model which can be utilized to, through the assessment of independent variables, properly predict the species variable.

The structure of the rpart() function is incredibly similar to the linear model function which is native within R.

model <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = raniris [1:100,], method="class")

Let's break this structure down:

Species - Is the model's dependent variable.

Sepal.Length + Sepal.Width + Petal.Length + Petal.Width - Are the model's independent variables.

data = irisr[1:100,] - This option is specifying the data which will be included within the analysis. As we discussed previously, for the purposes our model, only the first 100 row entries of the initial data frame will be included as the foundational aspects in which to structure the model.

method = "case" - This option indicates to the computer that the dependent variable is categorical and not continuous.

After running the above function, we are left with newly created variable: "model".

Conclusions

From this variable we can draw various conclusions.

Running the variable: "model" within the terminal should produce the following console output:

n= 100

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000)

2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) *

3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362)

6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) *

7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) *


Let's break this structure down:

Structure Summary

n = 100 - This is the initial number of observations passed into the model.

Logic of the nodal split – Example: Petal.Length>=2.45

Total Observations Included within node - Example: 69

Observations which were incorrectly designated - Example: 34

Nodal Designation – Example: versicolor

Percentage of categorical observations occupying each category – Example: (0.00000000 0.50724638 0.49275362)

The Structure Itself

root 100 65 versicolor (0.31000000 0.35000000 0.34000000) - Root is the initial number of observations which are fed through the tree model, hence the term root. The numbers which are found within the parenthesis are the percentage breakdowns of the observations by category.

Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) * - The first split which filters model data between two branches. The first branch, sorts data to the left leaf, in which, 31 of the observations are setosa (100%). The condition which determines the discrimination of data is the Petal.Length (<2.45) variable value of the observation. The (*) symbol is indicating that the node is a terminal node. This means that this node leads to a leaf.

Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362) - This branch indicates a split based on the right sided alternative to the prior condition. The initial number within the first set of numbers indicates the number of cases which remain prior to further sorting, and the subsequent number indicates the number of cases which are virginica (and not veriscolor) . The next set of numbers indicates the percentage of the remaining 69 cases which are versicol (50%), and the percentage of the remaining 69 cases which are virginica (49%) .

Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) * - This branch indicates a left split. The (*) symbol is indicates that the node is a terminal node. Of the cases sorted through the node, 35 of the observations are veriscolor (95%) and 2 of the observations are virginica (5%).

Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) * - This branch indicates a right split alternative. The (*) symbol indicates that the node is a terminal node. Of the cases sorted through the node, 32 of the observations are veriscolor (100%), and 0 of the observations are virginica (0%).

Further information, for inference, can be generated by running the following code within the terminal:

summary(model)

This produces the following console output:


(I have created annotations beneath each relevant portion of output)

Call:

rpart(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length +

Petal.Width, data = raniris[1:100, ], method = "class")

n= 100

CP nsplit rel error xerror xstd

1 0.4846154 0 1.00000000 1.26153846 0.05910576

2 0.0100000 2 0.03076923 0.04615385 0.02624419

This portion of the output will be useful as we explore the process of "pruning" later in the article. 


Variable importance

Petal.Width Petal.Length Sepal.Length Sepal.Width

35 31 20 14

Node number 1: 100 observations, complexity param=0.4846154

predicted class=versicolor expected loss=0.65 P(node) =1

class counts: 31 35 34

probabilities: 0.310 0.350 0.340

left son=2 (31 obs) right son=3 (69 obs)

Primary splits:

Petal.Length < 2.45 to the left, improve=32.08725, (0 missing)

Petal.Width < 0.8 to the left, improve=32.08725, (0 missing)

Sepal.Length < 5.55 to the left, improve=18.52595, (0 missing)

Sepal.Width < 3.05 to the right, improve=12.67416, (0 missing)

Surrogate splits:

Petal.Width < 0.8 to the left, agree=1.00, adj=1.000, (0 split)

Sepal.Length < 5.45 to the left, agree=0.89, adj=0.645, (0 split)

Sepal.Width < 3.35 to the right, agree=0.83, adj=0.452, (0 split)



The initial split from the root.



Node number 2: 31 observations

predicted class=setosa expected loss=0 P(node) =0.31

class counts: 31 0 0

probabilities: 1.000 0.000 0.000





Filtered results which exist within the "setosa" leaf.




Node number 3: 69 observations, complexity param=0.4846154

predicted class=versicolor expected loss=0.4927536 P(node) =0.69

class counts: 0 35 34

probabilities: 0.000 0.507 0.493

left son=6 (37 obs) right son=7 (32 obs)





The results of the aforementioned split prior to being filtered through the pedal width conditional.




Primary splits:

Petal.Width < 1.65 to the left, improve=30.708970, (0 missing)

Petal.Length < 4.75 to the left, improve=25.420120, (0 missing)

Sepal.Length < 6.35 to the left, improve= 7.401845, (0 missing)

Sepal.Width < 2.95 to the left, improve= 3.878961, (0 missing)

Surrogate splits:

Petal.Length < 4.75 to the left, agree=0.899, adj=0.781, (0 split)

Sepal.Length < 6.15 to the left, agree=0.754, adj=0.469, (0 split)

Sepal.Width < 2.95 to the left, agree=0.696, adj=0.344, (0 split)


Node number 6: 37 observations

predicted class=versicolor expected loss=0.05405405 P(node) =0.37

class counts: 0 35 2

probabilities: 0.000 0.946 0.054





Filtered results which exist within the "versicolor" leaf.




Node number 7: 32 observations

predicted class=virginica expected loss=0 P(node) =0.32

class counts: 0 0 32

probabilities: 0.000 0.000 1.000





Filtered results which exist within the "virginica" leaf.




Visualizing Output with a Well Needed Illustration

If you got lost somewhere along the way during the prior section, don't be ashamed, it is understandable. I am not in any way operating under the pretense that any of this is innate or easily scalable.

However, much of what I attempted to explain the preceding paragraphs can be best surmised through the utilization of the "rpart.plot" package.

# Model Illustration Code #

rpart.plot(model, type = 3, extra = 101)


Console Output:



What is being illustrated in the graphic are the decision branches, and the leaves which ultimately serve as the destinations for the final categorical filtering process.

The leaf "setosa" contains 31 observations which were correctly identified as "setosa" observations. The total number of observations equates for 31% of the total observational rows which were passed through the model.

The leaf "versicolor" contains 35 observations which were correctly identified as "versicolor", and 2 observations which were misidentified. The misidentified observations would instead belong within the “virginica” categorical leaf. The total number of observation contained within the "versicolor" leaf, both correct and incorrect, equal for a total of 37% of the observational rows which were passed through the model.

The leaf "virginica" contains 32 observations which were correctly identified as "virginica". The total number of observations equates for 32% of the total observational rows which were passed through the model.

Testing the Model

Now that our decision tree model has been built, let's test its predictive ability with the data which we left absent from our initial analyzation.

# Create "confusion matrix" to test model accuracy #

prediction <- predict(model, raniris[101:150,], type="class")


table(raniris[101:150,]$Species, predicted = prediction )

A variable named "prediction" is created through the utilization of the predict() function. Passed to this function as options are: the model variable, the remaining rows of the randomized "iris" data frame, and the model type.

Next, a table is created which illustrates the differentiation between what was predicted and what the observation occurrence actually equals. The option "predicted = " will always equal your prediction variable. The numbers within the brackets [101:150, ] specify the rows of the randomized data frame which will act as test observations for the model. “raniris” is the data frame from which these observations will be drawn, and “$Species” specifies the data frame variable which will be assessed.

The result of initiating the above lines of code produces the following console output:

           predicted
             setosa versicolor virginica
setosa      19           0              0
versicolor    0       13              2
virginica      0         2            14


This output table is known as a “confusion matrix”. Its purpose of existence is to sort the output provided into a readable format which illustrates the number of correctly predicted outcomes, and the number of incorrectly predicted outcomes within each category. In this particular case, all setosa observations were correctly predicted. 13 virsicolor observations were correctly predicted with 2 observations misattributed as virginica observations. 14 virginica observations were correctly attributed, with 2 observations misattributed as versicolor categorical entries. 

Method of Application (continuous variable) 

Now that we’ve successfully analyzed categorical data, we will progress within our study by also demonstrating rpart’s capacity as it pertains to the analysis of continuous data.

Again, we will be utilizing the “iris” data set. However, in this scenario, we will omit “species” from our model, and instead of attempting to identify the species of the iris in question, we will attempt to identify the septal length of an iris plant based on its other attributes. Therefore, in this example, our dependent variable will be “Sepal.Length”.

The main differentiation between the continuous data model and the categorical data model within the “rpart” package is the option which specifies the analytical methodology. Instead of specifying (method=”class”), we will instruct the package function to utilize (method=”anova”). Therefore, the function which will lead to creation of the model will resemble:

anmodel <- rpart(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = raniris[1:100,], method="anova")

Once the model is built, let’s take a look at the summary of its internal aspects:

summary(anmodel)

This produces the output:


Call:
rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = raniris[1:100, ], method = "anova")
n= 100

CP nsplit rel error xerror xstd
1 0.57720991 0 1.0000000 1.0240753 0.12908984
2 0.12187301 1 0.4227901 0.4792432 0.07380297
3 0.06212228 2 0.3009171 0.3499328 0.04643313
4 0.03392768 3 0.2387948 0.2920761 0.04577809
5 0.01783361 4 0.2048671 0.2920798 0.04349656
6 0.01614077 5 0.1870335 0.2838212 0.04639387
7 0.01092541 6 0.1708927 0.2792003 0.04602130
8 0.01000000 7 0.1599673 0.2849910 0.04586765

Variable importance
Petal.Length Petal.Width Sepal.Width
       46                  37                17

Node number 1: 100 observations, complexity param=0.5772099
mean=5.834, MSE=0.614244
left son=2 (49 obs) right son=3 (51 obs)
Primary splits:
Petal.Length < 4.25 to the left, improve=0.57720990, (0 missing)
Petal.Width < 1.15 to the left, improve=0.53758000, (0 missing)
Sepal.Width < 3.35 to the right, improve=0.02830809, (0 missing)
Surrogate splits:
Petal.Width < 1.35 to the left, agree=0.96, adj=0.918, (0 split)
Sepal.Width < 3.35 to the right, agree=0.65, adj=0.286, (0 split)

Node number 2: 49 observations, complexity param=0.06212228
mean=5.226531, MSE=0.1786839
left son=4 (34 obs) right son=5 (15 obs)
Primary splits:
Petal.Length < 3.45 to the left, improve=0.4358197, (0 missing)
Petal.Width < 0.35 to the left, improve=0.3640792, (0 missing)
Sepal.Width < 2.95 to the right, improve=0.1686580, (0 missing)
Surrogate splits:
Petal.Width < 0.8 to the left, agree=0.939, adj=0.8, (0 split)
Sepal.Width < 2.95 to the right, agree=0.878, adj=0.6, (0 split)

Node number 3: 51 observations, complexity param=0.121873
mean=6.417647, MSE=0.3375317
left son=6 (39 obs) right son=7 (12 obs)
Primary splits:
Petal.Length < 5.65 to the left, improve=0.4348743, (0 missing)
Sepal.Width < 3.05 to the left, improve=0.1970339, (0 missing)
Petal.Width < 1.95 to the left, improve=0.1805629, (0 missing)
Surrogate splits:
Sepal.Width < 3.15 to the left, agree=0.843, adj=0.333, (0 split)
Petal.Width < 2.15 to the left, agree=0.824, adj=0.250, (0 split)

Node number 4: 34 observations, complexity param=0.03392768
mean=5.041176, MSE=0.1288927
left son=8 (26 obs) right son=9 (8 obs)
Primary splits:
Sepal.Width < 3.65 to the left, improve=0.47554080, (0 missing)
Petal.Length < 1.35 to the left, improve=0.07911083, (0 missing)
Petal.Width < 0.25 to the left, improve=0.06421307, (0 missing)

Node number 5: 15 observations
mean=5.646667, MSE=0.03715556

Node number 6: 39 observations, complexity param=0.01783361
mean=6.205128, MSE=0.1799737
left son=12 (30 obs) right son=13 (9 obs)
Primary splits:
Sepal.Width < 3.05 to the left, improve=0.1560654, (0 missing)
Petal.Width < 2.05 to the left, improve=0.1506123, (0 missing)
Petal.Length < 4.55 to the left, improve=0.1334125, (0 missing)
Surrogate splits:
Petal.Width < 2.25 to the left, agree=0.846, adj=0.333, (0 split)

Node number 7: 12 observations
mean=7.108333, MSE=0.2257639

Node number 8: 26 observations
mean=4.903846, MSE=0.07344675

Node number 9: 8 observations
mean=5.4875, MSE=0.04859375

Node number 12: 30 observations, complexity param=0.01614077
mean=6.113333, MSE=0.1658222
left son=24 (23 obs) right son=25 (7 obs)
Primary splits:
Petal.Length < 5.15 to the left, improve=0.19929710, (0 missing)
Petal.Width < 1.45 to the right, improve=0.07411631, (0 missing)
Sepal.Width < 2.75 to the left, improve=0.06794425, (0 missing)
Surrogate splits:
Petal.Width < 2.05 to the left, agree=0.867, adj=0.429, (0 split)

Node number 13: 9 observations
mean=6.511111, MSE=0.1054321

Node number 24: 23 observations, complexity param=0.01092541
mean=6.013043, MSE=0.1620038
left son=48 (9 obs) right son=49 (14 obs)
Primary splits:
Petal.Width < 1.65 to the right, improve=0.18010500, (0 missing)
Petal.Length < 4.55 to the left, improve=0.12257150, (0 missing)
Sepal.Width < 2.75 to the left, improve=0.03274482, (0 missing)
Surrogate splits:
Petal.Length < 4.75 to the right, agree=0.783, adj=0.444, (0 split)

Node number 25: 7 observations
mean=6.442857, MSE=0.03673469

Node number 48: 9 observations
mean=5.8, MSE=0.1466667

Node number 49: 14 observations
mean=6.15, MSE=0.1239286 


The largest distinguishing factor between outputs is that instead of categorical sorting, “rpart” has organized the data by mean value and sorted in this manner. “MSE” is an abbreviation for “Mean Squared Error”, which measures the level of differentiation of other values in regards to the mean. The larger this value is, the greater the spatial differential between the set’s data points. 


As always, the phenomenon which is demonstrated within the raw output will look better in graphical form. To create an illustration of the model, utilize the code below:

# Note: R-Part Part will not round off the numerical figures within an ANOVA model’s output graphic #

# For this reason, I have explicitly disabled the “roundint” option #

rpart.plot(anmodel,extra = 101, type =3, roundint = FALSE)


This creates the following output:


In the leaves at the bottom of the graphic, the topmost value represents the mean value, the n value represents the number of observations which occupy that assigned filtered category, and the percentage value represents the percentage ratio of the number observations within the mean, divided by the number of observations within the entire set.

Testing the Model

Now that our decision tree model has been built, let's test its predictive ability with the data which was left absent from our initial analyzation.

When assessing non-categorical models for their predictive capacity, there are numerous methodologies which can be employed. In this article, we will be discussing two specifically.

Mean Absolute Error

The first method of predictive capacity that we will be discussing is known as The Mean Absolute Error. The Mean Absolute Error is essentially the mean of the absolute value of the total sum of differentiations, which are derived from subtracting the predictive observed values from the assigned observational values.

https://en.wikipedia.org/wiki/Mean_absolute_error

Within the R platform, deriving this value can be achieved through the utilization of the following code:

# Create predictive model #

anprediction <- predict(anmodel , raniris[101:150,])

# Create MAE function #

MAE <- function(actual, predicted) {mean(abs(actual - predicted))}

# Function Source: https://www.youtube.com/watch?v=XLNsl1Da5MA #

# Utilize MAE function #

MAE(raniris[101:150,]$Sepal.Length, anprediction)

Console Output:

[1] 0.2976927

The above output is indicating that there is, on average, a difference of 0.298 inches between the predicted value of sepal length and the actual value of sepal length.

Root Mean Squared Error

The Root Mean Squared Error is a value produced by a methodology utilized to measure the predictive capacity of models. Like the Mean Absolute Error, this formula is applied to the observational values as they appear within the initial data frame, and the predicted observational values which are generated by the predictive model.

However, the manner in which the output value is synthesized is less straight forward. The value itself is generated by solving for the square root of the average of squared differences between the predicted observational value and the original observational value. As a result, the interpretation of the final output value of the Root Mean Squared Error is more difficult to interpret than its Mean Absolute Error counterpart.

The Root Mean Squared Error is more sensitive to large differentiations between predictive value and observational error. With Mean Absolute Error, with enough observations, the eventual output value is smoothed out enough to provide the appearance of less distance between individual values than is actually the case. However, as was previously mentioned, Root Mean Squared Error maintains, through the method in which the eventual value is synthesized, a certain amount of distance variation regardless of the size of the set.

https://en.wikipedia.org/wiki/Root-mean-square_deviation

Within the R platform, deriving this value can be achieved through the utilization of the following code:

# Create predictive model #

anprediction <- predict(anmodel , raniris[101:150,])

# With the package "metrics" downloaded and enabled #

rmse(raniris[1:100,]$Sepal.Length, anprediction)

# Compute the Root Mean Standard Error (RMSE) of model test data #

prediction <- predict(anmod, raniris[101:150,], type="class")

Console Output:

[1] 1.128444

Decision Tree Nomenclature

As much of the terminology within the field of “machine learning” is synonymously applied regardless of model type, it is important to understand the basic descriptive terms in order to familiarize oneself with the contextual aspects of the subject matter.

In generating the initial graphic with the code:

rpart.plot(model, type = 3, extra = 101)

We were presented with the illustration below:


The “rpart” package, as it pertains to the model output provided, identifies each aspect of the model in the following manner:

# Generate model output with the following code #

model


> model
n= 100


node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 100 65 versicolor (0.31000000 0.35000000 0.34000000)
2) Petal.Length< 2.45 31 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.45 69 34 versicolor (0.00000000 0.50724638 0.49275362)
6) Petal.Width< 1.65 37 2 versicolor (0.00000000 0.94594595 0.05405405) *
7) Petal.Width>=1.65 32 0 virginica (0.00000000 0.00000000 1.00000000) *

If this identification was provided within a graphical representation of the model, the illustration would resemble the graphic below:


However, universally, the following graphic is a better representation of what each term is utilized to describe within the context of the field of study.

# Illustrate the model #

rpart.plot(model)


The first graphic provides a much more pragmatic representation of the model, a representation which is perfectly in accordance with the manner in which the rpart() function surmises the data. The latter graphic, illustrates the technique which is traditionally synonymous with the way in which a model of this type would be represented.

Therefore, if an individual were discussing this model with an outside researcher, he would refer to the model as possessing 3 leaves and 2 nodes. The tree being in possession of 1 root is essentially inherent. The term “branches” is the descriptor utilized to describe the black line which connect the various other aspects of the model. However, like the root of the tree, the branches themselves do not warrant mention. In summary, when referring to a tree model, it is a common practice to define it generally by the number of nodes and leaves it possesses.

Pruning with prune()

There will be instances in which you may wish to simplify a model by removing some of its extraneous nodes. The motivation for accomplishing such can be motivated by either a desire to simplify the model, or, as an attempt to optimize the model’s predictive capacity.

We will apply the pruning function to the second example model that we previously created.

First, we must find the CP value of the model that we wish to prune. This can be achieved through the utilization of the code:

printcp(anmodel)

This presents the following console output:

> printcp(anmodel)

Regression tree:
rpart(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = raniris[1:100, ], method = "anova")

Variables actually used in tree construction:
[1] Petal.Length Petal.Width Sepal.Width

Root node error: 61.424/100 = 0.61424

n= 100

CP nsplit rel error xerror xstd
1 0.577210 0 1.00000 1.04319 0.133636
2 0.121873 1 0.42279 0.52552 0.081797
3 0.062122 2 0.30092 0.39343 0.051912
4 0.033928 3 0.23879 0.32049 0.050067
5 0.017834 4 0.20487 0.32167 0.050154
6 0.016141 5 0.18703 0.29403 0.047955
7 0.010925 6 0.17089 0.29242 0.048231
8 0.010000 7 0.15997 0.29256 0.048205


Each value in the list represents a node, with the initial value (1) representing the model’s root. They typical course of action for pruning an “rpart” tree is to first identify the node with the lowest cross-validation error (xerror). Once this value has been identified, we must make note of the value’s corresponding CP score (0.01925). It is this value which will be utilized within our pruning function to modify the model.

With the above information ascertained, we can move forward in the pruning process by initiating the following code within R console.

prunedmodel <- prune(anmodel, 0.010925)

In the case of our example, due to the small CP value, no modifications were made to the original model. However, this not always the case. I encourage you to experiment with this function as it pertains to your own rpart models, the best way to learn is through repetition.

Dealing with Missing Values

Typically when analyzing real world data sets, there will be instances in which different variable observation values are absent. You should not let this all too common occurrence hinder your model ambitions. Thankfully, within the rpart function, there exists a mechanism for dealing with missing values. However, this mechanism only applies to observations which consist of missing independent variables, values which will be designated as dependent variables which are missing entries should be removed prior to analysis.

After testing the functionality of the method with data sets which I had previously removed portions of data from, there appeared to be very little impact on the model creation or prediction capacity. The algorithms which animate the data functions also exist in such a manner in which incomplete data sets can be passed through the model to generate predictions.

I’m not exactly sure how the underlying functionality of the rpart package specifically estimates the values of the missing, or “surrogate” variable observations. From reading various articles and the manual associated with the rpart package, I can only assume, from what was described, that the values of the missing variables are derived from full observations which share variable similarities.

Conclusion

The basic tree model as it is discussed within the contents of this article, is often passed over in favor of the random forest model. However, as you will observe in future articles, the basic tree model is not without merit, as due its singular nature, it is the easier model to explain and conceptually visualize. Both of the latter concepts are extremely valuable as it relates to data presentation and research publication. In the next article well will be discussing “Bagging”. Until then, stay subscribed, Data Heads.

Wednesday, September 21, 2022

(Python) Enabling the Nvidia GPU - Tensor Flow and Keras Utilization

Though this website does not typically feature articles related to Tensor Flow or Keras (machine learning libraries), due to reader requests, in this entry, I will illustrate how to enable Nvidia GPU utilization as it pertains to the aforementioned packages.

I will be operating under the following assumptions:

1. You are in possession of a Windows PC which contains a Nvidia GPU. 

2. You are relatively familiar with the Tensor Flow and Keras libraries. 

3. You are utilizing the Anaconda Python distribution.

If the above assumptions are correct, then I have designated 4 steps towards the completion of this process:

1. Installation of the most recent Nvidia GPU drivers. 

2. Installation of the “tensorflow-gpu” package library.

3. Installation of the CUDA toolkit.

4. Trouble-shooting.

Installing the most recent Nvidia drivers:

This step is relatively self explanatory. If you are in possession of a computer which contains a Nvidia GPU, you should have the following program located on your hard drive: “GeForce Experience”. Depending on the type of Nvidia GPU which you possess, the name of the program may vary. To locate this program, or a similar program which achieves the same result, search: “Nvidia”, from the desktop start bar.

After you have launched the Nvidia desktop interface, you will be asked to create a Nvidia Account. To achieve this, enter the appropriate information within the coinciding menu prompts. Once this has been completed, follow the link within the conformation e-mail to finalize the creation of your user account.

With your new Nvidia account created, you will possess the ability to access the latest driver updates within the Nvidia console interface. Be sure to update all of the drivers which are listed, prior to proceeding to the next step.


Installing the TensorFlow GPU Package Library

Completing this simple pre-requisite can be achieved by either:

A. Running the following code within the Jupyter Notebook programming environment:

import pip

pip.main(['install', 'tensorflow-gpu'])

B. Running the following code within the “Anaconda Prompt”:

conda install tensorflow-gpu

To reach the “Anaconda Prompt” terminal, type “Anaconda Prompt” into the Windows desktop search bar.

Installing the CUDA toolkit

You are now prepared to complete the final pre-requisite, which is the most complicated of all of the required steps.

First, you must click the link below:

https://developer.nvidia.com/cuda-downloads

The address above will direct you to the Nvidia webpage.

Select the appropriate options which pertain to your operating system from the list of selections. Doing such, will present a download link to the version of the CUDA software which is best suited for your PC.


The rest of the installation process is relatively straight-forward.

The product of the download will produce the file below:


(File name will vary based on operating system and version selection)

Double clicking this file icon will begin the installation process.


After clicking through the associated options, the following screen should appear:


If you do not have Microsoft Visual Studios installed on your PC, you will be presented with an installation error. However, if you do not intend to utilize the CUDA software for development related to such, you can continue the installation process without further hesitation.

Once the process is fully completed, the following shortcut icon should appear on your PC’s desktop:


I would now advise that you re-start your PC prior to implementing GPU utilization within your machine learning projects.

Trouble Shooting

I’ve found that GPU enabled Tensor Flow projects, at least from my experience, tend to be more error prone from session to session. However, I accept this shortcoming, due to the significant speed increase enabled by GPU utilization.

Utilizing the newly installed GPU implementation is relatively simple, as the implementation is automatically assumed within the model structure. Meaning, that the alteration of pre-existing machine learning project code is un-necessary as it pertains to GPU optimization. If you run code which has been previously created for the purpose of utilizing Keras and Tensor Flow library implementation, then the computer will automatically assume that you now desire to have the analysis performed through the GPU hardware architecture.

To ensure that GPU functionality is enabled, you may run the following lines of code within the Anaconda coding platform:

from tensorflow.python.client import device_lib

from keras import backend as K

print(device_lib.list_local_devices())

K.tensorflow_backend._get_available_gpus()

This should produce output which includes the term: ‘GPU’. If this is the case, then GPU utilization has been successfully enabled.

If, for whatever reason, errors occur as it relates to keras or tensorflow implementation following the installation of the prior programs, try completing any of the following steps to remedy this occurrence.

1. Restart the Anaconda platform and Jupyter Notebook.

2. Uninstall and re-install both the tensorflow and tensorflow-gpu libraries from the Anaconda Prompt (command line). This can be achieved by utilizing the code below:

conda uninstall tensorflow

conda uninstall tensorflow-gpu

conda install tensorflow

conda install tensorflow-gpu


Assuming that these remedies addressed and solved the previously present issue, assuming that an issue previously existed, then you should be prepared to experience the blazing speed enabled by Nvidia GPU utilization.

That’s all for this entry.

Stay busy, Data Heads!

Monday, September 12, 2022

Off the Beaten Path

Between the tremendous amount of work which has found itself to yours humbly, and a general lack of relevant content ideas, this blog has been on a temporary hiatus.

However, I plan on resurrecting this site for normal weekly entries moving forward.

There is nothing more depressing than revisiting the ruins of a previously maintained site.

Much of the low hanging fruit as it relates to the topic of basic statistics, and the utilization of various statistical programs, has been almost completely and thoroughly covered throughout the prior articles. For this reason, I’ve had some trouble as it pertains to deciding what topics ought to be covered in addition to what has already been discussed.

But don’t despair, I have a few ideas as to where we might be going next.

Please stay tuned, as data never sleeps.

-RD

Sunday, July 18, 2021

Pivot Tables (MS-Excel)

You didn’t honestly believe that I would continue to write articles without mentioning every analyst’s favorite Excel technique, did you?

Example / Demonstration:

For this demonstration, we are going to be utilizing the, “Removing Duplicate Entries (MS-Excel).csv” data file. This file can found within GitHub data repo, upload data: July 12, 2018. If you are too lazy to navigate over the repo site, the raw .csv data can be found down below:

VARA,VARB,VARC,VARD
Mike,1,Red,Spade
Mike,2,Blue,Club
Mike,1,Red,Spade
Troy,2,Green,Diamond
Troy,1,Red,Heart
Archie,2,Orange,Heart
Archie,2,Yellow,Diamond
Archie,2,Orange,Heart
Archie,1,Red,Spade
Archie,1,Blue,Spade
Archie,2,Red,Club
Archie,2,Red,Club
Jack,1,Red,Diamond
Jack,2,Blue,Diamond
Jack,2,Blue,Diamond
Rob,1,Green,Club
Rob,2,Orange,Spade
Brad,1,Red,Heart
Susan,2,Blue,Heart
Susan,2,Yellow,Club
Susan,1,Pink,Heart
Seth,2,Grey,Heart
Seth,1,Green,Club
Joanna,2,Pink,Club
Joanna,1,Green,Spade
Joanna,1,Green,Spade
Bertha,2,Grey,Diamond
Bertha,1,Grey,Diamond
Liz,1,Green,Spade


Let’s get started!

First, we’ll take a nice look at the data as it exists within MS-Excel:


Now we’ll pivot to excellence!

The easiest way to start building pivot tables, is to utilize the “Recommended PivotTables” option button located within the “Insert” menu, listed within Excel’s ribbon menu.


This should bring up the menu below:


Go ahead and select all row entries, across all variable columns.

Once this has been completed, click “OK”.

This should generate the following menu:


Let’s break down each recommendation.

“Sum of VARB by VARD” – This table is summing the total of the numerical values contained within VARB, as they correspond with VARD entries.

“Count of VARA by VARD” – This table is counting the total number of occurrences of categorical values within variable column VARD.

“Sum of VARB by VARC” – This table is summing the total of numerical values contained within VARB, as they correspond with VARC entries.


“Count of VARA by VARC” – This table is counting the total number of occurrences of categorical values within variable column VARA.

“Sum of VARB by VARA” – This table is summing the total of the numerical values contained within VARB, as they correspond with VARA entries.

Now, there may come a time in which none of the above options match exactly what you are looking for. In this case, you will want to utilize the “PivotTable” option button, located within the “Insert” menu, listed within Excel’s ribbon menu.


Go ahead and select all row entries, across all variable columns.

Change the option button to “New Worksheet”, instead of “Existing Worksheet”.

Once this has been complete, click “OK”.

Once this has been accomplished, you’ll be graced with a new menu, on a new Excel sheet (same workbook).

I won’t go into every single output option that you have available, but I will list a few you may want to try yourself. Each output variation can be created by dragging and dropping the variables listed within the topmost box, in varying order, into the boxes below: 


If VARA and VARC are both added to Rows, you will view the categorical occurrences of variable entries from VARC, with VARA acting as the unique ID.

Order matters in each pivot table variable designation place.

So, if we reverse the position of VARA and VARC, and instead list VARC first, followed by VARA, then we will a table which lists the categorical occurrences of VARA, with VARC acting as a unique ID.

If we include VARA and VARC as rows (in that order), and set the values variable to Sum of VARB, then the output should more so resemble an accounting sheet, with the sum of each numerical value corresponding with VARA, categorized by VARC, is summed (VARB). 


If we instead wanted the count, as opposed to the sum, we could click on the drop down arrow located next to “Count of VARB”, which presents the following options:


From the options listed, we well select “Value Field Settings”.

This presents the following menu, from which we will select “Count”.


The result of following the previously listed steps is illustrated below:


The Pivot Table creation menu also allows for further customization through the addition of column variables.

In the case of our example, we will make the following modifications to our table output:


VARC will now be designated as a column variable, VARA will be a row variable, and the count of VARB will be out values variable.  

The result of these modifications is shown below:


Our output format now contains a table which contains the count of each occurrence of each color (VARC), as each color corresponds with each individual listed (VARA) within the original data set. 

In conclusion, the pivot table option within MS-Excel, offers a variety of different display outputs which can be utilized to display statistical summary data.

The most important skill to develop as it pertains to this feature, is the ability to ascertain when a pivot table is necessary for your data project needs.

So with that, we will end this article.

I will see you next time, Data Head.

-RD

Wednesday, July 14, 2021

Getting to Know the Greeks

In today’s article, we are going to go a bit off the beaten path and discuss, The Greek Alphabet!


You might be wondering, why the sudden change of subject content…?

In order to truly master of the craft of data science, you will be required to stretch your mind in creative ways. The Greek Alphabet is utilized throughout the fields of statistics, mathematics, finance, computer science, astronomy and other western intellectual pursuits. For this reason, it really ought to be taught in elementary schools. However, to my knowledge, in most cases, it is not.

The Romans borrowed heavily from Greek Civilization, and contemporary western civilization borrowed heavily from the Romans. Therefore, to truly be a person of culture, you should learn the Greek Alphabet, and really, as much as you possibly can about Ancient Greek Culture. This includes the legends, heroes, and philosophers. We might be getting more into this in other articles, but for today, we will be sticking to the alphabet.

The Greek Alphabet

The best way to learn the Greek alphabet is to be Greek (j/k, but not really). In all other cases, application is probably the best way to commit various Greek letters, as symbols, to memory.

I would recommend drawing each letter in order, uppercase, and lowercase, and saying the name of the letter as it is written.

Let’s try this together!

Α α (Alpha) (Pronounced: AL-FUH) - Utilized in statistics as the symbol which connotates significance level. In finance, it is the percentage return of an investment above or below a predetermined index.

B β (Beta) (Pronounced: BAY-TUH) - In statistics, this symbol is utilized to represent type II errors. In finance, it is utilized to determine asset volatility.

Γ γ (Gamma) (Pronounced: GAM-UH) - In physics, this symbol is utilized to represent particle decay (Gamma Decay). There also exists Alpha Decay, and Beta Decay. The type of decay situationally differs depending on the circumstances.

Δ δ (Delta) (Pronounced: DEL-TUH) - This is currently the most common strain of the novel coronavirus (7/2021). In the field of chemistry, uppercase Delta is utilized to symbolize heat being added to a reaction.

Ε ε (Epsilon) (Pronounced: EP-SIL-ON) - “Machine Epsilon” is utilized in computer science as a way of dealing with floating point values and their assessment within logical statements.

Ζ ζ (Zeta) (Pronounced: ZAY-TUH) - The most common utilization assignment which I have witnessed for this letter, is its designation as the variable which represents the Reimann Zeta Function (number theory).

Η η (Eta) (Pronounced: EE -TUH) - I’ve mostly seen this letter designated as variable for the Dedekind eta function (number theory).

Θ θ (Theta) (Pronounced: THAY-TUH) - Theta is utilized as the symbol to represent a pentaquark, a transitive subatomic particle.

Ι ι (Iota) (Pronounced: EYE-OL-TA) - I’ve never seen this symbol utilized for anything outside of astronomical designations. Maybe if you make it big in science, you could give Iota the love that it so deserves.

Κ k (Kappa) (Pronounced: CAP-UH) - Kappa is the chosen variable designation for Einstein’s gravitational constant.

Λ λ (Lambda) (Pronounced: LAMB-DUH) - A potential emergent novel coronavirus variant (7/2021). Lowercase Lambda is also utilized throughout the Poisson Distribution function.

Μ μ (Mu) (Pronounced: MEW) - Lowercase Mu is utilized to symbolize the mean of a population (statistics). In particle physics, it can also be applied to represent the elementary particle: Muon.

Ν ν (Nu) (Pronounced: NEW) - As a symbol, this letter represents degrees of freedom (statistics).

Ξ ξ (Xi) (Pronounced: SEE) - In mathematics, uppercase Xi can be utilized to represent the Reimann Xi Function.

Ο ο (Omnicron) (Pronounced: OMNI-CRON) - A symbol which does not get very much love, or use, unlike its subsequent neighbor…

Π π (Pi) (Pronounced: PIE) - In mathematics, lowercase Pi often represents the mathematical real transcendental constant ≈ 3.1415…etc.

Ρ ρ (Rho) (Pronounced: ROW) - In the Black-Scholes model, Rho represent the rate of change of a portfolio with respect to interest rates

Σ σ (Sigma) (Pronounced: SIG-MA) - Lower case Sigma represents the standard deviation of a population (statistics). Upper case sigma represents a sum function (mathematics).

Τ τ (Tau) (Pronounced: TAIL) - Lower case Tau represents an elementary particle within the field of particle physics

Υ υ (Upsilon) (Pronounced: EEP-SIL-ON) - Does not really get very much use…

Φ φ (Phi) (Pronounced: FAI) - Lowercase Phi is utilized to represent the Golden Ratio.

Χ χ (Chi) (Pronounced: KAI) - Lower case Chi is utilized as a variable throughout the Chi-Square distribution function.

Ψ ψ (Psi) (Pronounced: PSY) - Lower case Psi is used to represent the (generalized) positional states of a qubit within a quantum computer.

Ω ω (Omega) (Pronounced: OHMEGA) - Utilized for just about everything.

Αυτα για τωρα. Θα σε δω την επόμενη φορά!

-RD

Friday, June 25, 2021

(R) The Levene's Test

In today’s article we will be discussing a technique which is not specifically interesting or pragmatically applicable. Still, for the sake of true data science proficiency, today we will be discussing, THE LEVENE'S TEST!

The Levene's Test is utilized to compare the variances of two separate data sets.

So naturally, our hypothesis would be:

Null Hypothesis: The variance measurements of the two data sets do not significantly differ.

Alternative Hypothesis: The variance measurements of the two data sets do significantly different.

The Levene's Test Example:

# The leveneTest() Function is included within the “car” package #

library(car)

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

N_LEV <- c(N1, N2)

group <- as.factor(c(rep(1, length(N1)), rep(2, length(N2))))

leveneTest(N_LEV, group)

# The above code is a modification of code provided by StackExchange user: ocram. #

# Source https://stats.stackexchange.com/questions/15722/how-to-use-levene-test-function-in-r #

This produces the output:

Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 1.7677 0.2049
14

Since the p-value of the output exceeds .05, we will not reject the null hypothesis (alpha = .05).

Conclusions:

The Levene’s Test for Equality of Variances did not indicate a significant differentiation in the variance measurement of Sample N1, as compared to the variance measurement of Sample N2, F(1,14) = 1.78, p= .21.

So, what is the overall purpose of this test? Meaning, when would its application be appropriate? The Levene’s Test is typically utilized as a pre-test prior to the application of the standard T-Test. However, it is uncommon to structure a research experiment in this manner.  Therefore, the Levene’s Test is more so something which is witnessed within the classroom, and not within the field.

Still, if you find yourself in circumstances in which this test is requested, know that it is often required to determine whether a standard T-Test is applicable. If variances are found to be un-equal, a Welch’s T-Test is typically preferred as an alternative to the standard T-Test.

-----------------------------------------------------------------------------------------------------------------------------

I promise that my next article will be more exciting.

Until next time.

-RD

Friday, June 18, 2021

(R) Imputing Missing Data with the MICE() Package


In today’s article we are going to discuss basic utilization of the MICE package.

The MICE package, is a package which assists with performing analysis on shoddily assembled data frames.

In the world of data science, the real world, not the YouTube world, or the classroom world, data often comes down in a less than optimal state. In most cases, this is more the reality of the matter.

Now, it would easy to throw up your hands and say, “I CAN’T PERFORM ANY SORT OF ANALYSIS WITH ALL OF THESE MISSING VARIABLES”,


~OR~

(Don’t succumb to temptation!) 

Unfortunately, for you, the data scientist, whoever passed you this data expects a product and not your excuses.

Fortunately, for all of us, there is a way forward.

Example:

Let’s say that you were given this small data set for analysis:

                                           

The data is provided in an .xls format, because why wouldn’t it be?

For the sake of not having you download an example data file, I have re-coded this data into the R format.

# Create Data Frame: "SheetB" #

VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, NA , 1, NA, 0, 0, 0, 0)

VarB <- c(20, 16, 20, 4, NA, NA, 13, 6, 2, 18, 12, NA, 13, 9, 14, 18, 6, NA, 5, 2)

VarC <- c(2, NA, 1, 1, NA, 2, 3, 1, 2, NA, 3, 4, 4, NA, 4, 3, 1, 2, 3, NA)

VarD <- c(70, 80, NA, 87, 79, 60, 61, 75, NA, 67, 62, 93, NA, 80, 91, 51, NA, 33, NA, 50)

VarE <- c(980, 800, 983, 925, 821, NA, NA, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)


SheetB <- data.frame(VarA, VarB, VarC, VarD, VarE)


If you would like to see a version of the initial example file with the missing values, the code to create this data frame is below:

# Create Data Frame: "SheetA" #

VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0)

VarB <- c(20, 16, 20, 4, 8, 17, 13, 6, 2, 18, 12, 17, 13, 9, 14, 18, 6, 13, 5, 2)

VarC <- c(2, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 4, 4, 1, 4, 3, 1, 2, 3, 1)

VarD <- c(70, 80, 90, 87, 79, 60, 61, 75, 92, 67, 62, 93, 74, 80, 91, 51, 64, 33, 77, 50)

VarE <- c(980, 800, 983, 925, 821, 978, 881, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)

SheetA <- data.frame(VarA, VarB, VarC, VarD, VarE)


In our example, we’ll assume that the sheet which contains all values is unavailable to you (“SheetA”). Therefore, to perform any sort of meaningful analysis, you will need to either delete all observations which contain missing data variables (DON’T DO IT!), or, run an imputation function.


We will opt to do the latter, and the function which we will utilize, is the mice() function.

First, we will initialize the appropriate library:

# Initalitze Library #

library(mice)

Next, we will perform the imputation function contained within the library.

# Perform Imputation #

SheetB_Imputed <- mice(SheetB, m=1, maxit = 50, method = 'pmm', seed = 500)

SheetB: is the data frame which is being called by the function.

m = 1: This is the number of data frame imputation variations which will be generated as a result of the mice function. One is all that is necessary.

maxit: This is the number of max iterations which will occur as the mice function calculates what it determines to be the optimal value of each missing variable cell.

method: Is the method which will be utilized to perform this function.

seed: The mice() function partially relies on randomness to generate missing variable values. The seed value can be whatever value you determine to be appropriate.

After performing the above function, you should be greeted with the output below:

iter imp variable
1 1 VarA VarB VarC VarD VarE
2 1 VarA VarB VarC VarD VarE
3 1 VarA VarB VarC VarD VarE
4 1 VarA VarB VarC VarD VarE
5 1 VarA VarB VarC VarD VarE
6 1 VarA VarB VarC VarD VarE
7 1 VarA VarB VarC VarD VarE
8 1 VarA VarB VarC VarD VarE
9 1 VarA VarB VarC VarD VarE
10 1 VarA VarB VarC VarD VarE
11 1 VarA VarB VarC VarD VarE
12 1 VarA VarB VarC VarD VarE
13 1 VarA VarB VarC VarD VarE
14 1 VarA VarB VarC VarD VarE
15 1 VarA VarB VarC VarD VarE
16 1 VarA VarB VarC VarD VarE
17 1 VarA VarB VarC VarD VarE
18 1 VarA VarB VarC VarD VarE
19 1 VarA VarB VarC VarD VarE
20 1 VarA VarB VarC VarD VarE
21 1 VarA VarB VarC VarD VarE
22 1 VarA VarB VarC VarD VarE
23 1 VarA VarB VarC VarD VarE
24 1 VarA VarB VarC VarD VarE
25 1 VarA VarB VarC VarD VarE
26 1 VarA VarB VarC VarD VarE
27 1 VarA VarB VarC VarD VarE
28 1 VarA VarB VarC VarD VarE
29 1 VarA VarB VarC VarD VarE
30 1 VarA VarB VarC VarD VarE
31 1 VarA VarB VarC VarD VarE
32 1 VarA VarB VarC VarD VarE
33 1 VarA VarB VarC VarD VarE
34 1 VarA VarB VarC VarD VarE
35 1 VarA VarB VarC VarD VarE
36 1 VarA VarB VarC VarD VarE
37 1 VarA VarB VarC VarD VarE
38 1 VarA VarB VarC VarD VarE
39 1 VarA VarB VarC VarD VarE
40 1 VarA VarB VarC VarD VarE
41 1 VarA VarB VarC VarD VarE
42 1 VarA VarB VarC VarD VarE
43 1 VarA VarB VarC VarD VarE
44 1 VarA VarB VarC VarD VarE
45 1 VarA VarB VarC VarD VarE
46 1 VarA VarB VarC VarD VarE
47 1 VarA VarB VarC VarD VarE
48 1 VarA VarB VarC VarD VarE
49 1 VarA VarB VarC VarD VarE
50 1 VarA VarB VarC VarD VarE 


The output is informing you that the iteration was performed a total of 50 times on one single set.

The code below assigns all the initial values from the original set, with newly estimated values, which now occupy the variable cells which were previously blank.

# Assign Original Values with Imputations to Data Frame #

SheetB_Imputed_Complete <- complete(SheetB_Imputed)


The outcome should resemble something like:

(Beautiful!)


A quick warning, the mice() function cannot be utilized on data frames which contain unencoded categorical variable entries.

An example of this:

                                       

To get mice() to work correctly on this data set, you must recode "VARC" prior to proceeding. You could do this by changing each instance of "Spade" to 1, "Club" to 2, “Diamond" to 3, and "Heart" to 4.

For more information as it relates to this function, please check out this link.

That’s all for now, internet.

-RD

Saturday, June 12, 2021

(R) 2-Sample Test for Equality of Proportions

In today’s article we are going to revisit in greater detail, a topic which was reviewed in a prior article.

What the 2-Sample Test for Equality of Proportions seeks to achieve, is an assessment of differentiation as it pertains to one survey group’s response, as measured against another.

To illustrate the application of this methodology, I will utilize a prior example which was previously published to this site (10/15/2017).

Example:

A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased.

# Model Hypothesis #

# H0: p1 - p2 = 0 #

# (The proportions are the same) # 

# Ha: p1 - p2 > 0 #

# (The proportions are NOT the same) #

# Disable Scientific Notation in R Output #

options(scipen = 999)

# Model Application #

prop.test(x = c(600,500), n=c(1300,1500), conf.level = .95, correct = FALSE)


Which produces the output:

2-sample test for equality of proportions without continuity correction

data: c(600, 500) out of c(1300, 1500)
X-squared = 47.991, df = 1, p-value = 0.000000000004281
alternative hypothesis: two.sided
95 percent confidence interval:
0.09210145 0.16430881
sample estimates:
prop 1 prop 2
0.4615385 0.3333333

We are now prepared to state the details of our model’s application, and the subsequent findings and analysis which occurred as a result of such.

Conclusions:

A 2-Sample Test for Equality of Proportions without Continuity Correction was performed to analyze whether the poll survey results for Candidate A., significantly differed from subsequent poll survey results gathered weeks later. A 90% confidence interval was assumed for significance.

There was a significant difference in Candidate A’s favorability score as from the initial poll findings: 46% (600/1300), as compared to Candidate A’s favorability score the subsequent poll findings: 33% (500/1500); χ2 (1, N = 316) = 47.99, p > .10.

-----------------------------------------------------------------------------------------------------------------------------

That's all for now.

I'll see you next time, Data Heads.

-RD

Saturday, June 5, 2021

(R) Pearson’s Chi-Square Test Residuals and Post Hoc Analysis

In today’s article, we are going to discuss Pearson Residuals. A Pearson Residual is a product of post hoc analysis. These values can be utilized to further assess Pearson’s Chi-Square Test results.

If you are un-familiar with The Pearson’s Chi-Square Test, or what post hoc analysis typically entails, I would encourage you to do further research prior to proceeding.

Example:

To demonstrate this post hoc technique, we will utilize a prior article’s example:

The "Smoking : Obesity" Pearson’s Chi-Squared Test Demonstration.

# To test for goodness of fit #

Model <-matrix(c(5, 1, 2, 2),

nrow = 2,

dimnames = list("Smoker" = c("Yes", "No"),

"Obese" = c("Yes", "No")))

# To run the chi-square test #

# 'correct = FALSE' disables the Yates’ continuity correction #

chisq.test(Model, correct = FALSE)


This produces the output:

Pearson's Chi-squared test

data: Model
X-squared = 1.2698, df = 1, p-value = 0.2598

From the output provided, we can easily conclude that our results were not significant.

However, let’s delve a bit deeper into our findings.

First, let’s take a look at the matrix of the model.

Model

Obese
Smoker Yes No
Yes 5 2
No 1 2


Now, let’s take a look at the expected model values.

chi.result <- chisq.test(Model, correct = FALSE)

chi.result$expected


Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2


What does this mean?

The values above represent the values which we would expect to observe if the observational categories measured, perfectly adhered to the chi-square distribution. 

(Karl Pearson)

From the previously derived values, we can derived the Pearson Residual Values.

print(chi.result$residuals)

Obese
Smoker Yes No
Yes 0.3903600 -0.4780914
No -0.5962848 0.7302967

What we are specifically looking for, as it pertains to the residual output, are values which are greater than +2, or less than -2. If these findings were present in any of the above matrix entries, it would indicate that the model was inappropriately applied given the circumstances of the collected observational data.

The matrix values themselves, in the residual matrix, are the observed categorical values minus the expected values, divided by the square root of the expected values. 

Thus: Standard Residual = (Observed Values – Expected Value) / Square Root of Expected Value

Observed Values

Obese
Smoker Yes No
Yes 5 2
No 1 2

Expected Values

Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2

(5 – 4.2) / √ 4.2 = 0.3903600 

(1 – 1.8) / √ 1.8 = -0.5962848

(2 – 2.8) / √ 2.8 = -0.4780914

(2 – 1.2) / √ 1.2 = 0.7302967

~ OR ~

(5 - 4.2) / sqrt(4.2)

(1 - 1.8) / sqrt(1.8)

(2 - 2.8) / sqrt(2.8)

(2 - 1.2) / sqrt(1.2)


[1] 0.39036
[1] -0.5962848
[1] -0.4780914
[1] 0.7302967

The Pearson Residual Values (0.39036…etc.), are an estimate of the raw residual values’ standard deviations. It is for this reason, that any value greater than +2, or less than -2, would indicate a misapplication of the model. Or, at very least, indicate that more observational values ought to be collected prior to the model being applied again.

The Fisher’s Exact Test as a Post Hoc Analysis for The Pearson's Chi-Square Test

Let’s take our example one step further by applying The Fisher’s Exact Test as a method of post hoc analysis.

Why would we do this?

Assuming that our Chi-Square Test findings were significant, we may want to consider a Fisher’s Exact Test as a method to further prove evidence of significance.

A Fisher’s Exact Test is less robust in application as compared to the Chi-Square Test. For this reason, the Fisher’s Exact Test will always yield a lower p-value than its Chi-Square counterpart. 

(Sir Ronald Fisher)

fisher.result <- fisher.test(Model)

print(fisher.result$p.value)


[1] 0.5

<Yikes!>

Conclusions

Now that we have considered our analysis every which way, we can state our findings in APA Format.

This would resemble the following:

A chi-square test of proportions was performed to examine the relation of smoking and obesity. The relation between these variables was not found to be significant χ2 (1, N = 10) = 1.27, p > .05.


In investigating the Pearson Residuals produced from the model application, no value was found to be greater than +2, or less than -2. These findings indicate that the model was appropriate given the circumstances of the experimental data.

In order to further confirm our experimental findings, a Fisher’s Exact Test was also performed for post hoc analysis. The results of such indicated a non-significant relationship as it pertains to obesity as determined by individual smoker status: 71% (5/7), compared to individual non-smoker status: 33% (1/3); (p > .05).


-----------------------------------------------------------------------------------------------------------------------------

I hope that you found all of this helpful and entertaining.

Until next time,

-RD

Monday, November 9, 2020

(R) Cohen’s d

In today’s entry, we are going to discuss Cohen’s d, what it is, and when to utilize it. We will also discuss how to appropriately apply the methodology needed to derive this value, through the utilization of the R software package.


(SPSS does not contain the innate functionality necessary to perform this calculation)

Cohen’s d - (What it is):

Cohen’s d is utilized as a method to assess the magnitude of impact as it relates to two sample groups which are subject to differing conditions. For example, if a two sample t-test was being implemented to test a single group which received a drug, against another group which did not receive the drug, then the p-value of this test would determine whether or not the findings were significant.

Cohen’s d would measure the magnitude of the potential impact

Cohen’s d - (When to use it):

In your statistics class.

You could also utilize this test to perform post-hoc analysis as it relates to the ANOVA model and the Student’s T-Test. However, I have never witnessed the utilization of this test outside of an academic setting.

Cohen’s d – (How to interpret it):

General Interpretation Guidelines:

Greater than or equal to 0.2 = small
Greater than or equal to 0.5 = medium
Greater than or equal to 0.8 = large

Cohen’s d – (How to state your findings):

The effect size for this analysis (d = x.xx) was found to exceed Cohen’s convention for a [small, medium, large] effect (d = .xx).

Cohen’s d – (How to derive it):

# Within the R-Programming Code Space #

##################################

# length of sample 1 (x) #
lenx <-
# length of sample 2 (y) #
leny <-
# mean of sample 1 (x) #
meanx <-
# mean of sample 2 (y)#
meany <-
# SD of sample 1 (x) #
sdx <-
# SD of sample 2 (y) #
sdy <-

varx <- sdx^2
vary <- sdy^2
lx <- lenx - 1
ly <- leny - 1
md <- abs(meanx - meany) ## mean difference (numerator)
csd <- lx * varx + ly * vary
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d

cd

##################################


# The above code is a modified version of the code found at: #

# https://stackoverflow.com/questions/15436702/estimate-cohens-d-for-effect-size #


Cohen’s d – (Example):

FIRST WE MUST RUN A TEST IN WHICH COHEN’S d CAN BE APPLIED AS AN APPROPRIATE POST-HOC TEST METHODOLOGY.
 
Two Sample T-Test


This test is utilized if you randomly sample different sets of items from two separate control groups.

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

For this, we will use the code:

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)


Which produces the output:

Two Sample t-test

data: N2 and N1
t = 2.4558, df = 14, p-value = 0.02773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3007929 4.4492071
sample estimates:
mean of x mean of y
75.250 72.875


# Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #

# An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #

From this output we can conclude:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

Application of Cohen’s d 

length(N1) # 8 #
length(N2) # 8 #

mean(N1) # 72.875 #
mean(N2) # 75.25 #

sd(N1) # 2.167124 #
sd(N2) # 1.669046 #

# length of sample 1 (x) #
lenx <- 8
# length of sample 2 (y) #
leny <- 8
# mean of sample 1 (x) #
meanx <- 72.875
# mean of sample 2 (y)#
meany <- 75.25
# SD of sample 1 (x) #
sdx <- 2.167124
# SD of sample 2 (y) #
sdy <- 1.669046

varx <- sdx^2
vary <- sdy^2
lx <- lenx - 1
ly <- leny - 1
md <- abs(meanx - meany) ## mean difference (numerator)
csd <- lx * varx + ly * vary
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d

cd


Which produces the output:

[1] 1.227908

################################## 

From this output we can conclude:

The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80).

Combining both conclusions, our final written product would resemble:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80).

And that is it for this article.

Until next time,

-RD