Reflections of a Data Scientist: June 2021

Friday, June 25, 2021

(R) The Levene's Test

In today’s article we will be discussing a technique which is not specifically interesting or pragmatically applicable. Still, for the sake of true data science proficiency, today we will be discussing, THE LEVENE'S TEST!

The Levene's Test is utilized to compare the variances of two separate data sets.

So naturally, our hypothesis would be:

Null Hypothesis: The variance measurements of the two data sets do not significantly differ.

Alternative Hypothesis: The variance measurements of the two data sets do significantly different.

The Levene's Test Example:

# The leveneTest() Function is included within the “car” package #

library(car)

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

N_LEV <- c(N1, N2)

group <- as.factor(c(rep(1, length(N1)), rep(2, length(N2))))

leveneTest(N_LEV, group)

# The above code is a modification of code provided by StackExchange user: ocram. #

# Source https://stats.stackexchange.com/questions/15722/how-to-use-levene-test-function-in-r #

This produces the output:

Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 1.7677 0.2049
14

Since the p-value of the output exceeds .05, we will not reject the null hypothesis (alpha = .05).

Conclusions:

The Levene’s Test for Equality of Variances did not indicate a significant differentiation in the variance measurement of Sample N1, as compared to the variance measurement of Sample N2, F(1,14) = 1.78, p= .21.

So, what is the overall purpose of this test? Meaning, when would its application be appropriate? The Levene’s Test is typically utilized as a pre-test prior to the application of the standard T-Test. However, it is uncommon to structure a research experiment in this manner. Therefore, the Levene’s Test is more so something which is witnessed within the classroom, and not within the field.

Still, if you find yourself in circumstances in which this test is requested, know that it is often required to determine whether a standard T-Test is applicable. If variances are found to be un-equal, a Welch’s T-Test is typically preferred as an alternative to the standard T-Test.

-----------------------------------------------------------------------------------------------------------------------------

I promise that my next article will be more exciting.

Until next time.

-RD

Friday, June 18, 2021

(R) Imputing Missing Data with the MICE() Package

In today’s article we are going to discuss basic utilization of the MICE package.

The MICE package, is a package which assists with performing analysis on shoddily assembled data frames.

In the world of data science, the real world, not the YouTube world, or the classroom world, data often comes down in a less than optimal state. In most cases, this is more the reality of the matter.

Now, it would easy to throw up your hands and say, “I CAN’T PERFORM ANY SORT OF ANALYSIS WITH ALL OF THESE MISSING VARIABLES”,

~OR~

(Don’t succumb to temptation!)

Unfortunately, for you, the data scientist, whoever passed you this data expects a product and not your excuses.

Fortunately, for all of us, there is a way forward.

Example:

Let’s say that you were given this small data set for analysis:

The data is provided in an .xls format, because why wouldn’t it be?

For the sake of not having you download an example data file, I have re-coded this data into the R format.

# Create Data Frame: "SheetB" #

VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, NA , 1, NA, 0, 0, 0, 0)

VarB <- c(20, 16, 20, 4, NA, NA, 13, 6, 2, 18, 12, NA, 13, 9, 14, 18, 6, NA, 5, 2)

VarC <- c(2, NA, 1, 1, NA, 2, 3, 1, 2, NA, 3, 4, 4, NA, 4, 3, 1, 2, 3, NA)

VarD <- c(70, 80, NA, 87, 79, 60, 61, 75, NA, 67, 62, 93, NA, 80, 91, 51, NA, 33, NA, 50)

VarE <- c(980, 800, 983, 925, 821, NA, NA, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)

SheetB <- data.frame(VarA, VarB, VarC, VarD, VarE)

If you would like to see a version of the initial example file with the missing values, the code to create this data frame is below:

# Create Data Frame: "SheetA" #

VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0)

VarB <- c(20, 16, 20, 4, 8, 17, 13, 6, 2, 18, 12, 17, 13, 9, 14, 18, 6, 13, 5, 2)

VarC <- c(2, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 4, 4, 1, 4, 3, 1, 2, 3, 1)

VarD <- c(70, 80, 90, 87, 79, 60, 61, 75, 92, 67, 62, 93, 74, 80, 91, 51, 64, 33, 77, 50)

VarE <- c(980, 800, 983, 925, 821, 978, 881, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)

SheetA <- data.frame(VarA, VarB, VarC, VarD, VarE)

In our example, we’ll assume that the sheet which contains all values is unavailable to you (“SheetA”). Therefore, to perform any sort of meaningful analysis, you will need to either delete all observations which contain missing data variables (DON’T DO IT!), or, run an imputation function.

We will opt to do the latter, and the function which we will utilize, is the mice() function.

First, we will initialize the appropriate library:

# Initalitze Library #

library(mice)

Next, we will perform the imputation function contained within the library.

# Perform Imputation #

SheetB_Imputed <- mice(SheetB, m=1, maxit = 50, method = 'pmm', seed = 500)

SheetB: is the data frame which is being called by the function.

m = 1: This is the number of data frame imputation variations which will be generated as a result of the mice function. One is all that is necessary.

maxit: This is the number of max iterations which will occur as the mice function calculates what it determines to be the optimal value of each missing variable cell.

method: Is the method which will be utilized to perform this function.

seed: The mice() function partially relies on randomness to generate missing variable values. The seed value can be whatever value you determine to be appropriate.

After performing the above function, you should be greeted with the output below:

iter imp variable
1 1 VarA VarB VarC VarD VarE
2 1 VarA VarB VarC VarD VarE
3 1 VarA VarB VarC VarD VarE
4 1 VarA VarB VarC VarD VarE
5 1 VarA VarB VarC VarD VarE
6 1 VarA VarB VarC VarD VarE
7 1 VarA VarB VarC VarD VarE
8 1 VarA VarB VarC VarD VarE
9 1 VarA VarB VarC VarD VarE
10 1 VarA VarB VarC VarD VarE
11 1 VarA VarB VarC VarD VarE
12 1 VarA VarB VarC VarD VarE
13 1 VarA VarB VarC VarD VarE
14 1 VarA VarB VarC VarD VarE
15 1 VarA VarB VarC VarD VarE
16 1 VarA VarB VarC VarD VarE
17 1 VarA VarB VarC VarD VarE
18 1 VarA VarB VarC VarD VarE
19 1 VarA VarB VarC VarD VarE
20 1 VarA VarB VarC VarD VarE
21 1 VarA VarB VarC VarD VarE
22 1 VarA VarB VarC VarD VarE
23 1 VarA VarB VarC VarD VarE
24 1 VarA VarB VarC VarD VarE
25 1 VarA VarB VarC VarD VarE
26 1 VarA VarB VarC VarD VarE
27 1 VarA VarB VarC VarD VarE
28 1 VarA VarB VarC VarD VarE
29 1 VarA VarB VarC VarD VarE
30 1 VarA VarB VarC VarD VarE
31 1 VarA VarB VarC VarD VarE
32 1 VarA VarB VarC VarD VarE
33 1 VarA VarB VarC VarD VarE
34 1 VarA VarB VarC VarD VarE
35 1 VarA VarB VarC VarD VarE
36 1 VarA VarB VarC VarD VarE
37 1 VarA VarB VarC VarD VarE
38 1 VarA VarB VarC VarD VarE
39 1 VarA VarB VarC VarD VarE
40 1 VarA VarB VarC VarD VarE
41 1 VarA VarB VarC VarD VarE
42 1 VarA VarB VarC VarD VarE
43 1 VarA VarB VarC VarD VarE
44 1 VarA VarB VarC VarD VarE
45 1 VarA VarB VarC VarD VarE
46 1 VarA VarB VarC VarD VarE
47 1 VarA VarB VarC VarD VarE
48 1 VarA VarB VarC VarD VarE
49 1 VarA VarB VarC VarD VarE
50 1 VarA VarB VarC VarD VarE

The output is informing you that the iteration was performed a total of 50 times on one single set.

The code below assigns all the initial values from the original set, with newly estimated values, which now occupy the variable cells which were previously blank.

# Assign Original Values with Imputations to Data Frame #

SheetB_Imputed_Complete <- complete(SheetB_Imputed)

The outcome should resemble something like:

(Beautiful!)

A quick warning, the mice() function cannot be utilized on data frames which contain unencoded categorical variable entries.

An example of this:

To get mice() to work correctly on this data set, you must recode "VARC" prior to proceeding. You could do this by changing each instance of "Spade" to 1, "Club" to 2, “Diamond" to 3, and "Heart" to 4.

For more information as it relates to this function, please check out this link.

That’s all for now, internet.

-RD

Saturday, June 12, 2021

(R) 2-Sample Test for Equality of Proportions

In today’s article we are going to revisit in greater detail, a topic which was reviewed in a prior article.

What the 2-Sample Test for Equality of Proportions seeks to achieve, is an assessment of differentiation as it pertains to one survey group’s response, as measured against another.

To illustrate the application of this methodology, I will utilize a prior example which was previously published to this site (10/15/2017).

Example:

A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased.

# Model Hypothesis #

# H0: p1 - p2 = 0 #

# (The proportions are the same) #

# Ha: p1 - p2 > 0 #

# (The proportions are NOT the same) #

# Disable Scientific Notation in R Output #

options(scipen = 999)

# Model Application #

prop.test(x = c(600,500), n=c(1300,1500), conf.level = .95, correct = FALSE)

Which produces the output:

2-sample test for equality of proportions without continuity correction

data: c(600, 500) out of c(1300, 1500)
X-squared = 47.991, df = 1, p-value = 0.000000000004281
alternative hypothesis: two.sided
95 percent confidence interval:
0.09210145 0.16430881
sample estimates:
prop 1 prop 2
0.4615385 0.3333333

We are now prepared to state the details of our model’s application, and the subsequent findings and analysis which occurred as a result of such.

Conclusions:

A 2-Sample Test for Equality of Proportions without Continuity Correction was performed to analyze whether the poll survey results for Candidate A., significantly differed from subsequent poll survey results gathered weeks later. A 90% confidence interval was assumed for significance.

There was a significant difference in Candidate A’s favorability score as from the initial poll findings: 46% (600/1300), as compared to Candidate A’s favorability score the subsequent poll findings: 33% (500/1500); χ2 (1, N = 316) = 47.99, p > .10.

-----------------------------------------------------------------------------------------------------------------------------

That's all for now.

I'll see you next time, Data Heads.

-RD

Saturday, June 5, 2021

(R) Pearson’s Chi-Square Test Residuals and Post Hoc Analysis

In today’s article, we are going to discuss Pearson Residuals. A Pearson Residual is a product of post hoc analysis. These values can be utilized to further assess Pearson’s Chi-Square Test results.

If you are un-familiar with The Pearson’s Chi-Square Test, or what post hoc analysis typically entails, I would encourage you to do further research prior to proceeding.

Example:

To demonstrate this post hoc technique, we will utilize a prior article’s example:

The "Smoking : Obesity" Pearson’s Chi-Squared Test Demonstration.

# To test for goodness of fit #

Model <-matrix(c(5, 1, 2, 2),

nrow = 2,

dimnames = list("Smoker" = c("Yes", "No"),

"Obese" = c("Yes", "No")))

# To run the chi-square test #

# 'correct = FALSE' disables the Yates’ continuity correction #

chisq.test(Model, correct = FALSE)

This produces the output:

Pearson's Chi-squared test

data: Model
X-squared = 1.2698, df = 1, p-value = 0.2598

From the output provided, we can easily conclude that our results were not significant.

However, let’s delve a bit deeper into our findings.

First, let’s take a look at the matrix of the model.

Model

Obese
Smoker Yes No
Yes 5 2
No 1 2

Now, let’s take a look at the expected model values.

chi.result <- chisq.test(Model, correct = FALSE)

chi.result$expected

Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2

What does this mean?

The values above represent the values which we would expect to observe if the observational categories measured, perfectly adhered to the chi-square distribution.

(Karl Pearson)

From the previously derived values, we can derived the Pearson Residual Values.

print(chi.result$residuals)

Obese
Smoker Yes No
Yes 0.3903600 -0.4780914
No -0.5962848 0.7302967

What we are specifically looking for, as it pertains to the residual output, are values which are greater than +2, or less than -2. If these findings were present in any of the above matrix entries, it would indicate that the model was inappropriately applied given the circumstances of the collected observational data.

The matrix values themselves, in the residual matrix, are the observed categorical values minus the expected values, divided by the square root of the expected values.

Thus: Standard Residual = (Observed Values – Expected Value) / Square Root of Expected Value

Observed Values

Obese
Smoker Yes No
Yes 5 2
No 1 2

Expected Values

Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2

(5 – 4.2) / √ 4.2 = 0.3903600

(1 – 1.8) / √ 1.8 = -0.5962848

(2 – 2.8) / √ 2.8 = -0.4780914

(2 – 1.2) / √ 1.2 = 0.7302967

~ OR ~

(5 - 4.2) / sqrt(4.2)

(1 - 1.8) / sqrt(1.8)

(2 - 2.8) / sqrt(2.8)

(2 - 1.2) / sqrt(1.2)

[1] 0.39036
[1] -0.5962848
[1] -0.4780914
[1] 0.7302967

The Pearson Residual Values (0.39036…etc.), are an estimate of the raw residual values’ standard deviations. It is for this reason, that any value greater than +2, or less than -2, would indicate a misapplication of the model. Or, at very least, indicate that more observational values ought to be collected prior to the model being applied again.

The Fisher’s Exact Test as a Post Hoc Analysis for The Pearson's Chi-Square Test

Let’s take our example one step further by applying The Fisher’s Exact Test as a method of post hoc analysis.

Why would we do this?

Assuming that our Chi-Square Test findings were significant, we may want to consider a Fisher’s Exact Test as a method to further prove evidence of significance.

A Fisher’s Exact Test is less robust in application as compared to the Chi-Square Test. For this reason, the Fisher’s Exact Test will always yield a lower p-value than its Chi-Square counterpart.

(Sir Ronald Fisher)

fisher.result <- fisher.test(Model)

print(fisher.result$p.value)

[1] 0.5

<Yikes!>

Conclusions

Now that we have considered our analysis every which way, we can state our findings in APA Format.

This would resemble the following:

A chi-square test of proportions was performed to examine the relation of smoking and obesity. The relation between these variables was not found to be significant χ2 (1, N = 10) = 1.27, p > .05.

In investigating the Pearson Residuals produced from the model application, no value was found to be greater than +2, or less than -2. These findings indicate that the model was appropriate given the circumstances of the experimental data.

In order to further confirm our experimental findings, a Fisher’s Exact Test was also performed for post hoc analysis. The results of such indicated a non-significant relationship as it pertains to obesity as determined by individual smoker status: 71% (5/7), compared to individual non-smoker status: 33% (1/3); (p > .05).

-----------------------------------------------------------------------------------------------------------------------------

I hope that you found all of this helpful and entertaining.

Until next time,

-RD