Reflections of a Data Scientist: November 2020

In today’s entry, we are going to discuss Cohen’s d, what it is, and when to utilize it. We will also discuss how to appropriately apply the methodology needed to derive this value, through the utilization of the R software package.

(SPSS does not contain the innate functionality necessary to perform this calculation)

Cohen’s d - (What it is):

Cohen’s d is utilized as a method to assess the magnitude of impact as it relates to two sample groups which are subject to differing conditions. For example, if a two sample t-test was being implemented to test a single group which received a drug, against another group which did not receive the drug, then the p-value of this test would determine whether or not the findings were significant.

Cohen’s d would measure the magnitude of the potential impact.

Cohen’s d - (When to use it):

In your statistics class.

You could also utilize this test to perform post-hoc analysis as it relates to the ANOVA model and the Student’s T-Test. However, I have never witnessed the utilization of this test outside of an academic setting.

Cohen’s d – (How to interpret it):

General Interpretation Guidelines:

Greater than or equal to 0.2 = small
Greater than or equal to 0.5 = medium
Greater than or equal to 0.8 = large

Cohen’s d – (How to state your findings):

The effect size for this analysis (d = x.xx) was found to exceed Cohen’s convention for a [small, medium, large] effect (d = .xx).

Cohen’s d – (How to derive it):

# Within the R-Programming Code Space #

##################################

# length of sample 1 (x) #
lenx <-
# length of sample 2 (y) #
leny <-
# mean of sample 1 (x) #
meanx <-
# mean of sample 2 (y)#
meany <-
# SD of sample 1 (x) #
sdx <-
# SD of sample 2 (y) #
sdy <-

varx <- sdx^2
vary <- sdy^2
lx <- lenx - 1
ly <- leny - 1
md <- abs(meanx - meany) ## mean difference (numerator)
csd <- lx * varx + ly * vary
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d

cd

##################################

# The above code is a modified version of the code found at: #

# https://stackoverflow.com/questions/15436702/estimate-cohens-d-for-effect-size #

Cohen’s d – (Example):

FIRST WE MUST RUN A TEST IN WHICH COHEN’S d CAN BE APPLIED AS AN APPROPRIATE POST-HOC TEST METHODOLOGY.

Two Sample T-Test

This test is utilized if you randomly sample different sets of items from two separate control groups.

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

For this, we will use the code:

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)

Which produces the output:

Two Sample t-test

data: N2 and N1
t = 2.4558, df = 14, p-value = 0.02773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3007929 4.4492071
sample estimates:
mean of x mean of y
75.250 72.875

# Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #

# An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #

From this output we can conclude:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

Application of Cohen’s d

length(N1) # 8 #
length(N2) # 8 #

mean(N1) # 72.875 #
mean(N2) # 75.25 #

sd(N1) # 2.167124 #
sd(N2) # 1.669046 #

# length of sample 1 (x) #
lenx <- 8
# length of sample 2 (y) #
leny <- 8
# mean of sample 1 (x) #
meanx <- 72.875
# mean of sample 2 (y)#
meany <- 75.25
# SD of sample 1 (x) #
sdx <- 2.167124
# SD of sample 2 (y) #
sdy <- 1.669046

varx <- sdx^2
vary <- sdy^2
lx <- lenx - 1
ly <- leny - 1
md <- abs(meanx - meany) ## mean difference (numerator)
csd <- lx * varx + ly * vary
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d

cd

Which produces the output:

[1] 1.227908

##################################

From this output we can conclude:

The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80).

Combining both conclusions, our final written product would resemble:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80).

And that is it for this article.

Until next time,

-RD

Reflections of a Data Scientist

Monday, November 9, 2020

(R) Cohen’s d