Reflections of a Data Scientist: December 2017

Saturday, December 9, 2017

(R) Post Hoc Analysis and One Way ANOVA (SPSS)

As was previously mentioned, new entries posted on this blog will be primarily non-R related.

Today’s post will discuss Post Hoc Analysis, specifically Tukey’s Honest Significance Test. This test is also known as The Tukey Method, Tukey’s HSD, or TukeyHSD() in R.

Post Hoc refers to the testing that is performed following an ANOVA test. What this testing seeks to discover, is the significance of relationships that exist between variables within an ANOVA model. There are many different Post Hoc tests that can be utilized. For the purpose of this article, we will be specifically discussing Tukey’s HSD.

Something that I should mention before proceeding, is the reason for the utilization of ANOVA as opposed to a T-Test. ANOVA allows us to compare the means between various groups simultaneously, while maintaining the same confidence interval. If we had four experimental groups to test between, this would require 6 T-Tests.

1 vs. 2 | 1 vs. 3 | 1 vs. 4

2 vs. 3 | 2 vs. 4

3 vs. 4

Each T-Test, if assuming an alpha of .05, has a 5% chance of a Type I error occurring. This means, that there is a 30% chance (.05 * 6), that at least one Type I error would occur. The T-Test will analyze for a statistical difference between the means of two groups, whereas the ANOVA, analyzes for differences within the set of means.

If you recall from the previous article, we addressed two separate scenarios, one in which a cook was testing for the salt content of soup, and the other, in which the impact of study time was being assessed as it applied to students from two different schools.

We will run a Tukey’s HSD on the data collected from each study.

Scenario A: The Soup Scenario

satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)

salt <- c(rep("low",3), rep("med",4), rep("high",3))

salttest <- data.frame(satisfaction, salt)

results <- aov(satisfaction~salt, data=salttest)

Now to run the Tukey HSD Post Hoc Inquiry:

TukeyHSD(results)

Which produces the output:

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = satisfaction ~ salt, data = salttest)

$salt

diff lwr upr p adj

low-high 1.00000000 -4.148005 6.148005 0.8387911

med-high 0.91666667 -3.898852 5.732185 0.8445186

med-low -0.08333333 -4.898852 4.732185 0.9985693

Let us review each aspect of this output:

Diff – Is the difference of the averages between the values.

Lwr – Is the lower confidence interval of the difference.

Upr – Is the upper confidence interval of the difference.

P adj – The p-values pertaining to the significance of the compound values. Again, if 95%, we will be looking for values of significance that are less than .05.

Each value within the “p adj” column corresponds to an assessment of the significance pertaining to separate categorical aspects of the model. In the case of the above output, assuming an alpha value of .05, there were no significant differences between any of the categorical factors (p = 0.839; p = 0.844; p = 0.999).

Scenario B: Schools, Study Time and Stress Scenario

satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))

school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))

schooltest <- data.frame(satisfaction, studytime, school)

results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))

summary(results)

Now to run the Tukey HSD Post Hoc Inquiry:

TukeyHSD(results)

This produces the following output:

$studytime

diff lwr upr p adj

Three Hours-One Hour -1.3 -4.5013364 1.901336 0.5753377

Two Hours-One Hour 2.2 -1.0013364 5.401336 0.2198626

Two Hours-Three Hours 3.5 0.2986636 6.701336 0.0302463

Is describing the relationship between the varying levels of study time as it pertains to stress.

The next portion of the output:

$school

diff lwr upr p adj

SchoolB-SchoolA -0.6 -2.760257 1.560257 0.571817

Describes the relationship between the two school types as it pertains to stress.

Finally, the last portion of the output:

$`studytime:school`

diff lwr upr p adj

Three Hours:SchoolA-One Hour:SchoolA -0.4 -6.005413 5.2054132 0.9999178

Two Hours:SchoolA-One Hour:SchoolA 3.4 -2.205413 9.0054132 0.4401459

One Hour:SchoolB-One Hour:SchoolA 0.8 -4.805413 6.4054132 0.9976117

Three Hours:SchoolB-One Hour:SchoolA -1.4 -7.005413 4.2054132 0.9696463

Two Hours:SchoolB-One Hour:SchoolA 1.8 -3.805413 7.4054132 0.9157375

Two Hours:SchoolA-Three Hours:SchoolA 3.8 -1.805413 9.4054132 0.3223867

One Hour:SchoolB-Three Hours:SchoolA 1.2 -4.405413 6.8054132 0.9844928

Three Hours:SchoolB-Three Hours:SchoolA -1.0 -6.605413 4.6054132 0.9932117

Two Hours:SchoolB-Three Hours:SchoolA 2.2 -3.405413 7.8054132 0.8260605

One Hour:SchoolB-Two Hours:SchoolA -2.6 -8.205413 3.0054132 0.7067715

Three Hours:SchoolB-Two Hours:SchoolA -4.8 -10.405413 0.8054132 0.1240592

Two Hours:SchoolB-Two Hours:SchoolA -1.6 -7.205413 4.0054132 0.9470847

Three Hours:SchoolB-One Hour:SchoolB -2.2 -7.805413 3.4054132 0.8260605

Two Hours:SchoolB-One Hour:SchoolB 1.0 -4.605413 6.6054132 0.9932117

Two Hours:SchoolB-Three Hours:SchoolB 3.2 -2.405413 8.8054132 0.5052080

Describes the relationships between the combination of hours studied and school types.

We can make the following interpretations from the above outputs:

There was a significant difference in stress level between students who study two hours and students who study three hours (p = 0.0302463).

There was not a significant difference in stress level between students who attend SchoolA, and students who attend SchoolB.

There were not a significant differences in stress levels as it pertains to the combination of factors: school and study time.

I have often been asked what differentiates an ANOVA Post Hoc Test (such as Tukey’s HSD), from a T-Test proceeding an ANOVA calculation. The reasons for performing a Post Hoc Test, Tukey's in our case, as opposed to a T-Test, are as follows:

1. Performing multiple T-Tests to check for significance is ultimately time consuming, and nullifies the initial convenience of running an ANOVA test. Additionally, doing such, re-creates the compounding probability of error that we originally sought to avoid.

2. Tukey’s HSD takes into account the significance of each variable as they interact with other variables within the ANOVA model. Re-testing with the T-Test may show what data sets INDENPENDENTLY differ from the other data sets, but it will not illustrate what data sets differ within the model.

Now, I will demonstrate how to perform both a One Way ANOVA Test and a Post Hoc Tukey’s HSD test within SPSS .

First, we will need to define the variables, this can be achieved within the “Variable View” portion of SPSS. Selecting this view can be achieved by clicking the “Variable View” tab on the lower right hand side of the SPSS console.

Once “Variable View” has been selected, we can begin by defining our variable types.

Here I have defined two variables, “Satisfaction” and “Soup”, both were assigned the default variable parameters by the SPSS system.

Next, we need to define our value labels, to achieve this, I clicked on the cell which coincides with the variable “Soup”. This brings up a user interface, which allows for the entry of value labels, and the value for which the label is assigned.

Once this data has been input, we can now input the corresponding values into SPSS which are required for the assembly of our ANOVA model.

When this step has been completed, to proceed, we must choose “Analyze” from the upper most drop down menu. Select the option “Compare Means”, and the subsequent option, “One-Way ANOVA”.

This course of action should cause a menu to appear. For our “Dependent List” variable, we will choose, “Satisfaction”. For our “Factor” variable, we will choose “Soup”. Once this has been completed, select the middle box on the right corner of the menu which reads, “Post Hoc”. This causes another menu to appear which presents Post Hoc Analysis options. For our purposes, we will be checking the box next to “Tukey” prior to proceeding. Significance level should be left at .05 (or Alpha = .05). Click “Continue”, then click, “OK”.

This presents a more detailed Tukey’s HSD output than what was originally available in R:

Compared to the R output, the following output is detailing:

Mean Difference – Is the difference of the averages between the values. “diff” in R.

Std. Error – The standard error of the compared values. No equivalent in R.

Lower Bound – Is the lower confidence interval of the difference. “lwr” in R.

Upper Bound – Is the upper confidence interval of the difference. “upr” in R.

Sig. – The p-values pertaining to the significance of the compared values. Again, if 95%, we will be looking for values of significance that are less than .05. “p adj” in R.

That’s all for now, Data Heads. I’ll see you again soon with a brand new article, the subject matter of such is undetermined.

Tuesday, December 5, 2017

(R) Analysis of Variance - ANOVA

In this article, we will discuss ANOVA, specifically, when its usage is appropriate, and how it can be utilized within R. This will likely be the final article of the current series of entries pertaining to The R Programming Language. Subsequent articles will discuss concepts and usage of software within the SPSS platform.

ANOVA is an abbreviation that represents a method known as The Analysis of Variance.

There are few terms that are specific to ANOVA, those are:

Way – Which refers to an independent variable within the ANOVA model.

Factor – Another term which refers to an independent variable.

Level – The category of an independent variable within the ANOVA model.

ANOVA is used to compare the variances of various sample groups against one another. In many ways it is similar to a t-test, however, ANOVA allows for multiple group comparisons. This differs from the t-test, which only allows for one single group to be compared to another single group.

A post-hoc test is often performed after ANOVA has been calculated. We will discuss this topic in a different article. A post-hoc test is used to further investigate data sample similarities and is utilized when the ANOVA model returns certain results.

Like the t-test, there are different variations of the ANOVA model that are applicable depending on the data being analyzed. We will review three common ANOVA application as they pertain to various data types. The analyzation of the output of the model data is performed through the utilization of the F-Test. For a detailed description of the F-Test, and what conclusions it provides, please refer to the pervious article.

One Way ANOVA

As a reminder, Way, in this scenario, is referring to a single independent variable.

In a one way ANOVA, we are assuming the following:

1. Each sample is random.
2. Each sample is in no way influenced by the other sampling results.
3. Each dependent variable is sampled from a normally distributed population.
4. The variances of the samples, should be equivalent, or somewhat equivalent. The reason for such, is that the population variances are assumed to be equal for each sample.

The hypothesis for this model type will be:

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

Example Problem:

A chef wants to test if patrons prefer a soup which he prepares based on salt content. He prepares a limited experiment in which he creates three types of soup: soup with a low amount of salt, soup with a high amount of salt, and soup with a medium amount of salt. He then servers this soup to his customers and asks them to rate their satisfaction on a scale from 1-8.

Low Salt Soup it rated: 4, 1, 8
Medium Salt Soup is rated: 4, 5, 3, 5
High Salt Soup is rated: 3, 2, 5

Hypothesis:

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

Let’s use this data to create a model within R:

satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)

salt <- c(rep("low",3), rep("med",4), rep("high",3))

salttest <- data.frame(satisfaction, salt)

results <- aov(satisfaction~salt, data=salttest)

summary(results)

This produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
salt 2 1.92 0.958 0.209 0.816
Residuals 7 32.08 4.583

If p < .05, we will reject the null hypothesis.

Hypothesis: 0.816 > .05

Since the model’s p-value (.816) is greater than the assumed alpha (.05), we will fail to reject the null hypothesis. What this is indicating, is that at 95% confidence interval, we cannot state that through the analysis of the data provided, that there is a significant difference of customer satisfaction as it pertains to salt content in soup.

Two Way ANOVA

Two way, in this scenario, is referring to the two independent variables which will be utilized within this ANOVA model.

The hypothesis for this model type will be:

1.

H0: u1 = u2 = u3 =…..etc. (All means are equal)

H1: Not all means are equal.

2.

H0: uVar1 = uVar2 (Var1’s value does not significantly differ from Var2’s value)

H1: uVar1 NE uVar2

3.

H0: An interaction is absent.

H1: An interaction is present.

Example Problem:

Researchers want to test study habits within two schools as they pertain to student life satisfaction. The researchers also believe that the school that each group of students is attending may also have an impact on study habits. Students from each school are assigned study material which in sum, totals to 1 hour, 2 hours, and 3 hours on a daily basis. Measured is the satisfaction of each student group on a scale from 1-10 after a 1 month duration.

School A:

1 Hour of Study Time: 7, 2, 10, 2, 2
2 Hours of Study Time: 9, 10, 3, 10, 8
3 Hours of Study Time: 3, 6, 4, 7, 1

School B:

1 Hour of Study Time: 8, 5, 1, 3, 10
2 Hours of Study Time: 7, 5, 6, 4, 10
3 Hours of Study Time: 5, 5, 2, 2, 2

Let’s state our hypothesizes, as they apply to this problem:

1.

H0: u1 = u2 = u3 (Stress levels DO NOT differ depending on hours of daily study.)

H1: Not all means are equal. (Stress levels DO differ depending on hours of daily study.)

2.

H0: uSchoolA = uSchoolB (Stress levels DO NOT significantly differ depending on school school.)

H1: uSchoolA NE uSchoolB (Stress levels DO significantly differ depending of school.)

3.

H0: An interaction is absent. (The combination of school and study time is NOT impacting the outcome)

H1: An interaction is present. (The combination of school and study time IS impacting the outcome)

Entering this into R can be tricky, but stay with me:

satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))

school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))

schooltest <- data.frame(satisfaction, studytime, school)

results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))

summary(results)

Which produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
studytime 2 62.6 31.300 3.809 0.0366 *
school 1 2.7 2.700 0.329 0.5718
studytime:school 2 7.8 3.900 0.475 0.6278
Residuals 24 197.2 8.217
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since we have three hypothesis tests, we must assess all three of the p-values present within the output.

Study Time

p = 0.0366

School

p = 0.5718

Study Time : School

p = 0.6278

In investigating the output we can make the following conclusions:

Hypothesis 1: 0.0366 < .05

Hypothesis 2: 0.5718 > .05

Hypothesis 3: 0.6278 > .05

If p < .05, we will reject the null hypothesis.

Hypothesis 1: Reject

Hypothesis 2: Fail to Reject

Hypothesis 3: Fail to Reject

So we can state:

Students of different schools did not significantly different stress levels. There was significant difference between the levels of study time as it pertains to stress. No interaction effect was present.

(Two Way ANOVA must have columns observations of equal length)

Repeated-Measures ANOVA

A repeated measures ANOVA is similar to a paired t-test in that it samples from the same set more than once. This model contains one factor with at least two levels, and the levels are dependent.

Example Problem:

Researchers want to test the impact of reading existential philosophy on a group of 8 individuals. They measure the happiness of the participants three times, once prior to reading, once after reading the materials for one week, and once after reading the materials for two weeks. We will assume an alpha of .05.

Before Reading = 1, 8, 2, 4, 4, 10, 2, 9
After Reading = 4, 2, 5, 4, 3, 4, 2, 1
After Reading (wk. 2) = 5, 10, 1, 1, 4, 6, 1, 8

Hypothesis:

H0: u1 = u2 = u3

H1: Not all means are equal.

Let’s use this data to create a model within R:

library(lme4) # You will need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lmer(happiness ~ week + (1|id), data=survey)

anova(model)

Which produces the output:

Analysis of Variance Table
Df Sum Sq Mean Sq F value
week 2 15.083 7.5417 1.0462

The F-Test statistic = 1.0462

To calculate the p-value of our test statistic, we can use the following r-code:

pf(q=1.0462, df1=2, df2=14, lower.tail=FALSE) # Test Statistic , Numerator Degrees of Freedom = 2, Denominator Degrees of Freedom = 14 #

Which produces the output:

[1] 0.3771816

If p < .05, we will reject the null hypothesis.

Hypothesis: 0.3771816 > .05

With this information, we can conclude that the three conditions did not significantly differ pertaining to level of happiness.

A similar methodology that can be utilized to perform this analysis:

library(lme4) # You will need to install and enable this package #
library(nlme) # You will also need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lme(happiness ~ week, random=~1|id, data=survey)

anova(model)

This method saves some time by producing the output:

numDF denDF F-value p-value
(Intercept) 1 14 37.21053 <.0001
week 2 14 1.04624 0.3772

That is all for now, Data Heads. The topic of the next article will be Post-Hoc Analysis. Stay tuned!