Reflections of a Data Scientist: February 2018

Tuesday, February 27, 2018

(R) Mann-Whitney U Test (SPSS)

In the last article, we discussed non-parametric tests, specifically, the Wilcox Signed Rank Test. In this article, we will be discussing another test of similar nature, the Mann-Whitney U Test. The Mann-Whitney U Test is a spiritual relative of the Wilcox Signed Rank Test, as it utilizes rank, and is employed almost exclusively for the analyzation of non-parametric data.

The Mann-Whitney U Test provides a non-parametric alternative to The Two Sample Student’s T-Test. While I would recommend the latter simply due to its own innate robustness, the Mann-Whitney U Test will appear from time to time in research papers. Therefore, for this reason, and for a greater understanding as it pertains to the inner workings of the underlying methodology, the Mann-Whitney U Test should at the very least, be momentarily contemplated.

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

For this, we will utilize the code:

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)
N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

wilcox.test(N2, N1, alternative = "two.sided", paired = FALSE, conf.level = 0.95)

Which produces the output:

Wilcoxon rank sum test with continuity correction

data: N2 and N1
W = 50.5, p-value = 0.05575
alternative hypothesis: true location shift is not equal to 0

From this output we can conclude:

With a p-value of 0.05575 (0.05575 > .05), we can state that, at a 95% confidence interval, that the scientist's chemical is not altering the temperature of the water.

The t-test equivalent of this analysis would resemble:

(If we were measuring mean values)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, paired=FALSE, conf.level = 0.95)

Which produces the output:

Two Sample t-test

data: N2 and N1
t = 2.4558, df = 14, p-value = 0.02773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3007929 4.4492071
sample estimates:
mean of x mean of y
75.250 72.875

From observing the output of both tests, you can witness the differentiation of p-values provided by the two analysis methods: p-value = 0.05575 (Wilcox) vs. 0.02773 (T-Test).

Below are the steps necessary to perform the above analysis within the SPSS platform.

Mann-Whitney U Test Example:

For this particular test, data must be structured in an un-conventional manner. The cases are combined into one single variable, with their group identity providing their initial designation.

Below is our example data set:

From the “Analyze” menu, select “Nonparametric Tests”, then select “Legacy Dialogues”, followed by “2 Independent Samples”.

This should populate the menu below:

Select “N1N2”, and utilize the top center arrow to designate these values as “Test Variable(s)”. Once this has been completed, utilize the bottom center arrow to designate “Group” as our “Grouping Variable”. Two groups exist, which we must specifically define. To achieve this, click “Define Groups”, then enter the value “1” into the input adjacent to “Group 1”. Next, enter the value “2” into the input adjacent to “Group 2”. Once this step has been completed, click “Continue”, and then click “OK”.

This will generate the output below:

The two values from the output that are relevant for our purposes are those labeled “Asymp Sig.” and “Exact Sig”. There is some debate amongst researchers as to which value should be utilized for reaching a statistical conclusion. Some recommend utilizing “Exact Sig” when conducting analysis that contains only a few data points, and relying on “Asymp Sig” when working with larger data sets.

Remember, SPSS and R calculate output values differently for both the Mann-Whitney U Test, and the Wilcox Ranked Signed Rank Test. This differentiation arises from the methodology utilized to resolve rank order.

(R) Wilcox Signed Rank Test (SPSS)

In this entry, we will be learning how to utilize the Wilcox Signed Rank Test. This article is the first article in a series of articles which will discuss non-parametric tests.

What is a non-parametric test?

A non-parametric test is a method of analysis which is utilized to analyze sets of data which do not comply with a specific distribution type. As a result of such, this particular type of test is by design, more robust.

Many tests require as a prerequisite, that the underlying data be structured in a certain manner. However, typically these requirements do not significantly cause test results to be adversely impacted, as many tests which are parametric in nature, have a good deal of robustness included within their models.

Therefore, though I believe that it is important to be familiar with tests of this particular type, I would typically recommend performing their parametric alternatives. The reason for this recommendation relates to the general acceptance and greater familiarity that the parametric tests provide.

Wilcox Signed Rank Test

The Wilcox Signed Rank Test provides a non-parametric alternative to both the One Sample Student’s T-Test, and the Paired Student’s T-Test. This test, shares a particular commonality with the other non-parametric tests which will be discussed in later articles, in that, it utilizes a ranking system to increase the robustness of measurements. The test is named for Frank Wilcoxon, the chemist and statistician who initially derived it.

Wilcox Signed Rank Test (One Sample)

Example:

A factory employee believes that the cakes produced within his factory are being manufactured with excess amounts of corn syrup, thus altering the taste. 10 cakes were sampled from the most recent batch and tested for corn syrup composition. Typically, each cake is measured to contain a median value of 20% corn syrup. Utilizing a 95 % confidence interval, can we assume that the new batch of cakes contain more than a median measurement of 20% corn syrup?

The levels of the samples were:

.27, .31, .27, .34, .40, .29, .37, .14, .30, .20

Our hypothesis test will be:

H0: m = .2
HA: m > .2

The t-test equivalent of this hypothesis would be:

H0: u = .2
HA: u > .2

(If we were measuring mean values)

The key difference being that Wilcox Signed Rank Test is testing median values, while the t-test is testing mean values.

With our hypothesis created, we can assess that mu = .2, and that this test will be right tailed.

Within the R platform, the code required to perform this analysis is as follows:

# Wilcox Signed Rank Test (One Sample) #

N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)

wilcox.test(N, alternative="greater", mu= .2, conf.level = 0.95)

# " alternative = " Specifies the type of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #

Which produces the output:

Wilcoxon signed rank test with continuity correction

data: N
V = 44, p-value = 0.006386
alternative hypothesis: true location is greater than 0.2

Warning messages:
1: In wilcox.test.default(N, alternative = "greater", mu = 0.2, conf.level = 0.95) :
cannot compute exact p-value with ties
2: In wilcox.test.default(N, alternative = "greater", mu = 0.2, conf.level = 0.95) :
cannot compute exact p-value with zeroes

You can ignore the error messages. Due to two values being similar within our data set, R could not assign the values separate ranks. Since ranking in a primary component of the model’s analysis, the R console is making you aware that these similarities exist. Also, since the Wilcox Signed Rank Test relies on the differences of two data values to derive these ranks, you are being informed that there is no secondary values from which to create these values. To understand exactly what this means, you will first have to learn how to derive the results of this test by hand.

V = The sum of ranks assigned to the differences of positive values derived from the initial values of the data. This would be the value of the T+ variable if calculating the test by hand. This value provides no additional significance pertaining to analysis.

Since the p-value is less than .05, we conclude that at a 95% confidence interval, that the cakes that are being produced contain an excess amount of corn syrup.

The t-test equivalent of this analysis would resemble:

(If we were measuring mean values)

N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)

t.test(N, alternative = "greater", mu = .2, conf.level = 0.95)

# " alternative = " Specifies the typer of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #

Which would produce the output:

data: N
t = 3.6713, df = 9, p-value = 0.002572
alternative hypothesis: true mean is greater than 0.2
95 percent confidence interval:
0.244562 Inf
sample estimates:
mean of x
0.289

From observing the output of both tests, you can witness the slight differentiation of p-values provided by the two analysis methods: p-value = 0.006386 (Wilcox) vs. 0.002572 (T-Test).

Wilcox Signed Rank Test (Two Sample)

As mentioned previously, the Wilcox Signed Rank Test is the non-parametric alternative to the Paired Student’s T-Test.

Example:

A watch manufacturer believes that by changing to a new battery supplier, that the watches that are shipped which include an initial battery, will maintain longer lifespan. To test this theory, twelve watches are tested for duration of lifespan with the original battery.

The same twelve watches are then re-rested for duration with the new battery.

Can the watch manufacturer conclude, that the new battery increases the duration of lifespan for the manufactured watches? (We will assume an alpha value of .05).

For this, we will utilize the code:

N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)
N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

wilcox.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )

Which produces the output:

Wilcoxon signed rank test

data: N2 and N1
V = 66, p-value = 0.01709
alternative hypothesis: true location shift is greater than 0

With a p-value of 0.01709 (0.01709 < .05), we can conclude that, at a 95% confidence interval, that the new battery increases the duration of lifespan for the manufactured watches.

The t-test equivalent of this analysis would resemble:

(If we were measuring mean values)

N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)
N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

t.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )

Which would produce the output:

Paired t-test

data: N2 and N1
t = 2.4581, df = 11, p-value = 0.01589
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
12.32551 Inf
sample estimates:
mean of the differences
45.75

From observing the output of both tests, you can witness the slight differentiation of p-values provided by the two analysis methods: p-value = 0.01709 (Wilcox) vs. 0.01589 (T-Test).

Below are the steps necessary to perform the above analysis within the SPSS platform.

Wilcox Signed Rank Test (One Sample) Example:

For the first example, we will assume that the individuals conducting the research are searching for a general fluctuation in the data. The reason for this change in methodology, is due to the limitations of the SPSS platform. SPSS cannot perform Wilcox Signed Rank Tests for a single tailed hypothesis. Therefore, to illustrate for functionality, our example will be two tailed.

Below is our example data set:

From the “Analyze” menu, select “Nonparametric Tests”, then select “One Sample”.

The following menu should appear:

With the “Fields” tab selected, click the center arrow to move the variable “N1” into the “Test Fields” area. Once this has been completed, click on the “Settings” tab.

Once the “Settings” tab has been selected, click on the button located next to “Customize Tests”. Once this option has been specified, click on the box located next to “Compare median to hypothesized (Wilcox signed-rank test)”. Enter the “Hypothesized median” value of .02 into the adjacent box. After this step has been completed, click “Run”.

This should generate the output below:

Given in this diagram is the null hypothesis, the type of test that was conducted, the p-value coinciding with our test results (Sig.), and the decision which pertains to the p-value of the hypothesis.

In the case of our example, the null hypothesis is being rejected as our p-value is equal to .011.

Wilcox Signed Rank Test (Two Sample) Example:

For the second example, we will again assume that the individuals conducting the research are searching for a general fluctuation in the data. The reason for this change in methodology, is due to the limitations of the SPSS platform. SPSS cannot perform Wilcox Signed Rank Tests for a single tailed hypothesis. Therefore, to illustrate for functionality, our example will be two tailed.

Below is our sample data set:

To begin analysis, from the “Analyze” menu, select “Nonparametric Tests”, then select “Legacy Dialogs”, followed by “2 Related Samples”.

This should cause the menu below to appear:

Through the utilization of the center arrow button, move both variables to their appropriate pair destination on the right side of the screen. Once this has been completed, click “OK”. Performing this sequence of actions will create model output.

The figure contained in the “Test Statistics” which is labeled “Asymp. Sig. (2-tailed)” is the figure that we will be investigating. It is worth mentioning, that the Wilcox Signed Rank Test is calculated slightly differently in SPSS, as compared to R. The differentiation between the methodologies of calculation becomes apparent when there are ties amongst ranks, or when zero values are contained with the calculation used to generate the analysis.

With a p-value that is less than .05 (0.034 < .05), we can conclude that at a 95% confidence interval, that the new battery is impacting the duration of lifespan for the manufactured watches.

Monday, February 19, 2018

Repeated Measures Analysis of Variance (SPSS)

Previously, in a prior article, we discussed the concept of repeated measures analysis of variance (repeated measures ANOVA). In this article, we will be utilizing the same example data set, however, instead of creating the model within the R platform, we will demonstrate the creation of the model through the utilization of SPSS.

Repeated Measures Analysis of Variance (Repeated Measures ANOVA) Example:

Repeated-Measures ANOVA

A repeated measures ANOVA is similar to a paired t-test in that it samples from the same set more than once. This model contains one factor with at least two levels, and the levels are dependent.

Stated Problem:

Researchers want to test the impact of reading existential philosophy on a group of 8 individuals. They measure the happiness of the participants three times, once prior to reading, once after reading the materials for one week, and once after reading the materials for two weeks. We will assume an alpha of .05.

To create the model, first choose “Analyze” from the topmost menu. After this selection has been made, choose “General Linear Model”, and then select “Repeated Measures”.

This series of selections should cause the following interface to appear:

“Within-Subject Factor Name:” should be pre-populated with “factor1”. You can leave this entry as is. The “Number of Levels” entry must be modified to the level which is necessary for inclusion within the model. In the case of our example, we will set this value to “3”. I have created the graphic below which illustrates the levels:

Once the required data has been entered into the appropriate fields, we can continue with the model creation by clicking “Define”.

Doing such, will present the following menu:

“Within-Subject Variables” will initially be listed as ” _?_(1)" , " _?_(2)", and "_?_(3)". Clicking the topmost arrow pointing right will allow us to modify this list so that instead of the default place holder values, our model is prepared to populate with the necessary level variables. Once the variables have been listed in order, select “OK”.

This populates numerous tables within the output section of SPSS. However, only the table below is worth inspecting.

Since (.377 > .05) we will not reject the null hypothesis.

With this information, we can conclude that the three conditions did not significantly differ pertaining to level of happiness.

Random Effects Analysis of Variance (SPSS)

In today’s article, we will again be discussing the ANOVA model. When building an ANOVA model, typically, as a researcher, you will prefer to have the same amount of categorical variables representing each category. Also, the experimental parameters should be pre-established in way in which the same levels can be re-utilized in future models.

All previous ANOVA examples have demonstrated models which comply with the previously listed sentiments. However, there will occasions in which levels of the independent variables are not specifically chosen, but instead are drawn randomly from a larger population. If the experiment was repeated, the levels could potentially differ with the next sampling iteration. In such cases, the data can still be included to assist in the creation of a model which will be used to make inferences about a larger population.

A random effects model anticipates differing study size as it pertains to variables, and mean estimates of each variable grouping. In fixed effects models, narrower confidence intervals will occur due to the absence of this factor. In random effects models, larger confidence intervals will occur due to the model adjusting for such.

Therefore, fixed effect models are most appropriate when there is homogeneity. If this is indeed the case, the study will be more precise, and additionally the confidence interval will be narrower.

Random Effects Analysis of Variance Example:

Below is a modified data set from a previous example:

For this example, we will build a model which utilizes “Satisfaction” as the independent variable. Our dependent variables will be “School” and “Study_Time”. The “Random Factor(s)” that we will select will be the “Race” variable.

These options can be selected through the utilization of the following menu selections:

Below is an image which illustrates variable specification:

After selecting “Post Hoc” from the menu options, we will be presented with the following interface:

The post hoc test that we will select for subsequent analysis will be “Tukey”. The variables which we will specify for analysis are “School” and “Study_Time”.

Clicking “Continue”, followed by “OK”, will create the model and the necessary output.

Similarly, in the manner in which we analyzed previous ANOVA examples, we will specifically be investigating significant values as they pertain to each value, or combination of values, as they appear in the leftmost column. In this particular example, there are no significant values which coincide with interactions or specific model variables.

Therefore, we will move on to the post hoc test, which, in tandem with the above model output, does not illustrate significant difference between variable values.

In the next article, we will review the concept of repeated measures ANOVA. We also will discuss how to create these models within SPSS.

Thursday, February 15, 2018

Analysis of Covariance (and) Multivariate Analysis of Covariance (SPSS)

As if MANOVA and ANOVA were not difficult enough to understand, today we will be discussing their most complex forms, MANCOVA and ANCOVA. The “C” in the acronym stands for “covariance”. Which of course generates the inquiry, “What is covariance?”

Covariance is defined as: “a measure of the joint variability of two random variables.” *

In regular terms, this translates to: a variable within the model which will reduce error margin and increase specificity. The covariant, within the context of a model, essentially creates a means for weighing the model's dependent variable(s).

Here are few examples of variable groups which contain a covariant. The covariant in each set is proceeded by a (CV).

Set 1 = { Student Name, Score in AP English, Score in AP English Placement Exam, GPA, Student’s Parents’ Income (CV) }

Set 2 = { Patient’s Name, Weight, Hours Exercised per Week, Gender, Weight at the End of X Weeks, Caloric Intake (CV) }

Set 3 = { Tennant’s Name, Earnings, Late Payment Notices, Number of Children (CV) }

A covariant is a factor which exists outside of experimental parameters, which may assist in the analyzation of the data.

Just like their ANOVA and MANOVA equivalents, ANCOVA and MANCOVA can be multifactor. For our examples, we will perform two-factor analysis.

The dependent and covariate variables in both ANCOVA and MANCOVA models must be continuous. The independent variables must be categorical.

Two-Way ANCOVA (Analysis of Covariance) Example:

Below is the data set which we will be utilizing:

To begin, select “Analyze”, followed by “General Linear Model”, then select “Univariate”.

This series of selections should cause the following screen to appear:

After selecting the independent variables “School” and “Study_Time” as our “Fixed Factor(s)”, we will select the variable “Satisfaction” as our “Dependent Variable”. The “Covariate(s)” that we will be selecting is “Attended_Tutoring”.

We cannot create a “Post Hoc” test while utilizing the ANCOVA model. Therefore, we will proceed with creating our analysis output by clicking “OK”.

This generates the following report:

If we were to write our statistical conclusion based on this output in APA format, the conclusion would resemble:

There was no significant effect of school selection on life satisfaction after controlling for the effect of tutoring attendance, F(1, 23) = .048, p = .829.

There was a significant effect of study time on life satisfaction after controlling for the effect of tutoring attendance, F(2, 23) = 4.197, p = .028.

There was not a significant interaction between the variables school and study time, while controlling for the effect of tutoring attendance, F(2, 23) = .523, p = .600.

The covariate, tutoring attendance, was not significantly related to life satisfaction, F(1, 23) = 3.445, p = .076.

Two-Way MANCOVA (Multivariate Analysis of Covariance) Example:

The sample data set for this exercise can be found below:

We will begin model creation by selecting “Analyze”, then “General Linear Model”, followed by the “Multivariate” option.

This should populate the following screen:

After selecting the independent variables “IndepFactor” and “IndepFactorB” as our “Fixed Factor(s)”, we will select the variables “ContVarA” and “ContVarB” as our “Dependent Variables”. Our “Covariate(s)” will be the variable “CoVar”.

Again, as was the case with the ANCOVA model, we will not be able to run a post hoc test following the model’s creation.

After clicking “OK”, the following output should populate:

From this table, we may state the following conclusions:

Using Pillai’s Trace, there was not a significant effect of “IndepFactor” on “ContVarA” and “ContVarB”, after controlling for the effect of “CoVar”.

Using Pillai’s Trace, there was not a significant effect of the “IndepFactorB” on “ContVarA” and “ContVarB”, after controlling for the effect of “CoVar”.

Using Pillai’s Trace, there was no interaction present between “IndepFactor” and “IndepFactorB”¸ after controlling for the effect of “CoVar”.

Using Pillai’s Trace, there was not a significant effect of the covariant, “CoVar”, on “ContVarA” and “ContVarB”.

* - https://en.wikipedia.org/wiki/Covariance

Wednesday, February 14, 2018

(R) Multivariate Analysis of Variance (SPSS)

Multivariate Analysis of Variance, or MANOVA, is very similar to ANOVA in implementation. The difference lies in the number of dependent variables included within the model. A One-Way MANOVA contains more than 1 dependent variable, and 1 independent variable. A Two-Way MANOVA contains more than 1 dependent variable, and 2 independent variables. A requirement of the model, is that independent variables must be factor type variables, and dependent variables, must be continuous type variables.

One-Way MANOVA (One Independent Factorial Variable)

We’ll begin with a One-Way MANOVA example:

To begin, select “Analyze”, followed by “General Linear Model”, then select “Multivariate”.

This populates the following screen:

After selecting the independent variable “IndepFactor” as our “Fixed Factor(s)", we will select the variables “ContVarA” and “ContVarB” as our “Dependent Variables”.

Next, click “Post Hoc”, this brings up the following menu:

Select “IndepFactor” as the variable in which we will create a Post Hoc test to analyze. Select “Tukey” as the test to utilize.

Once this has been completed, we can move forward with the model creation.

The hypothesis test generated by for this model is as follows:

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

To test the hypothesis, we will analyze the “Sig” value for the “Pillai’s Trace” entry for “IndepFactor”.

Assuming an alpha value of .05, we will fail to reject the null hypothesis (.863 > .05). As a result of such, we can state:

Using Pillai’s Trace, there was not a significant effect of the “IndepFactor” on ContVarA and ContVarB.

(We utilize the Pillai’s Trace Test due to its robustness. This particular test does not assume homogeneity.)

Let’s now turn to our Tukey HSD output:

Though the MANOVA and ANOVA models differ in their composition, the Tukey’s HSD post hoc test is calculated in the same manner for both methods of analysis. This is not immediately evident in the SPSS output, but can be observed in the R code which performs the underlying function.

What is being displayed in SPSS, therefore, is the combination of multiple Tukey HSD post hoc tests. Each variable’s test values are being calculated independently prior to their combination within the output.

The Column which contains “ContVarA” is displaying the interaction between the values of “IndepFactor”, as the values pertain to the variable “ContVarA”.

The Column which contains “ContVarB” is displaying the interaction between the values of “IndepFactor”, as the values pertain to the variable “ContVarB”.

We can make the following interpretations from the above table:

There was not a significant difference in “ContVarA” between “IndepFactor” value 1 and “IndepFactor” value 2.

There was not a significant difference in “ContVarA” between “IndepFactor” value 2 and “IndepFactor” value 3.

There was not a significant difference in “ContVarB” between “IndepFactor” value 1 and “IndepFactor” value 2.

There was not a significant difference in “ContVarB” between “IndepFactor” value 2 and “IndepFactor” value 3.

In R, the code that would be utilized to complete a similar process is as follows:

# Create Data Frame #

contvara <- c(12.00, 64.00, 61.00, 99.00, 52.00, 65.00, 11.00, 55.00, 19.00, 42.00, 58.00, 6.00, 68.00, 75.00, 54.00)

contvarb <- c(307.00, 122.00, 199.00, 203.00, 707.00, 620.00, 208.00, 485.00, 629.00, 592.00, 316.00, 697.00, 794.00, 489.00, 274.00)

indepfactor <- c(1.00, 2.00, 3.00, 3.00, 2.00, 1.00, 2.00, 2.00, 1.00, 3.00, 2.00, 3.00, 1.00, 1.00, 1.00)

test <- data.frame(contvara, contvarb, factor(indepfactor))

# Create MANOVA Model + Analysis #

results <- manova(cbind(contvara, contvarb) ~ factor(indepfactor), data=test)

# View Model Results #

summary(results)

# Generate Tukey's HSD for "ContVarA" #

tuk1 <- aov(lm(contvara ~ factor(indepfactor) , data=test))

TukeyHSD(tuk1)

# Generate Tukey's HSD for "ContVarB" #

tuk2 <- aov(lm(contvarb ~ factor(indepfactor) , data=test))

TukeyHSD(tuk2)

Now let’s explore a Two-Way MANOVA example. In this example, we will be utilizing two independent variables. It is the number of independent variables contained within a model which determines the number listed prior to the hyphen. Therefore, by this methodology, a MANOVA model which contains three independent variables is referred to as a Three-Way MAOVA, etc.

Two-Way MANOVA (Two Independent Factorial Variables)

We will be using the same data set from the prior exercise.

To begin, select “Analyze”, followed by “General Linear Model”, then select “Multivariate”.

This populates the following screen:

After selecting the independent variables “IndepFactor” and “IndepFactorB” as our “Fixed Factor(s)", we will select the variables “ContVarA” and “ContVarB” as our “Dependent Variables”.

Next, click “Post Hoc”, this brings up the following menu:

Select “IndepFactor” and “IndepFactorB” as the variables in which we will create a Post Hoc test to analyze. Select “Tukey” as the test to utilize.

Once this has been completed, we can move forward with the model creation.

The hypothesis test generated by for this model is as follows:

1.

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

2.

H0: u1 = u2 = u3 =…..etc. (All means are equal)

H1: Not all means are equal.

3.

H0: An interaction is absent.

H1: An interaction is present.

To test the various hypothesizes, we will analyze the “Sig” value for the “Pillai’s Trace” entries pertaining to “IndepFactor”, “IndepFactorB”, and “IndepFactor * IndepFactorB”.

(Assuming an alpha value of .05)

Hypothesis 1: .695 (IndepFactor)

Hypothesis 2: .660 (IndepFactorB)

Hypothesis 3: .575 (Interaction)

Therefore:

Hypothesis 1: Fail to Reject

Hypothesis 2: Fail to Reject

Hypothesis 3: Fail to Reject

So we may state the following:

Using Pillai’s Trace, there was not a significant effect of “IndepFactor” on "ContVarA" and "ContVarB".

Using Pillai’s Trace, there was not a significant effect of the “IndepFactorB” on "ContVarA" and "ContVarB".

Using Pillai’s Trace, there was no interaction present between “IndepFactor” and “IndepFactorB”.

(We utilize the Pillai’s Trace Test due to its robustness. This particular test does not assume homogeneity.)

Let’s now turn to our Tukey HSD output:

This output is similar to the output generated from the Tukey’s HSD post hoc test of the previous model. Typically, there would be additional rows displaying data pertaining to the variable “IndepFactorB”. However, within the SPSS platform, post hoc tests are not performed if a group of factor type variable consists of fewer than two cases (ex. 1, 1, 2, 2, 3, 4, 4).

In R, the code that would be utilized to complete a similar process is as follows:

# Create Data Frame #

contvara <- c(12.00, 64.00, 61.00, 99.00, 52.00, 65.00, 11.00, 55.00, 19.00, 42.00, 58.00, 6.00, 68.00, 75.00, 54.00)

contvarb <- c(307.00, 122.00, 199.00, 203.00, 707.00, 620.00, 208.00, 485.00, 629.00, 592.00, 316.00, 697.00, 794.00, 489.00, 274.00)

indepfactor <- c(1.00, 2.00, 3.00, 3.00, 2.00, 1.00, 2.00, 2.00, 1.00, 3.00, 2.00, 3.00, 1.00, 1.00, 1.00)

indepfactorb <- c(8.00, 7.00, 8.00, 9.00, 10.00, 8.00, 5.00, 8.00, 9.00, 9.00, 5.00, 5.00, 8.00, 9.00, 6.00)

test <- data.frame(contvara, contvarb, factor(indepfactor), factor(indepfactorb))

# Create MANOVA Model + Analysis #

results <- manova(cbind(contvara, contvarb) ~ factor(indepfactor) * factor(indepfactorb) , data=test)

# View Model Results #

summary(results)

# Generate Tukey's HSD for "ContVarA" #

tuk1 <- aov(lm(contvara ~ factor(indepfactor) , data=test))

TukeyHSD(tuk1)

# Generate Tukey's HSD for "ContVarA" #

tuk1 <- aov(lm(contvara ~ factor(indepfactorb) , data=test))

TukeyHSD(tuk1)

# Generate Tukey's HSD for "ContVarB" #

tuk2 <- aov(lm(contvarb ~ factor(indepfactor) , data=test))

TukeyHSD(tuk2)

# Generate Tukey's HSD for "ContVarB" #

tuk2 <- aov(lm(contvarb ~ factor(indepfactorb) , data=test))

TukeyHSD(tuk2)

Saturday, February 10, 2018

Univariate (Two Way ANOVA) (SPPS)

Previously discussed within this blog was the concept of ANOVA. After describing the concept, example problems were solved through the utilization of the R software package. In this article, we will be solving a Two Way ANOVA example problem through the utilization of SPSS. For a conceptual understanding of this model, please refer to the previously posted articles pertaining to such.

We will be utilizing an example from the prior ANOVA article to illustrate the concept.

Two Way ANOVA

Two way, is referring to the two independent variables which will be utilized within this ANOVA model.

The hypothesis for this model type will be:

1.

H0: uVar1 = uVar2 (Var1’s value does not significantly differ from Var2’s value)

H1: uVar1 NE uVar2

2.

H0: u1 = u2 = u3 =…..etc. (All means are equal)

H1: Not all means are equal.

3.

H0: An interaction is absent.

H1: An interaction is present.

Example Problem:

Researchers want to test study habits within two schools as they pertain to student life satisfaction. The researchers also believe that the school that each group of students is attending may also have an impact on study habits. Students from each school are assigned study material which in sum, totals to 1 hour, 2 hours, and 3 hours on a daily basis. Measured is the satisfaction of each student group on a scale from 1-10 after a 1 month duration.

School A:

1 Hour of Study Time: 7, 2, 10, 2, 2
2 Hours of Study Time: 9, 10, 3, 10, 8
3 Hours of Study Time: 3, 6, 4, 7, 1

School B:

1 Hour of Study Time: 8, 5, 1, 3, 10
2 Hours of Study Time: 7, 5, 6, 4, 10
3 Hours of Study Time: 5, 5, 2, 2, 2

Within the SPSS platform, data entered would resemble the following:

To generate the model, select “Analyze”, then select “Generate Linear Model”, followed by “Univariate”.

This selection of options will bring you to the following screen:

The “Dependent Variable” will be “Satisfaction”. The “Fixed Factor(s)” will be “School” and “Study_Time”.

To run a subsequent Post Hoc Test, select “Post Hoc” from the Univariate menu options. We will attempt to run a Tukey’s HSD by selecting the variables “School” and “Study_Time”, and selecting “Tukey”.

Click “OK” to continue, then click “OK” again to create the model.

The image below is one of the output screens which is populated after model creation:

(red arrows were added for emphasis)

In the above output chart, we can utilize the significance values to determine draw conclusions pertaining to the data. We will be assuming an alpha of .05. Meaning, that any value below .05 will be deemed significant.

Let’s restate our hypothesizes, as they apply to this problem:

1.

H0: uSchoolA = uSchoolB (Stress levels DO NOT significantly differ depending on school school.)

H1: uSchoolA NE uSchoolB (Stress levels DO significantly differ depending of school.)

2.

H0: u1 = u2 = u3 (Stress levels DO NOT differ depending on hours of daily study.)

H1: Not all means are equal. (Stress levels DO differ depending on hours of daily study.)

3.

H0: An interaction is absent. (The combination of school and study time is NOT impacting the outcome)

H1: An interaction is present. (The combination of school and study time IS impacting the outcome)

In investigating the output we can make the following conclusions:

Hypothesis 1: .572 (School)

Hypothesis 2: .037 (Study Time)

Hypothesis 3: .628 (Interaction)

Hypothesis 1: Fail to Reject

Hypothesis 2: Reject

Hypothesis 3: Fail to Reject

So we can state:

Students of different schools did not have significantly different stress levels. There was significant difference between the levels of study time as it pertains to stress. No interaction effect was present.

(Two Way ANOVA must have columns observations of equal length)

Above is the Tukey’s HSD output generated by SPSS. Typically, an additional output for the variable “School” would be generated, as it was requested while specifying the Post Hoc option. However, in this specific case, there were fewer than three groups for the “School” variable, therefore, SPSS did not produce a Post Hoc output.

We can make the following interpretations from the above table:

There was a significant difference in stress levels between students who study two hours and students who study three hours.

There was not a significant difference in stress levels between students who study one hour and students who study two hours.

There was not a significant difference in stress levels between students who study one hour and students who study three hours.

That’s all for now, Data Heads! Stay tuned for more insightful articles!