Reflections of a Data Scientist: 2017

Saturday, December 9, 2017

(R) Post Hoc Analysis and One Way ANOVA (SPSS)

As was previously mentioned, new entries posted on this blog will be primarily non-R related.

Today’s post will discuss Post Hoc Analysis, specifically Tukey’s Honest Significance Test. This test is also known as The Tukey Method, Tukey’s HSD, or TukeyHSD() in R.

Post Hoc refers to the testing that is performed following an ANOVA test. What this testing seeks to discover, is the significance of relationships that exist between variables within an ANOVA model. There are many different Post Hoc tests that can be utilized. For the purpose of this article, we will be specifically discussing Tukey’s HSD.

Something that I should mention before proceeding, is the reason for the utilization of ANOVA as opposed to a T-Test. ANOVA allows us to compare the means between various groups simultaneously, while maintaining the same confidence interval. If we had four experimental groups to test between, this would require 6 T-Tests.

1 vs. 2 | 1 vs. 3 | 1 vs. 4

2 vs. 3 | 2 vs. 4

3 vs. 4

Each T-Test, if assuming an alpha of .05, has a 5% chance of a Type I error occurring. This means, that there is a 30% chance (.05 * 6), that at least one Type I error would occur. The T-Test will analyze for a statistical difference between the means of two groups, whereas the ANOVA, analyzes for differences within the set of means.

If you recall from the previous article, we addressed two separate scenarios, one in which a cook was testing for the salt content of soup, and the other, in which the impact of study time was being assessed as it applied to students from two different schools.

We will run a Tukey’s HSD on the data collected from each study.

Scenario A: The Soup Scenario

satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)

salt <- c(rep("low",3), rep("med",4), rep("high",3))

salttest <- data.frame(satisfaction, salt)

results <- aov(satisfaction~salt, data=salttest)

Now to run the Tukey HSD Post Hoc Inquiry:

TukeyHSD(results)

Which produces the output:

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = satisfaction ~ salt, data = salttest)

$salt

diff lwr upr p adj

low-high 1.00000000 -4.148005 6.148005 0.8387911

med-high 0.91666667 -3.898852 5.732185 0.8445186

med-low -0.08333333 -4.898852 4.732185 0.9985693

Let us review each aspect of this output:

Diff – Is the difference of the averages between the values.

Lwr – Is the lower confidence interval of the difference.

Upr – Is the upper confidence interval of the difference.

P adj – The p-values pertaining to the significance of the compound values. Again, if 95%, we will be looking for values of significance that are less than .05.

Each value within the “p adj” column corresponds to an assessment of the significance pertaining to separate categorical aspects of the model. In the case of the above output, assuming an alpha value of .05, there were no significant differences between any of the categorical factors (p = 0.839; p = 0.844; p = 0.999).

Scenario B: Schools, Study Time and Stress Scenario

satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))

school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))

schooltest <- data.frame(satisfaction, studytime, school)

results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))

summary(results)

Now to run the Tukey HSD Post Hoc Inquiry:

TukeyHSD(results)

This produces the following output:

$studytime

diff lwr upr p adj

Three Hours-One Hour -1.3 -4.5013364 1.901336 0.5753377

Two Hours-One Hour 2.2 -1.0013364 5.401336 0.2198626

Two Hours-Three Hours 3.5 0.2986636 6.701336 0.0302463

Is describing the relationship between the varying levels of study time as it pertains to stress.

The next portion of the output:

$school

diff lwr upr p adj

SchoolB-SchoolA -0.6 -2.760257 1.560257 0.571817

Describes the relationship between the two school types as it pertains to stress.

Finally, the last portion of the output:

$`studytime:school`

diff lwr upr p adj

Three Hours:SchoolA-One Hour:SchoolA -0.4 -6.005413 5.2054132 0.9999178

Two Hours:SchoolA-One Hour:SchoolA 3.4 -2.205413 9.0054132 0.4401459

One Hour:SchoolB-One Hour:SchoolA 0.8 -4.805413 6.4054132 0.9976117

Three Hours:SchoolB-One Hour:SchoolA -1.4 -7.005413 4.2054132 0.9696463

Two Hours:SchoolB-One Hour:SchoolA 1.8 -3.805413 7.4054132 0.9157375

Two Hours:SchoolA-Three Hours:SchoolA 3.8 -1.805413 9.4054132 0.3223867

One Hour:SchoolB-Three Hours:SchoolA 1.2 -4.405413 6.8054132 0.9844928

Three Hours:SchoolB-Three Hours:SchoolA -1.0 -6.605413 4.6054132 0.9932117

Two Hours:SchoolB-Three Hours:SchoolA 2.2 -3.405413 7.8054132 0.8260605

One Hour:SchoolB-Two Hours:SchoolA -2.6 -8.205413 3.0054132 0.7067715

Three Hours:SchoolB-Two Hours:SchoolA -4.8 -10.405413 0.8054132 0.1240592

Two Hours:SchoolB-Two Hours:SchoolA -1.6 -7.205413 4.0054132 0.9470847

Three Hours:SchoolB-One Hour:SchoolB -2.2 -7.805413 3.4054132 0.8260605

Two Hours:SchoolB-One Hour:SchoolB 1.0 -4.605413 6.6054132 0.9932117

Two Hours:SchoolB-Three Hours:SchoolB 3.2 -2.405413 8.8054132 0.5052080

Describes the relationships between the combination of hours studied and school types.

We can make the following interpretations from the above outputs:

There was a significant difference in stress level between students who study two hours and students who study three hours (p = 0.0302463).

There was not a significant difference in stress level between students who attend SchoolA, and students who attend SchoolB.

There were not a significant differences in stress levels as it pertains to the combination of factors: school and study time.

I have often been asked what differentiates an ANOVA Post Hoc Test (such as Tukey’s HSD), from a T-Test proceeding an ANOVA calculation. The reasons for performing a Post Hoc Test, Tukey's in our case, as opposed to a T-Test, are as follows:

1. Performing multiple T-Tests to check for significance is ultimately time consuming, and nullifies the initial convenience of running an ANOVA test. Additionally, doing such, re-creates the compounding probability of error that we originally sought to avoid.

2. Tukey’s HSD takes into account the significance of each variable as they interact with other variables within the ANOVA model. Re-testing with the T-Test may show what data sets INDENPENDENTLY differ from the other data sets, but it will not illustrate what data sets differ within the model.

Now, I will demonstrate how to perform both a One Way ANOVA Test and a Post Hoc Tukey’s HSD test within SPSS .

First, we will need to define the variables, this can be achieved within the “Variable View” portion of SPSS. Selecting this view can be achieved by clicking the “Variable View” tab on the lower right hand side of the SPSS console.

Once “Variable View” has been selected, we can begin by defining our variable types.

Here I have defined two variables, “Satisfaction” and “Soup”, both were assigned the default variable parameters by the SPSS system.

Next, we need to define our value labels, to achieve this, I clicked on the cell which coincides with the variable “Soup”. This brings up a user interface, which allows for the entry of value labels, and the value for which the label is assigned.

Once this data has been input, we can now input the corresponding values into SPSS which are required for the assembly of our ANOVA model.

When this step has been completed, to proceed, we must choose “Analyze” from the upper most drop down menu. Select the option “Compare Means”, and the subsequent option, “One-Way ANOVA”.

This course of action should cause a menu to appear. For our “Dependent List” variable, we will choose, “Satisfaction”. For our “Factor” variable, we will choose “Soup”. Once this has been completed, select the middle box on the right corner of the menu which reads, “Post Hoc”. This causes another menu to appear which presents Post Hoc Analysis options. For our purposes, we will be checking the box next to “Tukey” prior to proceeding. Significance level should be left at .05 (or Alpha = .05). Click “Continue”, then click, “OK”.

This presents a more detailed Tukey’s HSD output than what was originally available in R:

Compared to the R output, the following output is detailing:

Mean Difference – Is the difference of the averages between the values. “diff” in R.

Std. Error – The standard error of the compared values. No equivalent in R.

Lower Bound – Is the lower confidence interval of the difference. “lwr” in R.

Upper Bound – Is the upper confidence interval of the difference. “upr” in R.

Sig. – The p-values pertaining to the significance of the compared values. Again, if 95%, we will be looking for values of significance that are less than .05. “p adj” in R.

That’s all for now, Data Heads. I’ll see you again soon with a brand new article, the subject matter of such is undetermined.

Tuesday, December 5, 2017

(R) Analysis of Variance - ANOVA

In this article, we will discuss ANOVA, specifically, when its usage is appropriate, and how it can be utilized within R. This will likely be the final article of the current series of entries pertaining to The R Programming Language. Subsequent articles will discuss concepts and usage of software within the SPSS platform.

ANOVA is an abbreviation that represents a method known as The Analysis of Variance.

There are few terms that are specific to ANOVA, those are:

Way – Which refers to an independent variable within the ANOVA model.

Factor – Another term which refers to an independent variable.

Level – The category of an independent variable within the ANOVA model.

ANOVA is used to compare the variances of various sample groups against one another. In many ways it is similar to a t-test, however, ANOVA allows for multiple group comparisons. This differs from the t-test, which only allows for one single group to be compared to another single group.

A post-hoc test is often performed after ANOVA has been calculated. We will discuss this topic in a different article. A post-hoc test is used to further investigate data sample similarities and is utilized when the ANOVA model returns certain results.

Like the t-test, there are different variations of the ANOVA model that are applicable depending on the data being analyzed. We will review three common ANOVA application as they pertain to various data types. The analyzation of the output of the model data is performed through the utilization of the F-Test. For a detailed description of the F-Test, and what conclusions it provides, please refer to the pervious article.

One Way ANOVA

As a reminder, Way, in this scenario, is referring to a single independent variable.

In a one way ANOVA, we are assuming the following:

1. Each sample is random.
2. Each sample is in no way influenced by the other sampling results.
3. Each dependent variable is sampled from a normally distributed population.
4. The variances of the samples, should be equivalent, or somewhat equivalent. The reason for such, is that the population variances are assumed to be equal for each sample.

The hypothesis for this model type will be:

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

Example Problem:

A chef wants to test if patrons prefer a soup which he prepares based on salt content. He prepares a limited experiment in which he creates three types of soup: soup with a low amount of salt, soup with a high amount of salt, and soup with a medium amount of salt. He then servers this soup to his customers and asks them to rate their satisfaction on a scale from 1-8.

Low Salt Soup it rated: 4, 1, 8
Medium Salt Soup is rated: 4, 5, 3, 5
High Salt Soup is rated: 3, 2, 5

Hypothesis:

H0: u1 = u2 = u3 =…..etc.

H1: Not all means are equal.

Let’s use this data to create a model within R:

satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)

salt <- c(rep("low",3), rep("med",4), rep("high",3))

salttest <- data.frame(satisfaction, salt)

results <- aov(satisfaction~salt, data=salttest)

summary(results)

This produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
salt 2 1.92 0.958 0.209 0.816
Residuals 7 32.08 4.583

If p < .05, we will reject the null hypothesis.

Hypothesis: 0.816 > .05

Since the model’s p-value (.816) is greater than the assumed alpha (.05), we will fail to reject the null hypothesis. What this is indicating, is that at 95% confidence interval, we cannot state that through the analysis of the data provided, that there is a significant difference of customer satisfaction as it pertains to salt content in soup.

Two Way ANOVA

Two way, in this scenario, is referring to the two independent variables which will be utilized within this ANOVA model.

The hypothesis for this model type will be:

1.

H0: u1 = u2 = u3 =…..etc. (All means are equal)

H1: Not all means are equal.

2.

H0: uVar1 = uVar2 (Var1’s value does not significantly differ from Var2’s value)

H1: uVar1 NE uVar2

3.

H0: An interaction is absent.

H1: An interaction is present.

Example Problem:

Researchers want to test study habits within two schools as they pertain to student life satisfaction. The researchers also believe that the school that each group of students is attending may also have an impact on study habits. Students from each school are assigned study material which in sum, totals to 1 hour, 2 hours, and 3 hours on a daily basis. Measured is the satisfaction of each student group on a scale from 1-10 after a 1 month duration.

School A:

1 Hour of Study Time: 7, 2, 10, 2, 2
2 Hours of Study Time: 9, 10, 3, 10, 8
3 Hours of Study Time: 3, 6, 4, 7, 1

School B:

1 Hour of Study Time: 8, 5, 1, 3, 10
2 Hours of Study Time: 7, 5, 6, 4, 10
3 Hours of Study Time: 5, 5, 2, 2, 2

Let’s state our hypothesizes, as they apply to this problem:

1.

H0: u1 = u2 = u3 (Stress levels DO NOT differ depending on hours of daily study.)

H1: Not all means are equal. (Stress levels DO differ depending on hours of daily study.)

2.

H0: uSchoolA = uSchoolB (Stress levels DO NOT significantly differ depending on school school.)

H1: uSchoolA NE uSchoolB (Stress levels DO significantly differ depending of school.)

3.

H0: An interaction is absent. (The combination of school and study time is NOT impacting the outcome)

H1: An interaction is present. (The combination of school and study time IS impacting the outcome)

Entering this into R can be tricky, but stay with me:

satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))

school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))

schooltest <- data.frame(satisfaction, studytime, school)

results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))

summary(results)

Which produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
studytime 2 62.6 31.300 3.809 0.0366 *
school 1 2.7 2.700 0.329 0.5718
studytime:school 2 7.8 3.900 0.475 0.6278
Residuals 24 197.2 8.217
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since we have three hypothesis tests, we must assess all three of the p-values present within the output.

Study Time

p = 0.0366

School

p = 0.5718

Study Time : School

p = 0.6278

In investigating the output we can make the following conclusions:

Hypothesis 1: 0.0366 < .05

Hypothesis 2: 0.5718 > .05

Hypothesis 3: 0.6278 > .05

If p < .05, we will reject the null hypothesis.

Hypothesis 1: Reject

Hypothesis 2: Fail to Reject

Hypothesis 3: Fail to Reject

So we can state:

Students of different schools did not significantly different stress levels. There was significant difference between the levels of study time as it pertains to stress. No interaction effect was present.

(Two Way ANOVA must have columns observations of equal length)

Repeated-Measures ANOVA

A repeated measures ANOVA is similar to a paired t-test in that it samples from the same set more than once. This model contains one factor with at least two levels, and the levels are dependent.

Example Problem:

Researchers want to test the impact of reading existential philosophy on a group of 8 individuals. They measure the happiness of the participants three times, once prior to reading, once after reading the materials for one week, and once after reading the materials for two weeks. We will assume an alpha of .05.

Before Reading = 1, 8, 2, 4, 4, 10, 2, 9
After Reading = 4, 2, 5, 4, 3, 4, 2, 1
After Reading (wk. 2) = 5, 10, 1, 1, 4, 6, 1, 8

Hypothesis:

H0: u1 = u2 = u3

H1: Not all means are equal.

Let’s use this data to create a model within R:

library(lme4) # You will need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lmer(happiness ~ week + (1|id), data=survey)

anova(model)

Which produces the output:

Analysis of Variance Table
Df Sum Sq Mean Sq F value
week 2 15.083 7.5417 1.0462

The F-Test statistic = 1.0462

To calculate the p-value of our test statistic, we can use the following r-code:

pf(q=1.0462, df1=2, df2=14, lower.tail=FALSE) # Test Statistic , Numerator Degrees of Freedom = 2, Denominator Degrees of Freedom = 14 #

Which produces the output:

[1] 0.3771816

If p < .05, we will reject the null hypothesis.

Hypothesis: 0.3771816 > .05

With this information, we can conclude that the three conditions did not significantly differ pertaining to level of happiness.

A similar methodology that can be utilized to perform this analysis:

library(lme4) # You will need to install and enable this package #
library(nlme) # You will also need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lme(happiness ~ week, random=~1|id, data=survey)

anova(model)

This method saves some time by producing the output:

numDF denDF F-value p-value
(Intercept) 1 14 37.21053 <.0001
week 2 14 1.04624 0.3772

That is all for now, Data Heads. The topic of the next article will be Post-Hoc Analysis. Stay tuned!

Monday, November 13, 2017

(R) F-Test

You may remember the F-Test from the previous article on multiple linear regression. In this entry, we will further delve into the concept of the F-Test.

The F-Test is a statistical method for comparing two population variances. It’s most recognized utilization is as one of the aspects of the ANOVA method. This method will be discussed in a later article.

Essentially, the F-Test model enables the creation of a test statistic, critical value and a distribution model. With these values derived, a hypothesis test can be stated, and from such, the comparison of two variance can be achieved.

Some things to keep in mind before moving forward:

1. The F-Test assumes that the samples provided, originated from a normal distribution.

2. The F-Test attempts to discover whether two samples originate from populations with equal variances.

So for example, if we were comparing the following two samples:

samp1 <- c(-0.73544189, 0.36905647, 0.69982679, -0.91131589, -1.84019291, -1.02226811, -1.85088278, 2.24406451, 0.63377787, -0.80777949, 0.60145711, 0.43853971, -1.76386879, 0.32665597, 0.32333871, 0.90197004, 0.29803556, 0.47333427, 0.23710263, -1.48582332, -0.45548478, 0.36490345, -0.08390139, -0.46540965, -1.66657385)

samp2 <- c(0.67033912, -1.23197505, -0.18679478, 1.06563032, 0.08998155, 0.22634414, 0.06541938, -0.22454059, -1.00731073, -1.43042950, -0.62312404, -0.22700636, -0.71908729, -0.36873910, 0.15653935, -0.19328338, 0.56259671, 0.31443699, 1.02898245, 1.18903593, -0.14576090, 0.68375259, -0.15348007, 1.58654607, 0.01616986)

For a right tailed test, we would state the following hypothesis:

H0: σ2/1 =σ2/2
Ha: σ2/1>σ2/2

# Null Hypothesis = Variances are equal. #

# Alternative Hypothesis = The first measurement of variance is greater than the second measurement of variance. #

With both samples imported into R, we can now utilize the following code to perform the F-Test:

(We will assume an alpha of .05):

var.test(samp1, samp2, alternative = "greater", conf.level = .95)

Which produces the output:

F test to compare two variances

data: samp1 and samp2
F = 1.9112, num df = 24, denom df = 24, p-value =
0.05975
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
0.9634237 Inf
sample estimates:
ratio of variances

1.911201

Let us review each aspect of this output:

“ F = “ is the F-Test test statistic.

“num df = “ is the value of the degrees of freedom found within the numerator.

“denom df = “ is the value of the degrees of freedom found within the denomenator.

“p-value = “ is the probability of the corresponding F-Test statistic.

“95 percent confidence interval:” is the ratio between the two population variances at the 95% confidence level.

“ratio of variances” is the value of the variance of sample 1 divided by the variance of sample 2.

Looking at the p-value, which is greater than our alpha value (0.05975 > .05), we cannot conclude, that at a 95% confidence level, that our samples were taken from populations with differing variances.

Additionally, we can confirm this conclusions by comparing our F-Test statistic of 1.9112, to the F-Value which coincides with the appropriate degrees of freedom and alpha value. To find this value, we would typically consult a chart in the back of a statistics textbook. However, R makes the situation simpler by providing us with a method to reference this value.

Utilizing the code:

qf(.95, df1=24, df2=24) #Alpha .05, Numerator Degrees of Freedom = 24, Denomenator Degrees of Freedom = 24#

Produces the output:

[1] 1.98376

Again, we cannot conclude that because 1.9112 < 1.98376, that our samples were taken from populations with differing variances.

If we were to graph this test and distribution, the illustration would resemble:

If you would like to create your own f-distribution graphs, sans the mark-ups, you could use the following code:

curve(df(x, df1=24, df2=24), from=0, to=5) # Modify the degrees of freedom only #

Below is an illustration of a few various f-distribution types by varying degrees of freedom:

I hope that you found this article useful, in the next post, we will begin to discuss the concept of ANOVA.

* A helpful article pertaining to the F-Test statistic: http://atomic.phys.uni-sofia.bg/local/nist-e-handbook/e-handbook/eda/section3/eda359.htm

** Source for F-Distribution Image: https://en.wikipedia.org/wiki/F-distribution

Sunday, November 12, 2017

(R) VIF() and COR()

Revisiting multiple regression for the purpose of establishing a greater understanding of the topic, and additionally, to foster a more efficient application of the regression model; in this entry we will review functions and statistical concepts which were overlooked in the prior article pertaining to the subject matter.

Variance Influence Factor or VIF():

What is VIF()? According to Wikipedia, it is defined as such: “The variance inflation factor (VIF) quantifies the severity of multicollinearity in an ordinary least squares regression analysis.”

To define VIF() in layman’s terms, The Variance Influence Factor is a method for weighing each variable’s coefficient of determination (R-Squared), against all other variables within a multiple regression equation. It’s really as simple as that, but none of this will make sense without an example.

Example:

Consider the following numerical sets:

w <- c(23, 42, 55, 16, 24, 27, 24, 15, 23, 85)
x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

Let’s utilize these values to create a multiple regression model:

lm.multiregressw <- (lm(w ~ x + y + z))

Now let’s take a look at that model:

summary(lm.multiregressw)

Call:
lm(formula = w ~ x + y + z)

Residuals:
Min 1Q Median 3Q Max
-29.701 -11.020 -4.462 3.465 44.108

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.40034 68.57307 -0.020 0.984
x -0.08981 1.41549 -0.063 0.951
y 0.19770 1.20528 0.164 0.875
z 1.15241 1.60560 0.718 0.500

Residual standard error: 25.18 on 6 degrees of freedom
Multiple R-squared: 0.1109, Adjusted R-squared: -0.3336
F-statistic: 0.2495 on 3 and 6 DF, p-value: 0.8591

Given the Multiple R-squared value of this output, we can assume that this model is pretty awful. Regardless of such, our model could be effected by a phenomenon known as multiple collinearity. What this means, is that the variables could be interacting with each other, and thus, disrupting our model from providing accurate results. To prevent this from happening, we can measure the VIF() for each independent variable within the equation as it pertains to each other independent variable. If we were to do this manually, the code would resemble:

lm.multiregressx <- (lm(x ~ y + z )) # .4782 #

lm.multiregressy <- (lm(y ~ x + z )) # .5249 #

lm.multiregressz <- (lm(z ~ y + x )) # .1488 #

With each output, we would note the Multiple R-Squared value. I have provided those values to the right of each code line above. To then individually calculate the VIF(), we would utilize the following code for each R-Squared variable.

1 / (1 - .4282) # 1.916 #

1 / (1 - .5249) # 2.105 #

1 / (1 - .1488) # 1.175 #

This produces the values to the right of each code line. These values are the VIF() or Variance Influence Factor.

An easier way to derive the VIF() is to install the R package, “car”. Once “car” is installed, you can execute the following command:

vif(lm.multiregressw)

Which provides the output:

x y z
1.916437 2.104645 1.174747

Typically, most data scientists consider values of VIF() which are either over 5 or 10 (depending on sensitivity), to indicate that a variable ought to be removed. If you do plan on removing a variable from your model for such reasons, remove one variable at a time, as the removal of a single variable will effect subsequent VIF() measurements.

(Pearson) Coefficient of Correlation or cor():

Another tool at your disposal is the function cor(). Cor() allows the user to derive the coefficient of correlation ( r ) from numerical sets. For example:

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

cor(x,y)

[1] 0.6914018

As exciting as this seems, there is actually a better use for this function. For example, if you wanted to derive the coefficient of correlation as one variable as it pertains to others, you could use the following code lines to build a correlation matrix.

# Set values: #

w <- c(23, 42, 55, 16, 24, 27, 24, 15, 23, 85)
x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

# Create a data frame #

dframe <- data.frame(w,x,y,z)

# Create a correlation matrix #

correlationmatrix <- cor(dframe)

The output will resemble:

w x y z
w 1.0000000 0.1057911 0.1836205 0.3261470
x 0.1057911 1.0000000 0.6914018 0.2546853
y 0.1836205 0.6914018 1.0000000 0.3853427
z 0.3261470 0.2546853 0.3853427 1.0000000

In the next article, we will again review the F-Statistic in preparation for a discussion pertaining to the concept of ANOVA. Until then, hang in there statistics fans!

(R) Confidence Interval of The Mean (T)

Below is an example which illustrates a statistical concept:

# A new machine is purchased by a pastry shop which will be utilized to fill the insides of various pastries. A test run is commissioned in which the following proportions of jelly are inserted into each pastry. We are tasked with finding the 90% confidence interval for the mean filling amount from this process. #

s <- c(.28,.25,.22,.20,.33,.20)

w <- sd(s) / sqrt(length(s))

w

[1] 0.02092314

degrees <- length(s) - 1

degrees

[1] 5

t <- abs(qt(0.05, degrees))

t

[1] 2.015048

m <- (t * w)

m

[1] 0.04216114

mean(s) + c(-m, m)

[1] 0.2045055 0.2888278

# We can state, with 90% confidence, that the machine will insert into pastries proportions of filling between .205 and .289. In the above execise, a new concept is introduced in step 4. The abs() function, in conjunction with the qt() function, allows you to find Critical-T values through the utilization of R without having to refer to a textbook. A few examples as to how this function can be utilized are listed below. #

# T VALUES (Confidence / Degrees of Freedom ) #
abs(qt(0.25, 40)) # 75% confidence, 40 degrees of freedom, 1 sided (same as qt(0.75, 40)) #
abs(qt(0.01, 40)) # 99% confidence, 40 degrees of freedom, 1 sided (same as qt(0.99, 40)) #
abs(qt(0.01/2, 40)) # 99% confidence, 40 degrees of freedom, 2 sided #

I hope that you found this abbreviated article to be interesting and helpful. Until next time, I'll see you later Data Heads!

(R) T-Tests and The T-Statistics

Welcome to the 60th article of Confessions of a Data Scientist. In this article we will be reviewing an incredibly important statistical topic: The T-Statistic.

The T-Statistic has many usages, but before we can begin to discuss these methods, I will briefly explain the conceptual significance of the subject matter, and when it is appropriate to utilize the T-Statistic.

The T-Statistic was created by a man named William Sealy Gosset. This concept was created by Gosset while he was working as a chemist at The Guinness Brewing Company. “Student”, was the anonymous name that was chosen by Gosset, which he utilized when publishing his findings. The reason for this anonymity, is that Guinness forbade its chemists from publishing their research. Gosset originally devised the T-Statistic and the concept of the “T-Test” as methods to measure the quality of stout. *

When to utilize The T-Statistic:

1. When the population standard deviation is unknown. (This is almost always the case).
2. When the current sample size is less than or equal to 30.

It is important that I mention now, that a common misconception is that the t-distribution assumes normality. This is not the case. Though, the t-distribution does begin to more closely resemble a normal distribution as the degrees of freedom approach 100. **

There are many uses for the t-distribution, some of which will be covered in future articles. However, the most commonly applied methods are: The One Sample T-Test, The Two Sample T-Test, and The Paired Test.

One Sample T-Test

This test is utilized to compare a sample mean to a specific value, it is used when the dependent variable is measured at the interval or ratio level.

To apply this test, you will need to define the following variables:

The size of the sample (N).
The applicable confidence interval, and the corresponding alpha.
The standard deviation of the sample (s).
The mean of the sample (M).
The mean of the population (mu).

To demonstrate the application of this method, I will provide two examples.

Example 1:

A factory employee believes that the cakes produced within his factory are being manufactured with excess amounts of corn syrup, thus altering the taste. 10 cakes were sampled from the most recent batch and tested for corn syrup composition. Typically, each cake should comprise of 20% corn syrup. Utilizing a 95 % confidence interval, can we assume that the new batch of cakes contain more than a 20% proportion of corn syrup?

The levels of the samples were:

.27, .31, .27, .34, .40, .29, .37, .14, .30, .20

Our hypothesis test will be:

H0: u = .2
HA: u > .2

With our hypothesis created, we can assess that mu = .2, and that this test will be right tailed.

We do do not to derive the additional variables, as R will perform that task for us. However, we must first input the sample data into R:

N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)

Now R will automate the rest of the equations.

t.test(N, alternative = "greater", mu = .2, conf.level = 0.95)

# " alternative = " Specifies the type of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #

Which produces the output:

One Sample t-test

data: N
t = 3.6713, df = 9, p-value = 0.002572
alternative hypothesis: true mean is greater than 0.2
95 percent confidence interval:
0.244562 Inf
sample estimates:
mean of x
0.289

t = The t-test statistic, which derived by hand would be: the sample mean (M), minus the population mean (mu), divided by the standard deviation (S) divided by the square root of the size of the sample (N).

df = This value is utilized in conjunction with the confidence interval to define critical-t. It is derived by subtracting 1 from the value of N.

p-value = This is the probability that, given the parameters of the test, that we will obtain a value equal or greater than the t-region defined within our model.

##% confidence interval = This describes the numerical values which exists on each side of the t-distribution, which between both, contains ##% of the values of the model.

sample estimates = This value is the mean of the sample(s) utilized within the function.

With this output we can conclude:

With a p-value of .002572 (.002572 < .05), and a corresponding t-value of 3.6713, we can conclude that, at a 95% confidence interval, that the cakes are being produce contain an excess amount of corn syrup.

Example 2:

A new type of gasoline has been synthesized by a company that specializes in green energy solutions. The same fuel type, when synthesized traditionally, typically provides coup class vehicles with a mean of 25.6 mpg. Company statisticians were provided with a test sample of 8 cars which utilized the new fuel type. This sample of vehicles utilized the fuel at a mean rate of 23.2 mpg, with a sample standard deviation of 5 mpg. Can it be assumed, at a 95% confidence interval, that vehicle performance significantly differs when the new fuel type is utilized?

Our hypothesis will be:

H0: u = 25.6
HA: u NE 25.6

With our hypothesis created, we can assess that mu = 25.6, and that this test will be two tailed.
In this case we will need to create a random sample within R that adheres to the following parameters.

N = 8 Sample Size
S = 5 Sample Standard Deviation
M = 23.2 Sample Mean

This can be achieved with the following code block:

sampsize <- 8
standarddev <- 5
meanval <- 23.2

N <- rnorm(sampsize)
N <- standarddev*(N-mean(N))/sd(N)+meanval

With our “N” variable defined, we can now let R do the rest of the work:

t.test(N, alternative = "two.sided", mu = 25.6, conf.level = 0.95)

Which produces the output:

One Sample t-test

data: N
t = -1.3576, df = 7, p-value = 0.2167
alternative hypothesis: true mean is not equal to 25.6
95 percent confidence interval:
19.0199 27.3801
sample estimates:
mean of x
23.2

With this output we can conclude:

With a p-value of 0.2167 (.2167 > .05), and a corresponding t-value of -1.3576, we cannot conclude that, at a 95% confidence interval, that vehicle performance differs depending on utilization of fuel type.

Two Sample T-Test

This test is utilized if you randomly sample different sets of items from two separate control groups.

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

For this, we will use the code:

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)

Which produces the output:

Two Sample t-test

data: N2 and N1

t = 2.4558, df = 14, p-value = 0.02773

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

0.3007929 4.4492071

sample estimates:

mean of x mean of y

75.250 72.875

# Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #

# An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #

From this output we can conclude:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

Paired T-Test

This test is utilized if you are sampling the same set twice, once for each variable.

Example:

A watch manufacturer believes that by changing to a new battery supplier, that the watches that are shipped which include an initial battery, will maintain longer lifespan. To test this theory, twelve watches are tested for duration of lifespan with the original battery.

The same twelve watches are then re-rested for duration with the new battery.

Can the watch manufacturer conclude, that the new battery increases the duration of lifespan for the manufactured watches? (We will assume an alpha value of .05).

For this, we will utilize the code:

N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)
N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

t.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )

This produces the output:

Paired t-test

data: N2 and N1
t = 2.4581, df = 11, p-value = 0.01589
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
12.32551 Inf
sample estimates:
mean of the differences
45.75

From this output we can state:

With a p-value of 0.01589 (0.01589 < .05), and a corresponding t-value of 2.4581, we can conclude that, at a 95% confidence interval, that the new battery increases the duration of lifespan for the manufactured watches.

That's all for now. Stay tuned, data heads!

* https://en.wikipedia.org/wiki/Student%27s_t-test

** For more information as to why this is the case:

https://www.researchgate.net/post/Can_I_use_a_t-test_that_assumes_that_my_data_fit_a_normal_distribution_in_this_case_Or_should_I_use_a_non-parametric_test_Mann_Whitney2

Thursday, November 9, 2017

(R) The Distribution of Sample Means

Below are two examples which illustrate a statistical concept:

The Distribution of Sample Means:

The average mortgage for new families in Texas in $ 1300, with a standard deviation of $ 800. If 100 new families are surveyed, what is the probability that the mean mortgage payment will exceed $ 1500?

800/sqrt(100)

[1] 80

(1500-1300) / 80

[1] 2.5

pnorm(2.5, lower.tail=FALSE)

[1] 0.006209665

There is a 0.62% chance of this occurring.

An American football team typically scores 17 points per game, with a standard deviation of 3 points. In a sample of 16 games, what is the probability that the team scored an average of between 16 - 19 points?

3/sqrt(16)

[1] 0.75

(16 - 19) / 0.75

[1] -4

1 - pnorm(-4, lower.tail = TRUE) * 2

[1] 0.9999367

There is a 99.99% chance of this occurring.

Sunday, October 15, 2017

(R) Differences of Two Proportions

Below are two exercises which illustrate statistical concepts.

Confidence interval estimate for the difference of two proportions:

# Disable Scientific Notation in R Output #

options(scipen = 999)

Suppose that 85% of a sample of 100 factory workers, employed on a day shift, express positive job satisfaction, while 70% of 80 factory workers, who are employed on the night shift, share similar sentimentality. Establish a 90% confidence interval estimate for the difference.

# n1 = 100 #
# n2 = 80 #
# p1 = .85 #
# p2 = .70 #

sqrt((.85*.15/100) + (.70*.30/80))

[1] 0.06244998

z <- qnorm(.05, lower.tail = FALSE) * 0.06244998

.85 - .70

[1] 0.15

.15 + c(-z,z)

[1] 0.04727892 0.25272108

We can be 90% certain that the proportion of workers of day shift is between .05 (5%) and .25 (25%) higher than those on the day shift.

Hypothesis test for the difference of two proportions:

A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased.

# H0: p1 - p2 = 0 #
# Ha: p1 - p2 > 0 #
# Alpha = .10 #
# p1 = # 600 / 1300 # = 0.4615385 #
# p2 = # 500 / 1500 # = 0.3333333 #
# p = # (600 + 500) / (1300 + 1500) # = 0.3928571
# SigmaD = # sqrt(0.3928571 * (1 - 0.3928571) * (1/1300 + 1/1500)) # = 0.01850651 #

0.4615385 - 0.3333333

[1] 0.1282052

(0.1282052 - 0) / 0.01850651

[1] 6.927573

pnorm(6.927573, lower.tail = FALSE)

[1] 0.000000000002140608

Due to 0.000000000002140608 < .10 (Alpha), we can conclude that candidate A's popularity has decreased.

Hypothesis test for the difference of two proportions (automation):

Additionally, I have created a code block below which automates the process for the hypothesis testing for two means. To utilize this code, simply input the number of affirmative responders as the first two variables, and input the total individuals surveyed for the subsequent variables. The variable "finalres", will be the p-value of the output. Multiply this value by 2 if the test is two tailed.

survey1 <- # Survey 1 Variable #
survey2 <- # Survey 2 Variable #

total1 <- # Total Survey 1 Participants #
total2 <- # Total Survey 2 Participants #

p1 <- survey1/total1
p2 <- survey2/total2
p3 <- (survey1 + survey2) / (total1 + total2)
sigmad <- sqrt(p3 * (1 - p3) * (1/total1 + 1/total2))

vara <- p1 - p2
varb <- vara / sigmad

res1 <- pnorm(varb, lower.tail = FALSE)
res2 <- pnorm(varb, lower.tail = TRUE)
finalres <- ifelse(res1 > res2, res2, res1)

finalres # Multiply by 2 if utilizing a two tailed test #

(R) Sample Size and Margin of Error

If you are in the business of creating surveys, there are two important statistical concepts that you should be somewhat familiar with.

The first concept is known as margin of error. Margin of error can be defined as a value or as a percentage. What the margin of error illustrates, is the assumed variation that will occur within a statistical inference. For example, if a survey was taken of 100 individuals, and 60 of those individuals answered positively to a survey prompt, with a margin of error value of 5, the true value which is being inferred from the survey could alternate in either direction by the value of 5. Which would mean that the true value could range anywhere from 55-66.

Let's explore this concept in the following examples:

A pollster wants to infer, from surveying a population, the likelihood of a certain candidate winning a local election. The population of the voting public is 1000 individuals, and the pollster surveys 500 members of the voting body. From his survey, he has determined that 60% of those individuals are voting for candidate A. Assuming a confidence interval of .95, and a population percentage of 50%, what will the margin of error be for this survey?

####################################################

popsize <- 1000 # N #
samplesize <- 500 # n #
confidencelevel <- .95
populationpercentage <- .50 # p #

####################################################

con2 <- (1 - confidencelevel) / 2

MOE <- qnorm(con2, lower.tail = FALSE) * sqrt(populationpercentage * (1 - populationpercentage)) / sqrt((popsize - 1) * samplesize/(popsize - samplesize))

MOE

[1] 0.03100526

The margin of error is 3.10%.

Therefore, the pollster can conclude, with 95% confidence, that 60% of the population will vote for candidate A, with a margin of error of 3.10%. (The true value could fluctuate between 56.90% and 63.10%.

The next concept that we will review is sample size. This is essentially how many people need to take your survey for it be statistically significant. In this case, we will be solving for that value, and we will be assuming a population percentage of 50%.

The same pollster wants to survey another group of individuals to determine what type of milk that they would most prefer to drink. The pollster already knows his population size (1200), and has decided on a confidence level (95%) and a margin of error (5%). To meet this predetermined criteria, how many individuals should the pollster survey?

##################################

popsize <- 1200 # N #
confidencelevel <- .95
marginoferror <- .05 # MOE #
populationpercentage <- .5 # p #

##################################

con2 <- (1 - confidencelevel) / 2

a <- qnorm(con2, lower.tail = FALSE)^2 * populationpercentage * (1 - populationpercentage) / (marginoferror)^2

b <- 1 + qnorm(con2, lower.tail = FALSE)^2 * populationpercentage * (1 - populationpercentage) / ((marginoferror)^2 * popsize)

a/b

[1] 290.9928

The pollster should survey 291 individuals.

Below are some links that provide simplified online calculators, just in case you are not using R. Also provided, within those links, are more detailed definitions of the concepts illustrated above.

https://www.surveysystem.com/sscalc.htm

https://www.surveymonkey.com/mp/margin-of-error-calculator/

https://www.surveymonkey.com/mp/sample-size-calculator/

Thursday, October 5, 2017

(R) Hypothesis Tests of Proportions

In the past two articles, we discussed various topics related to proportions. In this article, we will continue to explore the various aspects pertaining to the subject of proportions. Specifically, this entry will discuss hypothesis tests as they are applicable to proportion data.

In testing data, we are faced with choosing between two separate hypotheses.

Those hypothesis are:

The Null Hypothesis - Which is a statement that is assumed to be true. This is the assertion that you will be seeking to disprove through the application of statistical methods.

The Alternative Hypothesis - This statement is the antithesis of the Null Hypothesis.

A rough example of stated hypothesis might resemble something like:

The Null Hypothesis - It is raining outside.

The Alternative Hypothesis - It is not raining outside.

This brings us to type errors, which cannot exist without hypothesis tests. There are two types of hypothesis errors.

The error types are:

Type I - Mistakenly rejecting a true Null Hypothesis.

Type II - Mistakenly failing to reject a false Null Hypothesis.

So let's try applying this knowledge to a few example problems.

###################

# A local university claims that only 20% of its incoming freshman class attend new student orientation. A statistician who is employed by the university believes that the real percentage is higher. He plans to ask 100 new students if they plan on attending the orientation. He will inform the admissions department if 30 or more students plan on attending. What is the probability of committing a Type I Error? #

H0:p = .20 (Null Hypothesis)
Ha:p > .20 (Alternative Hypothesis)

sqrt(.2 * .8/ 100)

[1] 0.04

(.3 - .2)/0.04

[1] 2.5

pnorm(2.5,lower.tail=FALSE)

[1] 0.006209665

Probability of Type I Error: .62 percent.

###################

# A radio manufacturer claims that 85% of the radios assembled from the latest batch are defective. A quality assurance representative believes that the number is lower and wishes to test at a 5% significance level. What is the conclusion if 90 of 125 radios are defective? #

H0:p = .85 (Null Hypothesis)
Ha:p < .85 (Alternative Hypothesis)

sqrt(.15 * .85/ 100)

[1] 0.03570714

# p = #
90/125

[1] 0.72

(.72 - .85)/0.03570714

[1] -3.640728

pnorm(-3.640728,lower.tail=FALSE)

[1] 0.9998641

Since 0.9998641 > .05, there is not sufficient evidence to reject the null hypothesis at the 5% significance level. The quality assurance representative should not challenge the manufacturer's claim.

###################

# A research group conducting a nutritional survey, decides to interview 500 households to test the hypothesis that 30% of the households eat dinner after 8:00 PM. Should the hypothesis be rejected at the 5% significance level if 200 families respond that they do indeed dine after 8:00 PM? #

H0:p = .30 (Null Hypothesis)
Ha:p NE .30 (Alternative Hypothesis)
Alpha = .05

sqrt(.30 * .70/ 100)

[1] 0.04582576

#p = #
200/500

[1] 0.4

(.4 - .3)/0.04582576

[1] 2.182179

pnorm(2.182179, lower.tail = FALSE) * 2

[1] 0.02909632

Since 0.02909632 < .05, there is sufficient evidence to reject the null hypothesis.

###################

# A local university has enrolled 10% of all eligible students from an adjacent neighborhood. The office of administration plans a survey of 2000 houses. If less 7% indicate that they are interested in potential enrollment, it will be concluded that the market share has dropped. What is the probability of a Type I Error? #

H0:p = .10 (Null Hypothesis)
Ha:p < .10 (Alternative Hypothesis)

sqrt(.10 * .90/ 100)

[1] 0.03

(.07 - .10)/0.03

[1] -1

pnorm(-1, lower.tail = TRUE)

[1] 0.1586553

Probability of Type I Error: 15.87 percent.

# What is the probability of a Type II Error if university enrollment is 8%. Meaning, what is the probability that we will fail to reject the 10% null hypothesis? #

(.07 - .08)/0.03

[1] -0.3333333
pnorm(-0.3333333, lower.tail = FALSE)

[1] 0.6305586

Probability of Type II Error: 63.05 percent.

In the next article, we will continue to investigate the hypothesis testing procedure, stay tuned Data Heads!

Saturday, September 30, 2017

INTCK (Duration Measurement) (SAS)

I recently had to re-visit a SAS project that I had previously completed. This particular project required that I create an additional variable within an existing data set, this variable was to store the difference of days which occurred between two pre-existing date values. In running the code, it occurred to me that I had not previously discussed the function necessary to obtain such a result. Therefore, this will be the topic of todays article.

The SAS function, INTCK, serves as a way of determining a selected duration of time which has elapsed between two SAS variables.

The form of the function is as follows:

INTCK(‘<measured duration>’ , <DATEA>, <DATEB>);

For example, if you wanted to measure the days that occurred between variable DATEA and DATEB, the code would resemble:

INTCK(‘day’ , DATEA, DATEB);

You will need to store the output, so we will create a new variable, DATEC, to do so. The code would then resemble:

DATEC = INTCK(‘day’ , DATEA, DATEB);

This function can evaluate many different measurements of time, however, I will only mention those that are most commonly utilized.

DATEC = INTCK(‘day’ , DATEA, DATEB); /* Days */

DATEC = INTCK(‘week’ , DATEA, DATEB); /* Weeks */

DATEC = INTCK(‘month’ , DATEA, DATEB); /* Months */

DATEC = INTCK(‘year’ , DATEA, DATEB); /* Years */

DATEC = INTCK(‘qtr’ , DATEA, DATEB); /* Quarters */

/*****************************************************************************/

!IMPORTANT NOTE! - INTCK() only returns INTEGERS! Meaning, that if a week and a half elapsed, and <INTCK(‘week’, x , y)> was being utilized, then the result would be 1, as only 1 WHOLE week has lapsed in duration.

In the next article, I will return to discussing the R coding language, specifically, hypothesis tests and error types.

Wednesday, September 20, 2017

(R) The Confidence Interval Estimate of Proportions

In this article, we will again be focusing on sample data. Specifically, what today's exercises will be demonstrating, is the process necessary to determine whether we can say that a proportion lies within a certain interval.

Example 1:

If 60% of a sample of 120 individuals leaving a diner claim to have spent over $12 for lunch, determine a 99% confidence interval estimate for the proportion of patrons who spent over $12.

sqrt(.4 * .6/ 100)

[1] 0.04898979

z <- qnorm(.005, lower.tail=FALSE) * 0.04898979
# .005 is due to our test containing 2 tails #

.60 + c(-z, z)

[1] 0.4738107 0.7261893

Conclusion: We are 99% certain that the proportion of diner patrons spending over $12 for lunch is between 0.4738107 (47.34%) and 0.7261893 (72.62%).

Example 2:

In random sample of lightbulbs being produced by a factory, 20 out of 300 were found to be shattered during the shipping process. Establish a 95% confidence interval estimate that accounts for these damages.

p = 20/300

[1] 0.06666667

sqrt(.066 * .934 / 100)

[1] 0.02482821

z <- qnorm(.025, lower.tail=FALSE) * 0.02482821

0.06666667 + c(-z,z)

[1] 0.01800427 0.11532907

Therefore, we can be 95% certain that the proportion of light bulbs damaged during the shipping process is between 0.01800427 (1.80% and 11.53%).

Furthermore, if we wished, we could apply these ratios to a total shipment to create an estimation.

If 1000 light bulbs shipped, we can be 95% confident that between 18 and 115.3 light bulbs are damaged within the shipment.

In the next article, we will be discussing hypothesis tests. Until then, stay tuned Data Monkeys!

(R) Distributions of Sample Proportions

Let's suppose that we have been presented with sample data from a larger collection of population data; or, let's suppose that we have been presented with incomplete information pertaining to a larger population.

If we were tasked to reach various conclusions based on such data, how would we structure our models? This article sets to answer these questions. To begin this study, we will review a series of example problems.

Example 1:

The military has instituted a new training regime in order to screen candidates for a newly formed battalion. Due to the specialization of this unit, candidates are vetted through exercises which screen through the utilization of extremely rigorous physical routines. Presently, only 60% of candidates who have attempted the regime, have successfully passed. If 100 new candidates volunteer for the unit, what is the probability that more than 70% of those candidates will pass the physical?

# Disable Scientific Notation in R Output #

options(scipen = 999)

# Find The Standard Deviation of The Sample #

Standard Deviation = Square Root of: (x)(1-x) / 100

sqrt(.4 * .6/ 100)

[1] 0.04898979

# Find the Z-Score #

(.7 - .6)/0.04898979

[1] 2.041242

Probability of Z-Score 2.041242 = .4793
(Check Z-Table)

Finally, conclude as to whether the probability of the sample exceeds 70%
(One tailed test)

.50 - .4793

[1] 0.0207

In R, the following code can be used to expedite the process:

sqrt(.4 * .6/ 100)

[1] 0.04898979

pnorm(q=.7, mean=.6, sd=0.04898979 , lower.tail=FALSE)

[1] 0.02061341

So, we can conclude, that if 100 new candidates volunteer for the unit, there is only a 2.06% chance that more than 70% of those candidates will pass the physical.

The process really is that simple.

In the next article we will review confidence interval estimate of proportions.

Monday, September 18, 2017

(R) Multiple Linear Regression - Pt. (II)

In the previous article, we discussed linear regression. In this article, we will discuss multiple linear regression. Multiple linear regression, from a conceptual standpoint, is exactly the same as linear regression. The only fundamental difference, is that through the utilization of a multiple linear regression model, multiple independent variables can be assessed.

In this example, we have three variables, each variable is comprised of prior observational data.

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

To integrate these variable sets into a model, we will use the following code:

multiregress <- (lm(y ~ x + z))

This code creates a new set ('multiregress'), which contains the regression model data. In this model, 'y' is the dependent variable, with 'x' and 'z' both represented as dependent variables.

We will need to run the following summary function to receive output information pertinent to the model:

summary(multiregress)

The output produced within the console window is as follows:

Call:
lm(formula = y ~ x + z)

Residuals:
Min 1Q Median 3Q Max
-6.4016 -5.0054 -1.7536 0.8713 14.0886

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.1434 12.0381 3.916 0.00578 **
x 0.7808 0.3316 2.355 0.05073 .
z 0.3990 0.4804 0.831 0.43363
---

Residual standard error: 7.896 on 7 degrees of freedom
Multiple R-squared: 0.5249, Adjusted R-squared: 0.3891
F-statistic: 3.866 on 2 and 7 DF, p-value: 0.07394

In this particular model scenario, the model that we would use to determine the value of 'y' is:

y = 0.7808x + 0.3990z + 47.1434

However, in investigating the results of the summary output, we observe that:

Multiple R-squared = 0.5249

Which can be a large enough coefficient, depending on what type of data we are observing...but the following values should raise some alarm:

p-value: 0.07394 (> Alpha of .05)

AND

F-statistic: 3.866 on 2 and 7 DF

Code:

qf(.95, df1=2, df2=7) #Alpha .05#

[1] 4.737414

4.737414 > 3.866

If these concepts seem foreign, please refer to the previous article.

From the summary data, we can conclude that this model is too inaccurate to be properly accepted and utilized.

Therefore, I would recommend re-creating this model with new independent variables.

When creating multiple linear regression models, it is important to consider the values of the f-statistic and the coefficient of determination (multiple r-squared). If variables are being added, or exchanged for different variables within an existing regression model, ideally, the f-statistic and the coefficient of determination should rise in value. This increase indicates that the model is increasing in accuracy. A decline in either of these values would indicate otherwise.

Moving forward, new articles will cover less complicated fundamental aspects of statistics. If you understand this article, and all prior articles, the following topics of discussion should be mastered with relative ease. Stay tuned for more content, Data Heads!

Thursday, September 14, 2017

(R) Linear Regression - Pt. (I)

In this first entry of a two part article series, we will be discussing linear regression. In the next article, I will move on to the more advanced topic of multiple regression.

Regression analysis allows you to create a predictive model. This model can be used to ascertain future results based on the value of a known variable.

Let's begin with an example.

We have two sets of data:

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

To determine if there is a relationship between the data points of each set, we must first decide which variable is effecting the other. Meaning, one variable's value will determine the value of the other variable. In the case of our example, we have determined that the value of 'x', is impacting the value of 'y'. Therefore, 'x' is our independent variable, and 'y' is our dependent variable, as y's value is dependent of the value of 'x'.

We can now create a linear model for this data. The dependent variable must be listed first in this function followed by the independent variable.

linregress <- (lm(y ~ x))

'lingress'' is the name of the data set that we will use to store this model. To produce a model summary, which will contain the information necessary for analysis, we must utilize the following command:

summary(linregress)

This should output the following data to the console:

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-9.563 -4.649 -1.361 1.457 13.139

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 52.6319 9.8653 5.335 0.000698 ***

x 0.8509 0.3144 2.707 0.026791 *

---

Residual standard error: 7.741 on 8 degrees of freedom

Multiple R-squared: 0.478, Adjusted R-squared: 0.4128

F-statistic: 7.327 on 1 and 8 DF, p-value: 0.02679

Now to overview each aspect of the output.

Call:

This is specifying the data that was "called" by the function.

Residuals:

A residual, is a value which represents the difference between the dependent value that is produced by the model, and the actual value of the dependent variable. These values are sorted in the way that is similar to the fivenum() function. For more information on this function, please consult prior articles.

However, there may be times that you would like to view the value of the residuals in their entirety. To achieve this, please utilize the function below:

resid(linregress)

This outputs the following to the console:

1 2 3 4 5 6 7 8

-5.606860 -1.563325 1.647757 -1.159631 -7.097625 13.138522 11.094987 -9.563325

9 10

-1.774406 0.883905

Coefficients

Estimate
(Intercept) - This value is the value of the y-intercept.

Estimate
x - This is the value of the slope.

With this information, we can now create our predictive model:

Y = 0.8509x + 52.6319

Y is the value of the dependent variable, and x is the value of the independent variable. If you enter the value of x into the equation, and work out the operations, you should recieve the predicted value of Y.

Std. Error
x - This value is the standard error of the slope value.

t value
x - This value is the standard error divided by the value of the coefficient. In our example, this value is comprised of the quotient 0.3144 / 0.8509 . The value of such is 2.707.

This allows us to create a t-test to check for significance. This particular test establishes a null hypothesis, which is utilized to check as to whether the slope has significance as it pertains to the model.

To obtain the t-test value to perform this evaluation, you will need to first determine the confidence interval that you wish to utilize. In this case, we'll assume 95% confidence, this equates an alpha value of .05.

Since the t-test that we will be performing is a two tailed test, we will enter the following code to receive our critcal value:

qt(c(.05/2), df=8)

In the above code, .05 is our alpha value, which is being divided by 2 due to the test requiring two tails. df=8 is the value of our degrees of freedom, these values can be found in the row output, which reads: "Residual standard error". This output specifies 8 degrees of freedom.

Since the t-value of our model (2.707), is greater than that t-test value itself (2.306004), we can state, that based on our confidence interval, the slope is significant as it pertains to our model.

p value
x - The p-value that is being indicated, is representative of the level of change dependent on the variable 'x'.

The lower the p-value, the greater the indication of this significance. We can test this value against a confidence interval, in this case, 95%, or alpha = .05. Since our p-value is 0.026791, which is smaller than the alpha value of .05, we can state, with 95% confidence, that this model is statistically significant.

Residual standard error:

This is the estimated standard deviation of the residual values.

There is some confusion as to how this value is calculated. It is important to note, that residual standard error is an ESTIMATED STANDARD DEVIATION. As such, degrees of freedom are
calculated as (n-p), and not (n-1). In the case of our model, if you were to calculate the standard deviation of the residuals with the function: SD(residual values), then the standard deviation value would be incorrect as it pertains to the model. *

F-Statistic

One of the most important statistical methods for checking significance is the F-Statistic.

As you can see from the above output, the F-Statistic is calculated to be:

7.327

With this information, we can conduct a hypothesis test after deciding on an appropriate confidence interval. Again, we will utilize a confidence interval of 95%, and this provides us with an alpha value of .05.

Degrees of freedom are provided, those values are 1 and 8.

With this information, we can now generate the critical value in which to test the F-Statistic against. Typically this value would be found on a table within a statistics textbook, However, a much more accurate and expedient way of finding this value, is through the utilization of R software.

qf(.95, df1=1, df2=8) #Alpha .05#

This provides us with the console output:

5.317655

Since our F-Statistic of 7.327 is greater than the critical value of our test statistic 5.317655, we will not reject the null hypothesis at a 95% confidence interval. Due to such, we can conclude,
with 95% confidence, that the model provides a significant fit for our data.

The Value of The Coefficient of Determination

(Multiple R-Squared)

(Notated as: r²)

The coefficient of determination can be thought of as a percent. It gives you an idea of how many data points fall within the results of the line formed by the regression equation. The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted.**

In most entry level statistics classes, this is the only variable that is evaluated when determining model significance. Typically, the higher the value of the coefficient of determination, assuming that all other tests of significance are met, the greater the usefulness and accuracy of the model. This value can be any number from 0-1. A value of 1 would indicate a perfect model.

(Adjusted R-Squared)

This value is the adjusted coefficient of determination, its method of calculation accounts for the number of observations contained within the model. Adjusted r-squared, by its nature, will always be of a lesser value than its multiple r-squared equivalent.***

The Value of the Coefficient of Correlation

(notated as: r)

This value is not provided in the summary. However, there may be times when you would like to have this value provided. The code to produce this value is below:

cor(y, x)

This outputs the following value to the console:

[1] 0.6914018

Therefore, r = 0.6914018.

Graphing the Linear Regression Model

The following code can be utilized to create the graph of a linear regression model in R. In this case, we will be creating a graphical representation of our example model.

plot(x, y, xlab="X-Value", ylab="Y-Value", main="Linear Regression Example")
abline(linregress)

* For more information on the standard error of the regression -
http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-to-interpret-s-the-standard-error-of-the-regression

** http://www.statisticshowto.com/what-is-a-coefficient-of-determination/

*** https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2