Reflections of a Data Scientist: November 2017

Monday, November 13, 2017

(R) F-Test

You may remember the F-Test from the previous article on multiple linear regression. In this entry, we will further delve into the concept of the F-Test.

The F-Test is a statistical method for comparing two population variances. It’s most recognized utilization is as one of the aspects of the ANOVA method. This method will be discussed in a later article.

Essentially, the F-Test model enables the creation of a test statistic, critical value and a distribution model. With these values derived, a hypothesis test can be stated, and from such, the comparison of two variance can be achieved.

Some things to keep in mind before moving forward:

1. The F-Test assumes that the samples provided, originated from a normal distribution.

2. The F-Test attempts to discover whether two samples originate from populations with equal variances.

So for example, if we were comparing the following two samples:

samp1 <- c(-0.73544189, 0.36905647, 0.69982679, -0.91131589, -1.84019291, -1.02226811, -1.85088278, 2.24406451, 0.63377787, -0.80777949, 0.60145711, 0.43853971, -1.76386879, 0.32665597, 0.32333871, 0.90197004, 0.29803556, 0.47333427, 0.23710263, -1.48582332, -0.45548478, 0.36490345, -0.08390139, -0.46540965, -1.66657385)

samp2 <- c(0.67033912, -1.23197505, -0.18679478, 1.06563032, 0.08998155, 0.22634414, 0.06541938, -0.22454059, -1.00731073, -1.43042950, -0.62312404, -0.22700636, -0.71908729, -0.36873910, 0.15653935, -0.19328338, 0.56259671, 0.31443699, 1.02898245, 1.18903593, -0.14576090, 0.68375259, -0.15348007, 1.58654607, 0.01616986)

For a right tailed test, we would state the following hypothesis:

H0: σ2/1 =σ2/2
Ha: σ2/1>σ2/2

# Null Hypothesis = Variances are equal. #

# Alternative Hypothesis = The first measurement of variance is greater than the second measurement of variance. #

With both samples imported into R, we can now utilize the following code to perform the F-Test:

(We will assume an alpha of .05):

var.test(samp1, samp2, alternative = "greater", conf.level = .95)

Which produces the output:

F test to compare two variances

data: samp1 and samp2
F = 1.9112, num df = 24, denom df = 24, p-value =
0.05975
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
0.9634237 Inf
sample estimates:
ratio of variances

1.911201

Let us review each aspect of this output:

“ F = “ is the F-Test test statistic.

“num df = “ is the value of the degrees of freedom found within the numerator.

“denom df = “ is the value of the degrees of freedom found within the denomenator.

“p-value = “ is the probability of the corresponding F-Test statistic.

“95 percent confidence interval:” is the ratio between the two population variances at the 95% confidence level.

“ratio of variances” is the value of the variance of sample 1 divided by the variance of sample 2.

Looking at the p-value, which is greater than our alpha value (0.05975 > .05), we cannot conclude, that at a 95% confidence level, that our samples were taken from populations with differing variances.

Additionally, we can confirm this conclusions by comparing our F-Test statistic of 1.9112, to the F-Value which coincides with the appropriate degrees of freedom and alpha value. To find this value, we would typically consult a chart in the back of a statistics textbook. However, R makes the situation simpler by providing us with a method to reference this value.

Utilizing the code:

qf(.95, df1=24, df2=24) #Alpha .05, Numerator Degrees of Freedom = 24, Denomenator Degrees of Freedom = 24#

Produces the output:

[1] 1.98376

Again, we cannot conclude that because 1.9112 < 1.98376, that our samples were taken from populations with differing variances.

If we were to graph this test and distribution, the illustration would resemble:

If you would like to create your own f-distribution graphs, sans the mark-ups, you could use the following code:

curve(df(x, df1=24, df2=24), from=0, to=5) # Modify the degrees of freedom only #

Below is an illustration of a few various f-distribution types by varying degrees of freedom:

I hope that you found this article useful, in the next post, we will begin to discuss the concept of ANOVA.

* A helpful article pertaining to the F-Test statistic: http://atomic.phys.uni-sofia.bg/local/nist-e-handbook/e-handbook/eda/section3/eda359.htm

** Source for F-Distribution Image: https://en.wikipedia.org/wiki/F-distribution

Sunday, November 12, 2017

(R) VIF() and COR()

Revisiting multiple regression for the purpose of establishing a greater understanding of the topic, and additionally, to foster a more efficient application of the regression model; in this entry we will review functions and statistical concepts which were overlooked in the prior article pertaining to the subject matter.

Variance Influence Factor or VIF():

What is VIF()? According to Wikipedia, it is defined as such: “The variance inflation factor (VIF) quantifies the severity of multicollinearity in an ordinary least squares regression analysis.”

To define VIF() in layman’s terms, The Variance Influence Factor is a method for weighing each variable’s coefficient of determination (R-Squared), against all other variables within a multiple regression equation. It’s really as simple as that, but none of this will make sense without an example.

Example:

Consider the following numerical sets:

w <- c(23, 42, 55, 16, 24, 27, 24, 15, 23, 85)
x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

Let’s utilize these values to create a multiple regression model:

lm.multiregressw <- (lm(w ~ x + y + z))

Now let’s take a look at that model:

summary(lm.multiregressw)

Call:
lm(formula = w ~ x + y + z)

Residuals:
Min 1Q Median 3Q Max
-29.701 -11.020 -4.462 3.465 44.108

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.40034 68.57307 -0.020 0.984
x -0.08981 1.41549 -0.063 0.951
y 0.19770 1.20528 0.164 0.875
z 1.15241 1.60560 0.718 0.500

Residual standard error: 25.18 on 6 degrees of freedom
Multiple R-squared: 0.1109, Adjusted R-squared: -0.3336
F-statistic: 0.2495 on 3 and 6 DF, p-value: 0.8591

Given the Multiple R-squared value of this output, we can assume that this model is pretty awful. Regardless of such, our model could be effected by a phenomenon known as multiple collinearity. What this means, is that the variables could be interacting with each other, and thus, disrupting our model from providing accurate results. To prevent this from happening, we can measure the VIF() for each independent variable within the equation as it pertains to each other independent variable. If we were to do this manually, the code would resemble:

lm.multiregressx <- (lm(x ~ y + z )) # .4782 #

lm.multiregressy <- (lm(y ~ x + z )) # .5249 #

lm.multiregressz <- (lm(z ~ y + x )) # .1488 #

With each output, we would note the Multiple R-Squared value. I have provided those values to the right of each code line above. To then individually calculate the VIF(), we would utilize the following code for each R-Squared variable.

1 / (1 - .4282) # 1.916 #

1 / (1 - .5249) # 2.105 #

1 / (1 - .1488) # 1.175 #

This produces the values to the right of each code line. These values are the VIF() or Variance Influence Factor.

An easier way to derive the VIF() is to install the R package, “car”. Once “car” is installed, you can execute the following command:

vif(lm.multiregressw)

Which provides the output:

x y z
1.916437 2.104645 1.174747

Typically, most data scientists consider values of VIF() which are either over 5 or 10 (depending on sensitivity), to indicate that a variable ought to be removed. If you do plan on removing a variable from your model for such reasons, remove one variable at a time, as the removal of a single variable will effect subsequent VIF() measurements.

(Pearson) Coefficient of Correlation or cor():

Another tool at your disposal is the function cor(). Cor() allows the user to derive the coefficient of correlation ( r ) from numerical sets. For example:

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

cor(x,y)

[1] 0.6914018

As exciting as this seems, there is actually a better use for this function. For example, if you wanted to derive the coefficient of correlation as one variable as it pertains to others, you could use the following code lines to build a correlation matrix.

# Set values: #

w <- c(23, 42, 55, 16, 24, 27, 24, 15, 23, 85)
x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

# Create a data frame #

dframe <- data.frame(w,x,y,z)

# Create a correlation matrix #

correlationmatrix <- cor(dframe)

The output will resemble:

w x y z
w 1.0000000 0.1057911 0.1836205 0.3261470
x 0.1057911 1.0000000 0.6914018 0.2546853
y 0.1836205 0.6914018 1.0000000 0.3853427
z 0.3261470 0.2546853 0.3853427 1.0000000

In the next article, we will again review the F-Statistic in preparation for a discussion pertaining to the concept of ANOVA. Until then, hang in there statistics fans!

(R) Confidence Interval of The Mean (T)

Below is an example which illustrates a statistical concept:

# A new machine is purchased by a pastry shop which will be utilized to fill the insides of various pastries. A test run is commissioned in which the following proportions of jelly are inserted into each pastry. We are tasked with finding the 90% confidence interval for the mean filling amount from this process. #

s <- c(.28,.25,.22,.20,.33,.20)

w <- sd(s) / sqrt(length(s))

w

[1] 0.02092314

degrees <- length(s) - 1

degrees

[1] 5

t <- abs(qt(0.05, degrees))

t

[1] 2.015048

m <- (t * w)

m

[1] 0.04216114

mean(s) + c(-m, m)

[1] 0.2045055 0.2888278

# We can state, with 90% confidence, that the machine will insert into pastries proportions of filling between .205 and .289. In the above execise, a new concept is introduced in step 4. The abs() function, in conjunction with the qt() function, allows you to find Critical-T values through the utilization of R without having to refer to a textbook. A few examples as to how this function can be utilized are listed below. #

# T VALUES (Confidence / Degrees of Freedom ) #
abs(qt(0.25, 40)) # 75% confidence, 40 degrees of freedom, 1 sided (same as qt(0.75, 40)) #
abs(qt(0.01, 40)) # 99% confidence, 40 degrees of freedom, 1 sided (same as qt(0.99, 40)) #
abs(qt(0.01/2, 40)) # 99% confidence, 40 degrees of freedom, 2 sided #

I hope that you found this abbreviated article to be interesting and helpful. Until next time, I'll see you later Data Heads!

(R) T-Tests and The T-Statistics

Welcome to the 60th article of Confessions of a Data Scientist. In this article we will be reviewing an incredibly important statistical topic: The T-Statistic.

The T-Statistic has many usages, but before we can begin to discuss these methods, I will briefly explain the conceptual significance of the subject matter, and when it is appropriate to utilize the T-Statistic.

The T-Statistic was created by a man named William Sealy Gosset. This concept was created by Gosset while he was working as a chemist at The Guinness Brewing Company. “Student”, was the anonymous name that was chosen by Gosset, which he utilized when publishing his findings. The reason for this anonymity, is that Guinness forbade its chemists from publishing their research. Gosset originally devised the T-Statistic and the concept of the “T-Test” as methods to measure the quality of stout. *

When to utilize The T-Statistic:

1. When the population standard deviation is unknown. (This is almost always the case).
2. When the current sample size is less than or equal to 30.

It is important that I mention now, that a common misconception is that the t-distribution assumes normality. This is not the case. Though, the t-distribution does begin to more closely resemble a normal distribution as the degrees of freedom approach 100. **

There are many uses for the t-distribution, some of which will be covered in future articles. However, the most commonly applied methods are: The One Sample T-Test, The Two Sample T-Test, and The Paired Test.

One Sample T-Test

This test is utilized to compare a sample mean to a specific value, it is used when the dependent variable is measured at the interval or ratio level.

To apply this test, you will need to define the following variables:

The size of the sample (N).
The applicable confidence interval, and the corresponding alpha.
The standard deviation of the sample (s).
The mean of the sample (M).
The mean of the population (mu).

To demonstrate the application of this method, I will provide two examples.

Example 1:

A factory employee believes that the cakes produced within his factory are being manufactured with excess amounts of corn syrup, thus altering the taste. 10 cakes were sampled from the most recent batch and tested for corn syrup composition. Typically, each cake should comprise of 20% corn syrup. Utilizing a 95 % confidence interval, can we assume that the new batch of cakes contain more than a 20% proportion of corn syrup?

The levels of the samples were:

.27, .31, .27, .34, .40, .29, .37, .14, .30, .20

Our hypothesis test will be:

H0: u = .2
HA: u > .2

With our hypothesis created, we can assess that mu = .2, and that this test will be right tailed.

We do do not to derive the additional variables, as R will perform that task for us. However, we must first input the sample data into R:

N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)

Now R will automate the rest of the equations.

t.test(N, alternative = "greater", mu = .2, conf.level = 0.95)

# " alternative = " Specifies the type of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #

Which produces the output:

One Sample t-test

data: N
t = 3.6713, df = 9, p-value = 0.002572
alternative hypothesis: true mean is greater than 0.2
95 percent confidence interval:
0.244562 Inf
sample estimates:
mean of x
0.289

t = The t-test statistic, which derived by hand would be: the sample mean (M), minus the population mean (mu), divided by the standard deviation (S) divided by the square root of the size of the sample (N).

df = This value is utilized in conjunction with the confidence interval to define critical-t. It is derived by subtracting 1 from the value of N.

p-value = This is the probability that, given the parameters of the test, that we will obtain a value equal or greater than the t-region defined within our model.

##% confidence interval = This describes the numerical values which exists on each side of the t-distribution, which between both, contains ##% of the values of the model.

sample estimates = This value is the mean of the sample(s) utilized within the function.

With this output we can conclude:

With a p-value of .002572 (.002572 < .05), and a corresponding t-value of 3.6713, we can conclude that, at a 95% confidence interval, that the cakes are being produce contain an excess amount of corn syrup.

Example 2:

A new type of gasoline has been synthesized by a company that specializes in green energy solutions. The same fuel type, when synthesized traditionally, typically provides coup class vehicles with a mean of 25.6 mpg. Company statisticians were provided with a test sample of 8 cars which utilized the new fuel type. This sample of vehicles utilized the fuel at a mean rate of 23.2 mpg, with a sample standard deviation of 5 mpg. Can it be assumed, at a 95% confidence interval, that vehicle performance significantly differs when the new fuel type is utilized?

Our hypothesis will be:

H0: u = 25.6
HA: u NE 25.6

With our hypothesis created, we can assess that mu = 25.6, and that this test will be two tailed.
In this case we will need to create a random sample within R that adheres to the following parameters.

N = 8 Sample Size
S = 5 Sample Standard Deviation
M = 23.2 Sample Mean

This can be achieved with the following code block:

sampsize <- 8
standarddev <- 5
meanval <- 23.2

N <- rnorm(sampsize)
N <- standarddev*(N-mean(N))/sd(N)+meanval

With our “N” variable defined, we can now let R do the rest of the work:

t.test(N, alternative = "two.sided", mu = 25.6, conf.level = 0.95)

Which produces the output:

One Sample t-test

data: N
t = -1.3576, df = 7, p-value = 0.2167
alternative hypothesis: true mean is not equal to 25.6
95 percent confidence interval:
19.0199 27.3801
sample estimates:
mean of x
23.2

With this output we can conclude:

With a p-value of 0.2167 (.2167 > .05), and a corresponding t-value of -1.3576, we cannot conclude that, at a 95% confidence interval, that vehicle performance differs depending on utilization of fuel type.

Two Sample T-Test

This test is utilized if you randomly sample different sets of items from two separate control groups.

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

For this, we will use the code:

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)

Which produces the output:

Two Sample t-test

data: N2 and N1

t = 2.4558, df = 14, p-value = 0.02773

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

0.3007929 4.4492071

sample estimates:

mean of x mean of y

75.250 72.875

# Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #

# An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #

From this output we can conclude:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

Paired T-Test

This test is utilized if you are sampling the same set twice, once for each variable.

Example:

A watch manufacturer believes that by changing to a new battery supplier, that the watches that are shipped which include an initial battery, will maintain longer lifespan. To test this theory, twelve watches are tested for duration of lifespan with the original battery.

The same twelve watches are then re-rested for duration with the new battery.

Can the watch manufacturer conclude, that the new battery increases the duration of lifespan for the manufactured watches? (We will assume an alpha value of .05).

For this, we will utilize the code:

N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)
N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

t.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )

This produces the output:

Paired t-test

data: N2 and N1
t = 2.4581, df = 11, p-value = 0.01589
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
12.32551 Inf
sample estimates:
mean of the differences
45.75

From this output we can state:

With a p-value of 0.01589 (0.01589 < .05), and a corresponding t-value of 2.4581, we can conclude that, at a 95% confidence interval, that the new battery increases the duration of lifespan for the manufactured watches.

That's all for now. Stay tuned, data heads!

* https://en.wikipedia.org/wiki/Student%27s_t-test

** For more information as to why this is the case:

https://www.researchgate.net/post/Can_I_use_a_t-test_that_assumes_that_my_data_fit_a_normal_distribution_in_this_case_Or_should_I_use_a_non-parametric_test_Mann_Whitney2

Thursday, November 9, 2017

(R) The Distribution of Sample Means

Below are two examples which illustrate a statistical concept:

The Distribution of Sample Means:

The average mortgage for new families in Texas in $ 1300, with a standard deviation of $ 800. If 100 new families are surveyed, what is the probability that the mean mortgage payment will exceed $ 1500?

800/sqrt(100)

[1] 80

(1500-1300) / 80

[1] 2.5

pnorm(2.5, lower.tail=FALSE)

[1] 0.006209665

There is a 0.62% chance of this occurring.

An American football team typically scores 17 points per game, with a standard deviation of 3 points. In a sample of 16 games, what is the probability that the team scored an average of between 16 - 19 points?

3/sqrt(16)

[1] 0.75

(16 - 19) / 0.75

[1] -4

1 - pnorm(-4, lower.tail = TRUE) * 2

[1] 0.9999367

There is a 99.99% chance of this occurring.