Reflections of a Data Scientist: (R) T-Tests and The T-Statistics

Welcome to the 60th article of Confessions of a Data Scientist. In this article we will be reviewing an incredibly important statistical topic: The T-Statistic.

The T-Statistic has many usages, but before we can begin to discuss these methods, I will briefly explain the conceptual significance of the subject matter, and when it is appropriate to utilize the T-Statistic.

The T-Statistic was created by a man named William Sealy Gosset. This concept was created by Gosset while he was working as a chemist at The Guinness Brewing Company. “Student”, was the anonymous name that was chosen by Gosset, which he utilized when publishing his findings. The reason for this anonymity, is that Guinness forbade its chemists from publishing their research. Gosset originally devised the T-Statistic and the concept of the “T-Test” as methods to measure the quality of stout. *

When to utilize The T-Statistic:

1. When the population standard deviation is unknown. (This is almost always the case).
2. When the current sample size is less than or equal to 30.

It is important that I mention now, that a common misconception is that the t-distribution assumes normality. This is not the case. Though, the t-distribution does begin to more closely resemble a normal distribution as the degrees of freedom approach 100. **

There are many uses for the t-distribution, some of which will be covered in future articles. However, the most commonly applied methods are: The One Sample T-Test, The Two Sample T-Test, and The Paired Test.

One Sample T-Test

This test is utilized to compare a sample mean to a specific value, it is used when the dependent variable is measured at the interval or ratio level.

To apply this test, you will need to define the following variables:

The size of the sample (N).
The applicable confidence interval, and the corresponding alpha.
The standard deviation of the sample (s).
The mean of the sample (M).
The mean of the population (mu).

To demonstrate the application of this method, I will provide two examples.

Example 1:

A factory employee believes that the cakes produced within his factory are being manufactured with excess amounts of corn syrup, thus altering the taste. 10 cakes were sampled from the most recent batch and tested for corn syrup composition. Typically, each cake should comprise of 20% corn syrup. Utilizing a 95 % confidence interval, can we assume that the new batch of cakes contain more than a 20% proportion of corn syrup?

The levels of the samples were:

.27, .31, .27, .34, .40, .29, .37, .14, .30, .20

Our hypothesis test will be:

H0: u = .2
HA: u > .2

With our hypothesis created, we can assess that mu = .2, and that this test will be right tailed.

We do do not to derive the additional variables, as R will perform that task for us. However, we must first input the sample data into R:

N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)

Now R will automate the rest of the equations.

t.test(N, alternative = "greater", mu = .2, conf.level = 0.95)

# " alternative = " Specifies the type of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #

Which produces the output:

One Sample t-test

data: N
t = 3.6713, df = 9, p-value = 0.002572
alternative hypothesis: true mean is greater than 0.2
95 percent confidence interval:
0.244562 Inf
sample estimates:
mean of x
0.289

t = The t-test statistic, which derived by hand would be: the sample mean (M), minus the population mean (mu), divided by the standard deviation (S) divided by the square root of the size of the sample (N).

df = This value is utilized in conjunction with the confidence interval to define critical-t. It is derived by subtracting 1 from the value of N.

p-value = This is the probability that, given the parameters of the test, that we will obtain a value equal or greater than the t-region defined within our model.

##% confidence interval = This describes the numerical values which exists on each side of the t-distribution, which between both, contains ##% of the values of the model.

sample estimates = This value is the mean of the sample(s) utilized within the function.

With this output we can conclude:

With a p-value of .002572 (.002572 < .05), and a corresponding t-value of 3.6713, we can conclude that, at a 95% confidence interval, that the cakes are being produce contain an excess amount of corn syrup.

Example 2:

A new type of gasoline has been synthesized by a company that specializes in green energy solutions. The same fuel type, when synthesized traditionally, typically provides coup class vehicles with a mean of 25.6 mpg. Company statisticians were provided with a test sample of 8 cars which utilized the new fuel type. This sample of vehicles utilized the fuel at a mean rate of 23.2 mpg, with a sample standard deviation of 5 mpg. Can it be assumed, at a 95% confidence interval, that vehicle performance significantly differs when the new fuel type is utilized?

Our hypothesis will be:

H0: u = 25.6
HA: u NE 25.6

With our hypothesis created, we can assess that mu = 25.6, and that this test will be two tailed.
In this case we will need to create a random sample within R that adheres to the following parameters.

N = 8 Sample Size
S = 5 Sample Standard Deviation
M = 23.2 Sample Mean

This can be achieved with the following code block:

sampsize <- 8
standarddev <- 5
meanval <- 23.2

N <- rnorm(sampsize)
N <- standarddev*(N-mean(N))/sd(N)+meanval

With our “N” variable defined, we can now let R do the rest of the work:

t.test(N, alternative = "two.sided", mu = 25.6, conf.level = 0.95)

Which produces the output:

One Sample t-test

data: N
t = -1.3576, df = 7, p-value = 0.2167
alternative hypothesis: true mean is not equal to 25.6
95 percent confidence interval:
19.0199 27.3801
sample estimates:
mean of x
23.2

With this output we can conclude:

With a p-value of 0.2167 (.2167 > .05), and a corresponding t-value of -1.3576, we cannot conclude that, at a 95% confidence interval, that vehicle performance differs depending on utilization of fuel type.

Two Sample T-Test

This test is utilized if you randomly sample different sets of items from two separate control groups.

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

For this, we will use the code:

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)

Which produces the output:

Two Sample t-test

data: N2 and N1

t = 2.4558, df = 14, p-value = 0.02773

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

0.3007929 4.4492071

sample estimates:

mean of x mean of y

75.250 72.875

# Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #

# An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #

From this output we can conclude:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

Paired T-Test

This test is utilized if you are sampling the same set twice, once for each variable.

Example:

A watch manufacturer believes that by changing to a new battery supplier, that the watches that are shipped which include an initial battery, will maintain longer lifespan. To test this theory, twelve watches are tested for duration of lifespan with the original battery.

The same twelve watches are then re-rested for duration with the new battery.

Can the watch manufacturer conclude, that the new battery increases the duration of lifespan for the manufactured watches? (We will assume an alpha value of .05).

For this, we will utilize the code:

N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)
N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

t.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )

This produces the output:

Paired t-test

data: N2 and N1
t = 2.4581, df = 11, p-value = 0.01589
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
12.32551 Inf
sample estimates:
mean of the differences
45.75

From this output we can state:

With a p-value of 0.01589 (0.01589 < .05), and a corresponding t-value of 2.4581, we can conclude that, at a 95% confidence interval, that the new battery increases the duration of lifespan for the manufactured watches.

That's all for now. Stay tuned, data heads!

* https://en.wikipedia.org/wiki/Student%27s_t-test

** For more information as to why this is the case:

https://www.researchgate.net/post/Can_I_use_a_t-test_that_assumes_that_my_data_fit_a_normal_distribution_in_this_case_Or_should_I_use_a_non-parametric_test_Mann_Whitney2

Reflections of a Data Scientist

Sunday, November 12, 2017

(R) T-Tests and The T-Statistics

No comments:

Post a Comment