Reflections of a Data Scientist: October 2017

Sunday, October 15, 2017

(R) Differences of Two Proportions

Below are two exercises which illustrate statistical concepts.

Confidence interval estimate for the difference of two proportions:

# Disable Scientific Notation in R Output #

options(scipen = 999)

Suppose that 85% of a sample of 100 factory workers, employed on a day shift, express positive job satisfaction, while 70% of 80 factory workers, who are employed on the night shift, share similar sentimentality. Establish a 90% confidence interval estimate for the difference.

# n1 = 100 #
# n2 = 80 #
# p1 = .85 #
# p2 = .70 #

sqrt((.85*.15/100) + (.70*.30/80))

[1] 0.06244998

z <- qnorm(.05, lower.tail = FALSE) * 0.06244998

.85 - .70

[1] 0.15

.15 + c(-z,z)

[1] 0.04727892 0.25272108

We can be 90% certain that the proportion of workers of day shift is between .05 (5%) and .25 (25%) higher than those on the day shift.

Hypothesis test for the difference of two proportions:

A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased.

# H0: p1 - p2 = 0 #
# Ha: p1 - p2 > 0 #
# Alpha = .10 #
# p1 = # 600 / 1300 # = 0.4615385 #
# p2 = # 500 / 1500 # = 0.3333333 #
# p = # (600 + 500) / (1300 + 1500) # = 0.3928571
# SigmaD = # sqrt(0.3928571 * (1 - 0.3928571) * (1/1300 + 1/1500)) # = 0.01850651 #

0.4615385 - 0.3333333

[1] 0.1282052

(0.1282052 - 0) / 0.01850651

[1] 6.927573

pnorm(6.927573, lower.tail = FALSE)

[1] 0.000000000002140608

Due to 0.000000000002140608 < .10 (Alpha), we can conclude that candidate A's popularity has decreased.

Hypothesis test for the difference of two proportions (automation):

Additionally, I have created a code block below which automates the process for the hypothesis testing for two means. To utilize this code, simply input the number of affirmative responders as the first two variables, and input the total individuals surveyed for the subsequent variables. The variable "finalres", will be the p-value of the output. Multiply this value by 2 if the test is two tailed.

survey1 <- # Survey 1 Variable #
survey2 <- # Survey 2 Variable #

total1 <- # Total Survey 1 Participants #
total2 <- # Total Survey 2 Participants #

p1 <- survey1/total1
p2 <- survey2/total2
p3 <- (survey1 + survey2) / (total1 + total2)
sigmad <- sqrt(p3 * (1 - p3) * (1/total1 + 1/total2))

vara <- p1 - p2
varb <- vara / sigmad

res1 <- pnorm(varb, lower.tail = FALSE)
res2 <- pnorm(varb, lower.tail = TRUE)
finalres <- ifelse(res1 > res2, res2, res1)

finalres # Multiply by 2 if utilizing a two tailed test #

(R) Sample Size and Margin of Error

If you are in the business of creating surveys, there are two important statistical concepts that you should be somewhat familiar with.

The first concept is known as margin of error. Margin of error can be defined as a value or as a percentage. What the margin of error illustrates, is the assumed variation that will occur within a statistical inference. For example, if a survey was taken of 100 individuals, and 60 of those individuals answered positively to a survey prompt, with a margin of error value of 5, the true value which is being inferred from the survey could alternate in either direction by the value of 5. Which would mean that the true value could range anywhere from 55-66.

Let's explore this concept in the following examples:

A pollster wants to infer, from surveying a population, the likelihood of a certain candidate winning a local election. The population of the voting public is 1000 individuals, and the pollster surveys 500 members of the voting body. From his survey, he has determined that 60% of those individuals are voting for candidate A. Assuming a confidence interval of .95, and a population percentage of 50%, what will the margin of error be for this survey?

####################################################

popsize <- 1000 # N #
samplesize <- 500 # n #
confidencelevel <- .95
populationpercentage <- .50 # p #

####################################################

con2 <- (1 - confidencelevel) / 2

MOE <- qnorm(con2, lower.tail = FALSE) * sqrt(populationpercentage * (1 - populationpercentage)) / sqrt((popsize - 1) * samplesize/(popsize - samplesize))

MOE

[1] 0.03100526

The margin of error is 3.10%.

Therefore, the pollster can conclude, with 95% confidence, that 60% of the population will vote for candidate A, with a margin of error of 3.10%. (The true value could fluctuate between 56.90% and 63.10%.

The next concept that we will review is sample size. This is essentially how many people need to take your survey for it be statistically significant. In this case, we will be solving for that value, and we will be assuming a population percentage of 50%.

The same pollster wants to survey another group of individuals to determine what type of milk that they would most prefer to drink. The pollster already knows his population size (1200), and has decided on a confidence level (95%) and a margin of error (5%). To meet this predetermined criteria, how many individuals should the pollster survey?

##################################

popsize <- 1200 # N #
confidencelevel <- .95
marginoferror <- .05 # MOE #
populationpercentage <- .5 # p #

##################################

con2 <- (1 - confidencelevel) / 2

a <- qnorm(con2, lower.tail = FALSE)^2 * populationpercentage * (1 - populationpercentage) / (marginoferror)^2

b <- 1 + qnorm(con2, lower.tail = FALSE)^2 * populationpercentage * (1 - populationpercentage) / ((marginoferror)^2 * popsize)

a/b

[1] 290.9928

The pollster should survey 291 individuals.

Below are some links that provide simplified online calculators, just in case you are not using R. Also provided, within those links, are more detailed definitions of the concepts illustrated above.

https://www.surveysystem.com/sscalc.htm

https://www.surveymonkey.com/mp/margin-of-error-calculator/

https://www.surveymonkey.com/mp/sample-size-calculator/

Thursday, October 5, 2017

(R) Hypothesis Tests of Proportions

In the past two articles, we discussed various topics related to proportions. In this article, we will continue to explore the various aspects pertaining to the subject of proportions. Specifically, this entry will discuss hypothesis tests as they are applicable to proportion data.

In testing data, we are faced with choosing between two separate hypotheses.

Those hypothesis are:

The Null Hypothesis - Which is a statement that is assumed to be true. This is the assertion that you will be seeking to disprove through the application of statistical methods.

The Alternative Hypothesis - This statement is the antithesis of the Null Hypothesis.

A rough example of stated hypothesis might resemble something like:

The Null Hypothesis - It is raining outside.

The Alternative Hypothesis - It is not raining outside.

This brings us to type errors, which cannot exist without hypothesis tests. There are two types of hypothesis errors.

The error types are:

Type I - Mistakenly rejecting a true Null Hypothesis.

Type II - Mistakenly failing to reject a false Null Hypothesis.

So let's try applying this knowledge to a few example problems.

###################

# A local university claims that only 20% of its incoming freshman class attend new student orientation. A statistician who is employed by the university believes that the real percentage is higher. He plans to ask 100 new students if they plan on attending the orientation. He will inform the admissions department if 30 or more students plan on attending. What is the probability of committing a Type I Error? #

H0:p = .20 (Null Hypothesis)
Ha:p > .20 (Alternative Hypothesis)

sqrt(.2 * .8/ 100)

[1] 0.04

(.3 - .2)/0.04

[1] 2.5

pnorm(2.5,lower.tail=FALSE)

[1] 0.006209665

Probability of Type I Error: .62 percent.

###################

# A radio manufacturer claims that 85% of the radios assembled from the latest batch are defective. A quality assurance representative believes that the number is lower and wishes to test at a 5% significance level. What is the conclusion if 90 of 125 radios are defective? #

H0:p = .85 (Null Hypothesis)
Ha:p < .85 (Alternative Hypothesis)

sqrt(.15 * .85/ 100)

[1] 0.03570714

# p = #
90/125

[1] 0.72

(.72 - .85)/0.03570714

[1] -3.640728

pnorm(-3.640728,lower.tail=FALSE)

[1] 0.9998641

Since 0.9998641 > .05, there is not sufficient evidence to reject the null hypothesis at the 5% significance level. The quality assurance representative should not challenge the manufacturer's claim.

###################

# A research group conducting a nutritional survey, decides to interview 500 households to test the hypothesis that 30% of the households eat dinner after 8:00 PM. Should the hypothesis be rejected at the 5% significance level if 200 families respond that they do indeed dine after 8:00 PM? #

H0:p = .30 (Null Hypothesis)
Ha:p NE .30 (Alternative Hypothesis)
Alpha = .05

sqrt(.30 * .70/ 100)

[1] 0.04582576

#p = #
200/500

[1] 0.4

(.4 - .3)/0.04582576

[1] 2.182179

pnorm(2.182179, lower.tail = FALSE) * 2

[1] 0.02909632

Since 0.02909632 < .05, there is sufficient evidence to reject the null hypothesis.

###################

# A local university has enrolled 10% of all eligible students from an adjacent neighborhood. The office of administration plans a survey of 2000 houses. If less 7% indicate that they are interested in potential enrollment, it will be concluded that the market share has dropped. What is the probability of a Type I Error? #

H0:p = .10 (Null Hypothesis)
Ha:p < .10 (Alternative Hypothesis)

sqrt(.10 * .90/ 100)

[1] 0.03

(.07 - .10)/0.03

[1] -1

pnorm(-1, lower.tail = TRUE)

[1] 0.1586553

Probability of Type I Error: 15.87 percent.

# What is the probability of a Type II Error if university enrollment is 8%. Meaning, what is the probability that we will fail to reject the 10% null hypothesis? #

(.07 - .08)/0.03

[1] -0.3333333
pnorm(-0.3333333, lower.tail = FALSE)

[1] 0.6305586

Probability of Type II Error: 63.05 percent.

In the next article, we will continue to investigate the hypothesis testing procedure, stay tuned Data Heads!