Reflections of a Data Scientist: (R) Differences of Two Proportions

Below are two exercises which illustrate statistical concepts.

Confidence interval estimate for the difference of two proportions:

# Disable Scientific Notation in R Output #

options(scipen = 999)

Suppose that 85% of a sample of 100 factory workers, employed on a day shift, express positive job satisfaction, while 70% of 80 factory workers, who are employed on the night shift, share similar sentimentality. Establish a 90% confidence interval estimate for the difference.

# n1 = 100 #
# n2 = 80 #
# p1 = .85 #
# p2 = .70 #

sqrt((.85*.15/100) + (.70*.30/80))

[1] 0.06244998

z <- qnorm(.05, lower.tail = FALSE) * 0.06244998

.85 - .70

[1] 0.15

.15 + c(-z,z)

[1] 0.04727892 0.25272108

We can be 90% certain that the proportion of workers of day shift is between .05 (5%) and .25 (25%) higher than those on the day shift.

Hypothesis test for the difference of two proportions:

A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased.

# H0: p1 - p2 = 0 #
# Ha: p1 - p2 > 0 #
# Alpha = .10 #
# p1 = # 600 / 1300 # = 0.4615385 #
# p2 = # 500 / 1500 # = 0.3333333 #
# p = # (600 + 500) / (1300 + 1500) # = 0.3928571
# SigmaD = # sqrt(0.3928571 * (1 - 0.3928571) * (1/1300 + 1/1500)) # = 0.01850651 #

0.4615385 - 0.3333333

[1] 0.1282052

(0.1282052 - 0) / 0.01850651

[1] 6.927573

pnorm(6.927573, lower.tail = FALSE)

[1] 0.000000000002140608

Due to 0.000000000002140608 < .10 (Alpha), we can conclude that candidate A's popularity has decreased.

Hypothesis test for the difference of two proportions (automation):

Additionally, I have created a code block below which automates the process for the hypothesis testing for two means. To utilize this code, simply input the number of affirmative responders as the first two variables, and input the total individuals surveyed for the subsequent variables. The variable "finalres", will be the p-value of the output. Multiply this value by 2 if the test is two tailed.

survey1 <- # Survey 1 Variable #
survey2 <- # Survey 2 Variable #

total1 <- # Total Survey 1 Participants #
total2 <- # Total Survey 2 Participants #

p1 <- survey1/total1
p2 <- survey2/total2
p3 <- (survey1 + survey2) / (total1 + total2)
sigmad <- sqrt(p3 * (1 - p3) * (1/total1 + 1/total2))

vara <- p1 - p2
varb <- vara / sigmad

res1 <- pnorm(varb, lower.tail = FALSE)
res2 <- pnorm(varb, lower.tail = TRUE)
finalres <- ifelse(res1 > res2, res2, res1)

finalres # Multiply by 2 if utilizing a two tailed test #

Reflections of a Data Scientist

Sunday, October 15, 2017

(R) Differences of Two Proportions

No comments:

Post a Comment