Reflections of a Data Scientist: (R) Distributions of Sample Proportions

Let's suppose that we have been presented with sample data from a larger collection of population data; or, let's suppose that we have been presented with incomplete information pertaining to a larger population.

If we were tasked to reach various conclusions based on such data, how would we structure our models? This article sets to answer these questions. To begin this study, we will review a series of example problems.

Example 1:

The military has instituted a new training regime in order to screen candidates for a newly formed battalion. Due to the specialization of this unit, candidates are vetted through exercises which screen through the utilization of extremely rigorous physical routines. Presently, only 60% of candidates who have attempted the regime, have successfully passed. If 100 new candidates volunteer for the unit, what is the probability that more than 70% of those candidates will pass the physical?

# Disable Scientific Notation in R Output #

options(scipen = 999)

# Find The Standard Deviation of The Sample #

Standard Deviation = Square Root of: (x)(1-x) / 100

sqrt(.4 * .6/ 100)

[1] 0.04898979

# Find the Z-Score #

(.7 - .6)/0.04898979

[1] 2.041242

Probability of Z-Score 2.041242 = .4793
(Check Z-Table)

Finally, conclude as to whether the probability of the sample exceeds 70%
(One tailed test)

.50 - .4793

[1] 0.0207

In R, the following code can be used to expedite the process:

sqrt(.4 * .6/ 100)

[1] 0.04898979

pnorm(q=.7, mean=.6, sd=0.04898979 , lower.tail=FALSE)

[1] 0.02061341

So, we can conclude, that if 100 new candidates volunteer for the unit, there is only a 2.06% chance that more than 70% of those candidates will pass the physical.

The process really is that simple.

In the next article we will review confidence interval estimate of proportions.

Reflections of a Data Scientist

Wednesday, September 20, 2017

(R) Distributions of Sample Proportions

No comments:

Post a Comment