Reflections of a Data Scientist: September 2017

Saturday, September 30, 2017

INTCK (Duration Measurement) (SAS)

I recently had to re-visit a SAS project that I had previously completed. This particular project required that I create an additional variable within an existing data set, this variable was to store the difference of days which occurred between two pre-existing date values. In running the code, it occurred to me that I had not previously discussed the function necessary to obtain such a result. Therefore, this will be the topic of todays article.

The SAS function, INTCK, serves as a way of determining a selected duration of time which has elapsed between two SAS variables.

The form of the function is as follows:

INTCK(‘<measured duration>’ , <DATEA>, <DATEB>);

For example, if you wanted to measure the days that occurred between variable DATEA and DATEB, the code would resemble:

INTCK(‘day’ , DATEA, DATEB);

You will need to store the output, so we will create a new variable, DATEC, to do so. The code would then resemble:

DATEC = INTCK(‘day’ , DATEA, DATEB);

This function can evaluate many different measurements of time, however, I will only mention those that are most commonly utilized.

DATEC = INTCK(‘day’ , DATEA, DATEB); /* Days */

DATEC = INTCK(‘week’ , DATEA, DATEB); /* Weeks */

DATEC = INTCK(‘month’ , DATEA, DATEB); /* Months */

DATEC = INTCK(‘year’ , DATEA, DATEB); /* Years */

DATEC = INTCK(‘qtr’ , DATEA, DATEB); /* Quarters */

/*****************************************************************************/

!IMPORTANT NOTE! - INTCK() only returns INTEGERS! Meaning, that if a week and a half elapsed, and <INTCK(‘week’, x , y)> was being utilized, then the result would be 1, as only 1 WHOLE week has lapsed in duration.

In the next article, I will return to discussing the R coding language, specifically, hypothesis tests and error types.

Wednesday, September 20, 2017

(R) The Confidence Interval Estimate of Proportions

In this article, we will again be focusing on sample data. Specifically, what today's exercises will be demonstrating, is the process necessary to determine whether we can say that a proportion lies within a certain interval.

Example 1:

If 60% of a sample of 120 individuals leaving a diner claim to have spent over $12 for lunch, determine a 99% confidence interval estimate for the proportion of patrons who spent over $12.

sqrt(.4 * .6/ 100)

[1] 0.04898979

z <- qnorm(.005, lower.tail=FALSE) * 0.04898979
# .005 is due to our test containing 2 tails #

.60 + c(-z, z)

[1] 0.4738107 0.7261893

Conclusion: We are 99% certain that the proportion of diner patrons spending over $12 for lunch is between 0.4738107 (47.34%) and 0.7261893 (72.62%).

Example 2:

In random sample of lightbulbs being produced by a factory, 20 out of 300 were found to be shattered during the shipping process. Establish a 95% confidence interval estimate that accounts for these damages.

p = 20/300

[1] 0.06666667

sqrt(.066 * .934 / 100)

[1] 0.02482821

z <- qnorm(.025, lower.tail=FALSE) * 0.02482821

0.06666667 + c(-z,z)

[1] 0.01800427 0.11532907

Therefore, we can be 95% certain that the proportion of light bulbs damaged during the shipping process is between 0.01800427 (1.80% and 11.53%).

Furthermore, if we wished, we could apply these ratios to a total shipment to create an estimation.

If 1000 light bulbs shipped, we can be 95% confident that between 18 and 115.3 light bulbs are damaged within the shipment.

In the next article, we will be discussing hypothesis tests. Until then, stay tuned Data Monkeys!

(R) Distributions of Sample Proportions

Let's suppose that we have been presented with sample data from a larger collection of population data; or, let's suppose that we have been presented with incomplete information pertaining to a larger population.

If we were tasked to reach various conclusions based on such data, how would we structure our models? This article sets to answer these questions. To begin this study, we will review a series of example problems.

Example 1:

The military has instituted a new training regime in order to screen candidates for a newly formed battalion. Due to the specialization of this unit, candidates are vetted through exercises which screen through the utilization of extremely rigorous physical routines. Presently, only 60% of candidates who have attempted the regime, have successfully passed. If 100 new candidates volunteer for the unit, what is the probability that more than 70% of those candidates will pass the physical?

# Disable Scientific Notation in R Output #

options(scipen = 999)

# Find The Standard Deviation of The Sample #

Standard Deviation = Square Root of: (x)(1-x) / 100

sqrt(.4 * .6/ 100)

[1] 0.04898979

# Find the Z-Score #

(.7 - .6)/0.04898979

[1] 2.041242

Probability of Z-Score 2.041242 = .4793
(Check Z-Table)

Finally, conclude as to whether the probability of the sample exceeds 70%
(One tailed test)

.50 - .4793

[1] 0.0207

In R, the following code can be used to expedite the process:

sqrt(.4 * .6/ 100)

[1] 0.04898979

pnorm(q=.7, mean=.6, sd=0.04898979 , lower.tail=FALSE)

[1] 0.02061341

So, we can conclude, that if 100 new candidates volunteer for the unit, there is only a 2.06% chance that more than 70% of those candidates will pass the physical.

The process really is that simple.

In the next article we will review confidence interval estimate of proportions.

Monday, September 18, 2017

(R) Multiple Linear Regression - Pt. (II)

In the previous article, we discussed linear regression. In this article, we will discuss multiple linear regression. Multiple linear regression, from a conceptual standpoint, is exactly the same as linear regression. The only fundamental difference, is that through the utilization of a multiple linear regression model, multiple independent variables can be assessed.

In this example, we have three variables, each variable is comprised of prior observational data.

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

To integrate these variable sets into a model, we will use the following code:

multiregress <- (lm(y ~ x + z))

This code creates a new set ('multiregress'), which contains the regression model data. In this model, 'y' is the dependent variable, with 'x' and 'z' both represented as dependent variables.

We will need to run the following summary function to receive output information pertinent to the model:

summary(multiregress)

The output produced within the console window is as follows:

Call:
lm(formula = y ~ x + z)

Residuals:
Min 1Q Median 3Q Max
-6.4016 -5.0054 -1.7536 0.8713 14.0886

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.1434 12.0381 3.916 0.00578 **
x 0.7808 0.3316 2.355 0.05073 .
z 0.3990 0.4804 0.831 0.43363
---

Residual standard error: 7.896 on 7 degrees of freedom
Multiple R-squared: 0.5249, Adjusted R-squared: 0.3891
F-statistic: 3.866 on 2 and 7 DF, p-value: 0.07394

In this particular model scenario, the model that we would use to determine the value of 'y' is:

y = 0.7808x + 0.3990z + 47.1434

However, in investigating the results of the summary output, we observe that:

Multiple R-squared = 0.5249

Which can be a large enough coefficient, depending on what type of data we are observing...but the following values should raise some alarm:

p-value: 0.07394 (> Alpha of .05)

AND

F-statistic: 3.866 on 2 and 7 DF

Code:

qf(.95, df1=2, df2=7) #Alpha .05#

[1] 4.737414

4.737414 > 3.866

If these concepts seem foreign, please refer to the previous article.

From the summary data, we can conclude that this model is too inaccurate to be properly accepted and utilized.

Therefore, I would recommend re-creating this model with new independent variables.

When creating multiple linear regression models, it is important to consider the values of the f-statistic and the coefficient of determination (multiple r-squared). If variables are being added, or exchanged for different variables within an existing regression model, ideally, the f-statistic and the coefficient of determination should rise in value. This increase indicates that the model is increasing in accuracy. A decline in either of these values would indicate otherwise.

Moving forward, new articles will cover less complicated fundamental aspects of statistics. If you understand this article, and all prior articles, the following topics of discussion should be mastered with relative ease. Stay tuned for more content, Data Heads!

Thursday, September 14, 2017

(R) Linear Regression - Pt. (I)

In this first entry of a two part article series, we will be discussing linear regression. In the next article, I will move on to the more advanced topic of multiple regression.

Regression analysis allows you to create a predictive model. This model can be used to ascertain future results based on the value of a known variable.

Let's begin with an example.

We have two sets of data:

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

To determine if there is a relationship between the data points of each set, we must first decide which variable is effecting the other. Meaning, one variable's value will determine the value of the other variable. In the case of our example, we have determined that the value of 'x', is impacting the value of 'y'. Therefore, 'x' is our independent variable, and 'y' is our dependent variable, as y's value is dependent of the value of 'x'.

We can now create a linear model for this data. The dependent variable must be listed first in this function followed by the independent variable.

linregress <- (lm(y ~ x))

'lingress'' is the name of the data set that we will use to store this model. To produce a model summary, which will contain the information necessary for analysis, we must utilize the following command:

summary(linregress)

This should output the following data to the console:

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-9.563 -4.649 -1.361 1.457 13.139

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 52.6319 9.8653 5.335 0.000698 ***

x 0.8509 0.3144 2.707 0.026791 *

---

Residual standard error: 7.741 on 8 degrees of freedom

Multiple R-squared: 0.478, Adjusted R-squared: 0.4128

F-statistic: 7.327 on 1 and 8 DF, p-value: 0.02679

Now to overview each aspect of the output.

Call:

This is specifying the data that was "called" by the function.

Residuals:

A residual, is a value which represents the difference between the dependent value that is produced by the model, and the actual value of the dependent variable. These values are sorted in the way that is similar to the fivenum() function. For more information on this function, please consult prior articles.

However, there may be times that you would like to view the value of the residuals in their entirety. To achieve this, please utilize the function below:

resid(linregress)

This outputs the following to the console:

1 2 3 4 5 6 7 8

-5.606860 -1.563325 1.647757 -1.159631 -7.097625 13.138522 11.094987 -9.563325

9 10

-1.774406 0.883905

Coefficients

Estimate
(Intercept) - This value is the value of the y-intercept.

Estimate
x - This is the value of the slope.

With this information, we can now create our predictive model:

Y = 0.8509x + 52.6319

Y is the value of the dependent variable, and x is the value of the independent variable. If you enter the value of x into the equation, and work out the operations, you should recieve the predicted value of Y.

Std. Error
x - This value is the standard error of the slope value.

t value
x - This value is the standard error divided by the value of the coefficient. In our example, this value is comprised of the quotient 0.3144 / 0.8509 . The value of such is 2.707.

This allows us to create a t-test to check for significance. This particular test establishes a null hypothesis, which is utilized to check as to whether the slope has significance as it pertains to the model.

To obtain the t-test value to perform this evaluation, you will need to first determine the confidence interval that you wish to utilize. In this case, we'll assume 95% confidence, this equates an alpha value of .05.

Since the t-test that we will be performing is a two tailed test, we will enter the following code to receive our critcal value:

qt(c(.05/2), df=8)

In the above code, .05 is our alpha value, which is being divided by 2 due to the test requiring two tails. df=8 is the value of our degrees of freedom, these values can be found in the row output, which reads: "Residual standard error". This output specifies 8 degrees of freedom.

Since the t-value of our model (2.707), is greater than that t-test value itself (2.306004), we can state, that based on our confidence interval, the slope is significant as it pertains to our model.

p value
x - The p-value that is being indicated, is representative of the level of change dependent on the variable 'x'.

The lower the p-value, the greater the indication of this significance. We can test this value against a confidence interval, in this case, 95%, or alpha = .05. Since our p-value is 0.026791, which is smaller than the alpha value of .05, we can state, with 95% confidence, that this model is statistically significant.

Residual standard error:

This is the estimated standard deviation of the residual values.

There is some confusion as to how this value is calculated. It is important to note, that residual standard error is an ESTIMATED STANDARD DEVIATION. As such, degrees of freedom are
calculated as (n-p), and not (n-1). In the case of our model, if you were to calculate the standard deviation of the residuals with the function: SD(residual values), then the standard deviation value would be incorrect as it pertains to the model. *

F-Statistic

One of the most important statistical methods for checking significance is the F-Statistic.

As you can see from the above output, the F-Statistic is calculated to be:

7.327

With this information, we can conduct a hypothesis test after deciding on an appropriate confidence interval. Again, we will utilize a confidence interval of 95%, and this provides us with an alpha value of .05.

Degrees of freedom are provided, those values are 1 and 8.

With this information, we can now generate the critical value in which to test the F-Statistic against. Typically this value would be found on a table within a statistics textbook, However, a much more accurate and expedient way of finding this value, is through the utilization of R software.

qf(.95, df1=1, df2=8) #Alpha .05#

This provides us with the console output:

5.317655

Since our F-Statistic of 7.327 is greater than the critical value of our test statistic 5.317655, we will not reject the null hypothesis at a 95% confidence interval. Due to such, we can conclude,
with 95% confidence, that the model provides a significant fit for our data.

The Value of The Coefficient of Determination

(Multiple R-Squared)

(Notated as: r²)

The coefficient of determination can be thought of as a percent. It gives you an idea of how many data points fall within the results of the line formed by the regression equation. The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted.**

In most entry level statistics classes, this is the only variable that is evaluated when determining model significance. Typically, the higher the value of the coefficient of determination, assuming that all other tests of significance are met, the greater the usefulness and accuracy of the model. This value can be any number from 0-1. A value of 1 would indicate a perfect model.

(Adjusted R-Squared)

This value is the adjusted coefficient of determination, its method of calculation accounts for the number of observations contained within the model. Adjusted r-squared, by its nature, will always be of a lesser value than its multiple r-squared equivalent.***

The Value of the Coefficient of Correlation

(notated as: r)

This value is not provided in the summary. However, there may be times when you would like to have this value provided. The code to produce this value is below:

cor(y, x)

This outputs the following value to the console:

[1] 0.6914018

Therefore, r = 0.6914018.

Graphing the Linear Regression Model

The following code can be utilized to create the graph of a linear regression model in R. In this case, we will be creating a graphical representation of our example model.

plot(x, y, xlab="X-Value", ylab="Y-Value", main="Linear Regression Example")
abline(linregress)

* For more information on the standard error of the regression -
http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-to-interpret-s-the-standard-error-of-the-regression

** http://www.statisticshowto.com/what-is-a-coefficient-of-determination/

*** https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2

Monday, September 4, 2017

(R) Chi-Square

Chi-Square is an often overlooked concept in statistics. It has many uses, as will be demonstrated in this article. The first essential aspect of understanding Chi-Square is to understand its pronunciation. Many would assume that the pronunciation is "C-HI", or "ChEE". Neither is correct, the proper pronunciation is “Kai". Next, let’s examines how a Chi-Square distribution appears when graphed.

Above is a graphical representation of the chi-square distribution. What is being illustrated is the probability densities of various chi-square distributions based on degrees of freedom.

Things to remember about the Chi-Squared Distribution:

1. It is a continuous probability distribution.

2. It is related to the standard normal distribution.

3. Degrees of freedom for a sample chi-square distribution will be the total number of independent standard normal variables minus one.

The chi-squared distribution is utilized for goodness-of-fit tests. Meaning, that it is used to test one set of data against another. This is undertaken in order to determine whether a model of predictability is accurate. The degrees of freedom (n-1), or the size of the sample minus one, determines the shape of the probability density curve. Alpha, or 1 minus the confidence interval, will determine the size of the rejection region. This region is defined as the right most area beneath the distribution curve. The chi-square value, is derived from utilizing a mathematical function. Once derived, this value is matched against a chi-square distribution table. The chi square value, in conjunction with the determined degrees of freedom and the alpha value, ultimately determine as to whether a relationship may be assumed to exist.

Example:

A small motel owner has created a model which he believes, is an accurate predictor of individuals who will stay at his establishment. He presents you with his findings:

Monday: 20
Tuesday: 28
Wednesday: 18
Thursday: 25
Friday: 16
Saturday: 22
Sunday: 26

The following week, you are tasked with keeping track of guests who rent rooms at the motel. Here are your findings:

Monday: 14
Tuesday: 25
Wednesday: 22
Thursday: 18
Friday: 16
Saturday: 24
Sunday: 30

Given your findings, and assuming a 95% confidence interval, can we assume that the motel owner's model is an accurate predictor?

Model <- c(20, 28, 18, 25, 16, 22, 26)

Results <- c(14, 25, 22, 18, 16, 24, 30)

chisq.test(Model, p = Results, rescale.p = TRUE)

Console Output:

Chi-squared test for given probabilities

data: Model
X-squared = 6.5746, df = 6, p-value = 0.362

Findings:

Degrees of Freedom (df) - 6
Confidence Interval (CI) - .95
Alpha (α) (1-CI) - .05
Chi Square Test Statistic - 6.5746

This creates the hypothesis test parameters:

H0 : The model is a good fit (Null Hypothesis).

The critical value of 12.59 is found when consulting the chi-square distribution table. Since our chi-square value is less than this value (6.5746 < 12.59), we can state, that with 95 % confidence, that the owner's model is accurate.

Cannot Reject: Null Hypothesis.

Example:

The same small motel owner also created an additional model which he believes, is an accurate predictor of individuals who will stay at his establishment. He presents you with his findings:

Monday: 10%
Tuesday: 5%
Wednesday: 20%
Thursday: 10%
Friday: 20%
Saturday: 30%
Sunday: 5%

(Predicted percentage of total individuals who will stay throughout the week)

The following week, you are tasked to keep track of guests who rent rooms at the motel. Here are your findings:

Monday: 11
Tuesday: 25
Wednesday: 30
Thursday: 13
Friday: 23
Saturday: 17
Sunday: 8

(Actual number of individuals who stayed throughout the week)

Given your findings, and assuming a 95% confidence interval, can we assume that the motel owner's model is an accurate predictor?

Model <- c(.10, .05, .20, .10, .20, .30, .05)

Results <- c(11, 25, 30, 13, 23, 17, 8)

chisq.test(Results, p=Model, rescale.p= FALSE)

Console Output:

Chi-squared test for given probabilities

data: Results
X-squared = 68.184, df = 6, p-value = 9.634e-13

Findings:

Degrees of Freedom (df) - 6
Confidence Interval (CI) - .95
Alpha (α) (1-CI) - .05
Chi-Square Test Statistic - 68.184

This creates the hypothesis test parameters:

H0 : The model is a good fit (Null Hypothesis).

The critical value 12.59, is found when consulting the chi-squared distribution table. Since our chi-square value is greater than this value (68.184 > 12.59), we cannot state, that with 95 % confidence, that the owner's model is inaccurate.

Reject: Null Hypothesis.

Example:

While working as a statistician at a local university, you are tasked to evaluate, based on survey data, the level of job satisfaction that each member of the staff currently has for their occupational role. The data that you gather from the surveys is as follows:

General Faculty
130 Satisfied 20 Unsatisfied

Professors
30 Satisfied 20 Unsatisfied

Adjunct Professors
80 Satisfied 20 Unsatisfied

Custodians
20 Satisfied 10 Unsatisfied

The question remains however, as to whether the assigned role of each staff member, has any impact on the survey results. To decide this, with 95% confidence, you must follow the subsequent steps.

First, we will need to input this survey data into R as a matrix. This can be achieved by utilizing the code below:

Model <- matrix(c(130, 30, 80, 20, 20, 20, 20, 10), nrow = 4, ncol=2)

The result should resemble:

Once this step has been completed, the next step is as simple as entering the code:

chisq.test(Model)

Console Output:

Pearson's Chi-squared test

data: Model
X-squared = 18.857, df = 3, p-value = 0.0002926

Findings:

Degrees of Freedom (df) - 3
Confidence Interval (CI) - .95
Alpha (α) (1-CI) - .05
Chi Square Test Statistic - 18.857

This creates the hypothesis test parameters:

H0 : There is no correlation between job type and job satisfaction (Null Hypothesis). Job type and job satisfaction are independent variables.

HA: There is a correlation between job type and job satisfaction. Job type and job satisfaction are not independent variables.

The critical value 7.815 is found when consulting the chi squared distribution table. Since our chi square value is greater than this value (18.857 > 7.815), we can state, that with 95 % confidence, that there is a correlation between job type and overall satisfaction.

Reject: Null Hypothesis.

* Source for Chi Square Distribution Image - https://en.wikipedia.org/wiki/Chi-squared_distribution