Reflections of a Data Scientist: (R) Linear Regression

In this first entry of a two part article series, we will be discussing linear regression. In the next article, I will move on to the more advanced topic of multiple regression.

Regression analysis allows you to create a predictive model. This model can be used to ascertain future results based on the value of a known variable.

Let's begin with an example.

We have two sets of data:

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

To determine if there is a relationship between the data points of each set, we must first decide which variable is effecting the other. Meaning, one variable's value will determine the value of the other variable. In the case of our example, we have determined that the value of 'x', is impacting the value of 'y'. Therefore, 'x' is our independent variable, and 'y' is our dependent variable, as y's value is dependent of the value of 'x'.

We can now create a linear model for this data. The dependent variable must be listed first in this function followed by the independent variable.

linregress <- (lm(y ~ x))

'lingress'' is the name of the data set that we will use to store this model. To produce a model summary, which will contain the information necessary for analysis, we must utilize the following command:

summary(linregress)

This should output the following data to the console:

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-9.563 -4.649 -1.361 1.457 13.139

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 52.6319 9.8653 5.335 0.000698 ***

x 0.8509 0.3144 2.707 0.026791 *

---

Residual standard error: 7.741 on 8 degrees of freedom

Multiple R-squared: 0.478, Adjusted R-squared: 0.4128

F-statistic: 7.327 on 1 and 8 DF, p-value: 0.02679

Now to overview each aspect of the output.

Call:

This is specifying the data that was "called" by the function.

Residuals:

A residual, is a value which represents the difference between the dependent value that is produced by the model, and the actual value of the dependent variable. These values are sorted in the way that is similar to the fivenum() function. For more information on this function, please consult prior articles.

However, there may be times that you would like to view the value of the residuals in their entirety. To achieve this, please utilize the function below:

resid(linregress)

This outputs the following to the console:

1 2 3 4 5 6 7 8

-5.606860 -1.563325 1.647757 -1.159631 -7.097625 13.138522 11.094987 -9.563325

9 10

-1.774406 0.883905

Coefficients

Estimate
(Intercept) - This value is the value of the y-intercept.

Estimate
x - This is the value of the slope.

With this information, we can now create our predictive model:

Y = 0.8509x + 52.6319

Y is the value of the dependent variable, and x is the value of the independent variable. If you enter the value of x into the equation, and work out the operations, you should recieve the predicted value of Y.

Std. Error
x - This value is the standard error of the slope value.

t value
x - This value is the standard error divided by the value of the coefficient. In our example, this value is comprised of the quotient 0.3144 / 0.8509 . The value of such is 2.707.

This allows us to create a t-test to check for significance. This particular test establishes a null hypothesis, which is utilized to check as to whether the slope has significance as it pertains to the model.

To obtain the t-test value to perform this evaluation, you will need to first determine the confidence interval that you wish to utilize. In this case, we'll assume 95% confidence, this equates an alpha value of .05.

Since the t-test that we will be performing is a two tailed test, we will enter the following code to receive our critcal value:

qt(c(.05/2), df=8)

In the above code, .05 is our alpha value, which is being divided by 2 due to the test requiring two tails. df=8 is the value of our degrees of freedom, these values can be found in the row output, which reads: "Residual standard error". This output specifies 8 degrees of freedom.

Since the t-value of our model (2.707), is greater than that t-test value itself (2.306004), we can state, that based on our confidence interval, the slope is significant as it pertains to our model.

p value
x - The p-value that is being indicated, is representative of the level of change dependent on the variable 'x'.

The lower the p-value, the greater the indication of this significance. We can test this value against a confidence interval, in this case, 95%, or alpha = .05. Since our p-value is 0.026791, which is smaller than the alpha value of .05, we can state, with 95% confidence, that this model is statistically significant.

Residual standard error:

This is the estimated standard deviation of the residual values.

There is some confusion as to how this value is calculated. It is important to note, that residual standard error is an ESTIMATED STANDARD DEVIATION. As such, degrees of freedom are
calculated as (n-p), and not (n-1). In the case of our model, if you were to calculate the standard deviation of the residuals with the function: SD(residual values), then the standard deviation value would be incorrect as it pertains to the model. *

F-Statistic

One of the most important statistical methods for checking significance is the F-Statistic.

As you can see from the above output, the F-Statistic is calculated to be:

7.327

With this information, we can conduct a hypothesis test after deciding on an appropriate confidence interval. Again, we will utilize a confidence interval of 95%, and this provides us with an alpha value of .05.

Degrees of freedom are provided, those values are 1 and 8.

With this information, we can now generate the critical value in which to test the F-Statistic against. Typically this value would be found on a table within a statistics textbook, However, a much more accurate and expedient way of finding this value, is through the utilization of R software.

qf(.95, df1=1, df2=8) #Alpha .05#

This provides us with the console output:

5.317655

Since our F-Statistic of 7.327 is greater than the critical value of our test statistic 5.317655, we will not reject the null hypothesis at a 95% confidence interval. Due to such, we can conclude,
with 95% confidence, that the model provides a significant fit for our data.

The Value of The Coefficient of Determination

(Multiple R-Squared)

(Notated as: r²)

The coefficient of determination can be thought of as a percent. It gives you an idea of how many data points fall within the results of the line formed by the regression equation. The higher the coefficient, the higher percentage of points the line passes through when the data points and line are plotted.**

In most entry level statistics classes, this is the only variable that is evaluated when determining model significance. Typically, the higher the value of the coefficient of determination, assuming that all other tests of significance are met, the greater the usefulness and accuracy of the model. This value can be any number from 0-1. A value of 1 would indicate a perfect model.

(Adjusted R-Squared)

This value is the adjusted coefficient of determination, its method of calculation accounts for the number of observations contained within the model. Adjusted r-squared, by its nature, will always be of a lesser value than its multiple r-squared equivalent.***

The Value of the Coefficient of Correlation

(notated as: r)

This value is not provided in the summary. However, there may be times when you would like to have this value provided. The code to produce this value is below:

cor(y, x)

This outputs the following value to the console:

[1] 0.6914018

Therefore, r = 0.6914018.

Graphing the Linear Regression Model

The following code can be utilized to create the graph of a linear regression model in R. In this case, we will be creating a graphical representation of our example model.

plot(x, y, xlab="X-Value", ylab="Y-Value", main="Linear Regression Example")
abline(linregress)

* For more information on the standard error of the regression -
http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-to-interpret-s-the-standard-error-of-the-regression

** http://www.statisticshowto.com/what-is-a-coefficient-of-determination/

*** https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2

Reflections of a Data Scientist

Thursday, September 14, 2017

(R) Linear Regression - Pt. (I)

No comments:

Post a Comment