Monday, September 18, 2017

(R) Multiple Linear Regression - Pt. (II)

In the previous article, we discussed linear regression. In this article, we will discuss multiple linear regression. Multiple linear regression, from a conceptual standpoint, is exactly the same as linear regression. The only fundamental difference, is that through the utilization of a multiple linear regression model, multiple independent variables can be assessed.

In this example, we have three variables, each variable is comprised of prior observational data.

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)


To integrate these variable sets into a model, we will use the following code:

multiregress <- (lm(y ~ x + z))

This code creates a new set ('multiregress'), which contains the regression model data. In this model, 'y' is the dependent variable, with 'x' and 'z' both represented as dependent variables.

We will need to run the following summary function to receive output information pertinent to the model:

summary(multiregress)

The output produced within the console window is as follows:

Call:
lm(formula = y ~ x + z)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4016 -5.0054 -1.7536  0.8713 14.0886 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  47.1434    12.0381   3.916  0.00578 **
x             0.7808     0.3316   2.355  0.05073 . 
z             0.3990     0.4804   0.831  0.43363   
---

Residual standard error: 7.896 on 7 degrees of freedom
Multiple R-squared:  0.5249, Adjusted R-squared:  0.3891 
F-statistic: 3.866 on 2 and 7 DF,  p-value: 0.07394

In this particular model scenario, the model that we would use to determine the value of 'y' is:

y = 0.7808x + 0.3990z + 47.1434

However, in investigating the results of the summary output, we observe that:

Multiple R-squared = 0.5249

Which can be a large enough coefficient, depending on what type of data we are observing...but the following values should raise some alarm:

p-value: 0.07394 (> Alpha of .05)

AND

F-statistic: 3.866 on 2 and 7 DF

Code: 

qf(.95, df1=2, df2=7) #Alpha .05#

[1] 4.737414

4.737414 > 3.866

If these concepts seem foreign, please refer to the previous article.

From the summary data, we can conclude that this model is too inaccurate to be properly accepted and utilized.

Therefore, I would recommend re-creating this model with new independent variables.

When creating multiple linear regression models, it is important to consider the values of the f-statistic and the coefficient of determination (multiple r-squared). If variables are being added, or exchanged for different variables within an existing regression model, ideally, the f-statistic and the coefficient of determination should rise in value. This increase indicates that the model is increasing in accuracy. A decline in either of these values would indicate otherwise.

Moving forward, new articles will cover less complicated fundamental aspects of statistics. If you understand this article, and all prior articles, the following topics of discussion should be mastered with relative ease. Stay tuned for more content, Data Heads!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.