Reflections of a Data Scientist: (R) VIF() and COR()

Revisiting multiple regression for the purpose of establishing a greater understanding of the topic, and additionally, to foster a more efficient application of the regression model; in this entry we will review functions and statistical concepts which were overlooked in the prior article pertaining to the subject matter.

Variance Influence Factor or VIF():

What is VIF()? According to Wikipedia, it is defined as such: “The variance inflation factor (VIF) quantifies the severity of multicollinearity in an ordinary least squares regression analysis.”

To define VIF() in layman’s terms, The Variance Influence Factor is a method for weighing each variable’s coefficient of determination (R-Squared), against all other variables within a multiple regression equation. It’s really as simple as that, but none of this will make sense without an example.

Example:

Consider the following numerical sets:

w <- c(23, 42, 55, 16, 24, 27, 24, 15, 23, 85)
x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

Let’s utilize these values to create a multiple regression model:

lm.multiregressw <- (lm(w ~ x + y + z))

Now let’s take a look at that model:

summary(lm.multiregressw)

Call:
lm(formula = w ~ x + y + z)

Residuals:
Min 1Q Median 3Q Max
-29.701 -11.020 -4.462 3.465 44.108

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.40034 68.57307 -0.020 0.984
x -0.08981 1.41549 -0.063 0.951
y 0.19770 1.20528 0.164 0.875
z 1.15241 1.60560 0.718 0.500

Residual standard error: 25.18 on 6 degrees of freedom
Multiple R-squared: 0.1109, Adjusted R-squared: -0.3336
F-statistic: 0.2495 on 3 and 6 DF, p-value: 0.8591

Given the Multiple R-squared value of this output, we can assume that this model is pretty awful. Regardless of such, our model could be effected by a phenomenon known as multiple collinearity. What this means, is that the variables could be interacting with each other, and thus, disrupting our model from providing accurate results. To prevent this from happening, we can measure the VIF() for each independent variable within the equation as it pertains to each other independent variable. If we were to do this manually, the code would resemble:

lm.multiregressx <- (lm(x ~ y + z )) # .4782 #

lm.multiregressy <- (lm(y ~ x + z )) # .5249 #

lm.multiregressz <- (lm(z ~ y + x )) # .1488 #

With each output, we would note the Multiple R-Squared value. I have provided those values to the right of each code line above. To then individually calculate the VIF(), we would utilize the following code for each R-Squared variable.

1 / (1 - .4282) # 1.916 #

1 / (1 - .5249) # 2.105 #

1 / (1 - .1488) # 1.175 #

This produces the values to the right of each code line. These values are the VIF() or Variance Influence Factor.

An easier way to derive the VIF() is to install the R package, “car”. Once “car” is installed, you can execute the following command:

vif(lm.multiregressw)

Which provides the output:

x y z
1.916437 2.104645 1.174747

Typically, most data scientists consider values of VIF() which are either over 5 or 10 (depending on sensitivity), to indicate that a variable ought to be removed. If you do plan on removing a variable from your model for such reasons, remove one variable at a time, as the removal of a single variable will effect subsequent VIF() measurements.

(Pearson) Coefficient of Correlation or cor():

Another tool at your disposal is the function cor(). Cor() allows the user to derive the coefficient of correlation ( r ) from numerical sets. For example:

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

cor(x,y)

[1] 0.6914018

As exciting as this seems, there is actually a better use for this function. For example, if you wanted to derive the coefficient of correlation as one variable as it pertains to others, you could use the following code lines to build a correlation matrix.

# Set values: #

w <- c(23, 42, 55, 16, 24, 27, 24, 15, 23, 85)
x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

# Create a data frame #

dframe <- data.frame(w,x,y,z)

# Create a correlation matrix #

correlationmatrix <- cor(dframe)

The output will resemble:

w x y z
w 1.0000000 0.1057911 0.1836205 0.3261470
x 0.1057911 1.0000000 0.6914018 0.2546853
y 0.1836205 0.6914018 1.0000000 0.3853427
z 0.3261470 0.2546853 0.3853427 1.0000000

In the next article, we will again review the F-Statistic in preparation for a discussion pertaining to the concept of ANOVA. Until then, hang in there statistics fans!

Reflections of a Data Scientist

Sunday, November 12, 2017

(R) VIF() and COR()

No comments:

Post a Comment