Reflections of a Data Scientist: (R) Finding the Best Linear Model w/stepAIC()

In today’s article, we will continue to address reader inquiries. Recently, I was contacted by an analyst who shared a concern pertaining to linear modeling, specifically, what is the most optimal manner in which a user may create an efficient linear model under the circumstances in which a data frame contains numerous independent variables? The trial-and-error technique isn’t a terrible option absent an abundant number of independent variables. However, when encountering a data frame which contains hundreds of independent variables, a more efficient method is necessary.

Thankfully, for the R user, a tenable solution exists.

Utilizing the “MASS” Package to find the Best Linear Model

As the title suggests, this technique requires that the “MASS” package be downloaded and enabled.

For this example, we will be utilizing a rather lengthy data frame. The sample data frame: “BiTestData.csv”, can be found amongst other files within the site’s corresponding GitHub.

Once the .CSV file has been downloaded, it can be loaded into the R platform through the utilization of the following code:

DataFrameA <- read.table("C:\\Users\\UserName\\Desktop\\BiTestData.csv", fill = TRUE, header = TRUE, sep = "," )

The pathway must be altered to reflect the file destination within your working environment.

To demonstrate the capability of the “MASS” package, we will first create a logistic regression model within R through the utilization of the glm() function.

bimodel <- glm(Outcome ~., family=binomial, data=DataFrameA)

summary(bimodel)

# Console Output: #

Call:
glm(formula = Outcome ~ ., family = binomial, data = DataFrameA)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.35061 -0.00005 -0.00005 -0.00004 1.77333

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.032e+01 1.980e+03 -0.010 0.992
VarA -6.206e-02 1.269e+04 0.000 1.000
VarB 2.036e+01 1.254e+04 0.002 0.999
VarC -4.461e-01 5.376e-01 -0.830 0.407
VarD -5.893e-01 5.699e-01 -1.034 0.301
VarE 4.928e-01 9.435e-01 0.522 0.601
VarF -2.334e-02 5.032e-02 -0.464 0.643

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 142.301 on 144 degrees of freedom
Residual deviance: 84.197 on 138 degrees of freedom
AIC: 98.197

Number of Fisher Scoring iterations: 19

We will now measure the model’s predictive capacity through the application of the Nagelkerke R-Squared methodology.

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

# Console Output #

McFadden Adj.McFadden Cox.Snell

0.40831495 0.29587694 0.33015814

Nagelkerke McKelvey.Zavoina Effron

0.52807741 0.96866777 0.33839985

Count Adj.Count AIC

0.81379310 0.03571429 98.19715620

Corrected.AIC

99.01467445

Notice that the Nagelkerke R-Squared value is .528, which by most standards, indicates that the model possesses a fairly decent predictive capacity. In prior articles related to Logistic Regression Analysis, we discussed how this statistic is utilized in lieu of the traditional R-Squared figure to measure the strength of predictability in logistic regression models. However, another statistic which is illustrated within this output, the AIC, or Akaike Information Criterion, was not specifically mentioned.

AIC differs from both the Nagelkerke R-Squared value and the traditional R-Squared statistic, in that, it does not measure how well the current model explains the observed data, but instead, seeks to estimate model accuracy as it is applied to new observational data. R-Squared measures training error, while AIC acts as an estimate of the test error, thus, accounting for bias and variance.

As was mentioned in the prior article pertaining to Logistic Regression, when measuring the strength of model predictability, the Nagelkerke R-Squared value is the most easily interpretable.

The reason which necessitates the discussion of the Akaike Information Criterion is its utilization as the mechanism for which model optimization is determined by the stepAIC function. As it concerns interpretability, the smaller the AIC value, the better the model is assumed to perform when applied to new observational sets.

Let us now apply the stepAIC() function to our linear model and observe the results.

# With the “MASS” package downloaded and enabled #

stepAIC(bimodel)

This produces the output:

# Console Output #

Start: AIC=98.2

Outcome ~ VarA + VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarA 1 84.197 96.197

- VarF 1 84.414 96.414

- VarE 1 84.479 96.479

- VarC 1 84.891 96.891

- VarD 1 85.290 97.290

- VarB 1 86.022 98.022

<none> 84.197 98.197

Step: AIC=96.2

Outcome ~ VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarF 1 84.414 94.414

- VarE 1 84.479 94.479

- VarC 1 84.891 94.891

- VarD 1 85.290 95.290

<none> 84.197 96.197

- VarB 1 96.542 106.542

Step: AIC=94.41

Outcome ~ VarB + VarC + VarD + VarE

Df Deviance AIC

- VarE 1 84.677 92.677

- VarC 1 84.999 92.999

- VarD 1 85.586 93.586

<none> 84.414 94.414

- VarB 1 96.757 104.757

Step: AIC=92.68

Outcome ~ VarB + VarC + VarD

Df Deviance AIC

- VarC 1 85.485 91.485

- VarD 1 85.742 91.742

<none> 84.677 92.677

- VarB 1 132.815 138.815

Step: AIC=91.49

Outcome ~ VarB + VarD

Df Deviance AIC

- VarD 1 86.557 90.557

<none> 85.485 91.485

- VarB 1 139.073 143.073

Step: AIC=90.56

Outcome ~ VarB

Df Deviance AIC

<none> 86.557 90.557

- VarB 1 142.301 144.301

Call: glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Coefficients:

(Intercept) VarB

-20.57 20.34

Degrees of Freedom: 144 Total (i.e. Null); 143 Residual

Null Deviance: 142.3

Residual Deviance: 86.56 AIC: 90.56

As illustrated, the ideal model that the stepAIC() function suggests is:

bimodel <- glm(Outcome ~ VarB, family=binomial, data=DataFrameA)

summary(bimodal)

# Console Output #

Call:
glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.08424 -0.00005 -0.00005 -0.00005 1.27352

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -20.57 1957.99 -0.011 0.992
VarB 20.34 1957.99 0.010 0.992

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 142.301 on 144 degrees of freedom
Residual deviance: 86.557 on 143 degrees of freedom
AIC: 90.557

Number of Fisher Scoring iterations: 19
Now let’s measure the model’s predictive capacity.

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

# Console Output #

McFadden Adj.McFadden Cox.Snell Nagelkerke McKelvey.Zavoina Effron Count Adj.Count

0.3917303 0.3495661 0.3191667 0.5104969 0.9686596 0.3114910 NA NA

AIC Corrected.AIC

90.5571588 90.6416659

As you can observe from the information presented above, the Nagelkerke (0.51) value has been lowered slightly. However, the AIC (90.56) value has fallen by a much more substantial amount. This should be viewed as a positive occurrence. The lower the AIC value, the more the model is able to appropriately account for new observational data. The slight decline in the Nagelkerke value is significantly offset by the large AIC value decline, therefore, we can conclude that given the dependent variables present within the data set, that the model below contains the optimal structuring format:

bimodel <- glm(Outcome ~ VarB, family=binomial, data=DataFrameA)

For more information pertaining to The Akaike Information Criterion (AIC):

https://en.wikipedia.org/wiki/Akaike_information_criterion

For more information pertaining to the Akaike Information Criterion and the R-Squared statistic as quantifiable measurements:

https://stats.stackexchange.com/questions/140965/when-aic-and-adjusted-r2-lead-to-different-conclusions

That’s all for now, Data Heads! Stay subscribed for more substantive concepts.

Reflections of a Data Scientist

Monday, October 22, 2018

(R) Finding the Best Linear Model w/stepAIC()

No comments:

Post a Comment