In today’s article, we will continue to address reader inquiries. Recently, I was contacted by an analyst who shared a concern pertaining to linear modeling, specifically, what is the most optimal manner in which a user may create an efficient linear model under the circumstances in which a data frame contains numerous independent variables? The trial-and-error technique isn’t a terrible option absent an abundant number of independent variables. However, when encountering a data frame which contains hundreds of independent variables, a more efficient method is necessary.

Thankfully, for the R user, a tenable solution exists.

As the title suggests, this technique requires that the

For this example, we will be utilizing a rather lengthy data frame. The sample data frame:

Once the .CSV file has been downloaded, it can be loaded into the R platform through the utilization of the following code:

To demonstrate the capability of the

Thankfully, for the R user, a tenable solution exists.

__Utilizing the “MASS” Package to find the Best Linear Model__As the title suggests, this technique requires that the

**“MASS”**package be downloaded and enabled.For this example, we will be utilizing a rather lengthy data frame. The sample data frame:

**“BiTestData.csv”**, can be found amongst other files within the site’s corresponding GitHub.Once the .CSV file has been downloaded, it can be loaded into the R platform through the utilization of the following code:

**DataFrameA <- read.table("C:\\Users\\UserName\\Desktop\\BiTestData.csv", fill = TRUE, header = TRUE, sep = "," )****The pathway must be altered to reflect the file destination within your working environment.**

To demonstrate the capability of the

**“MASS”**package, we will first create a logistic regression model within R through the utilization of the glm() function.**bimodel <- glm(Outcome ~., family=binomial, data=DataFrameA)**

summary(bimodel)summary(bimodel)

**# Console Output: #**

*Call:*

*glm(formula = Outcome ~ ., family = binomial, data = DataFrameA)*

*Deviance Residuals:*

*Min 1Q Median 3Q Max*

*-1.35061 -0.00005 -0.00005 -0.00004 1.77333*

*Coefficients:*

*Estimate Std. Error z value Pr(>|z|)*

*(Intercept) -2.032e+01 1.980e+03 -0.010 0.992*

*VarA -6.206e-02 1.269e+04 0.000 1.000*

*VarB 2.036e+01 1.254e+04 0.002 0.999*

*VarC -4.461e-01 5.376e-01 -0.830 0.407*

*VarD -5.893e-01 5.699e-01 -1.034 0.301*

*VarE 4.928e-01 9.435e-01 0.522 0.601*

*VarF -2.334e-02 5.032e-02 -0.464 0.643*

*(Dispersion parameter for binomial family taken to be 1)*

*Null deviance: 142.301 on 144 degrees of freedom*

*Residual deviance: 84.197 on 138 degrees of freedom*

*AIC: 98.197*

*Number of Fisher Scoring iterations: 19*

We will now measure the model’s predictive capacity through the application of the Nagelkerke R-Squared methodology.

**# Generate Nagelkerke R Squared #**

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

**# Console Output #**

*McFadden Adj.McFadden Cox.Snell*

*0.40831495 0.29587694 0.33015814*

*Nagelkerke McKelvey.Zavoina Effron*

**0.52807741**0.96866777 0.33839985

*Count Adj.Count AIC*

*0.81379310 0.03571429 98.19715620*

*Corrected.AIC*

*99.01467445*

Notice that the Nagelkerke R-Squared value is .528, which by most standards, indicates that the model possesses a fairly decent predictive capacity. In prior articles related to Logistic Regression Analysis, we discussed how this statistic is utilized in lieu of the traditional R-Squared figure to measure the strength of predictability in logistic regression models. However, another statistic which is illustrated within this output, the AIC, or Akaike Information Criterion, was not specifically mentioned.

AIC differs from both the Nagelkerke R-Squared value and the traditional R-Squared statistic, in that, it does not measure how well the current model explains the observed data, but instead, seeks to estimate model accuracy as it is applied to new observational data. R-Squared measures training error, while AIC acts as an estimate of the test error, thus, accounting for bias and variance.

As was mentioned in the prior article pertaining to Logistic Regression, when measuring the strength of model predictability, the Nagelkerke R-Squared value is the most easily interpretable.

The reason which necessitates the discussion of the Akaike Information Criterion is its utilization as the mechanism for which model optimization is determined by the stepAIC function. As it concerns interpretability, the smaller the AIC value, the better the model is assumed to perform when applied to new observational sets.

Let us now apply the stepAIC() function to our linear model and observe the results.

**# With the “MASS” package downloaded and enabled #**

stepAIC(bimodel)

stepAIC(bimodel)

**This produces the output:**

**# Console Output #**

*Start: AIC=98.2*

Outcome ~ VarA + VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarA 1 84.197 96.197

- VarF 1 84.414 96.414

- VarE 1 84.479 96.479

- VarC 1 84.891 96.891

- VarD 1 85.290 97.290

- VarB 1 86.022 98.022

<none> 84.197 98.197

Step: AIC=96.2

Outcome ~ VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarF 1 84.414 94.414

- VarE 1 84.479 94.479

- VarC 1 84.891 94.891

- VarD 1 85.290 95.290

<none> 84.197 96.197

- VarB 1 96.542 106.542

Step: AIC=94.41

Outcome ~ VarB + VarC + VarD + VarE

Df Deviance AIC

- VarE 1 84.677 92.677

- VarC 1 84.999 92.999

- VarD 1 85.586 93.586

<none> 84.414 94.414

- VarB 1 96.757 104.757

Step: AIC=92.68

Outcome ~ VarB + VarC + VarD

Df Deviance AIC

- VarC 1 85.485 91.485

- VarD 1 85.742 91.742

<none> 84.677 92.677

- VarB 1 132.815 138.815

Step: AIC=91.49

Outcome ~ VarB + VarD

Df Deviance AIC

- VarD 1 86.557 90.557

<none> 85.485 91.485

- VarB 1 139.073 143.073

Step: AIC=90.56

Outcome ~ VarB

Df Deviance AIC

<none> 86.557 90.557

- VarB 1 142.301 144.301

Call: glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Coefficients:

(Intercept) VarB

-20.57 20.34

Degrees of Freedom: 144 Total (i.e. Null); 143 Residual

Null Deviance: 142.3

Residual Deviance: 86.56 AIC: 90.56

Outcome ~ VarA + VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarA 1 84.197 96.197

- VarF 1 84.414 96.414

- VarE 1 84.479 96.479

- VarC 1 84.891 96.891

- VarD 1 85.290 97.290

- VarB 1 86.022 98.022

<none> 84.197 98.197

Step: AIC=96.2

Outcome ~ VarB + VarC + VarD + VarE + VarF

Df Deviance AIC

- VarF 1 84.414 94.414

- VarE 1 84.479 94.479

- VarC 1 84.891 94.891

- VarD 1 85.290 95.290

<none> 84.197 96.197

- VarB 1 96.542 106.542

Step: AIC=94.41

Outcome ~ VarB + VarC + VarD + VarE

Df Deviance AIC

- VarE 1 84.677 92.677

- VarC 1 84.999 92.999

- VarD 1 85.586 93.586

<none> 84.414 94.414

- VarB 1 96.757 104.757

Step: AIC=92.68

Outcome ~ VarB + VarC + VarD

Df Deviance AIC

- VarC 1 85.485 91.485

- VarD 1 85.742 91.742

<none> 84.677 92.677

- VarB 1 132.815 138.815

Step: AIC=91.49

Outcome ~ VarB + VarD

Df Deviance AIC

- VarD 1 86.557 90.557

<none> 85.485 91.485

- VarB 1 139.073 143.073

Step: AIC=90.56

Outcome ~ VarB

Df Deviance AIC

<none> 86.557 90.557

- VarB 1 142.301 144.301

Call: glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Coefficients:

(Intercept) VarB

-20.57 20.34

Degrees of Freedom: 144 Total (i.e. Null); 143 Residual

Null Deviance: 142.3

Residual Deviance: 86.56 AIC: 90.56

As illustrated, the ideal model that the stepAIC() function suggests is:

**bimodel <- glm(Outcome ~ VarB, family=binomial, data=DataFrameA)**

summary(bimodal)

summary(bimodal)

**# Console Output #**

*Call:*

glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.08424 -0.00005 -0.00005 -0.00005 1.27352

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -20.57 1957.99 -0.011 0.992

VarB 20.34 1957.99 0.010 0.992

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 142.301 on 144 degrees of freedom

Residual deviance: 86.557 on 143 degrees of freedom

AIC: 90.557

Number of Fisher Scoring iterations: 19

glm(formula = Outcome ~ VarB, family = binomial, data = DataFrameA)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.08424 -0.00005 -0.00005 -0.00005 1.27352

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -20.57 1957.99 -0.011 0.992

VarB 20.34 1957.99 0.010 0.992

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 142.301 on 144 degrees of freedom

Residual deviance: 86.557 on 143 degrees of freedom

AIC: 90.557

Number of Fisher Scoring iterations: 19

Now let’s measure the model’s predictive capacity.

**# Generate Nagelkerke R Squared #**

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

# Console Output #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(bimodel)

# Console Output #

*McFadden Adj.McFadden Cox.Snell Nagelkerke McKelvey.Zavoina Effron Count Adj.Count*

0.3917303 0.3495661 0.3191667

AIC Corrected.AIC

0.3917303 0.3495661 0.3191667

**0.5104969**0.9686596 0.3114910 NA NAAIC Corrected.AIC

**90.5571588**90.6416659*As you can observe from the information presented above, the Nagelkerke (0.51) value has been lowered slightly. However, the AIC (90.56) value has fallen by a much more substantial amount. This should be viewed as a positive occurrence. The lower the AIC value, the more the model is able to appropriately account for new observational data. The slight decline in the Nagelkerke value is significantly offset by the large AIC value decline, therefore, we can conclude that given the dependent variables present within the data set, that the model below contains the optimal structuring format:*

**bimodel <- glm(Outcome ~ VarB, family=binomial, data=DataFrameA)**

For more information pertaining to The Akaike Information Criterion (AIC):

https://en.wikipedia.org/wiki/Akaike_information_criterion

For more information pertaining to the Akaike Information Criterion and the R-Squared statistic as quantifiable measurements:

https://stats.stackexchange.com/questions/140965/when-aic-and-adjusted-r2-lead-to-different-conclusions

That’s all for now, Data Heads! Stay subscribed for more substantive concepts.

## No comments:

## Post a Comment

Note: Only a member of this blog may post a comment.