Today we will be discussing an advanced topic, but a useful topic nonetheless, that topic being: Logistic Regression. Before delving into this subject matter, I would advise you, the reader, if you are not already familiar with linear regression, to please review the articles pertaining to this topic.

Logistic regression is a method which is similar to linear regression, however, the logistic regression method is utilized specifically to create models which analyze dependent variables that are binary in nature (True/False, Yes/No, Positive/Negative, 1/0, etc.).

Let’s begin with a simple example, I have previously entered some sample data into SPSS.

As we discussed in a previous article, you should edit the variable labels so that they correspond with the appropriate binary outcomes.

**Analyze"**from the top drop down menu, then select "

**Regression"**. From the next menu, select "

**Binary Logistic"**.

The next screen presents us with the options necessary to structure the equation utilized to create the model. The logistic regression model is structured in a way that is similar to the linear regression model. In the case of our example,

**“Cancer”**will be our dependent variable.

**“Age”**,

**“Obese”**, and

**“Smoking”**will be the model’s independent variables.

**NOTE: In this example, our categorical variables are binary. Meaning, that they can only represents two values (0,1). If we were working with sample data that contained categorical variables that were not binary, we would have to specify this in the following screen. This menu can be toggled by clicking on the “Categorical” option.**

From the options menu, select

**“Hosmer-Lemeshow goodness-of-fit”**. This test will be useful in interpretating the model results.

**“Save”**from the menu options, will present you with this interface. Check

**“Probabilities”**and

**“Group Membership”**beneath the

**“Predicted Values“**menu header. These options being enabled will output probability data directly to our SPSS spreadsheet.

So what does all of the generated output indicate?

For the most part, you can ignore all of the data listed in the screen below:

The first column lists the model variables.

So, if we were to construct the model as an equation, it would resemble:

Returning to our original model equation:

Let’s assume that we wish to test our model on a 19 year old, who is not obese, and who smokes. The equation would resemble:

But Logit(p) does not equal probability. It equals the Logit value of .77.

To generate the associated probability of this value, you can enter the following information into the R console:

Which produces the console output:

Additionally, you could also create the model within “R” by utilizing the code:

Which produces the console output:

In both cases, the output value is the probability value of a positive occurrence. In the case of our scenario, this would equate a cancer diagnosis. The numbers from our model are un-realistic, but this only due to the absolutely un-realistic sample data that I hastily created.

Returning to the odds ratio.

The value given in the column Exp(B) is the exponential value of the value given in column B. What this value is indicates is the probability increase, within the context of the model, if the variable value of the corresponding B variable is increased by one. However, this value must first be transformed.

Assuming the context of the model, if an individual were being assessed for cancer, assuming that he aged a year, the probability of cancer would increase:

(1.031 – 1) * 100

3.1 %

Previously, during this exercise, I asked that you select

For the most part, you can ignore all of the data listed in the screen below:

__Block 0__*(Ignore Block 0)*

__Block 1__

Starting from Block 1 to perform analysis, please first observe the portion of this output which reads

The rightmost column entry:

For our model, there were 5 predicted non-occurrences of cancer, and 2 occurrence of cancer that were not predicted. Thus, the percentage of correct estimates is 71.4. (5/(5+2))

Also, there were 2 predicted occurrences of cancer, and 6 non predicted occurrence. Thus, the percentage of correct estimates is 75. (6/(6+2))

The total percentage of accurate estimations provided by the model is 73.3. ((6+5)/(6+2+5+2).

**“Omnibus Tests of Model Coefficients”**. We will primarily be concerned with the bottom-most line which coincides with the row labeled,**"Model"**.The rightmost column entry:

**"Sig."**, indicates the significance level of the model. Depending on the pre-determined confidence interval which has been established, this value will determine whether or not the model is significant.**"df"**represents the test model's degrees of freedom. The**"Chi-square"**column contains the corresponding test statistic. All three values, in addition to the number of observations within the model, are utilized in tandem when documenting the model's significance.**“-2 Log Likelihood”, “Cox & Snell R Square”, “Nagelkerke R Square”**– All three columns correspond with outputs generated from methodologies which test the strength of the model.**– This value can reach a maximum of .75. Therefore, I believe that it is best to consult the “Nagelkerke R Square” value when assessing the overall model.**__Cox & Snell R Squared__**– This value can reach a maximum of 1. Thus, it is the measurement that best resembles the equivalent of The Coefficient of Determination. I would recommend referring to this value when considering the strength of the predictive model.**__Nagelkerke R Square__**– This is measuring for co-linearity amongst the independent variables of the model. Only the significance of this test is valuable. If the significance percentage is lower than .05, a correlation may be present. In such cases, additional testing should be performed to address such. Since this test utilizes a Chi-Squared distribution, it is sensitive to sample size. The test is considered most accurate when a sample size is greater than 400.**__The “Homer and Lameshow Test”__**– This can be ignored.**__“Contingency Table for Hosmer and Lameshow Test”__**– This output is simple to understand but difficult to follow. What is being illustrated by the output is the number of positive and negative cases predicted by the model, and the number of actual cases that occurred within the sample.**__“Classification Table”__For our model, there were 5 predicted non-occurrences of cancer, and 2 occurrence of cancer that were not predicted. Thus, the percentage of correct estimates is 71.4. (5/(5+2))

Also, there were 2 predicted occurrences of cancer, and 6 non predicted occurrence. Thus, the percentage of correct estimates is 75. (6/(6+2))

The total percentage of accurate estimations provided by the model is 73.3. ((6+5)/(6+2+5+2).

__“Variables in the Equation”__The first column lists the model variables.

**“Age”**,

**“Obese”**,

**“Smoking”**are dependent variables.

**“Constant”**is a variable which will be included as an aspect of the model.

So, if we were to construct the model as an equation, it would resemble:

**Logit(p) = (Age * .030) + (Obese * -.389) + (Smoking * 2.544) – 2.344**

**S.E**- represents the standard error of each variable within the context of the model.

**Wald**- represents the value of the Wald Test variable. You can ignore this value.

**df**- is indicating the degrees of freedom for each variable.

**Sig**- is indicating the significance of each variable within the context of the model. An insignificant variable value (typically > .05), does not necessarily indicate that the variable itself should be excluded from subsequent model generation.

**– This indicates the value in which the probability of a positive outcome will increase if the corresponding variable is increased by one. This warrants further explanation, as logistic regression models utilize the “Log” function to generate results.**

__Exp(B) or “Odds Ratio”__Returning to our original model equation:

**Logit(p) = (Age * .030) + (Obese * -.389) + (Smoking * 2.544) – 2.344**

Let’s assume that we wish to test our model on a 19 year old, who is not obese, and who smokes. The equation would resemble:

**Logit(p) = (19 * .030) + (0 * -.389) + (1 * 2.544) – 2.344**

Logit(p) = .77

Logit(p) = .77

But Logit(p) does not equal probability. It equals the Logit value of .77.

To generate the associated probability of this value, you can enter the following information into the R console:

**plogis(.77)**

Which produces the console output:

*[1] 0.6835209*

Additionally, you could also create the model within “R” by utilizing the code:

**a <- 19**

b <- 0

c <- 1

p <- (.030 * a) + (-.389 * b) + (2.544 * c) - 2.344

plogis(p)

b <- 0

c <- 1

p <- (.030 * a) + (-.389 * b) + (2.544 * c) - 2.344

plogis(p)

Which produces the console output:

*[1] 0.6835209*

In both cases, the output value is the probability value of a positive occurrence. In the case of our scenario, this would equate a cancer diagnosis. The numbers from our model are un-realistic, but this only due to the absolutely un-realistic sample data that I hastily created.

Returning to the odds ratio.

The value given in the column Exp(B) is the exponential value of the value given in column B. What this value is indicates is the probability increase, within the context of the model, if the variable value of the corresponding B variable is increased by one. However, this value must first be transformed.

Assuming the context of the model, if an individual were being assessed for cancer, assuming that he aged a year, the probability of cancer would increase:

(1.031 – 1) * 100

3.1 %

__Data Sheet Output__Previously, during this exercise, I asked that you select

**“Save”**from the menu options, and then subsequently select

**“Probabilities”**and “

**Group Membership”**. In doing so, we have informed SPSS to output data directly to our datasheet. The results are listed below.

We are presented with two new variables

I will now briefly illustrate how to obtain the same results in R.

# Console Output #

Analysis of Deviance Table

#

# Output #

# Output #

**“PRE_1”**and**“PGR_1”**.**“PRE_1”**represents the probability (.00 – 1.00) of a positive event occurring when the dependent variable data found in the left most columns is input into the model.**“PGR_1”**represents the model’s prediction given the dependent variable data (1 or 0). This data output is needed to create a ROC curve. ROC curves will be covered in a later article.

__Model Creation within R__I will now briefly illustrate how to obtain the same results in R.

**# Model Creation #**

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

CancerModel <- data.frame(Age, Obese, Smoking, Cancer)

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

CancerModel <- data.frame(Age, Obese, Smoking, Cancer)

**# Analyze the Significance of the Model #**

**CancerModelLog <- glm(Cancer ~ Age + Obese + Smoking, family=binomial)**

CancerModelLog1 <- glm(Cancer ~ 1, family=binomial)

anova(CancerModelLog, CancerModelLog1, test="LRT")

CancerModelLog1 <- glm(Cancer ~ 1, family=binomial)

anova(CancerModelLog, CancerModelLog1, test="LRT")

# Console Output #

Analysis of Deviance Table

*Model 1: Cancer ~ Age + Obese + Smoking*

Model 2: Cancer ~ 1

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 11 16.807

2 14 20.728 -3 -3.9209 0.2701

Model 2: Cancer ~ 1

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 11 16.807

2 14 20.728 -3 -3.9209 0.2701

#

**Summary Creation and Output #**

CancerModelLog <- glm(Cancer~ Age + Obese + Smoking, family=binomial)

summary(CancerModelLog)

CancerModelLog <- glm(Cancer~ Age + Obese + Smoking, family=binomial)

summary(CancerModelLog)

# Output #

*Call:*

*glm(formula = Cancer ~ Age + Obese + Smoking, family = binomial)*

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6096 -0.7471 0.5980 0.8260 1.8485

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.34431 2.25748 -1.038 0.2991

Age 0.02984 0.04055 0.736 0.4617

Obese -0.38924 1.39132 -0.280 0.7797

Smoking 2.54387 1.53564 1.657 0.0976 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 20.728 on 14 degrees of freedom

Residual deviance: 16.807 on 11 degrees of freedom

AIC: 24.807

Number of Fisher Scoring iterations: 4

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6096 -0.7471 0.5980 0.8260 1.8485

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.34431 2.25748 -1.038 0.2991

Age 0.02984 0.04055 0.736 0.4617

Obese -0.38924 1.39132 -0.280 0.7797

Smoking 2.54387 1.53564 1.657 0.0976 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 20.728 on 14 degrees of freedom

Residual deviance: 16.807 on 11 degrees of freedom

AIC: 24.807

Number of Fisher Scoring iterations: 4

**# Generate Nagelkerke R Squared #**

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(CancerModelLog)

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(CancerModelLog)

# Console Output # McFadden Adj.McFadden Cox.Snell Nagelkerke McKelvey.Zavoina Effron0.2328838 -0.2495624 0.2751639 0.3674311 0.3477522 0.3042371 0.8000000 Adj.Count AIC Corrected.AIC 0.5714286 23.9005542 27.9005542 # Calculate Exp(B) # # (Intercept) # exp(-2.34431) # Age # exp(0.02984) # Obese # exp(-0.38924) # Smoking # exp(2.54387) # Output # [1] 0.09591336 [1] 1.03029 [1] 0.6775716 [1] 12.72884 # Utilize VIF() and COR() in lieu of The Homer and Lameshow Test # |

**# Generate Correlation Matrix #**

**correlationmatrix <- cor(CancerModel)**

**# Output #**

*Age Obese Smoking Cancer*

Age 1.00000000 0.1231756 -0.2836428 0.02147097

Obese 0.12317556 1.0000000 0.4642857 0.19642857

Smoking -0.28364280 0.4642857 1.0000000 0.46428571

Cancer 0.02147097 0.1964286 0.4642857 1.00000000

Age 1.00000000 0.1231756 -0.2836428 0.02147097

Obese 0.12317556 1.0000000 0.4642857 0.19642857

Smoking -0.28364280 0.4642857 1.0000000 0.46428571

Cancer 0.02147097 0.1964286 0.4642857 1.00000000

**# Generate Variance Influence Factor #**

**# Download and Enable Package: "car" #**

**vif(CancerModelLog)**

# Output #

*Age Obese Smoking*

1.286936 1.361140 1.654232

1.286936 1.361140 1.654232

## No comments:

## Post a Comment

Note: Only a member of this blog may post a comment.