Reflections of a Data Scientist: (R) Logistic Regression Analysis (Non-Binary Categorical Variables) (SPSS)

In a previous article we covered how to analyze data through the utilization of the logistic regression model. In the example that was presented, categorical data was conveniently binary in every instance. In this example, I will demonstrate how to utilize the logistic regression model when categorical data contains multiple categories.

We will again refer to the example data set below. I have added an additional category that specifies the “Race” of each individual surveyed.

With category labels enabled, the data resembles:

Performing the analysis is similar to the prior example, except, in this case, we will be selecting the “Categorical” option. After doing such, we will specify that “Race” is a categorical covariate.

Moving forward with our analysis, we will receive the following as a portion of our output:

“Race” has been split into 4 separate variables, with “Race” as a single variable, remaining for evaluation as whole.

Race(1) refers to the “Race” category: “White”.

Race(2) refers to the “Race” category: “African American”.

Race(3) refers to the “Race” category: “Asian”.

Race(4) refers to the “Race” category: “Indian”.

The “Race” category “Native American” is still accounted for within the context of the model. However, its value is that of the constant in addition to all other variables.

In this example case, our equation would resemble:

Logit(p) = -2.13055 + (Age * 0.03335) + (Obese * -0.56859) + (Smoking * 3.02867) + (White * -1.10077) + (African_American * -1.05379) + (Asian * -1.22213) + (Indian * 0.69143)

So, if we wanted to test our model probability for an individual who was:

55 Years of Age

Obese

A Smoker

White

The equation would resemble:

Logit(p) = -2.13055 + (55 * 0.03335) + (1 * -0.56859) + (1 * 3.02867) + (1 * -1.10077) + (0 * -1.05379) + (0 * -1.22213) + (0 * 0.69143)

So our logit(p) value would be: 1.06301

Which equals a positive probability of: 0.7432653

Additionally, if our model was tested for an individual who was:

26 Years of Age

Not Obese

A Smoker

Native American

Our equation would resemble:

Logit(p) = -2.13055 + (26 * 0.03335) + (0 * -0.56859) + (1 * 3.02867) + (0 * -1.10077) + (0 * -1.05379) + (0 * -1.22213) + (0 * 0.69143)

Logit(p) would equal: 1.76522

Which equals a positive probability of: 0.8538622

You can test this model in R with the following code:

# Model Test Code #

Age <- 0

Obese <- 0

Smoking <- 0

White <- 0

African_American <- 0

Asian <- 0

Indian <- 0

p <- -2.13055 + (Age * 0.03335) + (Obese * -0.56859) + (Smoking * 3.02867) + (White * -1.10077) + (African_American * -1.05379) + (Asian * -1.22213) + (Indian * 0.69143)

plogis(p)

Here is how you would create the same model through the utilization of the “R” Platform:

# Non-Binary Categorical Variables #

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

White <- c(1,1,1,0,0,0,0,0,0,0,0,0,0,0,0)

African_American <- c(0,0,0,1,1,1,0,0,0,0,0,0,0,0,0)

Asian <- c(0,0,0,0,0,0,1,1,1,0,0,0,0,0,0)

Indian <- c(0,0,0,0,0,0,0,0,0,1,1,1,0,0,0)

Native_American <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,1,1)

CancerModelII <- data.frame(Age, Obese, Smoking, Cancer, White, African_American, Asian, Indian, Native_American )

CancerModelLogII <- glm(Cancer~ Age + Obese + Smoking + White + African_American + Asian + Indian + Native_American, family=binomial)

summary(CancerModelLogII)

# Which produces the output #

Call:
glm(formula = Cancer ~ Age + Obese + Smoking + White + African_American +
Asian + Indian + Native_American, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.9613 -0.7252 0.4240 0.8107 1.7092

Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.13055 2.58207 -0.825 0.409
Age 0.03335 0.04641 0.719 0.472
Obese -0.56859 1.60680 -0.354 0.723
Smoking 3.02867 1.95858 1.546 0.122
White -1.10077 2.35673 -0.467 0.640
African_American -1.05379 2.18843 -0.482 0.630
Asian -1.22213 2.40838 -0.507 0.612
Indian 0.69143 2.51153 0.275 0.783
Native_American NA NA NA NA

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 20.728 on 14 degrees of freedom
Residual deviance: 15.366 on 7 degrees of freedom
AIC: 31.366

Number of Fisher Scoring iterations: 4

Reflections of a Data Scientist

Saturday, January 13, 2018

(R) Logistic Regression Analysis (Non-Binary Categorical Variables) (SPSS)

No comments:

Post a Comment