Reflections of a Data Scientist: (R) Logistic Ordinal Regression (SPSS)

As if multinomial logistic regression was not confusing enough, today, we will be discussing an even more vexing variation of the logistic regression model. The model which will the topic of this article is: Logistic Ordinal Regression.

The Logistic Ordinal Regression Model is incredibly similar in concept as compared to the multinomial logistic regression model. Both models are utilized to assess ordinal variable outcomes through logistical methodologies. The primary differentiation between the two models pertains to model synthesis and output. If you have an intimate understandings of these differences, selecting the appropriate model should be a relatively simple task. However, if you do not have significant insight as to what aspects specifically differ between each model, I would suggest utilizing the multinomial logistic regression model. My reason for this suggestion pertains to the ease of use and interpretation which is present within the multinomial logistic regression model. Even with an understanding of the logistic ordinal regression model, the output produced from this method of analysis can be difficult to fully decipher into useful conclusions.

Example (SPSS):

In this demonstration, we will assume that you are attempting to predict an individual’s favorite color based on other aspects of their individuality.

We will begin with a familiar data set:

Which we will assign value labels to produce the following data interface:

To begin our analysis, we must select, from the topmost menu, “Analyze”, then “Regression”, followed by “Ordinal”.

This sequence of actions should cause the following menu to appear:

Using the topmost arrow button, assign “Color” as a ”Dependent” variable. Once this has been completed, utilize the center arrow button to assign the remaining variables (“Gender”, “Smoker”, “Car”), as “Factor(s)”.

Next, click on the button labeled “Output”, this should populate the following menu:

From this interface, beneath the header labeled, “Saved Variables”, select the following options: “Predicted Category” and “Estimated Response Probability”.

Once this has been completed, click the box labeled “Continued”, then select the box labeled “OK”.

This should produce a voluminous output, however we will only concern ourselves with the following output aspects:

As was described in a prior article, Pseudo R-Squared methods are utilized to measure for model fit when the traditional coefficient methodology is inapplicable. In the case of our example, we will be utilizing the “Nagelkerke” value, which can be assessed on a scale similar to the traditional r-squared metric. Since this model’s Nagelkerke value is .935, we can assume that the model does function as a decent predictor for our dependent variable.

The above output provides us with the internal aspects of the model’s synthesis. Though this may appear daunting to behold at first, the information that is illustrated in the chart is no different than the output generated as it pertains to a typical linear model.

Unlike the multinomial logistic regression model output, generating probabilities and model equations from this analysis is a bit more complicated.

To solve for a particular combination of variable observations, we must first construct the following equation:

Cumulative Logit = (Gender:00 * 19.226) + (Smoker:00 * 35.644) + (Car:1 * 17.313) + (Car:2 * 15.911)

To solve for the categorical probabilities of for each combination, we must use the following R code:

# Gender: 00 #

a <- 0

# Smoker: 00 #

b <- 0

# Car: 1 #

c <- 0

# Car: 2 #

d <- 0

color <- (a * 19.226) + (35.644 * b) + (17.313 * c) + (15.911 * d)

# Cumulative Logit Values #

color1 <- 51.944 -color # 51.944 = color1 threshold value #

color2 <- 52.99 - color # 52.99 = color2 threshold value #

color3 <- 53.97 - color # 53.97= color3 threshold value #

# Cumulative Odds #

cumoddscol1 <- exp(color1)

cumoddscol2 <- exp(color2)

cumoddscol3 <- exp(color3)

# Cumulative Proportion #

cumpropcol1 <- (1/(1 + cumoddscol1))

cumpropcol2 <- (1/(1 + cumoddscol2))

cumpropcol3 <- (1/(1 + cumoddscol3))

# Category Probability #

catprob1 <- (1- cumpropcol1)

catprob2 <- (cumpropcol1 - cumpropcol2)

catprob3 <- (cumpropcol2 - cumpropcol3)

catprob4 <- (1- (catprob1 + catprob2 + catprob3))

# Probability of variable categorical combinations = color:1 #

catprob1

# Probability of variable categorical combinations = color:2 #

catprob2

# Probability of variable categorical combinations = color:3 #

catprob3

# Probability of variable categorical combinations = color:4 #

catprob4

This produces the output:

> # Probability of variable categorical combinations = color:1 #
>
> catprob1
[1] 0.5960419
>
> # Probability of variable categorical combinations = color:2 #
>
> catprob2
[1] 0.2116372
>
> # Probability of variable categorical combinations = color:3 #
>
> catprob3
[1] 0.1102848
>
> # Probability of variable categorical combinations = color:4 #
>
> catprob4
[1] 0.082036

SPSS computes these predicted values for you. As a result of the “Save” option being selected earlier, categorical predicted values are output into a column within the original data set.

Example (R):

# (With the package: "MASS", downloaded and enabled) #

# Logistic Ordinal Regression #

# Input vector data #

color <- c(3.00, 4.00, 1.00, 4.00, 4.00, 4.00, 1.00, 1.00, 1.00, 2.00)

gender <- c(.00, .00, 1.00, .00, .00, 1.00, 1.00, 1.00, .00, 1.00)

smoker <- c(.00, .00, 1.00, .00, .00, .00, .00, .00, 1.00, .00)

car <-c(3.00, 3.00, 2.00, 2.00, 3.00, 1.00, 2.00, 1.00, 2.00,2.00)

# Set vectors to factor type #

color <- as.factor(color)

gender <- as.factor(gender)

smoker <- as.factor(smoker)

car <- as.factor(car)

# Create data frame #

testset <- data.frame(color, gender, smoker, car)

# Create model #

output <- polr(color ~ gender + smoker + car, data=testset)

# Generate output #

summary(output)

This generates the output:

Call:
polr(formula = color ~ gender + smoker + car, data = testset)

Coefficients:
Value Std. Error t value
gender1 -33.811 29.830 -1.1335
smoker1 -42.984 89.457 -0.4805
car2 -1.403 2.095 -0.6695
car3 -31.899 29.848 -1.0687

Intercepts:
Value Std. Error t value
1|2 -34.8240 29.8352 -1.1672
2|3 -33.7777 29.8385 -1.1320
3|4 -32.7980 29.8381 -1.0992

Residual Deviance: 14.50387
AIC: 28.50387

From such, we can create the following code to test variable combinations:

# Gender: 01 #

a <- 0

# Smoker 01 #

b <- 0

# Car 2 #

c <- 0

# Car 3 #

d <- 0

color <- (a * -33.811) + (-42.984 * b) + (-1.403 * c) + (-31.899 * d)

To generate categorical probabilities, we will use a slightly modified version of the prior code which was utilized for our SPSS example.

# Cumulative Logit Values #

color1 <- -34.8240 -color # 34.8240 = color1 threshold value #

color2 <- -33.7777 - color # 33.7777 = color2 threshold value #

color3 <- -32.7980 - color # 32.7980 = color3 threshold value #

# Cumulative Odds

cumoddscol1 <- exp(color1)

cumoddscol2 <- exp(color2)

cumoddscol3 <- exp(color3)

# Cumulative Proportion #

cumpropcol1 <- (1/(1 + cumoddscol1))

cumpropcol2 <- (1/(1 + cumoddscol2))

cumpropcol3 <- (1/(1 + cumoddscol3))

# Category Probability #

catprob1 <- (1 - cumpropcol1 )

catprob1 <- (1 - cumpropcol1 )

catprob2 <- (cumpropcol1 - cumpropcol2)

catprob3 <- (cumpropcol2 - cumpropcol3)

catprob4 <- (1- (catprob1 + catprob2 + catprob3))

# Probability of variable categorical combinations = color:1 #

catprob1

# Probability of variable categorical combinations = color:2 #

catprob2

# Probability of variable categorical combinations = color:3 #

catprob3

# Probability of variable categorical combinations = color:4 #

catprob4

NOTE: The model’s internal aspects differ depending on the platform which was utilized to generate the analysis. Though the model predictions do not differ, I would recommend, if publishing findings, to utilize SPSS in lieu of R. The reason for this rational, pertains to the auditing record which SPSS possesses. If data output possesses abnormalities, R, being open source, cannot be held to account. Additionally, as the multinomial function within R exists as an additional aspect of an external package, it could potentially cause platform computational errors to have a greater likelihood of occurrence. *

* - A similar warning was issued as it pertains to multinomial regression model.

Reflections of a Data Scientist

Friday, May 4, 2018

(R) Logistic Ordinal Regression (SPSS)

No comments:

Post a Comment