Reflections of a Data Scientist: (R) Loglinear Analysis (SPSS)

Today we will be discussing an incredibly difficult concept, Loglinear Analysis. Loglinear analysis is similar to both the logistical regression and the chi-squared methodology of analysis. The loglinear method analyzes data that is structured in a manner which resembles the chi-square model. The main differentiation between the two methods, is the utilization of the poisson distribution as an aspect of the loglinear model. While the chi-square model allows for the testing of a hypothesis, the loglinear method allows us to create a predictive model as an end result of the analysis. The similarities which exist between loglinear and logistic regression are evident in the utilization of the log function. Both models require log transformations of variables in order to be utilized for application.

In research, it is the odds ratio that is the most important aspect of this model’s output. Again, this concept is difficult to conceptualize, and in my personal opinion, is not worth the complexity required to produce the meager products of its synthesis.

Example:

Below is our sample data set:

*A note on structuring data for the purpose of loglinear analysis. When utilizing SPSS to perform this type of analysis, I have found that it is best to start counting categorical cases at “1”, as opposed to “0”. That would mean, in the context of our example, that each prompt bearing a “Yes” label has an underlying value of “1”, and each prompt bearing a “No” label has an underlying value of “2”. *

To perform a Loglinear analysis, select "Analyze" from the top drop down menu, then select "Loglinear". From the next menu, select "General".

This series of selections should populate the following menu:

Using the topmost middle arrow, designate “Smoking” and “Obese” as “Factor(s)”.

Next, click on the button, “Model”.

This should generate the following sub-menu:

Make sure that the box adjacent to “Predicted values” is checked.

After this has been completed, click on the box labeled “Model”.

The following sub-menu should appear:

Select the option “Build terms”.

Once this has been selected, use the middle drop down menu and center arrow to designate “Smoking” and “Obese” as “Interaction” type variables.

Click “Continue” when you have completed the above steps.

Click the “Options” button to generate the following sub-menu:

Be sure to check every option beneath “Display” and then click “Continue”.

This should create the output screens below:

The Pearson Chi-Square value measures the strength of the overall model. In this case, our model should not be seriously considered as being a decent statistical representation of the underlying phenomenon, as the significance value far exceeds .05.

Most of the other output can be disregarded. First, we will consider the model attributes. The model itself, if it were illustrated as a linear equation, would resemble:

Y = .847x + .405x1 + .182

The variable Y is the square root of the predicted number of outcomes which coincide with the various variable combinations.

For example, if this experiment were repeated with the same number of participants, we could predict the number of individuals within each category.

In the category of non-obese smokers, we would expect to find:

Y = (.847 * 1) + (.405 * 0) + .182

Y = 1.0296

(Remember that 1.029 is the square root of the actual figure. Therefore, we must find the exponential value that the Y-value represents.)

So:

exp(Y) = 2.798 (Non-Obese Smokers)

This value can be confirmed in the above table labeled, “Cell Counts and Residuals”.

So what does any of this mean, and how would we report these results in a meaningful way? The LogLinear model attempts to create probabilities which could be potentially utilized to prepare for future experimental results. In the case of our example, each variable was shown to be an insignificant aspect of the model. This is illustrated by the “Sig” values found adjacent to each corresponding variable within the “Parameter Estimates” table. Combined together, it is unsurprising to witness such a low chi-square value, which acts as measurement determining the overall shared significance of the model.

As mentioned previously, the true worthwhile output that is gathered from this method exists in the form of estimated odds. To generate these odds, we must perform the following calculations:

(The numbers utilized for these calculations are taken directly from the “Cell Counts and Residuals” table.)

Estimated Odds Obese Smokers

Yes/No

4.2 / 2.8 = 1.5

Estimated Odds Obese Non-Smokers

Yes/No

1.8 / 1.2 = 1.5

Odds Ratio (Estimated Odds of Obese Smokers / Estimated Odds of Obese Non-Smokers)

1.5 / 1.5 = 1

So let’s state all of our conclusions in a substantive and succinct written summary:

A loglinear analysis was utilized to produce a model. The Pearson’s chi-squared value of this model was X2(1) = 1.270, p = .261*. This indicates that the model results were not found to be significant**. Estimated odds ratios indicate that members within this study are equally*** likely to be obese whether or not they are smokers.

* Use the Greek letter Chi exponentially squared. The corresponding output can be found in the “Goodness-of-Fit Tests” table.

** Your results should be significant.

*** (1 Times)

Here is the code to perform this analysis within the R platform:

Obese <- c("Yes", "Yes", "No", "No")

Smoking <- c("Yes", "No", "Yes", "No")

Count <- c(5, 1, 2, 2)

Data <- data.frame(Count, Obese, Smoking)

DataModel <- glm(Count ~ Obese + Smoking , family = poisson)

# If you’d like to include interaction effects, utilize the code comment below #

# DataModel <- glm(Count ~ Obese + Smoking + Smoking * Obese , family = poisson) #

summary(DataModel)

This produces the output:

Call:
glm(formula = Count ~ Obese + Smoking, family = poisson)

Deviance Residuals:
1 2 3 4
0.3789 -0.6515 -0.5041 0.6658

Coefficients:

Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.1823 0.6952 0.262 0.793
ObeseYes 0.4055 0.6455 0.628 0.530
SmokingYes 0.8473 0.6901 1.228 0.219

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3.3137 on 3 degrees of freedom
Residual deviance: 1.2654 on 1 degrees of freedom
AIC: 17.973

Number of Fisher Scoring iterations: 4

# To test for goodness of fit #

# The model must be re-created in matrix form #

Model <-matrix(c(5, 1, 2, 2),

nrow = 2,

dimnames = list("Smoker" = c("Yes", "No"),

"Obese" = c("Yes", "No")))

# To run the chi-square test #

# ‘correct = FALSE’ disables the Yates’ continuity correction #

chisq.test(Model, correct = FALSE)

This produces the output:

Pearson's Chi-squared test

data: Model
X-squared = 1.2698, df = 1, p-value = 0.2598

So now you have another model at your disposal, Data Heads. Please come back to visit again soon!

Reflections of a Data Scientist

Wednesday, April 4, 2018

(R) Loglinear Analysis (SPSS)

No comments:

Post a Comment