Reflections of a Data Scientist: (R) Pearson’s Chi-Square Test Residuals and Post Hoc Analysis

In today’s article, we are going to discuss Pearson Residuals. A Pearson Residual is a product of post hoc analysis. These values can be utilized to further assess Pearson’s Chi-Square Test results.

If you are un-familiar with The Pearson’s Chi-Square Test, or what post hoc analysis typically entails, I would encourage you to do further research prior to proceeding.

Example:

To demonstrate this post hoc technique, we will utilize a prior article’s example:

The "Smoking : Obesity" Pearson’s Chi-Squared Test Demonstration.

# To test for goodness of fit #

Model <-matrix(c(5, 1, 2, 2),

nrow = 2,

dimnames = list("Smoker" = c("Yes", "No"),

"Obese" = c("Yes", "No")))

# To run the chi-square test #

# 'correct = FALSE' disables the Yates’ continuity correction #

chisq.test(Model, correct = FALSE)

This produces the output:

Pearson's Chi-squared test

data: Model
X-squared = 1.2698, df = 1, p-value = 0.2598

From the output provided, we can easily conclude that our results were not significant.

However, let’s delve a bit deeper into our findings.

First, let’s take a look at the matrix of the model.

Model

Obese
Smoker Yes No
Yes 5 2
No 1 2

Now, let’s take a look at the expected model values.

chi.result <- chisq.test(Model, correct = FALSE)

chi.result$expected

Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2

What does this mean?

The values above represent the values which we would expect to observe if the observational categories measured, perfectly adhered to the chi-square distribution.

(Karl Pearson)

From the previously derived values, we can derived the Pearson Residual Values.

print(chi.result$residuals)

Obese
Smoker Yes No
Yes 0.3903600 -0.4780914
No -0.5962848 0.7302967

What we are specifically looking for, as it pertains to the residual output, are values which are greater than +2, or less than -2. If these findings were present in any of the above matrix entries, it would indicate that the model was inappropriately applied given the circumstances of the collected observational data.

The matrix values themselves, in the residual matrix, are the observed categorical values minus the expected values, divided by the square root of the expected values.

Thus: Standard Residual = (Observed Values – Expected Value) / Square Root of Expected Value

Observed Values

Obese
Smoker Yes No
Yes 5 2
No 1 2

Expected Values

Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2

(5 – 4.2) / √ 4.2 = 0.3903600

(1 – 1.8) / √ 1.8 = -0.5962848

(2 – 2.8) / √ 2.8 = -0.4780914

(2 – 1.2) / √ 1.2 = 0.7302967

~ OR ~

(5 - 4.2) / sqrt(4.2)

(1 - 1.8) / sqrt(1.8)

(2 - 2.8) / sqrt(2.8)

(2 - 1.2) / sqrt(1.2)

[1] 0.39036
[1] -0.5962848
[1] -0.4780914
[1] 0.7302967

The Pearson Residual Values (0.39036…etc.), are an estimate of the raw residual values’ standard deviations. It is for this reason, that any value greater than +2, or less than -2, would indicate a misapplication of the model. Or, at very least, indicate that more observational values ought to be collected prior to the model being applied again.

The Fisher’s Exact Test as a Post Hoc Analysis for The Pearson's Chi-Square Test

Let’s take our example one step further by applying The Fisher’s Exact Test as a method of post hoc analysis.

Why would we do this?

Assuming that our Chi-Square Test findings were significant, we may want to consider a Fisher’s Exact Test as a method to further prove evidence of significance.

A Fisher’s Exact Test is less robust in application as compared to the Chi-Square Test. For this reason, the Fisher’s Exact Test will always yield a lower p-value than its Chi-Square counterpart.

(Sir Ronald Fisher)

fisher.result <- fisher.test(Model)

print(fisher.result$p.value)

[1] 0.5

<Yikes!>

Conclusions

Now that we have considered our analysis every which way, we can state our findings in APA Format.

This would resemble the following:

A chi-square test of proportions was performed to examine the relation of smoking and obesity. The relation between these variables was not found to be significant χ2 (1, N = 10) = 1.27, p > .05.

In investigating the Pearson Residuals produced from the model application, no value was found to be greater than +2, or less than -2. These findings indicate that the model was appropriate given the circumstances of the experimental data.

In order to further confirm our experimental findings, a Fisher’s Exact Test was also performed for post hoc analysis. The results of such indicated a non-significant relationship as it pertains to obesity as determined by individual smoker status: 71% (5/7), compared to individual non-smoker status: 33% (1/3); (p > .05).

-----------------------------------------------------------------------------------------------------------------------------

I hope that you found all of this helpful and entertaining.

Until next time,

-RD

Reflections of a Data Scientist

Saturday, June 5, 2021

(R) Pearson’s Chi-Square Test Residuals and Post Hoc Analysis

No comments:

Post a Comment