Reflections of a Data Scientist: October 2020

Friday, October 16, 2020

(R) Fisher’s Exact Test

In today’s entry, we are going to briefly review Fisher’s Exact Test, and its appropriate application within the R programming language.

Like the F-Test, Fisher’s Exact Test utilizes the F-Distribution as its primary mechanism of functionality. The F-Distribution being initially derived by Sir. Ronald Fisher.

(The Man)

(The Distribution)

The Fisher’s Exact Test is very similar to The Chi-Squared Test. Both tests are utilized to assess categorical data classifications. The Fisher’s Exact Test was designed specifically for 2x2 contingency sorted data, though, more rows could theoretically be added if necessary. A general rule for application as it relates to selecting the appropriate test for the given circumstances (Fisher’s Exact vs. Chi-Squared), pertains directly to the sample size. If a cell within the contingency table would contain less than 5 observations, a Fisher’s Exact Test would be more appropriate.

The test itself was created for the purpose of studying small observational samples. For this reason, the test is considered to be “conservative”, as compared to The Chi-Squared Test. Or, in layman terms, you are less likely to reject the null hypothesis when utilizing a Fisher’s Exact Test, as the test errs on the side of caution. As previously mentioned, the test was designed for smaller observational series, therefore, its conservative nature is a feature, not an error.

Let’s give it a try in today’s…

Example:

A professor instructs two classes on the subject of Remedial Calculus. He believes, based on a book that he recently completed, that students who consume avocados prior to taking an exam, will generally perform better than students who did not consume avocados prior to taking an exam. To test this hypothesis, the professor has one of classes consume avocados prior to a very difficult pass/fail examination. The other class does not consume avocados, and also completes the same examination. He collects the results of his experiment, which are as follows:

Class 1 (Avocado Consumers)

Pass: 15

Fail: 5

Class 2 (Avocado Abstainers)

Pass: 10

Fail: 15

It is also worth mentioning that professor will be assuming an alpha value of .05.

# The data must first be entered into a matrix #

Model <- matrix(c(15, 10, 5, 15), nrow = 2, ncol=2)

# Let’s examine the matrix to make sure everything was entered correctly #

Model

Console Output:

[,1] [,2]
[1,] 15 5
[2,] 10 15

# Now to apply Fisher’s Exact Test #

fisher.test(Model)

Console Output:

Fisher's Exact Test for Count Data

data: Model
p-value = 0.03373
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.063497 20.550173
sample estimates:
odds ratio
4.341278

Findings:

Fisher’s Exact Test was applied to our experimental findings for analysis. The results of such indicated a significant relationship as it pertains to avocado consumption and examination success: 75% (15/20), as compared to non-consumption and examination success: 40% (10/25); (p = .03).

If we were to apply the Chi-Squared Test to the same data matrix, we would receive the following output:

# Application of Chi-Squared Test to prior experimental observations #

chisq.test(Model, correct = FALSE)

Console Output:

Pearson's Chi-squared test

data: Model
X-squared = 5.5125, df = 1, p-value = 0.01888

Findings:

As you might have expected, the application of the Chi-Squared Test yielded an even smaller p-value! If we were to utilize this test in lieu of The Fisher’s Exact Test, our results would also demonstrate significance.

That is all for this entry.

Thank you for your patronage.

I hope to see you again soon.

-RD

Wednesday, October 14, 2020

Why Isn’t My Excel Function Working?! (MS-Excel)

Even an old data scientist can learn a new trick every once in a while.

Today was such a day.

Imagine my shock, as I spent about two and a half hours trying to get the most basic MS-Excel Functions to correctly execute.

This brings us to today’s example.

I’m not sure if this is now a default option within the latest version of Excel, or why this option would even exist, however, I feel that it is my duty to warn you of its existence.

For the sake this demonstration, we’ll hypothetically assume that you are attempting to write a =COUNTIF function within cell: C2, in order assess the value contained within cell: A2. If we were to drag this formula to the cells beneath: C2, in order to apply the function to cells: C3 and C4, a mis-application occurs, as the value “Car” is not contained within A3 or A4, and yet, the value 1 is returned.

If this “error” arises, it is likely due to the option “Manual” being pre-selected within the “Calculator Options” drop-down menu, which itself, is contained within the “Formulas” ribbon menu. To remedy this situation, change the selection to “Automatic” within the “Calculator Options” drop down.

(Click on image to enlarge)

The result should be the previously expected outcome:

Instead of accidentally and unknowingly encountering this error/feature in a way which is detrimental to your research, I would always recommend checking that “Calculator Options” is set to “Automatic”, prior to beginning your work within the MS-Excel platform.

I hope that you found this article useful.

I’ll see you in the next entry.

-RD

Tuesday, October 6, 2020

Averaging Across Variable Columns (SPSS)

There may be a more efficient way to perform this function, as simpler functionality exists within other programming languages. However, I have not been able to discover a non “ad-hoc” method for performing this task within SPSS.

We will assume that we are operating within the following data set:

Which possesses the following data labels:

Assuming that all variables are on a similar scale, we could create a new variable by utilizing the code below:

COMPUTE CatSum=MEAN(VarA,
VarB,
VarC).
EXECUTE.

This new variable will be named “CatSum”. This variable will be comprised of the mean of the sum of each variable’s corresponding observational data rows: (“VarA”, “VarB”, “VarC”).

To generate the mean value of our newly created “CatSum” variable, we would execute the following code:

DESCRIPTIVES VARIABLES=CatSum
/STATISTICS=MEAN STDDEV.

This produces the output:

To reiterate what we are accomplishing by performing this task, we are simply generating the mean value of the sum of variables: “VarA”, “VarB”, “VarC”.

Another way to conceptually envision this process, is to imagine that we are placing all of the variables together into a single column:

After which, we are generating the mean value of the column which contains all of the combined variable observational values.

And that, is that!

At least, for this article.

Stay studious in the interim, Data Heads!

- RD