Sunday, July 18, 2021

Pivot Tables (MS-Excel)

You didn’t honestly believe that I would continue to write articles without mentioning every analyst’s favorite Excel technique, did you?

Example / Demonstration:

For this demonstration, we are going to be utilizing the, “Removing Duplicate Entries (MS-Excel).csv” data file. This file can found within GitHub data repo, upload data: July 12, 2018. If you are too lazy to navigate over the repo site, the raw .csv data can be found down below:

VARA,VARB,VARC,VARD
Mike,1,Red,Spade
Mike,2,Blue,Club
Mike,1,Red,Spade
Troy,2,Green,Diamond
Troy,1,Red,Heart
Archie,2,Orange,Heart
Archie,2,Yellow,Diamond
Archie,2,Orange,Heart
Archie,1,Red,Spade
Archie,1,Blue,Spade
Archie,2,Red,Club
Archie,2,Red,Club
Jack,1,Red,Diamond
Jack,2,Blue,Diamond
Jack,2,Blue,Diamond
Rob,1,Green,Club
Rob,2,Orange,Spade
Brad,1,Red,Heart
Susan,2,Blue,Heart
Susan,2,Yellow,Club
Susan,1,Pink,Heart
Seth,2,Grey,Heart
Seth,1,Green,Club
Joanna,2,Pink,Club
Joanna,1,Green,Spade
Joanna,1,Green,Spade
Bertha,2,Grey,Diamond
Bertha,1,Grey,Diamond
Liz,1,Green,Spade


Let’s get started!

First, we’ll take a nice look at the data as it exists within MS-Excel:


Now we’ll pivot to excellence!

The easiest way to start building pivot tables, is to utilize the “Recommended PivotTables” option button located within the “Insert” menu, listed within Excel’s ribbon menu.


This should bring up the menu below:


Go ahead and select all row entries, across all variable columns.

Once this has been completed, click “OK”.

This should generate the following menu:


Let’s break down each recommendation.

“Sum of VARB by VARD” – This table is summing the total of the numerical values contained within VARB, as they correspond with VARD entries.

“Count of VARA by VARD” – This table is counting the total number of occurrences of categorical values within variable column VARD.

“Sum of VARB by VARC” – This table is summing the total of numerical values contained within VARB, as they correspond with VARC entries.


“Count of VARA by VARC” – This table is counting the total number of occurrences of categorical values within variable column VARA.

“Sum of VARB by VARA” – This table is summing the total of the numerical values contained within VARB, as they correspond with VARA entries.

Now, there may come a time in which none of the above options match exactly what you are looking for. In this case, you will want to utilize the “PivotTable” option button, located within the “Insert” menu, listed within Excel’s ribbon menu.


Go ahead and select all row entries, across all variable columns.

Change the option button to “New Worksheet”, instead of “Existing Worksheet”.

Once this has been complete, click “OK”.

Once this has been accomplished, you’ll be graced with a new menu, on a new Excel sheet (same workbook).

I won’t go into every single output option that you have available, but I will list a few you may want to try yourself. Each output variation can be created by dragging and dropping the variables listed within the topmost box, in varying order, into the boxes below: 


If VARA and VARC are both added to Rows, you will view the categorical occurrences of variable entries from VARC, with VARA acting as the unique ID.

Order matters in each pivot table variable designation place.

So, if we reverse the position of VARA and VARC, and instead list VARC first, followed by VARA, then we will a table which lists the categorical occurrences of VARA, with VARC acting as a unique ID.

If we include VARA and VARC as rows (in that order), and set the values variable to Sum of VARB, then the output should more so resemble an accounting sheet, with the sum of each numerical value corresponding with VARA, categorized by VARC, is summed (VARB). 


If we instead wanted the count, as opposed to the sum, we could click on the drop down arrow located next to “Count of VARB”, which presents the following options:


From the options listed, we well select “Value Field Settings”.

This presents the following menu, from which we will select “Count”.


The result of following the previously listed steps is illustrated below:


The Pivot Table creation menu also allows for further customization through the addition of column variables.

In the case of our example, we will make the following modifications to our table output:


VARC will now be designated as a column variable, VARA will be a row variable, and the count of VARB will be out values variable.  

The result of these modifications is shown below:


Our output format now contains a table which contains the count of each occurrence of each color (VARC), as each color corresponds with each individual listed (VARA) within the original data set. 

In conclusion, the pivot table option within MS-Excel, offers a variety of different display outputs which can be utilized to display statistical summary data.

The most important skill to develop as it pertains to this feature, is the ability to ascertain when a pivot table is necessary for your data project needs.

So with that, we will end this article.

I will see you next time, Data Head.

-RD

Wednesday, July 14, 2021

Getting to Know the Greeks

In today’s article, we are going to go a bit off the beaten path and discuss, The Greek Alphabet!


You might be wondering, why the sudden change of subject content…?

In order to truly master of the craft of data science, you will be required to stretch your mind in creative ways. The Greek Alphabet is utilized throughout the fields of statistics, mathematics, finance, computer science, astronomy and other western intellectual pursuits. For this reason, it really ought to be taught in elementary schools. However, to my knowledge, in most cases, it is not.

The Romans borrowed heavily from Greek Civilization, and contemporary western civilization borrowed heavily from the Romans. Therefore, to truly be a person of culture, you should learn the Greek Alphabet, and really, as much as you possibly can about Ancient Greek Culture. This includes the legends, heroes, and philosophers. We might be getting more into this in other articles, but for today, we will be sticking to the alphabet.

The Greek Alphabet

The best way to learn the Greek alphabet is to be Greek (j/k, but not really). In all other cases, application is probably the best way to commit various Greek letters, as symbols, to memory.

I would recommend drawing each letter in order, uppercase, and lowercase, and saying the name of the letter as it is written.

Let’s try this together!

Α α (Alpha) (Pronounced: AL-FUH) - Utilized in statistics as the symbol which connotates significance level. In finance, it is the percentage return of an investment above or below a predetermined index.

B β (Beta) (Pronounced: BAY-TUH) - In statistics, this symbol is utilized to represent type II errors. In finance, it is utilized to determine asset volatility.

Γ γ (Gamma) (Pronounced: GAM-UH) - In physics, this symbol is utilized to represent particle decay (Gamma Decay). There also exists Alpha Decay, and Beta Decay. The type of decay situationally differs depending on the circumstances.

Δ δ (Delta) (Pronounced: DEL-TUH) - This is currently the most common strain of the novel coronavirus (7/2021). In the field of chemistry, uppercase Delta is utilized to symbolize heat being added to a reaction.

Ε ε (Epsilon) (Pronounced: EP-SIL-ON) - “Machine Epsilon” is utilized in computer science as a way of dealing with floating point values and their assessment within logical statements.

Ζ ζ (Zeta) (Pronounced: ZAY-TUH) - The most common utilization assignment which I have witnessed for this letter, is its designation as the variable which represents the Reimann Zeta Function (number theory).

Η η (Eta) (Pronounced: EE -TUH) - I’ve mostly seen this letter designated as variable for the Dedekind eta function (number theory).

Θ θ (Theta) (Pronounced: THAY-TUH) - Theta is utilized as the symbol to represent a pentaquark, a transitive subatomic particle.

Ι ι (Iota) (Pronounced: EYE-OL-TA) - I’ve never seen this symbol utilized for anything outside of astronomical designations. Maybe if you make it big in science, you could give Iota the love that it so deserves.

Κ k (Kappa) (Pronounced: CAP-UH) - Kappa is the chosen variable designation for Einstein’s gravitational constant.

Λ λ (Lambda) (Pronounced: LAMB-DUH) - A potential emergent novel coronavirus variant (7/2021). Lowercase Lambda is also utilized throughout the Poisson Distribution function.

Μ μ (Mu) (Pronounced: MEW) - Lowercase Mu is utilized to symbolize the mean of a population (statistics). In particle physics, it can also be applied to represent the elementary particle: Muon.

Ν ν (Nu) (Pronounced: NEW) - As a symbol, this letter represents degrees of freedom (statistics).

Ξ ξ (Xi) (Pronounced: SEE) - In mathematics, uppercase Xi can be utilized to represent the Reimann Xi Function.

Ο ο (Omnicron) (Pronounced: OMNI-CRON) - A symbol which does not get very much love, or use, unlike its subsequent neighbor…

Π π (Pi) (Pronounced: PIE) - In mathematics, lowercase Pi often represents the mathematical real transcendental constant ≈ 3.1415…etc.

Ρ ρ (Rho) (Pronounced: ROW) - In the Black-Scholes model, Rho represent the rate of change of a portfolio with respect to interest rates

Σ σ (Sigma) (Pronounced: SIG-MA) - Lower case Sigma represents the standard deviation of a population (statistics). Upper case sigma represents a sum function (mathematics).

Τ τ (Tau) (Pronounced: TAIL) - Lower case Tau represents an elementary particle within the field of particle physics

Υ υ (Upsilon) (Pronounced: EEP-SIL-ON) - Does not really get very much use…

Φ φ (Phi) (Pronounced: FAI) - Lowercase Phi is utilized to represent the Golden Ratio.

Χ χ (Chi) (Pronounced: KAI) - Lower case Chi is utilized as a variable throughout the Chi-Square distribution function.

Ψ ψ (Psi) (Pronounced: PSY) - Lower case Psi is used to represent the (generalized) positional states of a qubit within a quantum computer.

Ω ω (Omega) (Pronounced: OHMEGA) - Utilized for just about everything.

Αυτα για τωρα. Θα σε δω την επόμενη φορά!

-RD

Friday, June 25, 2021

(R) The Levene's Test

In today’s article we will be discussing a technique which is not specifically interesting or pragmatically applicable. Still, for the sake of true data science proficiency, today we will be discussing, THE LEVENE'S TEST!

The Levene's Test is utilized to compare the variances of two separate data sets.

So naturally, our hypothesis would be:

Null Hypothesis: The variance measurements of the two data sets do not significantly differ.

Alternative Hypothesis: The variance measurements of the two data sets do significantly different.

The Levene's Test Example:

# The leveneTest() Function is included within the “car” package #

library(car)

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

N_LEV <- c(N1, N2)

group <- as.factor(c(rep(1, length(N1)), rep(2, length(N2))))

leveneTest(N_LEV, group)

# The above code is a modification of code provided by StackExchange user: ocram. #

# Source https://stats.stackexchange.com/questions/15722/how-to-use-levene-test-function-in-r #

This produces the output:

Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 1.7677 0.2049
14

Since the p-value of the output exceeds .05, we will not reject the null hypothesis (alpha = .05).

Conclusions:

The Levene’s Test for Equality of Variances did not indicate a significant differentiation in the variance measurement of Sample N1, as compared to the variance measurement of Sample N2, F(1,14) = 1.78, p= .21.

So, what is the overall purpose of this test? Meaning, when would its application be appropriate? The Levene’s Test is typically utilized as a pre-test prior to the application of the standard T-Test. However, it is uncommon to structure a research experiment in this manner.  Therefore, the Levene’s Test is more so something which is witnessed within the classroom, and not within the field.

Still, if you find yourself in circumstances in which this test is requested, know that it is often required to determine whether a standard T-Test is applicable. If variances are found to be un-equal, a Welch’s T-Test is typically preferred as an alternative to the standard T-Test.

-----------------------------------------------------------------------------------------------------------------------------

I promise that my next article will be more exciting.

Until next time.

-RD

Friday, June 18, 2021

(R) Imputing Missing Data with the MICE() Package


In today’s article we are going to discuss basic utilization of the MICE package.

The MICE package, is a package which assists with performing analysis on shoddily assembled data frames.

In the world of data science, the real world, not the YouTube world, or the classroom world, data often comes down in a less than optimal state. In most cases, this is more the reality of the matter.

Now, it would easy to throw up your hands and say, “I CAN’T PERFORM ANY SORT OF ANALYSIS WITH ALL OF THESE MISSING VARIABLES”,


~OR~

(Don’t succumb to temptation!) 

Unfortunately, for you, the data scientist, whoever passed you this data expects a product and not your excuses.

Fortunately, for all of us, there is a way forward.

Example:

Let’s say that you were given this small data set for analysis:

                                           

The data is provided in an .xls format, because why wouldn’t it be?

For the sake of not having you download an example data file, I have re-coded this data into the R format.

# Create Data Frame: "SheetB" #

VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, NA , 1, NA, 0, 0, 0, 0)

VarB <- c(20, 16, 20, 4, NA, NA, 13, 6, 2, 18, 12, NA, 13, 9, 14, 18, 6, NA, 5, 2)

VarC <- c(2, NA, 1, 1, NA, 2, 3, 1, 2, NA, 3, 4, 4, NA, 4, 3, 1, 2, 3, NA)

VarD <- c(70, 80, NA, 87, 79, 60, 61, 75, NA, 67, 62, 93, NA, 80, 91, 51, NA, 33, NA, 50)

VarE <- c(980, 800, 983, 925, 821, NA, NA, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)


SheetB <- data.frame(VarA, VarB, VarC, VarD, VarE)


If you would like to see a version of the initial example file with the missing values, the code to create this data frame is below:

# Create Data Frame: "SheetA" #

VarA <- c(1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0)

VarB <- c(20, 16, 20, 4, 8, 17, 13, 6, 2, 18, 12, 17, 13, 9, 14, 18, 6, 13, 5, 2)

VarC <- c(2, 3, 1, 1, 1, 2, 3, 1, 2, 1, 3, 4, 4, 1, 4, 3, 1, 2, 3, 1)

VarD <- c(70, 80, 90, 87, 79, 60, 61, 75, 92, 67, 62, 93, 74, 80, 91, 51, 64, 33, 77, 50)

VarE <- c(980, 800, 983, 925, 821, 978, 881, 912, 987, 889, 870, 918, 923, 833, 839, 919, 905, 859, 819, 966)

SheetA <- data.frame(VarA, VarB, VarC, VarD, VarE)


In our example, we’ll assume that the sheet which contains all values is unavailable to you (“SheetA”). Therefore, to perform any sort of meaningful analysis, you will need to either delete all observations which contain missing data variables (DON’T DO IT!), or, run an imputation function.


We will opt to do the latter, and the function which we will utilize, is the mice() function.

First, we will initialize the appropriate library:

# Initalitze Library #

library(mice)

Next, we will perform the imputation function contained within the library.

# Perform Imputation #

SheetB_Imputed <- mice(SheetB, m=1, maxit = 50, method = 'pmm', seed = 500)

SheetB: is the data frame which is being called by the function.

m = 1: This is the number of data frame imputation variations which will be generated as a result of the mice function. One is all that is necessary.

maxit: This is the number of max iterations which will occur as the mice function calculates what it determines to be the optimal value of each missing variable cell.

method: Is the method which will be utilized to perform this function.

seed: The mice() function partially relies on randomness to generate missing variable values. The seed value can be whatever value you determine to be appropriate.

After performing the above function, you should be greeted with the output below:

iter imp variable
1 1 VarA VarB VarC VarD VarE
2 1 VarA VarB VarC VarD VarE
3 1 VarA VarB VarC VarD VarE
4 1 VarA VarB VarC VarD VarE
5 1 VarA VarB VarC VarD VarE
6 1 VarA VarB VarC VarD VarE
7 1 VarA VarB VarC VarD VarE
8 1 VarA VarB VarC VarD VarE
9 1 VarA VarB VarC VarD VarE
10 1 VarA VarB VarC VarD VarE
11 1 VarA VarB VarC VarD VarE
12 1 VarA VarB VarC VarD VarE
13 1 VarA VarB VarC VarD VarE
14 1 VarA VarB VarC VarD VarE
15 1 VarA VarB VarC VarD VarE
16 1 VarA VarB VarC VarD VarE
17 1 VarA VarB VarC VarD VarE
18 1 VarA VarB VarC VarD VarE
19 1 VarA VarB VarC VarD VarE
20 1 VarA VarB VarC VarD VarE
21 1 VarA VarB VarC VarD VarE
22 1 VarA VarB VarC VarD VarE
23 1 VarA VarB VarC VarD VarE
24 1 VarA VarB VarC VarD VarE
25 1 VarA VarB VarC VarD VarE
26 1 VarA VarB VarC VarD VarE
27 1 VarA VarB VarC VarD VarE
28 1 VarA VarB VarC VarD VarE
29 1 VarA VarB VarC VarD VarE
30 1 VarA VarB VarC VarD VarE
31 1 VarA VarB VarC VarD VarE
32 1 VarA VarB VarC VarD VarE
33 1 VarA VarB VarC VarD VarE
34 1 VarA VarB VarC VarD VarE
35 1 VarA VarB VarC VarD VarE
36 1 VarA VarB VarC VarD VarE
37 1 VarA VarB VarC VarD VarE
38 1 VarA VarB VarC VarD VarE
39 1 VarA VarB VarC VarD VarE
40 1 VarA VarB VarC VarD VarE
41 1 VarA VarB VarC VarD VarE
42 1 VarA VarB VarC VarD VarE
43 1 VarA VarB VarC VarD VarE
44 1 VarA VarB VarC VarD VarE
45 1 VarA VarB VarC VarD VarE
46 1 VarA VarB VarC VarD VarE
47 1 VarA VarB VarC VarD VarE
48 1 VarA VarB VarC VarD VarE
49 1 VarA VarB VarC VarD VarE
50 1 VarA VarB VarC VarD VarE 


The output is informing you that the iteration was performed a total of 50 times on one single set.

The code below assigns all the initial values from the original set, with newly estimated values, which now occupy the variable cells which were previously blank.

# Assign Original Values with Imputations to Data Frame #

SheetB_Imputed_Complete <- complete(SheetB_Imputed)


The outcome should resemble something like:

(Beautiful!)


A quick warning, the mice() function cannot be utilized on data frames which contain unencoded categorical variable entries.

An example of this:

                                       

To get mice() to work correctly on this data set, you must recode "VARC" prior to proceeding. You could do this by changing each instance of "Spade" to 1, "Club" to 2, “Diamond" to 3, and "Heart" to 4.

For more information as it relates to this function, please check out this link.

That’s all for now, internet.

-RD

Saturday, June 12, 2021

(R) 2-Sample Test for Equality of Proportions

In today’s article we are going to revisit in greater detail, a topic which was reviewed in a prior article.

What the 2-Sample Test for Equality of Proportions seeks to achieve, is an assessment of differentiation as it pertains to one survey group’s response, as measured against another.

To illustrate the application of this methodology, I will utilize a prior example which was previously published to this site (10/15/2017).

Example:

A pollster took a survey of 1300 individuals, the results of such indicated that 600 were in favor of candidate A. A second survey, taken weeks later, showed that 500 individuals out of 1500 voters were now in favor with candidate A. At a 10% significant level, is there evidence that the candidate's popularity has decreased.

# Model Hypothesis #

# H0: p1 - p2 = 0 #

# (The proportions are the same) # 

# Ha: p1 - p2 > 0 #

# (The proportions are NOT the same) #

# Disable Scientific Notation in R Output #

options(scipen = 999)

# Model Application #

prop.test(x = c(600,500), n=c(1300,1500), conf.level = .95, correct = FALSE)


Which produces the output:

2-sample test for equality of proportions without continuity correction

data: c(600, 500) out of c(1300, 1500)
X-squared = 47.991, df = 1, p-value = 0.000000000004281
alternative hypothesis: two.sided
95 percent confidence interval:
0.09210145 0.16430881
sample estimates:
prop 1 prop 2
0.4615385 0.3333333

We are now prepared to state the details of our model’s application, and the subsequent findings and analysis which occurred as a result of such.

Conclusions:

A 2-Sample Test for Equality of Proportions without Continuity Correction was performed to analyze whether the poll survey results for Candidate A., significantly differed from subsequent poll survey results gathered weeks later. A 90% confidence interval was assumed for significance.

There was a significant difference in Candidate A’s favorability score as from the initial poll findings: 46% (600/1300), as compared to Candidate A’s favorability score the subsequent poll findings: 33% (500/1500); χ2 (1, N = 316) = 47.99, p > .10.

-----------------------------------------------------------------------------------------------------------------------------

That's all for now.

I'll see you next time, Data Heads.

-RD

Saturday, June 5, 2021

(R) Pearson’s Chi-Square Test Residuals and Post Hoc Analysis

In today’s article, we are going to discuss Pearson Residuals. A Pearson Residual is a product of post hoc analysis. These values can be utilized to further assess Pearson’s Chi-Square Test results.

If you are un-familiar with The Pearson’s Chi-Square Test, or what post hoc analysis typically entails, I would encourage you to do further research prior to proceeding.

Example:

To demonstrate this post hoc technique, we will utilize a prior article’s example:

The "Smoking : Obesity" Pearson’s Chi-Squared Test Demonstration.

# To test for goodness of fit #

Model <-matrix(c(5, 1, 2, 2),

nrow = 2,

dimnames = list("Smoker" = c("Yes", "No"),

"Obese" = c("Yes", "No")))

# To run the chi-square test #

# 'correct = FALSE' disables the Yates’ continuity correction #

chisq.test(Model, correct = FALSE)


This produces the output:

Pearson's Chi-squared test

data: Model
X-squared = 1.2698, df = 1, p-value = 0.2598

From the output provided, we can easily conclude that our results were not significant.

However, let’s delve a bit deeper into our findings.

First, let’s take a look at the matrix of the model.

Model

Obese
Smoker Yes No
Yes 5 2
No 1 2


Now, let’s take a look at the expected model values.

chi.result <- chisq.test(Model, correct = FALSE)

chi.result$expected


Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2


What does this mean?

The values above represent the values which we would expect to observe if the observational categories measured, perfectly adhered to the chi-square distribution. 

(Karl Pearson)

From the previously derived values, we can derived the Pearson Residual Values.

print(chi.result$residuals)

Obese
Smoker Yes No
Yes 0.3903600 -0.4780914
No -0.5962848 0.7302967

What we are specifically looking for, as it pertains to the residual output, are values which are greater than +2, or less than -2. If these findings were present in any of the above matrix entries, it would indicate that the model was inappropriately applied given the circumstances of the collected observational data.

The matrix values themselves, in the residual matrix, are the observed categorical values minus the expected values, divided by the square root of the expected values. 

Thus: Standard Residual = (Observed Values – Expected Value) / Square Root of Expected Value

Observed Values

Obese
Smoker Yes No
Yes 5 2
No 1 2

Expected Values

Obese
Smoker Yes No
Yes 4.2 2.8
No 1.8 1.2

(5 – 4.2) / √ 4.2 = 0.3903600 

(1 – 1.8) / √ 1.8 = -0.5962848

(2 – 2.8) / √ 2.8 = -0.4780914

(2 – 1.2) / √ 1.2 = 0.7302967

~ OR ~

(5 - 4.2) / sqrt(4.2)

(1 - 1.8) / sqrt(1.8)

(2 - 2.8) / sqrt(2.8)

(2 - 1.2) / sqrt(1.2)


[1] 0.39036
[1] -0.5962848
[1] -0.4780914
[1] 0.7302967

The Pearson Residual Values (0.39036…etc.), are an estimate of the raw residual values’ standard deviations. It is for this reason, that any value greater than +2, or less than -2, would indicate a misapplication of the model. Or, at very least, indicate that more observational values ought to be collected prior to the model being applied again.

The Fisher’s Exact Test as a Post Hoc Analysis for The Pearson's Chi-Square Test

Let’s take our example one step further by applying The Fisher’s Exact Test as a method of post hoc analysis.

Why would we do this?

Assuming that our Chi-Square Test findings were significant, we may want to consider a Fisher’s Exact Test as a method to further prove evidence of significance.

A Fisher’s Exact Test is less robust in application as compared to the Chi-Square Test. For this reason, the Fisher’s Exact Test will always yield a lower p-value than its Chi-Square counterpart. 

(Sir Ronald Fisher)

fisher.result <- fisher.test(Model)

print(fisher.result$p.value)


[1] 0.5

<Yikes!>

Conclusions

Now that we have considered our analysis every which way, we can state our findings in APA Format.

This would resemble the following:

A chi-square test of proportions was performed to examine the relation of smoking and obesity. The relation between these variables was not found to be significant χ2 (1, N = 10) = 1.27, p > .05.


In investigating the Pearson Residuals produced from the model application, no value was found to be greater than +2, or less than -2. These findings indicate that the model was appropriate given the circumstances of the experimental data.

In order to further confirm our experimental findings, a Fisher’s Exact Test was also performed for post hoc analysis. The results of such indicated a non-significant relationship as it pertains to obesity as determined by individual smoker status: 71% (5/7), compared to individual non-smoker status: 33% (1/3); (p > .05).


-----------------------------------------------------------------------------------------------------------------------------

I hope that you found all of this helpful and entertaining.

Until next time,

-RD

Monday, November 9, 2020

(R) Cohen’s d

In today’s entry, we are going to discuss Cohen’s d, what it is, and when to utilize it. We will also discuss how to appropriately apply the methodology needed to derive this value, through the utilization of the R software package.


(SPSS does not contain the innate functionality necessary to perform this calculation)

Cohen’s d - (What it is):

Cohen’s d is utilized as a method to assess the magnitude of impact as it relates to two sample groups which are subject to differing conditions. For example, if a two sample t-test was being implemented to test a single group which received a drug, against another group which did not receive the drug, then the p-value of this test would determine whether or not the findings were significant.

Cohen’s d would measure the magnitude of the potential impact

Cohen’s d - (When to use it):

In your statistics class.

You could also utilize this test to perform post-hoc analysis as it relates to the ANOVA model and the Student’s T-Test. However, I have never witnessed the utilization of this test outside of an academic setting.

Cohen’s d – (How to interpret it):

General Interpretation Guidelines:

Greater than or equal to 0.2 = small
Greater than or equal to 0.5 = medium
Greater than or equal to 0.8 = large

Cohen’s d – (How to state your findings):

The effect size for this analysis (d = x.xx) was found to exceed Cohen’s convention for a [small, medium, large] effect (d = .xx).

Cohen’s d – (How to derive it):

# Within the R-Programming Code Space #

##################################

# length of sample 1 (x) #
lenx <-
# length of sample 2 (y) #
leny <-
# mean of sample 1 (x) #
meanx <-
# mean of sample 2 (y)#
meany <-
# SD of sample 1 (x) #
sdx <-
# SD of sample 2 (y) #
sdy <-

varx <- sdx^2
vary <- sdy^2
lx <- lenx - 1
ly <- leny - 1
md <- abs(meanx - meany) ## mean difference (numerator)
csd <- lx * varx + ly * vary
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d

cd

##################################


# The above code is a modified version of the code found at: #

# https://stackoverflow.com/questions/15436702/estimate-cohens-d-for-effect-size #


Cohen’s d – (Example):

FIRST WE MUST RUN A TEST IN WHICH COHEN’S d CAN BE APPLIED AS AN APPROPRIATE POST-HOC TEST METHODOLOGY.
 
Two Sample T-Test


This test is utilized if you randomly sample different sets of items from two separate control groups.

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

For this, we will use the code:

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)


Which produces the output:

Two Sample t-test

data: N2 and N1
t = 2.4558, df = 14, p-value = 0.02773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3007929 4.4492071
sample estimates:
mean of x mean of y
75.250 72.875


# Note: In this case, the 95 percent confidence interval is measuring the difference of the mean values of the samples. #

# An additional option is available when running a two sample t-test, The Welch Two Sample T-Test. To utilize this option while performing a t-test, the "var.equal = TRUE" must be changed to "var.equal = FALSE". The output produced from a Welch Two Sample t-test is slightly more robust and accounts for differing sample sizes. #

From this output we can conclude:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

Application of Cohen’s d 

length(N1) # 8 #
length(N2) # 8 #

mean(N1) # 72.875 #
mean(N2) # 75.25 #

sd(N1) # 2.167124 #
sd(N2) # 1.669046 #

# length of sample 1 (x) #
lenx <- 8
# length of sample 2 (y) #
leny <- 8
# mean of sample 1 (x) #
meanx <- 72.875
# mean of sample 2 (y)#
meany <- 75.25
# SD of sample 1 (x) #
sdx <- 2.167124
# SD of sample 2 (y) #
sdy <- 1.669046

varx <- sdx^2
vary <- sdy^2
lx <- lenx - 1
ly <- leny - 1
md <- abs(meanx - meany) ## mean difference (numerator)
csd <- lx * varx + ly * vary
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d

cd


Which produces the output:

[1] 1.227908

################################## 

From this output we can conclude:

The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80).

Combining both conclusions, our final written product would resemble:

With a p-value of 0.02773 (.0.02773 < .05), and a corresponding t-value of 2.4558, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

The effect size for this analysis (d = 1.23) was found to exceed Cohen’s convention for a large effect (d = .80).

And that is it for this article.

Until next time,

-RD

Friday, October 16, 2020

(R) Fisher’s Exact Test

In today’s entry, we are going to briefly review Fisher’s Exact Test, and its appropriate application within the R programming language.

Like the F-Test, Fisher’s Exact Test utilizes the F-Distribution as its primary mechanism of functionality. The F-Distribution being initially derived by Sir. Ronald Fisher.

(The Man)

(The Distribution)

The Fisher’s Exact Test is very similar to The Chi-Squared Test. Both tests are utilized to assess categorical data classifications. The Fisher’s Exact Test was designed specifically for 2x2 contingency sorted data, though, more rows could theoretically be added if necessary. A general rule for application as it relates to selecting the appropriate test for the given circumstances (Fisher’s Exact vs. Chi-Squared), pertains directly to the sample size. If a cell within the contingency table would contain less than 5 observations, a Fisher’s Exact Test would be more appropriate.

The test itself was created for the purpose of studying small observational samples. For this reason, the test is considered to be “conservative”, as compared to The Chi-Squared Test. Or, in layman terms, you are less likely to reject the null hypothesis when utilizing a Fisher’s Exact Test, as the test errs on the side of caution. As previously mentioned, the test was designed for smaller observational series, therefore, its conservative nature is a feature, not an error.

Let’s give it a try in today’s…

Example:

A professor instructs two classes on the subject of Remedial Calculus. He believes, based on a book that he recently completed, that students who consume avocados prior to taking an exam, will generally perform better than students who did not consume avocados prior to taking an exam. To test this hypothesis, the professor has one of classes consume avocados prior to a very difficult pass/fail examination. The other class does not consume avocados, and also completes the same examination. He collects the results of his experiment, which are as follows:

Class 1 (Avocado Consumers)

Pass: 15

Fail: 5

Class 2 (Avocado Abstainers)

Pass: 10

Fail: 15

It is also worth mentioning that professor will be assuming an alpha value of .05.

# The data must first be entered into a matrix #

Model <- matrix(c(15, 10, 5, 15), nrow = 2, ncol=2)

# Let’s examine the matrix to make sure everything was entered correctly #

Model


Console Output:

[,1] [,2]
[1,] 15 5
[2,] 10 15


# Now to apply Fisher’s Exact Test #

fisher.test(Model)

Console Output:

        Fisher's Exact Test for Count Data

data: Model
p-value = 0.03373
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.063497 20.550173
sample estimates:
odds ratio
4.341278


Findings:

Fisher’s Exact Test was applied to our experimental findings for analysis. The results of such indicated a significant relationship as it pertains to avocado consumption and examination success: 75% (15/20), as compared to non-consumption and examination success: 40% (10/25); (p = .03).

If we were to apply the Chi-Squared Test to the same data matrix, we would receive the following output:

# Application of Chi-Squared Test to prior experimental observations #

chisq.test(Model, correct = FALSE)

Console Output:

Pearson's Chi-squared test

data: Model
X-squared = 5.5125, df = 1, p-value = 0.01888


Findings: 

As you might have expected, the application of the Chi-Squared Test yielded an even smaller p-value! If we were to utilize this test in lieu of The Fisher’s Exact Test, our results would also demonstrate significance.

That is all for this entry.

Thank you for your patronage.

I hope to see you again soon.

-RD

Wednesday, October 14, 2020

Why Isn’t My Excel Function Working?! (MS-Excel)

Even an old data scientist can learn a new trick every once in a while.

Today was such a day.

Imagine my shock, as I spent about two and a half hours trying to get the most basic MS-Excel Functions to correctly execute.

This brings us to today’s example.

I’m not sure if this is now a default option within the latest version of Excel, or why this option would even exist, however, I feel that it is my duty to warn you of its existence.


For the sake this demonstration, we’ll hypothetically assume that you are attempting to write a =COUNTIF function within cell: C2, in order assess the value contained within cell: A2. If we were to drag this formula to the cells beneath: C2, in order to apply the function to cells: C3 and C4, a mis-application occurs, as the value “Car” is not contained within A3 or A4, and yet, the value 1 is returned.

If this “error” arises, it is likely due to the option “Manual” being pre-selected within the “Calculator Options” drop-down menu, which itself, is contained within the “Formulas” ribbon menu. To remedy this situation, change the selection to “Automatic” within the “Calculator Options” drop down.

(Click on image to enlarge)

The result should be the previously expected outcome:


Instead of accidentally and unknowingly encountering this error/feature in a way which is detrimental to your research, I would always recommend checking that “Calculator Options” is set to “Automatic”, prior to beginning your work within the MS-Excel platform.


I hope that you found this article useful.

I’ll see you in the next entry.

-RD

Tuesday, October 6, 2020

Averaging Across Variable Columns (SPSS)

There may be a more efficient way to perform this function, as simpler functionality exists within other programming languages. However, I have not been able to discover a non “ad-hoc” method for performing this task within SPSS.

We will assume that we are operating within the following data set:


Which possesses the following data labels:


Assuming that all variables are on a similar scale, we could create a new variable by utilizing the code below:

COMPUTE CatSum=MEAN(VarA,
VarB,
VarC).
EXECUTE.

This new variable will be named “CatSum”. This variable will be comprised of the mean of the sum of each variable’s corresponding observational data rows: (“VarA”, “VarB”, “VarC”). 


To generate the mean value of our newly created “CatSum” variable, we would execute the following code:

DESCRIPTIVES VARIABLES=CatSum
/STATISTICS=MEAN STDDEV. 

This produces the output:



To reiterate what we are accomplishing by performing this task, we are simply generating the mean value of the sum of variables: “VarA”, “VarB”, “VarC”.

Another way to conceptually envision this process, is to imagine that we are placing all of the variables together into a single column:


After which, we are generating the mean value of the column which contains all of the combined variable observational values.

And that, is that!

At least, for this article. 

Stay studious in the interim, Data Heads!

- RD