Reflections of a Data Scientist

Monday, May 12, 2025

(R) Visually Presenting the Results of a Chi-Squared Test

In this article, I’ll demonstrate how to create a Chi-Square visualization in order to enhance your research findings.

Example:

In order for this code to function, your data must be structured in a manner which resembles the following graphic:

The following code below will create the example’s associated data frame with the R-Programming Language.

Smoking = c('Smoker', 'Smoker', 'Smoker', 'Smoker', 'Smoker', 'Smoker', 'Smoker', 'Non-Smoker', 'Non-Smoker', 'Non-Smoker')

Obesity = c('Not Obese', 'Not Obese', 'Obese', 'Obese', 'Obese', 'Obese', 'Obese', 'Obese', 'Not Obese', 'Not Obese')

Smoker = data.frame(Smoking, Obesity)

##############################################################

# Getting the Libraries in Order (for Graphical Output) #

library(ggplot2)
library(dplyr)

# Assuming your dataset is called Smoker and has columns Smoking and Obesity #

Smoker <- Smoker %>%
group_by(Smoking, Obesity) %>%
tally() %>%
ungroup() %>%
group_by(Smoking) %>%
mutate(percentage = n / sum(n)) %>%
ungroup()

Smoker$percentage <- Smoker$percentage * 100

# Re-level Obesity so "Obese" appears first #

Smoker$Obesity <- factor(Smoker$Obesity, levels = c("Obese", "Not Obese"))

# Now plot the bar chart with percentage values #

ggplot(Smoker, aes(x = Smoking, y = percentage, fill = Obesity)) +
geom_bar(stat = "identity", position = "dodge") + # Use stat = "identity" since you already calculated the percentages #
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentage #
labs(y = "Percentage") +
theme_minimal()

# Relevel Smoking so "Smoker" appears first #

Smoker$Smoking <- factor(Smoker$Smoking, levels = c("Smoker", "Non-Smoker"))

# Now plot the bar chart with percentage values #

ggplot(Smoker, aes(x = Smoking, y = percentage, fill = Obesity)) +
geom_bar(stat = "identity", position = "dodge") + # Use stat = "identity" since you already calculated the percentages #
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentage #
labs(y = "Percentage") +
theme_minimal()

##############################################################

Until next time,

-RD

(R) Visually Presenting the Results of a T-Test

In this article, I’ll demonstrate how to create a T-Test visualization in order to enhance your research findings.

Example:

In order for this code to function, your data must be structured in a manner which resembles the following graphic:

The following code below will create the example’s associated data frame with the R-Programming Language.

Group = c('Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'Original_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch', 'New_Watch')

Observation = c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309, 337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

Watch = data.frame(Group, Observation)

##############################################################

# Getting the Libraries in Order (for Graphical Output) #

# load package

library(ggplot2)

# Specify the order of the groups #

Watch$Group <- factor(Watch$Group, levels = c("Original_Watch", "New_Watch"))

# Graph w/o Title #

ggplot(data = Watch, aes(x = Observation, y = Group, fill = Group)) +
geom_bar(stat = 'summary') +
geom_errorbar(stat = 'summary', width = 0.2) +
theme(legend.position = "none") +
coord_flip()

# Graph with Title #

ggplot(data = Watch, aes(x = Observation, y = Group, fill = Group)) +
geom_bar(stat = 'summary') +
geom_errorbar(stat = 'summary', width = 0.2) +
coord_flip() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(title = "Your Title Here")

# Graph with Secondary Y Label #

ggplot(data = Watch, aes(x = Observation, y = Group, fill = Group)) +
geom_bar(stat = 'summary') +
geom_errorbar(stat = 'summary', width = 0.2) +
coord_flip() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5)
) +
labs(
title = "Your Title Here",
x = "Observation",
y = "Watch Type\n\n(Error Bars: 95% CI)"
)

##############################################################

In each instance of the output, different options as it pertains to the ggplot() function are specified. In the examples above, the 95% CI bars are the shared attribute amongst all outputs. This can also be disabled, if necessary.

Saturday, April 19, 2025

(R) Visually Presenting the Results of a Two-Way ANOVA Model

In this article, I’ll demonstrate how to create a Two-Way ANOVA visualization in order to enhance your research findings.

Example:

In order for this code to function, your data must be structured in a manner which resembles the following graphic:

The following code below will create the example’s associated data frame with the R-Programming Language.

Satisfaction = c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

StudyTime = c('One Hour', 'One Hour', 'One Hour', 'One Hour', 'One Hour', 'One Hour', 'One Hour', 'One Hour', 'One Hour', 'One Hour', 'Two Hours', 'Two Hours', 'Two Hours', 'Two Hours', 'Two Hours', 'Two Hours', 'Two Hours', 'Two Hours', 'Two Hours', 'Two Hours', 'Three Hours', 'Three Hours', 'Three Hours', 'Three Hours', 'Three Hours', 'Three Hours', 'Three Hours', 'Three Hours', 'Three Hours', 'Three Hours')

School = c('SchoolA', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolA', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolB', 'SchoolB')

School_Sat = data.frame(Satisfaction, StudyTime, School)

##############################################################

# Getting the Libraries in Order (for Graphical Output) #

# Load Package #

library(ggpubr)

# Mean and Standard Error #

ggline(subset(School_Sat),
x = "StudyTime",
y = "Satisfaction",
color = "School",
add = c("mean_se") # add mean and standard error #
) +
labs(y = "Satisfaction")

# Mean Only w/ Title #

ggline(subset(School_Sat),
x = "StudyTime",
y = "Satisfaction",
color = "School",
add = c("mean") # add mean only #
) +
labs(y = "Satisfaction") +
ggtitle("Your Title Here") +
theme(plot.title = element_text(hjust = 0.5))

##############################################################

In each instance of the output, different options as it pertains to the ggplot() function, are specified. In the first example, the graphic contains 95% CI bars. This can also be disabled, if necessary.

To create a customized graphic which contains attributes more specific to your research, please consult the "ggpubr" manual. This can be found online.

Saturday, April 12, 2025

(R) Visually Presenting the Results of a One-Way ANOVA Model

So, you’ve appropriately applied the one-way ANOVA methodology within the R-Programming Language – and you’ve already reported your results within the APA format. However, you now require a visualization of your data.

Example:

In order for this code to function, your data must be structured in a manner which resembles the following graphic:

All of the factors must be contained within a single variable (Group), and all of the associated observations must be contained with a corresponding variable (Observation).

The following code below will create the example’s data frame within the R-Programming Language.

Group = c('A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D')

Observation = c(3, 6, 6, 2, 4, 6, 5, 4, 7, 7, 6, 2, 2, 4, 7, 7, 6, 6, 7, 7, 7, 7, 7, 6, 6, 7, 7, 5, 2, 5, 4, 4, 7, 4, 4, 5, 5, 6, 7, 6, 1, 1, 6, 1, 5, 7, 7, 7, 4, 6, 4, 2, 3, 7, 7, 7, 5, 7, 7, 7, 7, 7, 6, 7, 6, 7, 7, 7, 7, 7, 4, 6, 7, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1)

ANOVA_TEST = data.frame(Group, Observation)

###########################################################################

# Getting the Libraries in Order (for Graphical Output) #

library(ggplot2)

library(gplots)

## Requires library 'ggplot2' ##

ggplot(ANOVA_TEST) +
aes(x = Group, y = Observation, color = Group, shape = Group) + ## Each group has its own color and each group has its own shape. Remove "shape = " to surpress this ##
geom_jitter(width = 0.2, alpha = 0.6) + ## Tighten up the group width ##
stat_summary(fun = mean, geom = "point", shape = 95, size = 6) + # color mapped by Group ##

## The "-" found in each group's y-axis values, is the mean value for that group ##
theme(legend.position = "right") ## To remove the legend, select legend.position = "none" ##

## Requires library 'gplots' ##

ANOVA_TEST$Group = factor(ANOVA_TEST$Group)

plotmeans(ANOVA_TEST$Observation ~ ANOVA_TEST$Group,
xlab = "Watch Type\n(Error Bars: 95% CI)", # x-axis label
ylab = "Observation",
main = "Place Your Title Here",
pch = 16) # Solid points #

###########################################################################

Output

In the above graphic, each group variable (factor) displayed on the x-axis, has its corresponding observational values displayed on the y-axis. Each group’s observations possess a different color and shape. The single dash “-“ found within each group amongst its observation points, represents that group’s mean value.

This graphic presents a different visualization approach. Each group on the x-axis has its number of corresponding observations displayed above each axis point (n=). The solid black dots represent the mean of each group on the x-axis, while the blue lines represent the range of associated confidence interval values.

These visualizations, along with your results reported within the APA format, should assist you in presenting your findings to readers in the most comprehensive manner possible. There are different customization options as it pertains to each graphing function. However, you’ll need to do additional research to determine what best works for your needs.

Until next time,

-RD

Saturday, February 24, 2024

(R) Ethiopian Multiplication

It is unfortunate that Africa is often overlooked as it pertains to the continent’s pre-colonial achievements. In today’s article, we’re going to examine the manner in which Ethiopians once performed multiplication calculations.

Interestingly enough, this method seems to almost purposely avoid the inclusion of decimal figures within its algorithmic components. I’m not sure if this was the result of a greater aversion to irrational figures, or simply just a coincidence of the consequence of its design.

The methodology which I will be referencing throughout this entry can be found below:

“Ethiopian Binary Math | the Engines of Our Ingenuity.” Engines.egr.uh.edu, engines.egr.uh.edu/episode/504. Accessed 24 Feb. 2024.

Example of Method Application

For our example, we will multiply the values of 9 and 115.

If we were to perform this multiplication exercise though the utilization of our modern multiplication algorithm, we would take a far different approach. However, if we were to approach this mathematical inquiry from the Ethiopian perspective, we would take the following steps.

We would begin by recording the first initial value (9), and then record the quotient of that value by dividing by 2, but not before discarding any potential remainder. Next, we would divide that value (4) by 2, recording the quotient after discarding any potential remainder. We would continue this process until our final recorded value was 1.

After having dealt with the first value, we would record the second initial value (115), and then record the product of that value multiplied by 2. Next, we would then record that value’s (115) product after multiplying by 2 (230). We would then continue this process until the number of values within the second column, equals the number of values contained within the first column.

To come to the final sum, we would add all of the values within the Value 2 column which possess a corresponding odd number entry within the Value 1 column.

9 and 1 are both odd numbers, and their corresponding entries are 115 and 920. Therefore, by summing 115 and 920, we reach the value of 1035.

If we were to reverse the order of multiplication, 115 multiplied by 9, as opposed to 9 multiplied by 115, our outcome would not change. Although, the column values would differ.

###########################################################################

# The first value within the multiplication equation #

val1 <- 115

# The second value within the multiplication equation #

val2 <- 9

###########################################################################

vec1 <- c()

vec2 <- c()

while (val1 != 0)

{

vec1 <<- c(vec1, val1)

print(val1)

val1 <- val1 - ceiling(val1/2)

}

# Number of observational entries for val1 #

n <- length(vec1)

while (n != 0)

{

vec2 <<- c(vec2, val2)

print(val2)

val2 <- val2 * 2

n <<- (n - 1)

}

# Create data frame #

Eth_Frame <- data.frame(vec1, vec2)

# Function which identifies odd elements within a vector #

odds <- function(x) subset(x, x %% 2 != 0)

# Identify odd elements #

odd_values <- odds(vec1)

# Match odd elements within odd_values vector to Eth_Frame variable 'vec1' #

Eth_Frame$odd_vec <- odd_values[match(Eth_Frame$vec1, odd_values)]

# Remove all observational values in which vec1 = odd_vec within 'Eth_Frame' #

Eth_Frame <- subset(Eth_Frame, Eth_Frame$vec1 == Eth_Frame$odd_vec)

# Sum all remaining elements within variable 'vec2', within data frame 'Eth_Frame' #

sum(Eth_Frame$vec2)

###########################################################################

Corresponding output:

> sum(Eth_Frame$vec2)
[1] 1035

###########################################################################

If you would like to see the contents of each column prior to the calculation being completed, print the Eth_Frame to the console prior to defining the odds function. Also, try reversing each value’s entry within the code (val1 <- 9 val2 <- 115), to verify that order does not play a role in determining the algorithm’s outcome.

Final Thoughts

I’m not going to pontificate too deeply on the concept detailed within this entry. However, some scholars have found similarities when comparing this method of multiplication, to the way in which digital computing methodologies handle similar tasks. I do find it fascinating that the genesis of many of the concepts which empower modernity, seem to find their un-actualized origins within prior historic periods.

It is also thought provoking in that, the aversion to zero and irrationality in mathematics which began in Greece, continued to haunt the west, and some western adjacent societies for an absurd duration of time.

For more on that topic, please refer to the link below:

Matson, John. “The Origin of Zero.” Scientific American, 21 Aug. 2009, www.scientificamerican.com/article/history-of-zero/.

Until next time.

-RD

Monday, February 12, 2024

(R) Is Taylor Swift’s Presence Significantly Impacting NFL Game Outcomes?

I hope that everyone enjoyed Super Bowl 58. Regardless of the outcome, I hope that we can all agree that the athleticism displayed by both the 49ers and the Chiefs, made this particular game, one for the ages.

I was planning on writing this article even before the contest was decided. However, I feel that it’s more relevant now, given the game’s results.

So, let’s ask the question:

Is Taylor Swift’s Presence Significantly Impacting NFL Game Outcomes (for Kansas City)?

To answer this question, we’ll be employing two separate tests and hypotheses.

Both assessments will utilize the Welch Two Sample T-Test methodology.

Our hypotheses are:

H0: (Null) – There was not a significant difference as it pertains to Kansas City’s total offensive yards per game with Taylor Swift in attendance.

HA: (Alternative) – The was a significant difference as it pertains to Kansas City’s total offensive yards per game with Taylor Swift in attendance.

~ AND ~

H0: (Null) – There was not a significant difference as it pertains to Kansas City’s total yards allowed per game with Taylor Swift in attendance.

HA: (Alternative) – The was a significant difference as it pertains to Kansas City’s total yards allowed per game with Taylor Swift in attendance.

Hogan, Kate. “Every Time Taylor Swift Went to a Kansas City Chiefs Game - and Whether or Not the Team Won.” Peoplemag, PEOPLE, 12 Feb. 2024, people.com/every-time-taylor-swift-went-to-kansas-city-chiefs-game-if-they-won-8410511.

“2023 Kansas City Chiefs Rosters, Stats, Schedule, Team Draftees, Injury Reports.” Pro, www.pro-football-reference.com/teams/kan/2023.htm. Accessed 12 Feb. 2024.

###########################################################################

CSV File:

,KC - 2023 NFL Season - Team Statistics,,,
Week,Offense (TotYd),Defense (TotYd),Opponent,Taylor Swift (Y/N)
1,316,368,DET,N
2,399,271,JAX,N
3,456,203,CHI,Y
4,401,336,NYJ,Y
5,333,329,MIN,N
6,389,197,DEN,Y
7,483,358,LAC,Y
8,274,240,DEN,N
9,267,292,MIA,N
10,Bye Week,,,
11,336,238,PHI,N
12,360,358,LVR,N
13,337,382,GNB,Y
14,346,327,BUF,Y
15,326,206,NEW,Y
16,308,205,LVR,Y
17,373,263,CIN,Y
18,268,353,LAC,N

###########################################################################

## Code for Testing and Analysis ##

Offense_TotYd <- c(316, 399, 456, 401, 333, 389, 483, 274, 267, 336, 360, 337, 346, 326, 308, 373, 268)

Defense_TotYd <- c(368, 271, 203, 336, 329, 197, 358, 240, 292, 238, 358, 382, 327, 206, 205, 263, 353)

Offense_TotYd_No <- c(316, 399, 333, 274, 267, 336, 360, 268)

Defense_TotYd_No <- c(368, 271, 329, 240, 292, 238, 358, 353)

Offense_TotYd_Swift <- c(456, 401, 389, 483, 337, 346, 326, 308, 373)

Defense_TotYd_Swift <- c(203, 336, 197, 358, 382, 327, 206, 205, 263)

t.test(Offense_TotYd_No, Offense_TotYd_Swift, paired = FALSE, conf.level = 0.95)

t.test(Defense_TotYd_No, Defense_TotYd_Swift, paired = FALSE, conf.level = 0.95)

sd(Offense_TotYd_No)

sd(Offense_TotYd_Swift)

sd(Defense_TotYd_No)

sd(Defense_TotYd_Swift)

###########################################################################

Findings

There was a significant difference as it pertains to Kansas City’s total offensive yards per game (2023) with Taylor Swift in attendance (n = 9), as compared to total offensive yards per game with Taylor Swift not in attendance (n = 8). Kansas City’s total offensive yards per game with Taylor Swift in attendance (M = 379.89, SD =59.23), Kansas City’s total offensive yards per game with Taylor Swift not in attendance (M = 319.13, SD = 47.67), Conditions; t(14.878) = -2.341, p = 0.03.

There was not a significant difference as it pertains to Kansas City’s total yards allowed per game (2023) with Taylor Swift in attendance (n = 9), as compared to total yards allowed per game with Taylor Swift not in attendance (n = 8). Kansas City’s total yards allowed per game with Taylor Swift in attendance (M = 275.22, SD = 75.69), Kansas City’s total yards allowed per game with Taylor Swift not in attendance (M = 306.13, SD = 53.03), Conditions; t(14.294) = 0.98307, p = 0.34.

Conclusions

While Tay Tay’s presence was generally beneficial to the Kansas City Chiefs as it pertains to the assessed metrics (Offensive Yards per Game, Yards Allowed per Game), only the offensive side of the ball saw a significant differentiation in performance (p = .03). Perhaps T. Swizzle provided the Kansas City Chiefs with some extra oomph via her top 40 mojo. Whatever the case may be, Tayla Swiff continues to be a good luck charm for the Chiefs whenever she is in attendance.

Monday, January 29, 2024

(R) Is Motion Prior to the Snap Correlated with Wins per Season?

The year of our Lord 2023, has been an anemic year as it pertains to NFL offensive prowess. There are numerous articles written as to why this may be the case. I personally believe that it may be due to an over reliance on the passing game. While this approach was optimal throughout prior years, recent defensive innovations devised to temper this phenomenon, have since severely limited its present effectiveness.

There are many methodologies which can be utilized to counteract the effectiveness of newly emergent algorithmic football defenses. The tried and true amongst such, is the application of motion prior to the snap.

In doing research for this article, I stumbled upon the following data:

So, in the spirit of upcoming Super Bowl LVIII (49ers vs Chiefs), let’s analyze this information in order to discern whether team wins are correlated with pre-snap motion.

###########################################################################

# Getting the Libraries in Order (for Graphical Output) #

library(ggpubr)

library(tseries)

# Motion Prior to Snap (through Week 6) #

# Data Collected by ESPNStatsInfo #

# Populate Data Frame #

Team <- c('Dolphins', 'Rams', '49ers', 'Lions', 'Packers', 'Chargers', 'Seahawks', 'Falcons', 'Ravens', 'Titans', 'Giants', 'Bears', 'Chiefs', 'Colts', 'Steelers', 'Texans', 'Jaguars', 'Jets', 'Broncos', 'Vikings', 'Washington', 'Bengals', 'Patriots', 'Buccaneers', 'Cardinals', 'Bills', 'Browns', 'Panthers', 'Saints', 'Raiders', 'Eagles', 'Cowboys')

Motion_Percentage <- c(80.2, 65.4, 77.5, 62.3, 58.3, 55.6, 48.9, 61.1, 49.5, 48, 44.3, 56.1, 68.3, 45.4, 45.4, 52.3, 44.9, 35.5, 38.1, 46.3, 53.2, 44.6, 51.1, 43.9, 39.9, 48.7, 47.3, 35.4, 28.5, 52.9, 21.7, 42.1)

Wins_2023 <- c(11, 10, 12, 12, 9, 5, 9, 7, 13, 6, 6, 7, 11, 9, 10, 10, 9, 7, 8, 7, 4, 9, 4, 9, 4, 11, 11, 2, 9, 8, 11, 12)

Motion_Report <- data.frame(Team, Wins_2023, Motion_Percentage)

# Derive Mean and Standard Deviation #

mean(Motion_Report$Motion_Percentage)

sd(Motion_Report$Motion_Percentage)

mean(Motion_Report$Wins_2023)

sd(Motion_Report$Wins_2023)

# Apply Correlation Methodology $

cor.test(Motion_Report$Motion_Percentage, Motion_Report$Wins_2023)

# Create Graphic Visualizations #

data <- data.frame(Motion_Percentage, Wins_2023)

ggscatter(data, x = "Motion_Percentage", y = "Wins_2023",

add = "reg.line", conf.int = TRUE,

cor.coef = TRUE, cor.method = "pearson",

xlab = "Rate of Motion (Through Week 6)", ylab = "Season Wins (2023)")

###########################################################################

Findings

There was a positive correlation between the two variables: Season Wins (2023) (n = 32) and Rate of Motion (Through Week 6) (n = 32). Season Wins (2023) (M = 8.5, SD = 2.747), Rate of Motion (Through Week 6) (M = 49.772, SD = 12.526), Conditions; t(30) = 1.4809, p = .15. Pearson Product-Moment Correlation Coefficient: r = .26.

###########################################################################

Conclusions

While the p-value findings (p = .15) can be viewed as non-significant at the alpha level of .05, we must take into account that there are certain experimental limitations which will innately confound our results. For one, wins per season is zero sum. Meaning, that a win for one team, is always loss for another. Also, as both motion and wins per season are discrete variables, there is a limited predefined range of differentiation which exists between each variable. This, combined with wins per team being non-independent, reduces the test result to a generalization.

Success of motion implementation is being assessed solely on wins, and the mechanism for generating such is being assessed by offensive motion alone. Our assessment does not account for strength of schedule, team defensive prowess, player fundamentals, etc.

However, that being said, with a p = .15, and a correlation coefficient value of r = .26, it likely is more fortuitous, all things being equal, to implement an offense which possesses pre-snap motion. There certainly are many other factors which can determine outcome which are not assessed within this model, but in all likelihood, they will not have a large impact upon the overall findings.

Monday, October 16, 2023

The Friendship Paradox

Like many of the critical attributes of life, that which is most evident, lies obscured by monotony. This is especially true as it pertains to mathematical paradoxes, as the most enlightening insights within the field, have the habit of appearing both obvious and universally evident after discovery. Like many mystical traditions, these insights are best discovered through contradiction and reduction.

The Paradox

The Friendship Paradox, in simpler terms, identifies the common phenomenon in which an individual, typically possesses less friends than his friends. Additionally, the sum of friends which his friends possess, will be greater than the sum of his total number of friends.

This paradox possesses wide reaching implications, as it describes events which are self-arising and irrefutable. However, before we can detail applicability, we must demonstrate the paradox as it was initially discerned.

First, let’s get some terminology down.

In graph theory, circular graphics are known as nodes, or vertexes. The lines which illustrate relationships between the nodes are known as edges.

Now, let’s utilize this style of graphical representation to demonstrate the relationships between 5 individuals.

The chart below represents the above relationships, but in a different format.

As each relationship is symmetric, if one friend considers himself to be a friend of another individual, that individual also considers the initial individual to be his friend as well. As shown above, A is friends with B and E. B is friends with both A and E, and also friends with C.

If we derive the mean as it relates to the average number of friends that an individual possesses within our experiment, we come to the value of 2.8.

In this instance, E possesses the most friends, and every individual who is friends with E, possesses less. Therefore, the average number of friends that an individual within a group possesses (2.8), will likely be greater than the actual number of friends that a singular individual possesses.

The Philosophical Implications

If a single individual begins to quantify a particular phenomenon as it relates to their person on an individualized basis, or even as it relates to a novel phenomenon, then the natural consequence of this endeavor is that this individual from the onset will find himself at a disadvantage.

For example, a new creation upon its genesis, possessing autonomy, will immediately be concerned with attaining sustenance. This was not a concern which was possessed within the prior state of non-being. In contemplating one’s beauty, a young woman immediately begins to compare herself to those whom she perceives as being more beautiful. We would never anticipate the inverse to occur.

This is the paradox of living, striving to possess more while the value of that which we possess becomes diminished. This is due to the passage of time, but also due to singular possession of a resource also diminishing in value. Something within our possession loses value from the moment of possession, as both the individual and the possession are diminished by the natural passage of time.

Example (2):

Here is another example, if an individual walks into a crowded elevator filled with random strangers, then there is a greater probability that this individual will have the same number, or a greater number of friends, than each stranger within the elevator. However, if the same individual were invited to a party hosted by a friend, then there is a lesser probability of this individual possessing more friends, or a similar number of friends, as compared to each other party attendees. In the elevator scenario, there is no guarantee that any individual within the elevator possesses a single friend. This also includes the individual entering the already crowded elevator. However, in the party scenario, each party goer has at least one friend, that being - the party’s host. In this case, the count begins at the neutral value of 1, except for the case of the party’s host, who is friends with every individual in attendance.

As described above, the friendship paradox also seeks to demonstrate, “the sum of friends which his friends possess, will be greater than the sum of his total number of friends.”

In the case of our first example, this value would be calculated as follows:

To better illustrate this phenomenon, I’ve constructed a new example relationship diagram below:

In this instance, E has more friends than A, B, C, D.

E has 4 friends. While A, B, C, D each have 1 friend (E).

In total: A, B, C, D possess the same number of friends in sum (1+1+1+1).

If A, B, C, or D possessed one additional friend – F, then in total, they would together possess more friends in sum than E (1 + 1 + 1 + 2).

If this were the case, the paradox would hold, as E would have a total of 4 friends, but the total number of his friends of friends would be greater (5).

Conclusion

That's all for today.

I hope that you enjoyed this entry and will visit again soon.

-RD

(R) Utilizing Crowd Prediction Methodologies to Draft the Optimal Fantasy Football Team (II)

What would an application be without proof? A notion?

To prove that the ADP drafting strategy is superior to other ranking methodologies, I performed the following analysis.

#############################################################################

Data Source(s):

#############################################################################

ADP

https://fantasydata.com/nfl/fantasy-football-leaders?position=1&season=2022&seasontype=1&scope=1&subscope=1&scoringsystem=2&startweek=1&endweek=1&aggregatescope=1&range=1

Offense

https://fantasydata.com/nfl/ppr-adp?season=2022&leaguetype=2&type=ppr

Kickers

https://fantasydata.com/nfl/fantasy-football-leaders?position=6&season=2022&seasontype=1&scope=1&subscope=1&scoringsystem=2&startweek=1&endweek=1&aggregatescope=1&range=1

DST

https://fantasydata.com/nfl/fantasy-football-leaders?position=7&season=2022&seasontype=1&scope=1&subscope=1&scoringsystem=2&startweek=1&endweek=1&aggregatescope=1&range=1

#############################################################################

The Analysis (n = 300)

#############################################################################

ADP <- c(1.4, 2.4, 2.7, 4.2, 4.9, 5.8, 7, 7.2, 8.8, 10, 10.6, 12.4, 13.1, 14, 15.3, 16, 16.6, 17.9, 18.5, 19.1, 19.6, 19.8, 21.8, 22.7, 24.7, 25.9, 26.1, 27.3, 27.9, 29.3, 30.5, 31.6, 32.4, 33.2, 33.2, 34.1, 36.1, 38, 39.1, 39.3, 39.4, 39.7, 42.3, 43.2, 43.5, 44.2, 45.1, 46.5, 46.8, 47.3, 48.4, 50.2, 50.9, 51.5, 52.2, 53.9, 54.1, 56, 56.7, 57.9, 59.4, 61.2, 61.5, 62.1, 62.9, 63, 63.8, 63.9, 68, 68.1, 69.2, 69.9, 70.1, 71, 71.4, 71.7, 72, 72.9, 75.3, 75.4, 77.1, 77.9, 82.3, 83.8, 84.4, 84.8, 85, 86.7, 87.9, 89.5, 90.2, 90.2, 90.7, 91.3, 91.7, 91.8, 94.4, 95.9, 96.2, 98.5, 98.7, 99.1, 101.9, 102.6, 102.7, 103.6, 103.7, 105.4, 106.5, 106.7, 107.8, 108.6, 109.7, 109.9, 110.6, 111.6, 113.1, 113.6, 113.9, 114.9, 117.3, 117.7, 118.5, 119, 119.6, 120.5, 121, 121.2, 121.3, 122.7, 124.2, 125, 128.9, 129.8, 130.2, 130.2, 130.3, 130.8, 131.2, 132, 135.1, 135.9, 136.1, 136.3, 137.9, 138.9, 140.1, 140.9, 141.8, 142.7, 143.8, 144.8, 145.3, 145.4, 145.7, 147, 147, 148.1, 148.4, 148.9, 149, 149.4, 151.3, 151.4, 152.1, 152.5, 152.6, 153.8, 154.9, 155.6, 155.7, 156.7, 156.8, 156.8, 158, 158.5, 158.8, 158.9, 159.4, 160.1, 160.5, 160.7, 161, 161.5, 162, 163, 163.5, 164, 164.6, 164.6, 165, 165.2, 165.6, 166, 167, 168, 169, 169.5, 170, 171, 172, 173, 174, 175, 176, 177, 178, 178, 179, 180, 180, 181, 182, 183, 184, 185, 186, 187, 187, 188, 189, 190, 190, 191, 192, 192, 193, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265)

PPR <- c(146.4, 356.36, 372.7, 302.76, 368.66, 201.4, 223.46, 237.8, 242.4, 239.5, 211.7, 335.5, 191.1, 316.6, 248.6, 284, 316.3, 301.6, 281.4, 395.52, 168.4, 42, 226.1, 347.2, 216.5, 190.5, 225.4, 185.8, 200.2, 164, 299.6, 75.6, 220.9, 281.26, 141.3, 205.1, 199.1, 229, 417.4, 177.7, 81.2, 176.5, 43.6, 200.5, 180.7, 115.1, 159.4, 259.2, 350.7, 328.3, 167.6, 84.8, 226.8, 236.08, 51.1, 84.9, 98.3, 378.04, 222.8, 145.6, 90.9, 204.2, 216.7, 200.52, 142.7, 52.2, 171.6, 180, 156, 101.5, 126.8, 248.8, 74.2, 166.4, 79, 177.9, 225.76, 267.6, 141.2, 185.3, 249.1, 239.2, 53.5, 246, 87.1, 198.6, 254.6, 271.66, 88.1, 115.6, 227.8, 151.7, 174.8, 105.7, 108.38, 148.7, 69.4, 202.5, 241.9, 88.6, 148.2, 12.46, 168.3, 219.08, 115.7, 178.6, 147.3, 57.3, 87.9, 159.4, 291.58, 55.8, 198.2, 112.7, 237.3, 126, 98.2, 73.5, 135.7, 165.9, 150, 43.4, 135, 105.4, 215.4, 225.9, 70.4, 230.92, 167.12, 139.1, 166.5, 87.7, 88.4, 102, 122.4, 7, 94.1, 114, 105.04, 155.28, 102.9, 160, 103.9, 101.6, 159, 102, 98, 161, 295.98, 295.62, 115, 87.9, 112, 133, 101, 103, 55.2, 43.92, 116, 97, 131, 180.3, 121, 130.6, 143, 142, 119.8, 215.7, 25.5, 125, 123.6, 99, 39, 110, 97.1, 130, 129, 97, 57, 18, 118, 154, 168, 100, 116.9, 20.1, 104, 98.2, 186, 97, 60.2, 110, 101, 52.2, 155.6, 93.8, 3.2, 110, 169.3, 97.6, 34.4, 161.1, 78.9, 25.8, 51.6, 112.3, 115, 176.3, 198.1, 106, 26.4, 82.7, 149.1, 117.8, 88.3, 81.7, 110.1, 83, 4.5, 50.1, 164.1, 73, 12, 57.68, 139, 61.6, 112, 77.4, 45.4, 46.5, 15.1, 35.5, 72.7, 75.2, 84.5, 110, 0.2, 142.1, 34, 12.2, 83, 82.6, 21.1, 77.2, 196.3, 11.4, 51.7, 8.4, 54.1, 161.24, 46.2, 8.6, 8.4, 13.8, 289, 37.8, 170.08, 128.4, 89.6, 112.8, 104.1, 284.32, 154.16, 59.3, 24.9, 114.5, 121.42, 158.5, 59.8, 98.92, 0, 115.1, 10.2, 181.52, 14.8, 4.2, 12.8, 18.8, 103.9, 196.56, 1.3, 11.8, 16.2, 26, 39.1, 43.1, 53.6, 103.8, 27.3, 303.88, 30.8, 64.5, 3.5, 73.88, 110.2, 64.7, 84.3, 30.5, 10.2, 70.3)

cor.test(ADP, PPR)

Which produces the output:

Pearson's product-moment correlation

data: ADP and PPR

t = -13.394, df = 298, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.6791003 -0.5370390

sample estimates:

cor

-0.6130004

#############################################################################

Creating the Visual Output

#############################################################################

my_data <- data.frame(ADP, PPR)

library("ggpubr")

ggscatter(my_data, x = "ADP", y = "PPR",

add = "reg.line", conf.int = TRUE,

cor.coef = TRUE, cor.method = "pearson",

xlab = "ADP (2022)", ylab = "Player Score (PPR)")

#############################################################################

Conclusion

#############################################################################

There was a negative correlation between the two variables: ADP (n = 300), and Player Score (PPR). ADP (M = 134.81, SD = 73.31), Player Score (PPR) (M = 133.33, SD = 86.66), Conditions; t(298) = -13.394 , p < .01. Pearson Product-Moment Correlation Coefficient: r = -0.61.

#############################################################################

As the findings indicate, there is a significant negative correlation as it pertains to ADP and player performance (PPR - 2022 Season). In plain terms, this means that we should typically expect to see better fantasy performance from players with lower ADP rankings. I hope that everyone enjoyed this article and my adherence to APA reporting standards. <3

Until next time.

Stay cool, Data Heads.

-RD

Monday, October 9, 2023

Utilizing Crowd Prediction Methodologies to Draft the Optimal Fantasy Football Team

To win your league in Fantasy Football, or at least qualify for the playoffs, you don’t need to buy magazines, study tape, or watch ESPN. All that is required, is a general understanding of crowd psychology.

What is ADP?

ADP represents the average draft position for players in fantasy drafts. Each league, for each fantasy sport, typically displays this player value during the drafting process. This value, as the name suggests, is derived from where a particular player was selected by other “fantasy team owners”, during prior drafts.

As was discussed within a previous article demonstrating draft order and its impact on a fantasy team’s final placement, if each fantasy team owner drafted optimally within their respective position, we would expect the final standings to directly reflect the initial draft order.

However, how can an individual be sure that he is drafting optimally? The answer is simpler than one might assume. To achieve optimal drafting potential, one must adhere to drafting players with the best available ADP value throughout the drafting process.

Why does it Work?

Following the crowd consensus, should provide a fantasy participant with their best opportunity for victory. The strictest adherent to this methodology, will benefit most from the number non-adherents within their league. Let’s consider why this is the case.

While pundit, or site rankings of players are often determined by a single individual, or a group of informed individuals, ADP rankings are determined by draft consensus. Meaning, that there are more minds at work as it pertains to determining a player’s draft value. These ranks are also assigned through the drafting process. This differs from other ranking processes, in that, the act of drafting establishes the value. This is similar as to how market participants set prices through buying and selling assets. Whereas, the ranking process is more akin to the way in which planned economies function.

We’ll assume that ADP perfectly correlates with the eventual points scored by each player within the league. Therefore, with this assumption in place, and also assuming that each league participant drafts optimally, we should expect to see point distributions resemble something like the graphic below.

(In our example scenario, each subsequent player is valued at one point less than the previous player.)

Therefore, all things being equal, we would expect the final point totals for each league participant to be:

Gaining the Edge

In every instance, the largest advantage belongs to the team which drafts first, with diminishing advantage being assessed sequentially throughout the remaining draft order. To attempt to compensate for this diminishment, or to expand one’s edge regardless of draft position, a league participant should strongly adhere to the ADP value ranking system while drafting. By not attempting to gain an edge through self perceived insight, opportunities will arise as a result of opponents who attempt otherwise.

Every draft misstep is the micro-process of reallocating points from your team to another team within your league. In the example below, the teams highlighted in green are adhering to a strict ADP drafting strategy. The teams highlighted in red, instead are going for a less strict approach.

As is shown in the graphic, the ADP adhering teams were able to benefit from the mistakes made by their opponents. In each instance, the green teams were able to draft the players which were passed upon by their red counterparts. Thus, the ADP adhering teams increased their edge at the expense of the non-adhering teams.

I Know that I Know Nothing

The above strategy, functions on the foundation of two re-enforcing cognitive biases. One being the overestimation of one’s own abilities and talents, and the other being the discounting of the abilities and talents of others.

As far as football is concerned, I have personally witnessed friends who watch far more football than I do, who know far more about the players than I do, blow out their drafts, and fail to make their league’s playoffs in complex and interesting ways. In almost every case, the culprit tends to be impatience and exotic maneuvering. What’s also strange about this cohort of individuals, is that they tend to quickly abandon the strange individualized strategies which initially required a high level of conviction to attempt. This phenomenon itself might warrant an article in the future.

Be sure to watch the waiver wire, as further edge can be gained from managers who prematurely release underperforming players. Also, it should be noted, that ADP rankings as a drafting criteria, are only applicable in leagues which do not utilize custom rule sets.