Reflections of a Data Scientist: 2019

Sunday, August 4, 2019

Model and Method Utilization

There are many model types, methods and techniques demonstrated on this website. In this entry, I will categorize each of the aforementioned concepts, and provide a brief description as it pertains to the scenario which would warrant appropriate utilization.

(Tests of Normality)

Q-Q Plot – A graph which is utilized to assess data for normality.

P-P Plot – A graph which is utilized to assess data for normality.

Shapiro-Wilk Normality Test – A test which is utilized to test data for normality.

(Tests Related to Parametric Model Variable Correlation)

Variance Influence Factor – A method which tests model variables for correlation.

(Pearson) Coefficient of Correlation – A method which tests variables for correlation.

Partial Correlation - A method which is utilized to measure the correlation between two variables, while also controlling for a third variable.

Distance Correlation – A method which tests model variables for correlation through the utilization of a Euclidean distance formula.

Canonical Correlation – A method which assesses model variables for correlation through the combination of model variables into independent groups.

(Tests Related to Non-Parametric Model Variable Correlation)

Spearman’s Rank Correlation - A non-parametric alternative to the Pearson correlation. This method is utilized in circumstances when either data samples are non-linear, or the data type contained within those samples are ordinal. An example of ordinal data – “survey response data which asked the respondent to rank a particular item on a scale of 1-10”.

Kendall Rank Correlation Coefficient - Like Spearman’s rho, Kendall’s Tau is also utilized in circumstances when either data samples are non-linear, or the data type contained within the samples is ordinal.

(Tests of Significance Amongst Groups)

One Sample T-Test - This test is utilized to compare a sample mean to a specific value, it is used when the dependent variable is measured at the interval or ratio level.

Two Sample T-Test - This test functions in the same manner as the above test. However, in the case of this model, data is randomly sampled from different sets of items from two separate control groups.

The Welch Two Sample T-Test - This test functions in the same manner as the above test. The only difference being, this method is utilized if data is randomly sampled from different sets of items from two separate control groups of uneven size.

Paired T-Test – Similar in composition to the Two Sample T-Test, this test is utilized if you are sampling the same set twice, once for each variable.

(Analysis of Variance “ANOVA”)

Analysis of Variance – Also known as ANOVA, this method is utilized to test for significance across the variances of multiple sample groups. In many ways, this test is similar to a t-test, however, ANOVA allows for multiple group comparison.

One Way Analysis of Variance (ANOVA) – An ANOVA model containing a single independent variable.

Two Way Analysis of Variance (ANOVA) - An ANOVA model containing multiple independent variables.

Repeated-Measures Analysis of Variance (ANOVA) – An ANOVA model containing a single independent variable measured multiple times.

(Exotic Analysis of Variance “ANOVA” Variants)

Analysis of Covariance (ANCOVA) – An ANOVA model which also factors for a covariate value which may impact the system as a whole.

https://statistics.laerd.com/spss-tutorials/ancova-using-spss-statistics.php

Random Effects Analysis of Variance – An ANOVA model which is synthesized from sampling from a greater population in order to determine inference.

https://stat.ethz.ch/education/semesters/as2015/anova/06_Random_Effects.pdf

Multivariate Analysis of Variance (MANOVA) – An ANOVA model containing multiple dependent variables.

https://statistics.laerd.com/spss-tutorials/one-way-manova-using-spss-statistics.php

Multivariate of Covariance (MANCOVA) – An ANOVA model containing multiple dependent variables. Also factors for a covariate value which may impact the system as a whole.

https://statistics.laerd.com/spss-tutorials/one-way-mancova-using-spss-statistics.php

(Test of Significance for Nonparametric Data)

Friedman Test (One Way Analysis of Variance) – The nonparametric alternative to a One Way ANOVA test.

Wilcox Signed Rank Test (One Sample T-Test, Paired T-Test) – The nonparametric alternative to the One Sample T-Test, and the Paired T-Test.

Mann-Whitney U Test (Two Sample T-Test) – A nonparametric alternative to the One Way ANOVA test.

(Tests of Significance Amongst Groups)

Chi-Square – A test which measures categorical significance as it pertains to a binary outcome variable.

McNemar's Test – A test which measures categorical significance, limited to two initial categories, and two categorical outcomes. This test is typically utilized for drug trials.

(Metric to Assess Rate of Agreement Amongst Two Entitles)

Cohen’s Kappa – A test which measures the rate of agreement amongst two entities.

(Tests of Significance Amongst Groups Comprised of Survey Questions)

Cronbach’s Alpha - Cronbach’s Alpha is primarily utilized to measure the inter-relatedness of response data collected from sociological surveys. Specifically, the potential differentiation of response information related to certain interrelated categorical survey questions.

(Tests Pertaining to Stationarity and Random Walks)

Dicky-Fuller Test – A methodology of analysis utilized to test data for stationarity.

Phillips-Perron Unit Root Test – A methodology utilized to test data for random walk potential.

(Comparison of Outcome Variables)

Two Step Cluster – A method which assesses model outcome variables through the utilization of a clustering technique.

K-Means - A method which assesses model outcome variables through the utilization of a clustering technique.

Hierarchical Cluster - A method which assesses model outcome variables through the utilization of a hierarchal technique.

K-Nearest Neighbor – A method which compares similarity of outcome variables as determined by the values of the model’s independent variables.

(Reduction of Independent Variables through Variable Synthesis)

Dimension Reduction – A method which creates new variables with values that are determined by the original values of the independent model variables.

(Impact Assessment)

TURF Analysis – A method of analysis typically utilized for product and design studies. This technique assesses the most effective way to reach a sample target demographic.

(Survival Analysis)

Survival Analysis - A statistical methodology which measures the probability of an event occurring within a group over a period of time.

(Sample Distribution Tests)

The Wald Wolfowitz Test - A method for analyzing a single data set in order to determine whether the elements within the data set were sampled independently.

The Wald Wolfowitz Test (2-Sample) - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

The Kolmogorov-Smirnov Test - A method for analyzing a single data set in order to determine whether the data was sampled from a normally distributed population.

The Kolmogorov-Smirnov Test (2-Sample) - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

(Outcome Models – Conditions for Utilization)

Linear Regression – Continuous outcome variable. Continuous independent variable(s).

General Linear Mixed Models – Continuous outcome variable. Any type of independent variable(s).

Logistic Regression Analysis – Binary outcome variable. Categorical or continuous independent variable(s).

Discriminant Analysis – Binary outcome variable. Categorical or continuous independent variable(s).

Loglinear Analysis - Binary outcome variable. Categorical independent variable(s).

Partial Least Squares Regression – Any type of outcome variable. Any type of independent variable(s).

Polynomial Regression – Continuous outcome variable. Continuous independent variable(s).

Multinomial Logistical Analysis – Categorical outcome variable. Categorical input variable(s).

Logistical Ordinal Regression – Categorical outcome variable. Categorical input variable(s).

Probit Regression – Binary outcome variable. Categorical or continuous input variable(s).

2-Stage Least Squares Regression - Categorical outcome variable. Continuous independent variable(s).

APA Format

In today’s article, we will discuss the standard methodology which is utilized to report statistical findings. In previous examples featured on this website, model outputs were explained in a more simplistic manner in order to decrease the level of complexity related to such. However, if the purpose of the overall research endeavor is to produce results for publication, then the APA format should be applied to whatever experimental findings are generated from the application of methodologies.

“APA” is an abbreviation for The American Psychological Association. Regardless of the type of research that is being conducted, the formatting standards maintained by the APA as it applies to statistical research, should always be utilized when presenting data in a professional manner.

Details

All figures which contain decimal values should be rounded to the nearest hundredth. Ex. .105 = .11. Reporting p-values being the exception to this rule. P-values should, in most cases, be reported in a format which contains two decimals. The exception occurring when a greater amount of specificity is required to illustrate the details of the findings.

Another rule to keep in mind pertains to leading zeroes. A leading zero prior to a decimal place is only required if the represented figure has the potential to exceed “1”. If the value cannot exceed “1”, then a leading zero is un-necessary.

Below are examples which demonstrate the most common application of the APA format.

Chi-Square

Template:

A chi-square test of independence was performed to examine the relation between CATEGORY and OUTCOME. The relation between these variables was found to be significant at the p < .05 level, χ2 (DEGREES OF FREEDOM, N = SAMPLE SIZE) = X-Squared Value, p = p - value.

- OR -

A chi-square test of independence was performed to examine the relation between CATEGORY and OUTCOME. The relation between these variables was not found to be significant at the p < .05 level, χ2 (DEGREES OF FREEDOM, N = SAMPLE SIZE) = X-Squared Value, p = p - value.

Example:

While working as a statistician at a local university, you are tasked to evaluate, based on survey data, the level of job satisfaction that each member of the staff currently has for their occupational role (Assume a 95% Confidence Interval).

The data that you gather from the surveys is as follows:

General Faculty
130 Satisfied 20 Unsatisfied

Professors
30 Satisfied 20 Unsatisfied

Adjunct Professors
80 Satisfied 20 Unsatisfied

Custodians
20 Satisfied 10 Unsatisfied

# Code #

Model <- matrix(c(130, 30, 80, 20, 20, 20, 20, 10), nrow = 4, ncol=2)

N <- sum(130, 30, 80, 20, 20, 20, 20, 10)

chisq.test(Model)

N

# Console Output #

Pearson's Chi-squared test

data: Model
X-squared = 18.857, df = 3, p-value = 0.0002926

> N
[1] 330

APA Format:

A chi-square test of independence was performed to examine the relation between occupational role and job satisfaction. The relation between these variables was found to be significant at the p < .05 level, χ2 (3, N = 330) = 18.56, p < .001.

Tukey HSD

Template:

Post hoc comparisons using the Tukey HSD test indicated that the mean score for the CONDITION A (M = Mean1, SD = Standard Deviation1) was significantly different than CONDITION B (M = Mean2, SD = Standard Deviation2), p = p-value.

Analysis of Variance (ANOVA)

(One Way)

Template:

There was a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

- OR -

There was not a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

Example:

A chef wants to test if patrons prefer a soup which he prepares based on salt content. He prepares a limited experiment in which he creates three types of soup: soup with a low amount of salt, soup with a high amount of salt, and soup with a medium amount of salt. He then servers this soup to his customers and asks them to rate their satisfaction on a scale from 1-8.

Low Salt Soup it rated: 4, 1, 8
Medium Salt Soup is rated: 4, 5, 3, 5
High Salt Soup is rated: 3, 2, 5

(Assume a 95% Confidence Interval)

# Code #

satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)

salt <- c(rep("low",3), rep("med",4), rep("high",3))

salttest <- data.frame(satisfaction, salt)

results <- aov(satisfaction~salt, data=salttest)

summary(results)

# Console Output #

Df Sum Sq Mean Sq F value Pr(>F)
salt 2 1.92 0.958 0.209 0.816
Residuals 7 32.08 4.583

APA Format:

There not was a significant effect of the level of salt content on patron satisfaction at the p<.05 level for the three conditions (F(2, 7) = 0.21, p = 0.82).

(Two Way)

Template:

Hypothesis 1:

There was a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

- OR -

There was not a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

Hypothesis 2:

There was a significant effect of the CATEGORY2 on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(2), Degrees of Freedom(4)) = F Value, p = p - value).

- OR -

There was not a significant effect of the CATEGORY2 on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(2), Degrees of Freedom(4)) = F Value, p = p - value).

Hypothesis 3:

There was a statistically significant interaction effect of the CATEGORY1 on the CATEGORY2 at the p < .05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(3), Degrees of Freedom(4)) = F Value, p = p - value).

- OR -

There was not a statistically significant interaction effect of the CATEGORY1 on the CATEGORY2 at the p < .05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(3), Degrees of Freedom(4)) = F Value, p = p - value).

Example:

Researchers want to test study habits within two schools as they pertain to student life satisfaction. The researchers also believe that the school that each group of students is attending may also have an impact on study habits. Students from each school are assigned study material which in sum, totals to 1 hour, 2 hours, and 3 hours on a daily basis. Measured is the satisfaction of each student group on a scale from 1-10 after a 1 month duration.

(Assume a 95% Confidence Interval)

School A:

1 Hour of Study Time: 7, 2, 10, 2, 2
2 Hours of Study Time: 9, 10, 3, 10, 8
3 Hours of Study Time: 3, 6, 4, 7, 1

School B:

1 Hour of Study Time: 8, 5, 1, 3, 10
2 Hours of Study Time: 7, 5, 6, 4, 10
3 Hours of Study Time: 5, 5, 2, 2, 2

satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))

school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))

schooltest <- data.frame(satisfaction, studytime, school)

results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))

summary(results)

Which produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
studytime 2 62.6 31.300 3.809 0.0366 *
school 1 2.7 2.700 0.329 0.5718
studytime:school 2 7.8 3.900 0.475 0.6278
Residuals 24 197.2 8.217
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

APA Format:

There was a significant effect as it pertains to study time impacting student stress levels at the p < .05 level for the three conditions (F(2, 24) = 3.81, p = .04).

There was not a significant effect as it relates to the school attended impacting student stress levels at the p < .05 level for the two conditions (F(1, 24) = 0.329, p > .05).

There was not a statistically significant interaction effect of the school variable on the study time variable at the p < .05 level (F(2, 24) = 0.475, p > .05).

TukeyHSD(results)

> TukeyHSD(results)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = lm(satisfaction ~ studytime * school, data = schooltest))

$studytime
diff lwr upr p adj
Three Hours-One Hour -1.3 -4.5013364 1.901336 0.5753377
Two Hours-One Hour 2.2 -1.0013364 5.401336 0.2198626
Two Hours-Three Hours 3.5 0.2986636 6.701336 0.0302463

$school
diff lwr upr p adj
SchoolB-SchoolA -0.6 -2.760257 1.560257 0.571817

$`studytime:school`

diff lwr upr p adj
Three Hours:SchoolA-One Hour:SchoolA -0.4 -6.005413 5.2054132 0.9999178
Two Hours:SchoolA-One Hour:SchoolA 3.4 -2.205413 9.0054132 0.4401459
One Hour:SchoolB-One Hour:SchoolA 0.8 -4.805413 6.4054132 0.9976117
Three Hours:SchoolB-One Hour:SchoolA -1.4 -7.005413 4.2054132 0.9696463
Two Hours:SchoolB-One Hour:SchoolA 1.8 -3.805413 7.4054132 0.9157375
Two Hours:SchoolA-Three Hours:SchoolA 3.8 -1.805413 9.4054132 0.3223867
One Hour:SchoolB-Three Hours:SchoolA 1.2 -4.405413 6.8054132 0.9844928
Three Hours:SchoolB-Three Hours:SchoolA -1.0 -6.605413 4.6054132 0.9932117
Two Hours:SchoolB-Three Hours:SchoolA 2.2 -3.405413 7.8054132 0.8260605
One Hour:SchoolB-Two Hours:SchoolA -2.6 -8.205413 3.0054132 0.7067715
Three Hours:SchoolB-Two Hours:SchoolA -4.8 -10.405413 0.8054132 0.1240592
Two Hours:SchoolB-Two Hours:SchoolA -1.6 -7.205413 4.0054132 0.9470847
Three Hours:SchoolB-One Hour:SchoolB -2.2 -7.805413 3.4054132 0.8260605
Two Hours:SchoolB-One Hour:SchoolB 1.0 -4.605413 6.6054132 0.9932117
Two Hours:SchoolB-Three Hours:SchoolB 3.2 -2.405413 8.8054132 0.5052080

twohours <- c(9, 10, 3, 10, 8, 7, 5, 6, 4, 10)
threehours <- c(3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

mean(twohours)
sd(twohours)

mean(threehours)
sd(threehours)

> mean(twohours)
[1] 7.2
> sd(twohours)
[1] 2.616189
>
> mean(threehours)
[1] 3.7
> sd(threehours)
[1] 2.002776

APA Format:

Post hoc comparisons using the Tukey HSD test indicated that at the p < .05 level, the mean score for the level of stress exhibited by students who studied for Two Hours (M = 7.20, SD = 2.62), was significantly different as compared to the scores of the students who studied for Three Hours (M = 3.70, SD = 2.00), p = .03.

(Repeated Measures)

Template:

Example:

Researchers want to test the impact of reading existential philosophy on a group of 8 individuals. They measure the happiness of the participants three times, once prior to reading, once after reading the materials for one week, and once after reading the materials for two weeks. We will assume an alpha of .05.

Before Reading = 1, 8, 2, 4, 4, 10, 2, 9
After Reading = 4, 2, 5, 4, 3, 4, 2, 1
After Reading (wk. 2) = 5, 10, 1, 1, 4, 6, 1, 8

library(lme4) # You will need to install and enable this package #
library(nlme) # You will also need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lme(happiness ~ week, random=~1|id, data=survey)

anova(model)

This method saves some time by producing the output:

numDF denDF F-value p-value
(Intercept) 1 14 37.21053 <.0001
week 2 14 1.04624 0.3772

There was not a significant effect of the health assessment on the survey questions related to stroke concern at the p < .05 level for the five conditions (F(1, 14) = 1.05, p > .05).

Student’s T-Test

(One Sample T-Test)

Template:

(Right Tailed)

There was a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the historically assumed mean (M = Historic Mean Value); t(Degrees of Freedom) = t-value, p = p-value.

- OR -

There was not a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the historically assumed mean (M = Historic Mean Value); t(Degrees of Freedom) = t-value, p = p-value.

Example:

A factory employee believes that the cakes produced within his factory are being manufactured with excess amounts of corn syrup, thus altering the taste. 10 cakes were sampled from the most recent batch and tested for corn syrup composition. Typically, each cake should comprise of 20% corn syrup. Utilizing a 95 % confidence interval, can we assume that the new batch of cakes contain more than a 20% proportion of corn syrup?

The levels of the samples were:

.27, .31, .27, .34, .40, .29, .37, .14, .30, .20

N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)

t.test(N, alternative = "greater", mu = .2, conf.level = 0.95)

# " alternative = " Specifies the type of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #

One Sample t-test

data: N
t = 3.6713, df = 9, p-value = 0.002572
alternative hypothesis: true mean is greater than 0.2
95 percent confidence interval:
0.244562 Inf
sample estimates:
mean of x
0.289

mean(N)
sd(N)

> mean(N)
[1] 0.289
>
> sd(N)
[1] 0.07665942

APA Format:

A one sample t-test was conducted to compare the level of corn syrup in the current sample batch of cakes, to the assumed historical level of corn syrup contained within previously manufactured cakes.

There was a significant increase in the amount of corn syrup in the recent batch of cakes (M = .29, SD = .08), as compared to the historically assumed mean (M =.20); t(9) = 3.67, p = .003.

(Two Sample T-Test)

Template:

(Two Tailed)

There was a significant difference in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

-OR-

There was not a significant difference in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)

Two Sample t-test

data: N2 and N1
t = 2.4558, df = 14, p-value = 0.02773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3007929 4.4492071
sample estimates:
mean of x mean of y
75.250 72.875

mean(N1)

sd(N1)

mean(N2)

sd(N2)

> mean(N1)
[1] 72.875
>
> sd(N1)
[1] 2.167124
>
> mean(N2)
[1] 75.25
>
> sd(N2)
[1] 1.669046

APA Format:

A two sample t-test was conducted to compare the temperature of water prior to the application of the chemical, to the temperature of water subsequent to the application of the chemical

There was a significant difference in the temperature of water prior to the application of the chemical (M = 72.88, SD = 2.17), as compared to the temperature of the water subsequent to the application of the chemical (M = 75.25, SD = 1.67); t(14) = 2.46, p = .03.

(Paired T-Test)

Template:

(Right Tailed)

There was a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

- OR -

There was not a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

Example:

A watch manufacturer believes that by changing to a new battery supplier, that the watches that are shipped which include an initial battery, will maintain longer lifespan. To test this theory, twelve watches are tested for duration of lifespan with the original battery.

The same twelve watches are then re-rested for duration with the new battery.

Can the watch manufacturer conclude, that the new battery increases the duration of lifespan for the manufactured watches? (We will assume an alpha value of .05).

For this, we will utilize the code:

N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)
N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

t.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )

Paired t-test

data: N2 and N1
t = 2.4581, df = 11, p-value = 0.01589
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
12.32551 Inf
sample estimates:
mean of the differences
45.75

mean(N1)
sd(N1)

mean(N2)
sd(N2)

> mean(N1)
[1] 325.3333
>
> sd(N1)
[1] 56.84642
>
> mean(N2)
[1] 371.0833
>
> sd(N2)
[1] 51.22758

APA Format:

A paired t-test was conducted to the lifespan duration of watches which contained the new battery, to the lifespan of watches which contained the initial battery.

There was a significant increase in the lifespan duration of watches which contained the new battery (M = 325.33, SD =56.85), as compared to the lifespan of watches which contained the initial battery (M = 371.08, SD = 51.23); t(11) = 2.46, p = .02.

Regression Models

Example:

(Standard Regression Model)

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

multiregress <- (lm(y ~ x + z))

Call:
lm(formula = y ~ x + z)

Residuals:
Min 1Q Median 3Q Max
-6.4016 -5.0054 -1.7536 0.8713 14.0886

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.1434 12.0381 3.916 0.00578 **
x 0.7808 0.3316 2.355 0.05073 .
z 0.3990 0.4804 0.831 0.43363
---
Residual standard error: 7.896 on 7 degrees of freedom
Multiple R-squared: 0.5249, Adjusted R-squared: 0.3891
F-statistic: 3.866 on 2 and 7 DF, p-value: 0.07394

APA Format:

A linear regression model was utilized to test if variables “x” and “z” significantly predicted outcomes within the observations of “y” included within the sample data set. The results indicated that while “x” (B = .781, p = .051) is a significant predictor variable, the overall model itself does not possess a worthwhile predictive capacity (r2 = .041).

(Non-Standard Regression Model)

Example:

# Model Creation #

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

# Summary Creation and Output #

CancerModelLog <- glm(Cancer~ Age + Obese + Smoking, family=binomial)

summary(CancerModelLog)

# Output #

Call:

glm(formula = Cancer ~ Age + Obese + Smoking, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6096 -0.7471 0.5980 0.8260 1.8485

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.34431 2.25748 -1.038 0.2991
Age 0.02984 0.04055 0.736 0.4617
Obese -0.38924 1.39132 -0.280 0.7797
Smoking 2.54387 1.53564 1.657 0.0976 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 20.728 on 14 degrees of freedom
Residual deviance: 16.807 on 11 degrees of freedom
AIC: 24.807
Number of Fisher Scoring iterations: 4

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(CancerModelLog)

# Console Output #

McFadden Adj.McFadden Cox.Snell Nagelkerke McKelvey.Zavoina Effron
0.2328838 -0.2495624 0.2751639 0.3674311 0.3477522 0.3042371 0.8000000
Adj.Count AIC Corrected.AIC
0.5714286 23.9005542 27.9005542

APA Format:

A logistic regression model was utilized to test if a model containing the variables “Age”, “Smoking Status”, and “Obesity”, could predict Cancer outcomes as it pertains to the individuals included within the sample data set. The results indicated that the model does not possess a worthwhile predictive capacity (Nagelkerke R-Square = .37).

Saturday, July 20, 2019

How to Make Beautiful Visuals (MS-Excel)

I am aware that this subject matter may be considered to be very basic. However, as a data scientist, it is not entirely uncommon that the end result of many of your research endeavors, will somehow or another, require the creation of a presentation of findings.

This of course, inevitably, will lead to the utilization of Power Point. Which will, almost as a prerequisite, require the utilization of Excel.

Therefore, in today’s article, we will review instructions as it relates to the creation of visual outputs as enabled by MS-Excel.

To illustrate this concept, I have created an example worksheet.

This worksheet can be found within this website’s GitHub Repository.

Basic Column Chart

For our scenario, we’ll assume that your goal is to create an attractive column chart as it relates to the above data. Utilizing the “Insert” ribbon option, after highlighting the data,

and subsequently selecting of the top leftmost menu selection button,

presents us with a rather uninspiring graphical depiction of the underlying data.

Let’s make this graphic look a bit better visually.

First, we’ll make the columns more attractive by changing their texture.

This can be achieved by clicking on the column portion of the graphic.

Next, click on the “Format” option within the ribbon menu.

From the many sub-menu selections, click “Shape Effects”, followed by “Bevel”, subsequently followed by “Circle”.

The result should resemble the following:

Next, I would advise adding data labels. To achieve this, left click on any of the columns within the chart.

From the drop down menu, select “Add Data Labels”, followed by “Add Data Labels”.

The result is a much more informative graphic.

However, for the sake of our example, we’ll assume that the axis needs to be modified so that the scale depicted measures from 0.00 – 4.00.

Select the graph’s axis by first right clicking the axis potion of the graphic.

Next, to modify the axis, left click on the selected axis. From the menu which appears, select “Format Axis”.

From the grey menu which appears on the right side of the screen, enter the axis values which you feel are most appropriate for the graphic.

Finally, to make our graph extra eye-catching, we will copy it from the Excel workbook where it is currently located, and paste it into our Power Point template.

However, when pasting, we will be sure to select, from the options available upon left clicking the slide, “Use Destination Theme & Embed Workbook (H)”.

In the case of our example, the final product resembles the following:

Basic 2-D Line Chart

To create a 2-D line chart from the same data, we will again highlight the data, click on the "Insert" ribbon, and select the left topmost option.

This will present a rather uninspiring graphical depiction of the underlying data.

Let’s add some points to our graph to increase its descriptive capacity. This can be achieved by clicking on the line itself, then right clicking to display the following menu. From this menu select “Format Data Series”.

With the “Marker” option selected, you are granted the ability to select the type of point, and the size of the point, which you would prefer to be implemented.

The end result should resemble:

I already adjusted the axis. However, if you would prefer data labels and a templated format, please follow the prior portion of instructions within the previous example.

That’s all for now. Stay studious, Data Heads!

Saturday, June 8, 2019

(Python) Joining Distinct Variable Cell Entries with Pandas

Hey, Data Heads! I’m back from an extended hiatus with a quick article to demonstrate a very useful Pandas function.

To understand this entry, you must first have some prior experience with both the Python programming language, and the Pandas Python library. If you are unfamiliar with either of the aforementioned topics, information and demonstrations related to such can be found within previous articles featured on this website.

As you may recall from a much earlier article which discussed the SAS programming platform, a limitation exists within the SAS language which inhibits the joining of multiple distinct variables into a single cell entry, with all associated entries from other column variables being combined into a single associated variable adjacent to the distinct variable entry. In prior articles on this topic, I designed a series of macros to accomplish what I have just described, however, in the case of Python, specifically through the utilization of the Pandas library, this task can be achieved through a single line of code.

Example:

We will begin by enabling the Pandas library. After which, we will import the familiar data set: "SetA", into the allocated memory.

Just as a reminder, if you aren’t in the mood to input the .CSV cell entries yourself, this file, and all others, can be found within this website’s associated GitHub repository.

# Enable Pandas Package #

import pandas

# Specify the appropriate file path for import #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Desktop\\SetA.csv"

# Create a variable to store the data #

pandadataframe = pandas.read_csv(filepath)

# Modify the column variable to the appropriate variable format and type #

pandadataframe['VARA'] = pandadataframe['VARA'].astype('str')

pandadataframe['DATAVAL'] = pandadataframe['DATAVAL'].astype('str')

The function below, which serves as the method for generating the desired result, can only be utilized if all related variables referenced are of the "string" type. It is for this reason that the two lines of code above this description perform a variable type modification. This ensures that each variable referenced in the code below is a string type variable.

pandadataframe = pandadataframe.groupby(['VARA'])['DATAVAL'].apply('|'.join).reset_index()

Once the above function has performed its task, we will then perform the print function in order to display the results of such.

print(pandadataframe)

Which displays the following output:

VARA DATAVAL
0 A 1|2
1 B 1|2|3
2 E 1|2|3|4

As we have succeeded with our task, all that remains is saving our newly created data set. This can be achieved through the utilization of the code below:

# Choose file pathway designation to indicate where data will be saved #

pandadataframe.to_csv("C:\\Users\\ Desktop\\SetAOutput.csv", sep=',', encoding='utf-8', index = False)

The data set, when viewed within MS-Excel will resemble the following image:

I hope that you found this article helpful. Soon I’ll be back with another entry, but not too soon. Until then, stay inquisitive, Data Heads!

Tuesday, February 26, 2019

Trim, Concatenate, Remove Punctuation, Left and Right (MS-Excel)

In today’s entry we will explore, or in the case of concatenate, re-explore some of the more useful text modification functions within MS-Excel.

The example work sheet which we will be utilizing is illustrated below:

This worksheet can be found within this website’s GitHub Repository.

Let’s say that you wanted to create a single cell within the work sheet which contained the following formatted text:

The large cat, sat his large rear on, the tiny mat.

Typing this out ourselves, or manually formatting the text contained within each cell, seems like the direct way of completing this task. However, we will assume that achieving such is impossible in our example scenario.

TRIM()

This function, according to the Microsoft Office website:

“Removes all spaces from text except for single spaces between words. Use TRIM on text that you have received from another application that may have irregular spacing.”

Let’s apply this function to each cell entry from columns A to D.

This is established by entering “=TRIM()” within each destination cell, with the function being initiated to target each corresponding cell.

The result is as follows:

CONCAT()

Now that the previous step has been completed, we can begin the concatenation process. Within a destination cell, (we will use E4), we will type the following code:

=CONCAT(A2, " ", B2, ", ", C2, " ",D2)

Illustrated, this appears as such:

The result being:

We’ve almost completed our task. All that remains is a single modification. We must remove the “;”, at the end of the sentence, and in its place, insert a “.”.

(NOTE: “CONCAT” replaces the “CONCATENATE” function which existed within the older versions of Excel. If the “CONCAT” function is not performing its task, try utilizing the “CONCATENATE” function in the same manner as illustrated above.)

LEFT() and RIGHT()

Though these functions are not immediately useful as they pertain to the completion of our task, they should nevertheless be discussed.

LEFT() and RIGHT() are two separate MS-Excel functions. Each function provides a similar task, that task being, the return of a specified number of characters from a previously indicated cell.

RIGHT() and LEFT() dictate the direction of the character count.

RIGHT() – Selects characters from left to right.

LEFT() – Selects characters from right to left.

So, if for example we typed:

=LEFT(A2, 3)

into an empty cell, and A2 contained:

The large cat,

The value within the destination cell would now contain:

The

Likewise, if we were to type:

=RIGHT(A2,4)

into an empty cell, and A2 contained:

The large cat,

The value within the destination cell would now contain:

cat,

RemovePuncuation()

Another useful function, which is unrelated to this exercise, is RemovePuncuation(). As the name indicates, RemovePuncuation() creates a cell entry which contains the contents of an indicated cell, with all punctuation removed.

Therefore, if we typed:

=RemovePuncuation(A2)

into an empty cell, and A2 contained:

The large cat,

The value within the destination cell would now contain:

The large cat

This function removes ALL punctuation. Therefore, all periods, commas, apostrophes, semi-colons, etc., would be removed from the text within the destination cell.

Removing the Final Cell Character

We will finish our exercise by creating a new cell entry which contains the contents our original cell, with the exception of the final character (;).

This can be achieved with the code below:

=LEFT(E2, LEN(E2)-(RIGHT(E2) = ";"))

Implemented, this resembles the following:

The result being:

The initial function is specifying the removal of the semi-colon.

The subsequent function is adding a period in lieu of the removed semi-colon.

E2 is the cell value which is being targeted. This target value can be modified based on what the situation entails. The value at the end of the function “;”, can be modified to whatever the final character is (“.”, “,”, etc.) within the target cell which requires removal.