Reflections of a Data Scientist: August 2019

Sunday, August 4, 2019

Model and Method Utilization

There are many model types, methods and techniques demonstrated on this website. In this entry, I will categorize each of the aforementioned concepts, and provide a brief description as it pertains to the scenario which would warrant appropriate utilization.

(Tests of Normality)

Q-Q Plot – A graph which is utilized to assess data for normality.

P-P Plot – A graph which is utilized to assess data for normality.

Shapiro-Wilk Normality Test – A test which is utilized to test data for normality.

(Tests Related to Parametric Model Variable Correlation)

Variance Influence Factor – A method which tests model variables for correlation.

(Pearson) Coefficient of Correlation – A method which tests variables for correlation.

Partial Correlation - A method which is utilized to measure the correlation between two variables, while also controlling for a third variable.

Distance Correlation – A method which tests model variables for correlation through the utilization of a Euclidean distance formula.

Canonical Correlation – A method which assesses model variables for correlation through the combination of model variables into independent groups.

(Tests Related to Non-Parametric Model Variable Correlation)

Spearman’s Rank Correlation - A non-parametric alternative to the Pearson correlation. This method is utilized in circumstances when either data samples are non-linear, or the data type contained within those samples are ordinal. An example of ordinal data – “survey response data which asked the respondent to rank a particular item on a scale of 1-10”.

Kendall Rank Correlation Coefficient - Like Spearman’s rho, Kendall’s Tau is also utilized in circumstances when either data samples are non-linear, or the data type contained within the samples is ordinal.

(Tests of Significance Amongst Groups)

One Sample T-Test - This test is utilized to compare a sample mean to a specific value, it is used when the dependent variable is measured at the interval or ratio level.

Two Sample T-Test - This test functions in the same manner as the above test. However, in the case of this model, data is randomly sampled from different sets of items from two separate control groups.

The Welch Two Sample T-Test - This test functions in the same manner as the above test. The only difference being, this method is utilized if data is randomly sampled from different sets of items from two separate control groups of uneven size.

Paired T-Test – Similar in composition to the Two Sample T-Test, this test is utilized if you are sampling the same set twice, once for each variable.

(Analysis of Variance “ANOVA”)

Analysis of Variance – Also known as ANOVA, this method is utilized to test for significance across the variances of multiple sample groups. In many ways, this test is similar to a t-test, however, ANOVA allows for multiple group comparison.

One Way Analysis of Variance (ANOVA) – An ANOVA model containing a single independent variable.

Two Way Analysis of Variance (ANOVA) - An ANOVA model containing multiple independent variables.

Repeated-Measures Analysis of Variance (ANOVA) – An ANOVA model containing a single independent variable measured multiple times.

(Exotic Analysis of Variance “ANOVA” Variants)

Analysis of Covariance (ANCOVA) – An ANOVA model which also factors for a covariate value which may impact the system as a whole.

https://statistics.laerd.com/spss-tutorials/ancova-using-spss-statistics.php

Random Effects Analysis of Variance – An ANOVA model which is synthesized from sampling from a greater population in order to determine inference.

https://stat.ethz.ch/education/semesters/as2015/anova/06_Random_Effects.pdf

Multivariate Analysis of Variance (MANOVA) – An ANOVA model containing multiple dependent variables.

https://statistics.laerd.com/spss-tutorials/one-way-manova-using-spss-statistics.php

Multivariate of Covariance (MANCOVA) – An ANOVA model containing multiple dependent variables. Also factors for a covariate value which may impact the system as a whole.

https://statistics.laerd.com/spss-tutorials/one-way-mancova-using-spss-statistics.php

(Test of Significance for Nonparametric Data)

Friedman Test (One Way Analysis of Variance) – The nonparametric alternative to a One Way ANOVA test.

Wilcox Signed Rank Test (One Sample T-Test, Paired T-Test) – The nonparametric alternative to the One Sample T-Test, and the Paired T-Test.

Mann-Whitney U Test (Two Sample T-Test) – A nonparametric alternative to the One Way ANOVA test.

(Tests of Significance Amongst Groups)

Chi-Square – A test which measures categorical significance as it pertains to a binary outcome variable.

McNemar's Test – A test which measures categorical significance, limited to two initial categories, and two categorical outcomes. This test is typically utilized for drug trials.

(Metric to Assess Rate of Agreement Amongst Two Entitles)

Cohen’s Kappa – A test which measures the rate of agreement amongst two entities.

(Tests of Significance Amongst Groups Comprised of Survey Questions)

Cronbach’s Alpha - Cronbach’s Alpha is primarily utilized to measure the inter-relatedness of response data collected from sociological surveys. Specifically, the potential differentiation of response information related to certain interrelated categorical survey questions.

(Tests Pertaining to Stationarity and Random Walks)

Dicky-Fuller Test – A methodology of analysis utilized to test data for stationarity.

Phillips-Perron Unit Root Test – A methodology utilized to test data for random walk potential.

(Comparison of Outcome Variables)

Two Step Cluster – A method which assesses model outcome variables through the utilization of a clustering technique.

K-Means - A method which assesses model outcome variables through the utilization of a clustering technique.

Hierarchical Cluster - A method which assesses model outcome variables through the utilization of a hierarchal technique.

K-Nearest Neighbor – A method which compares similarity of outcome variables as determined by the values of the model’s independent variables.

(Reduction of Independent Variables through Variable Synthesis)

Dimension Reduction – A method which creates new variables with values that are determined by the original values of the independent model variables.

(Impact Assessment)

TURF Analysis – A method of analysis typically utilized for product and design studies. This technique assesses the most effective way to reach a sample target demographic.

(Survival Analysis)

Survival Analysis - A statistical methodology which measures the probability of an event occurring within a group over a period of time.

(Sample Distribution Tests)

The Wald Wolfowitz Test - A method for analyzing a single data set in order to determine whether the elements within the data set were sampled independently.

The Wald Wolfowitz Test (2-Sample) - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

The Kolmogorov-Smirnov Test - A method for analyzing a single data set in order to determine whether the data was sampled from a normally distributed population.

The Kolmogorov-Smirnov Test (2-Sample) - A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

(Outcome Models – Conditions for Utilization)

Linear Regression – Continuous outcome variable. Continuous independent variable(s).

General Linear Mixed Models – Continuous outcome variable. Any type of independent variable(s).

Logistic Regression Analysis – Binary outcome variable. Categorical or continuous independent variable(s).

Discriminant Analysis – Binary outcome variable. Categorical or continuous independent variable(s).

Loglinear Analysis - Binary outcome variable. Categorical independent variable(s).

Partial Least Squares Regression – Any type of outcome variable. Any type of independent variable(s).

Polynomial Regression – Continuous outcome variable. Continuous independent variable(s).

Multinomial Logistical Analysis – Categorical outcome variable. Categorical input variable(s).

Logistical Ordinal Regression – Categorical outcome variable. Categorical input variable(s).

Probit Regression – Binary outcome variable. Categorical or continuous input variable(s).

2-Stage Least Squares Regression - Categorical outcome variable. Continuous independent variable(s).

APA Format

In today’s article, we will discuss the standard methodology which is utilized to report statistical findings. In previous examples featured on this website, model outputs were explained in a more simplistic manner in order to decrease the level of complexity related to such. However, if the purpose of the overall research endeavor is to produce results for publication, then the APA format should be applied to whatever experimental findings are generated from the application of methodologies.

“APA” is an abbreviation for The American Psychological Association. Regardless of the type of research that is being conducted, the formatting standards maintained by the APA as it applies to statistical research, should always be utilized when presenting data in a professional manner.

Details

All figures which contain decimal values should be rounded to the nearest hundredth. Ex. .105 = .11. Reporting p-values being the exception to this rule. P-values should, in most cases, be reported in a format which contains two decimals. The exception occurring when a greater amount of specificity is required to illustrate the details of the findings.

Another rule to keep in mind pertains to leading zeroes. A leading zero prior to a decimal place is only required if the represented figure has the potential to exceed “1”. If the value cannot exceed “1”, then a leading zero is un-necessary.

Below are examples which demonstrate the most common application of the APA format.

Chi-Square

Template:

A chi-square test of independence was performed to examine the relation between CATEGORY and OUTCOME. The relation between these variables was found to be significant at the p < .05 level, χ2 (DEGREES OF FREEDOM, N = SAMPLE SIZE) = X-Squared Value, p = p - value.

- OR -

A chi-square test of independence was performed to examine the relation between CATEGORY and OUTCOME. The relation between these variables was not found to be significant at the p < .05 level, χ2 (DEGREES OF FREEDOM, N = SAMPLE SIZE) = X-Squared Value, p = p - value.

Example:

While working as a statistician at a local university, you are tasked to evaluate, based on survey data, the level of job satisfaction that each member of the staff currently has for their occupational role (Assume a 95% Confidence Interval).

The data that you gather from the surveys is as follows:

General Faculty
130 Satisfied 20 Unsatisfied

Professors
30 Satisfied 20 Unsatisfied

Adjunct Professors
80 Satisfied 20 Unsatisfied

Custodians
20 Satisfied 10 Unsatisfied

# Code #

Model <- matrix(c(130, 30, 80, 20, 20, 20, 20, 10), nrow = 4, ncol=2)

N <- sum(130, 30, 80, 20, 20, 20, 20, 10)

chisq.test(Model)

N

# Console Output #

Pearson's Chi-squared test

data: Model
X-squared = 18.857, df = 3, p-value = 0.0002926

> N
[1] 330

APA Format:

A chi-square test of independence was performed to examine the relation between occupational role and job satisfaction. The relation between these variables was found to be significant at the p < .05 level, χ2 (3, N = 330) = 18.56, p < .001.

Tukey HSD

Template:

Post hoc comparisons using the Tukey HSD test indicated that the mean score for the CONDITION A (M = Mean1, SD = Standard Deviation1) was significantly different than CONDITION B (M = Mean2, SD = Standard Deviation2), p = p-value.

Analysis of Variance (ANOVA)

(One Way)

Template:

There was a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

- OR -

There was not a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

Example:

A chef wants to test if patrons prefer a soup which he prepares based on salt content. He prepares a limited experiment in which he creates three types of soup: soup with a low amount of salt, soup with a high amount of salt, and soup with a medium amount of salt. He then servers this soup to his customers and asks them to rate their satisfaction on a scale from 1-8.

Low Salt Soup it rated: 4, 1, 8
Medium Salt Soup is rated: 4, 5, 3, 5
High Salt Soup is rated: 3, 2, 5

(Assume a 95% Confidence Interval)

# Code #

satisfaction <- c(4, 1, 8, 4, 5, 3, 5, 3, 2, 5)

salt <- c(rep("low",3), rep("med",4), rep("high",3))

salttest <- data.frame(satisfaction, salt)

results <- aov(satisfaction~salt, data=salttest)

summary(results)

# Console Output #

Df Sum Sq Mean Sq F value Pr(>F)
salt 2 1.92 0.958 0.209 0.816
Residuals 7 32.08 4.583

APA Format:

There not was a significant effect of the level of salt content on patron satisfaction at the p<.05 level for the three conditions (F(2, 7) = 0.21, p = 0.82).

(Two Way)

Template:

Hypothesis 1:

There was a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

- OR -

There was not a significant effect of the CATEGORY on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(1), Degrees of Freedom(2)) = F Value, p = p - value).

Hypothesis 2:

There was a significant effect of the CATEGORY2 on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(2), Degrees of Freedom(4)) = F Value, p = p - value).

- OR -

There was not a significant effect of the CATEGORY2 on the OUTCOME for SCENARIO at the p <. 05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(2), Degrees of Freedom(4)) = F Value, p = p - value).

Hypothesis 3:

There was a statistically significant interaction effect of the CATEGORY1 on the CATEGORY2 at the p < .05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(3), Degrees of Freedom(4)) = F Value, p = p - value).

- OR -

There was not a statistically significant interaction effect of the CATEGORY1 on the CATEGORY2 at the p < .05 level for the NUMBER OF CONDITIONS (F(Degrees of Freedom(3), Degrees of Freedom(4)) = F Value, p = p - value).

Example:

Researchers want to test study habits within two schools as they pertain to student life satisfaction. The researchers also believe that the school that each group of students is attending may also have an impact on study habits. Students from each school are assigned study material which in sum, totals to 1 hour, 2 hours, and 3 hours on a daily basis. Measured is the satisfaction of each student group on a scale from 1-10 after a 1 month duration.

(Assume a 95% Confidence Interval)

School A:

1 Hour of Study Time: 7, 2, 10, 2, 2
2 Hours of Study Time: 9, 10, 3, 10, 8
3 Hours of Study Time: 3, 6, 4, 7, 1

School B:

1 Hour of Study Time: 8, 5, 1, 3, 10
2 Hours of Study Time: 7, 5, 6, 4, 10
3 Hours of Study Time: 5, 5, 2, 2, 2

satisfaction <- c(7, 2, 10, 2, 2, 8, 5, 1, 3, 10, 9, 10, 3, 10, 8, 7, 5, 6, 4, 10, 3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

studytime <- c(rep("One Hour",10), rep("Two Hours",10), rep("Three Hours",10))

school = c(rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5), rep("SchoolA",5), rep("SchoolB",5))

schooltest <- data.frame(satisfaction, studytime, school)

results <- aov(lm(satisfaction ~ studytime * school, data=schooltest))

summary(results)

Which produces the output:

Df Sum Sq Mean Sq F value Pr(>F)
studytime 2 62.6 31.300 3.809 0.0366 *
school 1 2.7 2.700 0.329 0.5718
studytime:school 2 7.8 3.900 0.475 0.6278
Residuals 24 197.2 8.217
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

APA Format:

There was a significant effect as it pertains to study time impacting student stress levels at the p < .05 level for the three conditions (F(2, 24) = 3.81, p = .04).

There was not a significant effect as it relates to the school attended impacting student stress levels at the p < .05 level for the two conditions (F(1, 24) = 0.329, p > .05).

There was not a statistically significant interaction effect of the school variable on the study time variable at the p < .05 level (F(2, 24) = 0.475, p > .05).

TukeyHSD(results)

> TukeyHSD(results)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aov(formula = lm(satisfaction ~ studytime * school, data = schooltest))

$studytime
diff lwr upr p adj
Three Hours-One Hour -1.3 -4.5013364 1.901336 0.5753377
Two Hours-One Hour 2.2 -1.0013364 5.401336 0.2198626
Two Hours-Three Hours 3.5 0.2986636 6.701336 0.0302463

$school
diff lwr upr p adj
SchoolB-SchoolA -0.6 -2.760257 1.560257 0.571817

$`studytime:school`

diff lwr upr p adj
Three Hours:SchoolA-One Hour:SchoolA -0.4 -6.005413 5.2054132 0.9999178
Two Hours:SchoolA-One Hour:SchoolA 3.4 -2.205413 9.0054132 0.4401459
One Hour:SchoolB-One Hour:SchoolA 0.8 -4.805413 6.4054132 0.9976117
Three Hours:SchoolB-One Hour:SchoolA -1.4 -7.005413 4.2054132 0.9696463
Two Hours:SchoolB-One Hour:SchoolA 1.8 -3.805413 7.4054132 0.9157375
Two Hours:SchoolA-Three Hours:SchoolA 3.8 -1.805413 9.4054132 0.3223867
One Hour:SchoolB-Three Hours:SchoolA 1.2 -4.405413 6.8054132 0.9844928
Three Hours:SchoolB-Three Hours:SchoolA -1.0 -6.605413 4.6054132 0.9932117
Two Hours:SchoolB-Three Hours:SchoolA 2.2 -3.405413 7.8054132 0.8260605
One Hour:SchoolB-Two Hours:SchoolA -2.6 -8.205413 3.0054132 0.7067715
Three Hours:SchoolB-Two Hours:SchoolA -4.8 -10.405413 0.8054132 0.1240592
Two Hours:SchoolB-Two Hours:SchoolA -1.6 -7.205413 4.0054132 0.9470847
Three Hours:SchoolB-One Hour:SchoolB -2.2 -7.805413 3.4054132 0.8260605
Two Hours:SchoolB-One Hour:SchoolB 1.0 -4.605413 6.6054132 0.9932117
Two Hours:SchoolB-Three Hours:SchoolB 3.2 -2.405413 8.8054132 0.5052080

twohours <- c(9, 10, 3, 10, 8, 7, 5, 6, 4, 10)
threehours <- c(3, 6, 4, 7, 1, 5, 5, 2, 2, 2)

mean(twohours)
sd(twohours)

mean(threehours)
sd(threehours)

> mean(twohours)
[1] 7.2
> sd(twohours)
[1] 2.616189
>
> mean(threehours)
[1] 3.7
> sd(threehours)
[1] 2.002776

APA Format:

Post hoc comparisons using the Tukey HSD test indicated that at the p < .05 level, the mean score for the level of stress exhibited by students who studied for Two Hours (M = 7.20, SD = 2.62), was significantly different as compared to the scores of the students who studied for Three Hours (M = 3.70, SD = 2.00), p = .03.

(Repeated Measures)

Template:

Example:

Researchers want to test the impact of reading existential philosophy on a group of 8 individuals. They measure the happiness of the participants three times, once prior to reading, once after reading the materials for one week, and once after reading the materials for two weeks. We will assume an alpha of .05.

Before Reading = 1, 8, 2, 4, 4, 10, 2, 9
After Reading = 4, 2, 5, 4, 3, 4, 2, 1
After Reading (wk. 2) = 5, 10, 1, 1, 4, 6, 1, 8

library(lme4) # You will need to install and enable this package #
library(nlme) # You will also need to install and enable this package #

happiness <- c(1, 8, 2, 4, 4, 10, 2, 9, 4, 2, 5, 4, 3, 4, 2, 1, 5, 10, 1, 1, 4, 6, 1, 8 )

week <- c(rep("Before", 8), rep("Week1", 8), rep("Week2", 8))

id <- c(1,2,3,4,5,6,7, 8)

survey <- data.frame(id, happiness, week)

model <- lme(happiness ~ week, random=~1|id, data=survey)

anova(model)

This method saves some time by producing the output:

numDF denDF F-value p-value
(Intercept) 1 14 37.21053 <.0001
week 2 14 1.04624 0.3772

There was not a significant effect of the health assessment on the survey questions related to stroke concern at the p < .05 level for the five conditions (F(1, 14) = 1.05, p > .05).

Student’s T-Test

(One Sample T-Test)

Template:

(Right Tailed)

There was a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the historically assumed mean (M = Historic Mean Value); t(Degrees of Freedom) = t-value, p = p-value.

- OR -

There was not a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the historically assumed mean (M = Historic Mean Value); t(Degrees of Freedom) = t-value, p = p-value.

Example:

A factory employee believes that the cakes produced within his factory are being manufactured with excess amounts of corn syrup, thus altering the taste. 10 cakes were sampled from the most recent batch and tested for corn syrup composition. Typically, each cake should comprise of 20% corn syrup. Utilizing a 95 % confidence interval, can we assume that the new batch of cakes contain more than a 20% proportion of corn syrup?

The levels of the samples were:

.27, .31, .27, .34, .40, .29, .37, .14, .30, .20

N <- c(.27, .31, .27, .34, .40, .29, .37, .14, .30, .20)

t.test(N, alternative = "greater", mu = .2, conf.level = 0.95)

# " alternative = " Specifies the type of test that R will perform. "greater" indicates a right tailed test. "left" indicates a left tailed test."two.sided" indicates a two tailed test. #

One Sample t-test

data: N
t = 3.6713, df = 9, p-value = 0.002572
alternative hypothesis: true mean is greater than 0.2
95 percent confidence interval:
0.244562 Inf
sample estimates:
mean of x
0.289

mean(N)
sd(N)

> mean(N)
[1] 0.289
>
> sd(N)
[1] 0.07665942

APA Format:

A one sample t-test was conducted to compare the level of corn syrup in the current sample batch of cakes, to the assumed historical level of corn syrup contained within previously manufactured cakes.

There was a significant increase in the amount of corn syrup in the recent batch of cakes (M = .29, SD = .08), as compared to the historically assumed mean (M =.20); t(9) = 3.67, p = .003.

(Two Sample T-Test)

Template:

(Two Tailed)

There was a significant difference in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

-OR-

There was not a significant difference in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

70, 74, 76, 72, 75, 74, 71, 71

He then measures temperature in samples which the chemical was not applied.

74, 75, 73, 76, 74, 77, 78, 75

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

N1 <- c(70, 74, 76, 72, 75, 74, 71, 71)

N2 <- c(74, 75, 73, 76, 74, 77, 78, 75)

t.test(N2, N1, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)

Two Sample t-test

data: N2 and N1
t = 2.4558, df = 14, p-value = 0.02773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3007929 4.4492071
sample estimates:
mean of x mean of y
75.250 72.875

mean(N1)

sd(N1)

mean(N2)

sd(N2)

> mean(N1)
[1] 72.875
>
> sd(N1)
[1] 2.167124
>
> mean(N2)
[1] 75.25
>
> sd(N2)
[1] 1.669046

APA Format:

A two sample t-test was conducted to compare the temperature of water prior to the application of the chemical, to the temperature of water subsequent to the application of the chemical

There was a significant difference in the temperature of water prior to the application of the chemical (M = 72.88, SD = 2.17), as compared to the temperature of the water subsequent to the application of the chemical (M = 75.25, SD = 1.67); t(14) = 2.46, p = .03.

(Paired T-Test)

Template:

(Right Tailed)

There was a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

- OR -

There was not a significant increase in the GROUP A (M = Mean of GROUP A, SD = Standard Deviation of GROUP A), as compared to the GROUP B (M = Mean of GROUP B, SD = Standard Deviation of GROUP B), t(Degrees of Freedom) = t-value, p = p-value.

Example:

A watch manufacturer believes that by changing to a new battery supplier, that the watches that are shipped which include an initial battery, will maintain longer lifespan. To test this theory, twelve watches are tested for duration of lifespan with the original battery.

The same twelve watches are then re-rested for duration with the new battery.

Can the watch manufacturer conclude, that the new battery increases the duration of lifespan for the manufactured watches? (We will assume an alpha value of .05).

For this, we will utilize the code:

N1 <- c(376, 293, 210, 264, 297, 380, 398, 303, 324, 368, 382, 309)
N2 <- c(337, 341, 316, 351, 371, 440, 312, 416, 445, 354, 444, 326)

t.test(N2, N1, alternative = "greater", paired=TRUE, conf.level = 0.95 )

Paired t-test

data: N2 and N1
t = 2.4581, df = 11, p-value = 0.01589
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
12.32551 Inf
sample estimates:
mean of the differences
45.75

mean(N1)
sd(N1)

mean(N2)
sd(N2)

> mean(N1)
[1] 325.3333
>
> sd(N1)
[1] 56.84642
>
> mean(N2)
[1] 371.0833
>
> sd(N2)
[1] 51.22758

APA Format:

A paired t-test was conducted to the lifespan duration of watches which contained the new battery, to the lifespan of watches which contained the initial battery.

There was a significant increase in the lifespan duration of watches which contained the new battery (M = 325.33, SD =56.85), as compared to the lifespan of watches which contained the initial battery (M = 371.08, SD = 51.23); t(11) = 2.46, p = .02.

Regression Models

Example:

(Standard Regression Model)

x <- c(27, 34, 22, 30, 17, 32, 25, 34, 46, 37)
y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)
z <- c(13, 22, 18, 30, 15, 17, 20, 11, 20, 25)

multiregress <- (lm(y ~ x + z))

Call:
lm(formula = y ~ x + z)

Residuals:
Min 1Q Median 3Q Max
-6.4016 -5.0054 -1.7536 0.8713 14.0886

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.1434 12.0381 3.916 0.00578 **
x 0.7808 0.3316 2.355 0.05073 .
z 0.3990 0.4804 0.831 0.43363
---
Residual standard error: 7.896 on 7 degrees of freedom
Multiple R-squared: 0.5249, Adjusted R-squared: 0.3891
F-statistic: 3.866 on 2 and 7 DF, p-value: 0.07394

APA Format:

A linear regression model was utilized to test if variables “x” and “z” significantly predicted outcomes within the observations of “y” included within the sample data set. The results indicated that while “x” (B = .781, p = .051) is a significant predictor variable, the overall model itself does not possess a worthwhile predictive capacity (r2 = .041).

(Non-Standard Regression Model)

Example:

# Model Creation #

Age <- c(55, 45, 33, 22, 34, 56, 78, 47, 38, 68, 49, 34, 28, 61, 26)

Obese <- c(1,0,0,0,1,1,0,1,1,0,1,1,0,1,0)

Smoking <- c(1,0,0,1,1,1,0,0,1,0,0,1,0,1,1)

Cancer <- c(1,0,0,1,0,1,0,0,1,1,0,1,1,1,0)

# Summary Creation and Output #

CancerModelLog <- glm(Cancer~ Age + Obese + Smoking, family=binomial)

summary(CancerModelLog)

# Output #

Call:

glm(formula = Cancer ~ Age + Obese + Smoking, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6096 -0.7471 0.5980 0.8260 1.8485

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.34431 2.25748 -1.038 0.2991
Age 0.02984 0.04055 0.736 0.4617
Obese -0.38924 1.39132 -0.280 0.7797
Smoking 2.54387 1.53564 1.657 0.0976 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 20.728 on 14 degrees of freedom
Residual deviance: 16.807 on 11 degrees of freedom
AIC: 24.807
Number of Fisher Scoring iterations: 4

# Generate Nagelkerke R Squared #

# Download and Enable Package: "BaylorEdPsych" #

PseudoR2(CancerModelLog)

# Console Output #

McFadden Adj.McFadden Cox.Snell Nagelkerke McKelvey.Zavoina Effron
0.2328838 -0.2495624 0.2751639 0.3674311 0.3477522 0.3042371 0.8000000
Adj.Count AIC Corrected.AIC
0.5714286 23.9005542 27.9005542

APA Format:

A logistic regression model was utilized to test if a model containing the variables “Age”, “Smoking Status”, and “Obesity”, could predict Cancer outcomes as it pertains to the individuals included within the sample data set. The results indicated that the model does not possess a worthwhile predictive capacity (Nagelkerke R-Square = .37).