Monday, April 30, 2018

(R) Multinomial Logistical Analysis (SPSS)

If you unfamiliar with the concept of Logistic Regression, I advise that you review the article pertaining to such that was previously posted on this site. I make this suggestion due to the conceptual synthesis of the Multinomial Logistical Model acting as an extension of the logistical regression model. While the logistical regression model can only provide predictive capacity as it pertains to binary probabilities, the multinomial logistic regression model can account for multiple categorical outcomes.

Example (SPSS):

In this demonstration, we will assume that you are attempting to predict an individual’s favorite color based on other aspects of their individuality.

We will begin with our data set:


Which we will assign value labels to produce the following data interface:


To begin our analysis, we must select, from the topmost menu, “Analyze”, then “Regression”, followed by “Multinomial Logic”.

This sequence of actions should cause the following menu to appear:


Using the topmost arrow button, assign “Color” as a ”Dependent” variable. Once this has been completed, utilize the center arrow button to assign the remaining variables (“Gender”, “Smoker”, “Car”), as “Factor(s)”.

Next, click on the button labeled “Reference Category”, this should populate the following menu:


Select the “Custom” option and enter the value of “4” into the box below. After performing these actions, click “Continue”. Once you have returned to the initial menu, click on the button labeled “Save”.

This series of selections should cause the following sub-menu to appear:


Select the box labeled “Predicted category”, then click “Continue”. You will again be returned to the initial menu. From this menu, click “OK”.

This should produce a voluminous output, however we will only concern ourselves with the following output aspects:


As was described in a prior article, Pseudo R-Squared methods are utilized to measure for model fit when the traditional coefficient methodology is inapplicable. In the case of our example, we will be utilizing the “Nagelkerke” value, which can be assessed on a scale similar to the traditional r-squared metric. Since this model’s Nagelkerke value is .843, we can assume that the model does function as a decent predictor for our dependent variable.

The above output provides us with the internal aspects of the model’s synthesis. Though this may appear daunting to behold at first, the information that is illustrated in the chart is no different than the output generated as it pertains to a typical linear model.


In the case of our model, we will have three logistical equations:

Green = (Gender:Female * -35.791) + (Smoker:Yes * 34.774) + (Car:KIA * -17.40) + (Car:Ford * .985) + 17.40

Yellow = (Gender:Female * -36.664) + (Smoker:Yes * 15.892) + (Car:KIA * -35.632) + (Car:Ford * 1.499) + 16.886

Red = (Gender:Female * -19.199) + (Smoker:Yes * 18.880) + (Car:KIA * -37.252) + (Car:Ford * -19.974) + 18.506


As is the case with all log models, we need to transform the output values of each equation to generate the appropriate probabilities.

So, for our first example observation, our equations would resemble:

Observation 1 | Gender: Female | Smoker: No | Car: Chevy

Green = (1 * -35.791) + (0 * 34.774) + (0 * -17.40) + (0 * .985) + 17.40

Yellow = (1 * -36.664) + (0 * 15.892) + (0 * -35.632) + (0 * 1.499) + 16.886

Red = (1 * -19.199) + (0 * 18.880) + (0 * -37.252) + (0 * -19.974) + 18.506


Which produces the values of:

Green = -18.391

Yellow = -19.778

Red = -0.693


To produce probabilities, we have to transform these values through the utilization of the following R code:

Green <- -18.391

Yellow <- -19.778

Red <- -0.693


# Green #

exp(Green) / (1 + exp(Green) + exp(Red) + exp(Yellow))

#Red #

exp(Red) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Yellow #

exp(Yellow) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Blue (Reference Category) #

1 / (1 + exp(Green) + exp(Red) + exp(Yellow))


Which produces the following outputs:

[1] 6.867167e-09
[1] 0.333366
[1] 1.715581e-09
[1] 0.666634


Interpretation:

Each output value represents the probability of the occurrence of the dependent variable as it is related to the reference category (“Blue”).

P(Green) = 6.867167e-09

P(Yellow) = 1.715581e-09

P(Red) = 0.333366

R(Blue) = 0.666634


Therefore, in the case of the first observation of our example data set, we can assume that the reference category, “Blue”, has the highest likelihood of occurrence.

The predicted values, as a result of the “Save” option, have been output into a column within the original data set. 


Example (R):

If we wanted to repeat our analysis through the utilization of the “R” platform, we could do so with the following code:

# (With the package: "nnet", downloaded and enabled) #

# Multinomial Logistic Regression #

color <- c("Red", "Blue", "Green", "Blue", "Blue", "Blue", "Green", "Green", "Green", "Yellow")

gender <- c("Female", "Female", "Male", "Female", "Female", "Male", "Male", "Male", "Female", "Male")

smoker <- c("No", "No", "Yes", "No", "No", "No", "No", "No", "Yes", "No")

car <-c("Chevy", "Chevy", "Ford", "Ford", "Chevy", "KIA", "Ford", "KIA", "Ford", "Ford")

color <- as.factor(color)
gender <- as.factor(gender)
smoker <- as.factor(smoker)
car <- as.factor(car)

testset <- data.frame(color, gender, smoker, car)

mlr <- multinom(color ~ gender + smoker + car, data=testset )

summary(mlr)


This produces the following output:

Call:
multinom(formula = color ~ gender + smoker + car, data = testset)

Coefficients:
     (Intercept) genderMale smokerYes carFord carKIA
Green -40.2239699 36.73179 47.085203 21.36387 3.492186
Red      -0.6931559 -17.00881 -3.891315 -20.23802 -11.832468
Yellow -41.0510233 37.33637 -10.943821 21.58634 -22.161372

Std. Errors:
         (Intercept) genderMale smokerYes carFord carKIA
Green        0.4164966 4.164966e-01 7.125616e-14 3.642157e-01 6.388766e-01
Red           1.2247466 2.899257e-13 1.686282e-23 1.492263e-09 2.899257e-13
Yellow      0.3642157 3.642157e-01 6.870313e-26 3.642157e-01 1.119723e-12

Residual Deviance: 9.364263
AIC: 39.36426


To test the model results, the code below can be utilized:

# Test Model #

# Gender : Male #
a <- 0

# Smoker : Yes #
b <- 0

# Car : Ford #
c <- 0

# Car : KIA #
d <- 0

Green <- -40.2239699 + (a * 36.73179) + (b * 47.085203) + (c * 21.36387) + (d * 3.492186)

Red <- -0.6931559 + (a * -17.00881) + (b * -3.891315) + (c * -20.23802) + (d * -11.832468)

Yellow <- -41.0510233 + (a * 37.33637) + (b * -10.943821) + (c * 21.58634) + (d * -22.161372)


# Green #

exp(Green) / (1 + exp(Green) + exp(Red) + exp(Yellow))

#Red #

exp(Red) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Yellow #

exp(Yellow) / (1 + exp(Green) + exp(Red) + exp(Yellow))

# Blue (Reference Category) #


1 / (1 + exp(Green) + exp(Red) + exp(Yellow))

NOTE: The model’s internal aspects differ depending on the platform which was utilized to generate the analysis. Though the model predictions do not differ, I would recommend, if publishing findings, to utilize SPSS in lieu of R. The reason for this rational, pertains to the auditing record which SPSS possesses. If data output possesses abnormalities, R, being open source, cannot be held to account. Additionally, as the multinomial function within R exists as an additional aspect of an external package, it could potentially cause platform computational errors to have a greater likelihood of occurrence.

Monday, April 23, 2018

(R) Spearman’s Rank Correlation and Kendall Rank Correlation

In previous articles, the topics of ranking, correlation, and non-parametric data were discussed. Spearman’s Rank Correlation, and Kendall Rank Correlation, combine all of these methodologies into singular concepts. In this entry we will be discussing how these correlations are utilized within the R platform, and the appropriate circumstances which warrant their utilization.

Spearman’s Rank Correlation Coefficient

Spearman’s Rank Correlation Coefficient, also referred to as Spearman’s rho, is a non-parametric alternative to the Pearson correlation. The Spearman alternative is utilized in circumstances when either data samples are non-linear*, or the data type contained within those samples are ordinal**. The output variable that this method produces is known as “rho”. Hence the alternative name which this method is referred to as (“Spearman’s Rho).

As is case with non-parametric alternatives, the particular design of this procedure utilizes a rank system.

Example:

We are presented with the following data vectors from two survey prompts:

# Create data vector (scale 1-5) #

x <- c(5, 1, 1, 1, 3, 2, 5, 3, 3, 2, 4, 4, 4, 2, 5, 4, 4, 4, 4, 2)

# Create data vector (scale 1-5) #

y <- c(4,5, 4, 3, 1, 1, 5, 4, 5, 4, 3, 4, 3, 4, 5, 5, 3, 3, 5, 4)

# Create Spearman’s Rank Correlation #

cor.test(x, y, method=c("spearman"))


This produces the output:

Spearman's rank correlation rho

data: x and y
S = 1072.1, p-value = 0.4126
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho
0.1939455


From this output, we can first determine that the model strength is not the best, as the p-value = 0.4126, a value which is far above the common alpha level of .05. Next, we will assess the rho value output, which is 0.192955. This value is measured on a scale similar to the Pearson’s correlation. Since this value is relatively low, we will assume a weak positive correlation.

*- For an example as to what non-linear data might resemble, please refer to the article “(R) Polynomial Regression”, published April 17, 2018.

**- For example, survey response data which asked the respondent to rank a particular item on a scale of 1-10.


Kendall Rank Correlation Coefficient

The Kendall Rank Correlation Coefficient, also referred to as Kendall’s Tau, is also a non-parametric alternative to the Pearson correlation. Like Spearman’s rho, Kendall’s Tau is also utilized in circumstances when either data samples are non-linear, or the data type contained within those samples are ordinal. The output variable that this method produces is known as “rho”. As is case with non-parametric alternatives, the particular design of this procedure utilizes a rank system.

Example:

We are presented with the following data vectors from two survey prompts:

# Create data vector (scale 1-5) #

x <- c(5, 1, 1, 1, 3, 2, 5, 3, 3, 2, 4, 4, 4, 2, 5, 4, 4, 4, 4, 2)

# Create data vector (scale 1-5) #

y <- c(4,5, 4, 3, 1, 1, 5, 4, 5, 4, 3, 4, 3, 4, 5, 5, 3, 3, 5, 4)

# Create Kendall Rank Correlation #

cor.test(x, y, method=c("kendall"))


This produces the output:
Kendall's rank correlation tau

data: x and y
z = 0.84528, p-value = 0.398
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau
0.1617271


From this output, we can first determine that the model strength is not the best, as the p-value = 0.398, a value which is far above the common alpha level of .05. Next, we will assess the rho value output, which is 0.1617271. This value is measured on a scale similar to the Pearson’s correlation. Since this value is relatively low, we will assume a weak positive correlation.

Conclusion:

While both methods provide similar functionality, the Spearman’s Rank Correlation is utilized far more frequently than the Kendall Rank Correlation. I typically utilize both methodologies, compare the results of each, and then report my findings in a subsequent research composition.

I hope that you have found this article to be informative and interesting. Until next time, stay inquisitive, Data Heads!

(R) Fisher's Exact Test

Previously, a series of articles were featured on this site which specifically addressed the concept of non-parametric tests.

In continuing to explore this subject matter, today's entry will present an additional test of similar nature, the Fisher's Exact Test.

If you are un-familiar with the premise of non-parametric data types, please search the term: "non -parametric", in the search box located on the right.

Fisher's Exact Test

Fisher's Exact Test provides a non-parametric alternative to the chi-square test, as normality is a requirement for the chi-square test's utilization. This test is sometimes confused with the "F-Test", which was named in honor of biologist and statistician Ronald Fisher. However, Ronald Fisher himself derived this particular test, and as such, it bears his name.

Example:

While working as a statistician at a local university, you are tasked to evaluate, based on survey data, the level of job satisfaction that each member of the staff currently has for their occupational role. The data that you gather from the surveys is as follows:

General Faculty
130 Satisfied 20 Unsatisfied (Total 150 Members of General Faculty)

Professors
30 Satisfied 20 Unsatisfied (Total 50 Professors)

First, we will need to input this survey data into R as a matrix. This can be achieved by utilizing the code below:

# Fisher's Exact Test #

model <- matrix(c(130, 30, 20, 20), ncol=2)


The result should resemble:


To perform the Fisher's exact test, the following code is utilized:

fisher.test(model, conf.level = .95)

This produces the output:

Fisher's Exact Test for Count Data

data: model
p-value = 0.0001467
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.933511 9.623258
sample estimates:
odds ratio 
4.294116

The p-value output will be examined for significance. It is this value which is analyzed to assess the stated hypothesis, which is as follows:

H0: The association is random.
HA: The association is NOT random.


Since the p-value contained within the output (0.0001467), is less than our assumed alpha (.05), we will reject the null hypothesis, and assume that the association is not random.

The odds ratio is calculated in the following manner:

a <- (130 / 20)

b <- (30 / 20)

c <- a/b

c


Which produces the output:

[1] 4.333333

That’s it for now, Data Heads. Stay subscribed for more interesting articles!

Friday, April 20, 2018

(R) P-P Plot (SPSS)

Sharing many similarities with the Q-Q Plot, the P-P Plot is a lesser utilized graphical methodology used to compare the probability of data points from a single variable set, to the linear function of a normal probability distribution.

Example:

Let’s begin with our sample data set:


To begin our analysis, we must select, from the topmost menu, “Analyze”, then “Descriptive Statistics”, followed by “Q-Q Plot”.


This should cause the following menu to appear:


Through the utilization of the center arrow, designate “Y” as a variable. Once this has been completed, click “OK”. This should cause the following output to be generate.


Ideally, if the sample data is normally distributed, the dotted plots should follow the solid trend line as closely as possible. While the “Q-Q Plot”, plots the quartiles of a normal distribution against sample data, the “P-P Plot”, plots the cumulative probability distribution of the sample data against a cumulative normal probability distribution.

In almost all cases, to avoid confusion, you should preferably utilize the Q-Q Plot.

To generate the same output within R, we would utilize the following code:

Example (R):

# With the package ‘qqplotr’ downloaded and enabled #

# Create vector #

y <- c(70, 80, 73, 77, 60, 93, 85, 72, 90, 85)

# Transform vector into data frame #

y <- data.frame(norm = y)

# Create graphical output #

gg <- ggplot(data = y, mapping = aes(sample = norm)) +
stat_pp_band() +
stat_pp_line() +
stat_pp_point() +
labs(x = "Probability Points", y = "Cumulative Probability")
gg


This should produce the following visual output:


That’s all for now. See you soon for more interesting content, Data Heads!

(R) Q-Q Plots (SPSS)

The topic of Q-Q Plots was originally covered in an a prior article pertaining to the identification of normally distributed data. In this article, we will again review the concept of the Q-Q plot, which is a graphical methodology for identifying normally distributed data.

I will illustrate this concept with examples generated both within the SPSS platform, and the R platform.

Example (SPSS):

Our data vector within the R platform.

SDSample <- c(7.00, 5.10, 4.80, 2.90, 4.80, 5.80, 6.40, 6.10, 4.30, 7.20, 5.30, 4.00, 5.50, 5.40, 4.70, 4.50, 5.00, 4.70, 6.10, 5.10, 5.20, 5.00, 4.20, 5.10, 4.90, 5.30, 2.90, 5.80, 3.50, 4.90, 5.80, 6.10, 3.00, 5.90, 4.30, 5.30, 4.70, 6.40, 4.60, 3.50, 5.00, 3.50, 4.10, 5.70, 4.90, 6.10, 5.30, 6.90, 4.60, 4.90, 4.00, 3.90, 4.50, 5.90, 5.20, 7.20, 4.60, 4.40, 5.40, 5.90, 3.10, 5.60, 5.10, 4.40, 4.50, 3.10, 4.50, 6.00, 6.00, 5.10, 7.30, 4.60, 3.20, 4.10, 5.10, 4.90, 5.10, 5.60, 4.10, 5.70, 4.70, 5.70, 5.50, 4.50, 5.20, 5.00, 5.40, 5.10, 3.90, 4.30, 4.10, 4.30, 4.40, 2.40, 5.40, 6.30, 5.50, 4.30, 4.90, 2.90)

Within the SPSS platform, though partially visible, our data variable would resemble:


To begin our analysis, we will first select “Descriptive Statistics”, followed by “Q-Q Plots”.


This should cause the following menu to appear:


Through the utilization of the center arrow, designate “SDSample” as a variable. Once this has been completed, click “OK”. This should cause the following output to be generate.


Ideally, if the data is normally distributed, the dotted points should follow the solid trend line as closely as possible.

This particular Q-Q plot is reasonably consistent with normality.

To generate the same output within R, we would utilize the following code:

Example (R):

# Define the data vector #

SDSample <- c(7.00, 5.10, 4.80, 2.90, 4.80, 5.80, 6.40, 6.10, 4.30, 7.20, 5.30, 4.00, 5.50, 5.40, 4.70, 4.50, 5.00, 4.70, 6.10, 5.10, 5.20, 5.00, 4.20, 5.10, 4.90, 5.30, 2.90, 5.80, 3.50, 4.90, 5.80, 6.10, 3.00, 5.90, 4.30, 5.30, 4.70, 6.40, 4.60, 3.50, 5.00, 3.50, 4.10, 5.70, 4.90, 6.10, 5.30, 6.90, 4.60, 4.90, 4.00, 3.90, 4.50, 5.90, 5.20, 7.20, 4.60, 4.40, 5.40, 5.90, 3.10, 5.60, 5.10, 4.40, 4.50, 3.10, 4.50, 6.00, 6.00, 5.10, 7.30, 4.60, 3.20, 4.10, 5.10, 4.90, 5.10, 5.60, 4.10, 5.70, 4.70, 5.70, 5.50, 4.50, 5.20, 5.00, 5.40, 5.10, 3.90, 4.30, 4.10, 4.30, 4.40, 2.40, 5.40, 6.30, 5.50, 4.30, 4.90, 2.90)

# Create the Q-Q plot #

qqnorm(SDSample, main="")
qqline(SDSample)


This should produce the following visual output:


For more information on how to interpret the Q-Q plot, please click on the link below:

http://data.library.virginia.edu/understanding-q-q-plots/

(R) Multivariate Multiple Regression

Multivariate Multiple Regression, is a functional methodology within the R platform, that allows for the concurrent analysis of differing regression models, within the same functional dialogue. In this way, there is no structural difference as it pertains to the model’s internal synthesis as compared to the multiple regression model. Simply stated, the multivariate multiple regression method is a more convenient process for comparing the inner synthesis of differing multiple regression model.

This explanation will become more perspicuous proceeding the following example.

Example:

We are given the following data vectors:

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)

y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)

v <- c(6, 10, 6, 13, 19, 12, 11, 17, 18, 12)


With such, we are tasked with creating the following models:

y = z + w + v + (intercept)

x = z + w + v + (intercept)


This can typically be achieved with the code below:

testdataframe <- data.frame(x,y,z,w,v)

model1 <- lm(y ~ z + w + v, data = testdataframe)

model2 <- lm(x ~ z + w + v, data = testdataframe)

summary(model1)

summary(model2)


However, what if we wanted to utilize the multivariate multiple regression functionality, the code would instead resemble:

testmodel <- lm(cbind(x, y) ~ z + w + v, data = testdataframe)

summary(testmodel)


Which produces the output:

Response x :
Call:
lm(formula = x ~ z + w + v, data = testdataframe)

Residuals:
    Min       1Q       Median  3Q      Max
-3.2385 -2.6632 -0.5016 2.3280 5.5252

Coefficients:
           Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.78843 51.01213 0.466 0.657
z              0.07168 0.26709 0.268 0.797
w            -0.08600 0.04891 -1.758 0.129
v             -0.16200 0.29351 -0.552 0.601

Residual standard error: 3.741 on 6 degrees of freedom
Multiple R-squared: 0.3521, Adjusted R-squared: 0.02811
F-statistic: 1.087 on 3 and 6 DF, p-value: 0.4236


Response y :


Call:
lm(formula = y ~ z + w + v, data = testdataframe)

Residuals:

Min          1Q Median 3Q       Max
-17.907 -10.900 0.074 13.085 16.669

Coefficients:
          Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.7431 226.6371 0.286 0.785
z               0.4171 1.1866 0.351 0.737
w             -0.1636 0.2173 -0.753 0.480
v              -0.4937 1.3040 -0.379 0.718

Residual standard error: 16.62 on 6 degrees of freedom
Multiple R-squared: 0.106, Adjusted R-squared: -0.341
F-statistic: 0.2371 on 3 and 6 DF, p-value: 0.8675


How is the functionally preferable to the more verbose alternative? Since the models co-exist within the same data source, running subsequent functions will display concurrent output.

For example:

# Display coefficient values #

coef(testmodel)

# Display residual standard error #

sigma(testmodel)


This concludes our current article. I encourage you to experiment with this function if its utilization suits your experimental necessities. Until next time, Data Heads!

Tuesday, April 17, 2018

(R) Polynomial Regression

Today we will be discussing a very important method which is often overlooked within most entry level statistics curriculums.

The concept of Polynomial Regression, when applied to data, allows the practitioner to create a model which could potentially provide a better predictive capacity than a linear model. The inclusion of exponential variables within a regression formula can also, additionally, account for non-linear sample data within a model.

Ideally, when graphed, linear data should resemble the illustration below:


However, there may be instances where data is non-linear. In this case, the graphical data output could instead resemble something similar to the following illustration:


A polynomial regression model can account for the data point curvature. Due to such, applying this method can, at times, allow for the creation of a more accurate predictive model.

Example:

# Create data vectors #

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)

# Create polynomial variables #

# Squared #

x2 <- x^2

# Cubic #

x3 <- x^3

# Quadratic #

x4 <- x^4

# Create initial linear model #

lmtestset <- lm(formula = w ~ x)

summary(lmtestset)

anova(lmtestset)

# Create squared polynomial model #

lmtestset <- lm(formula = w ~ x + x2)

summary(lmtestset)

anova(lmtestset)

# Create cubic polynomial model #

lmtestset <- lm(formula = w ~ x + x2 + x3)

summary(lmtestset)

anova(lmtestset)

# Create quadratic polynomial model #

lmtestset <- lm(formula = w ~ x + x2 + x3 + x4)

summary(lmtestset)

anova(lmtestset)


As a product of each model’s creation, summary output will be produced. The typical coefficient of determination method for measuring fit will be misleading when assessing models of this type. Therefore, the code which I have created, produces an ANOVA output subsequent to the creation of each individual model.

To select the best fitting model for our data, we must compare the ANOVA outputs generated for each model composition.


What we are specifically seeking, is an increase in significance with each new variable addition. In the case of our example, the following model code produces the corresponding ANOVA output:

# Create squared polynomial model #

lmtestset <- lm(formula = w ~ x + x2)

anova(lmtestset)


Analysis of Variance Table

Response: w 
            Df Sum Sq Mean Sq F value Pr(>F)
x           1 1878.4 1878.42 4.1571 0.08084 .
x2         1 856.2 856.19 1.8948 0.21107
Residuals 7 3163.0 451.85
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

As is demonstrated within the output, the addition of the “x2” variable did not increase specificity, as the p-value for this variable is insignificant at the .05 level. From this output, we can conclude that the original linear model is superior to its polynomial alternative.

However, if the variable “x2” was significant, we would then test the cubic model which includes the variable “x3”. We would continue in this manner until we came upon a variable addition which was not significant. At that point, we would utilize whatever model was assessed prior.

If, for the sake of example, the quadratic polynomial model was the superior model, the linear equation which can be assembled from the provided output would resemble the following:


W = 385.54 + (-3.01 * x) + (-6.25 * x^2) + (1.25 * x^3) + (-0.06 * x^4)

This can be derived from the output below:

# lmtestset <- lm(formula = w ~ x + x2 + x3 + x4) # 


summary(lmtestset)

Call: lm(formula = w ~ x + x2 + x3 + x4)

Residuals:

     1        2          3          4            5         6         7           8          9         10
27.668 5.832 -3.929 -10.547 -27.332 10.453 8.082 -26.168 21.832 -5.893

Coefficients:
        Estimate Std. Error t value Pr(>|t|)
(Intercept) 385.53730 91.44558 4.216 0.00836 **
x               -3.30179 128.28914 -0.026 0.98046
x2             -6.25192 49.92296 -0.125 0.90522
x3              1.24753 6.91815 0.180 0.86398
x4             -0.06333 0.31200 -0.203 0.84715
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Monday, April 16, 2018

(R) Partial Least Squares Regression

The Partial Least Squares Regression model is a methodology which shares numerous similarities with concept of dimension reduction. Both methods utilize matrix based functionality to create output, and both seek to explain variance through the utilization of component analysis.

The output provided by the partial least squares regression methodology provides both variance estimates of components, and a working model in which to estimate dependent variable values. How this process is achieved, is illustrated in the example below.

Example:

(This example requires that the R Packages: “CCA”, and “PLS”, be downloaded and enabled.)

We will begin by defining our vector values:

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)

y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)

v <- c(6, 10, 6, 13, 19, 12, 11, 17, 18, 12)

Next, we will combine these vectors into a single data frame.

testset <- data.frame(x, y, z, w, v)

With the packages: “CCA”, and “PLS” enabled, we can create the following model:

plstestset <- plsr(w ~ x + y + z + v, data = testset, ncomp=4, validation="CV")

ncomp – Specifies the number of components to include within the model. Typically, we should first set this value to the number of independent variables contained within the model. After producing the initial output, we can then modify our model to include a specified number of components.

validation – This indicates the type of validation method that will be utilized within our model. In the example, “cross validation” was specified.

To view the output provided, we will need to run the code below:

summary(plstestset)


Which produces the following:

Data:       X dimension: 10 4
                Y dimension: 10 1
Fit method: kernelpls
Number of components considered: 4

VALIDATION: RMSEP
Cross-validated using 10 leave-one-out segments.
      (Intercept) 1 comps 2 comps 3 comps 4 comps
CV       26.98     28.47    38.84      40.81     40.85
adjCV  26.98     28.38    37.67      39.66     39.71

TRAINING: % variance explained
    1 comps 2 comps 3 comps 4 comps
X     76.45   82.76    91.55    100.00
w     12.29   39.46    39.69    39.72

From this output, we can determine that two components should suffice for the creation an accurate model. If we include extraneous components, we risk creating a disproportionate amount of inaccuracies for specificity’s sake.

With the decision made to create a new model containing two components, we will utilize the following code to create such:

plstestset <- plsr(w ~ x + y + z + v, data = testset, ncomp=2, validation="CV")

summary(plstestset)


This produces the output:

> summary(plstestset)
Data:      X dimension: 10 4
               Y dimension: 10 1
Fit method: kernelpls
Number of components considered: 2

VALIDATION: RMSEP
Cross-validated using 10 leave-one-out segments.
      (Intercept) 1 comps 2 comps
CV       26.98     28.47      38.84
adjCV  26.98     28.38      37.67

TRAINING: % variance explained
    1 comps 2 comps
X   76.45     82.76
w   12.29     39.46

Now that we have our model created, we can utilize it to test data which satisfies the model’s parameters. For our example, we will apply the model to the data contained within the original data set.

This can be achieved with the code below:

testset$fit <- predict(plstestset, newdata = testset)[,,2]

testset$fit – This is creating a new column within the data frame which is specifically designated to contain the predicted values of the dependent variable.

plstestset – This is the model which will be utilized to create the predicted independent variables.

newdata – This value indicates the data for which the model will be applied.

[,,2] – This code is indicating that we want to utilize two components in our predictability model.

If we instead decided that we only wanted to utilize one component, our code would resemble:

testset$fit <- predict(plstestset, newdata = testset)[,,1]

Running the following lines of code:

testset$fit <- predict(plstestset, newdata = testset)[,,2]

testset

Provides us with the following output:

    x y z w v                fit
1 8 97 188 366 6 339.7623
2 1 56 184 383 10 377.5278
3 4 97 181 332 6 351.2976
4 10 68 194 331 13 341.2700
5 8 94 190 311 19 331.5256
6 10 66 180 352 12 335.3762
7 3 81 192 356 11 363.3685
8 1 76 184 351 17 363.8741
9 1 86 192 399 18 363.3546
10 2 69 191 357 12 370.6433


The final column contains the model’s predicted values as it pertains to the dependent variable (“w”).

Saturday, April 14, 2018

(R) Canonical Correlation (SPSS)

Canonical Correlation, pronounced “can-non-ick-cal”, is a complicated methodology utilized to measure the correlation of single variable sets, against variables comprised of the combination of the original sets. In many ways, the functionality of this method is similar to that of dimension reduction. However, as it pertains to Canonical Correlation, variable sets are first manually selected by the user, and then subsequently, combined to create new variables. These new variables share the sum of dimensionality of the prior sets combined.

Once the combined sets have been derived, each original variable set is tested for correlation as it pertains to the newly derived set. These new sets, though the segments have already been previously defined, can be defined as variables independently. As such, they exist in a manner similar to latent variables, in that, they may illustrate an encompassing macro-phenomenon.

Example:

Let’s begin with our sample data set:


To begin our analysis, we must select, from the topmost menu, “Analyze”, then “Correlate”, followed by “Canonical Correlation”.


We will create two new sets, which will represent macro-phenomenons. “Set 1” will be comprised of the variables “X” and “Y”. “Set 2” will be comprised of the variables “Z” and “W”. Designating the variables can be achieved through the utilization of the two middle arrow buttons.

Once this designation is complete, click “OK”.


This creates the following output:


Various aspects of output are produced from the requested analysis, though, we only need to concern ourselves with the above listed portions.

“Sig,”, is detailing the significance of the correlated relationship of the combined variables, which were assembled to create each new data variable. If either correlation figure contains a significance value, which is greater than .05, then the analysis must be abandoned. (Assuming an alpha level of .05)

In our example, we will imagine that this is not the case.

As such, we will continue to the next step of the analysis, which is the assessment of each variable set independently,  as it is related to the combined variable sets.

The result of such, would lead to the synthesis of the following summary:

Tests of dimensionality for the canonical correlation analysis, indicate that the two canonical dimensions are not statistically significant at the .05 level. Dimension 1 had a canonical correlation of .609, while for the dimension 2, the canonical correlation was much lower at .059.

In examining the standardized canonical coefficients for the two dimensions across variable sets, the following conclusions can be derived. For the first canonical dimension, the variable which had the strongest influence, was variable “W” (.997), followed by variable “X” (-.890), which preceded variable “Y” (-.357), followed by variable “Z” (-118). For the second canonical dimension, the variable which had the strongest influence was variable “Z” (.993), followed by variable “Y” (.943), which was in turn followed by variable “X” (.475), which was subsequently followed by variable “W” (.088).

If we wanted to repeat our analysis through the utilization of the “R” platform, we could do so with the following code:

(This example requires that the R package: “CCA”, be downloaded and enabled.)

# Data Vectors #

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)

y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)

v <- c(6, 10, 6, 13, 19, 12, 11, 17, 18, 12)

# Vector consolidation into matrices #

xy <- matrix(c(x,y), ncol=2)

zw <- matrix(c(z,w), ncol=2)

# Application of analysis #

cc(xy, zw)


Which produces the output:

$cor
[1] 0.60321616 0.07284239

$names
$names$Xnames
NULL

$names$Ynames
NULL

$names$ind.names
NULL


$xcoef
[,1] [,2]
[1,] -0.23449315 -0.12499512
[2,] -0.02480536 0.06573124

$ycoef
[,1] [,2]
[1,] -0.01825046 0.199544843
[2,] 0.03902046 0.002265523

$scores
$scores$xscores
              [,1]        [,2]
[1,] -1.1968746 0.7831779
[2,] 1.4615973 -1.0368371
[3,] -0.2589020 1.2831584
[4,] -0.9465054 -1.3730183
[5,] -1.1224585 0.5859842
[6,] -0.8968947 -1.5044807
[7,] 0.3724769 0.3564537
[8,] 0.9654901 0.2777877
[9,] 0.7174364 0.9351001
[10,] 0.9046344 -0.3073261

$scores$yscores
            [,1]        [,2]
[1,] 0.468749446 0.1074573
[2,] 1.205099133 -0.6522082
[3,] -0.730193019 -1.3663844
[4,] -1.006469470 1.2254331
[5,] -1.713876856 0.3819432
[6,] 0.068466671 -1.5206187
[7,] 0.005542988 0.8829815
[8,] -0.043555633 -0.7247049
[9,] 1.683422831 0.9803990
[10,] 0.062813910 0.6857021

$scores$corr.X.xscores
           [,1]        [,2]
[1,] -0.9355965 -0.3530712
[2,] -0.4703893 0.8824590

$scores$corr.Y.xscores
             [,1]       [,2]
[1,] -0.03496377 0.072719921
[2,] 0.60070892 0.006634506

$scores$corr.X.yscores
            [,1]         [,2]
[1,] -0.5643669 -0.02571855
[2,] -0.2837464 0.06428042

$scores$corr.Y.yscores
            [,1]           [,2]
[1,] -0.05796226 0.9983188
[2,] 0.99584355 0.0910803

In the output, I marked the standardized canonical correlation coefficients in bold.

That’s all for now, Data Heads! Please stay subscribed for more great articles.

Friday, April 13, 2018

(R) Distance Correlation

In a previous entry, the topic of Nearest Neighbor was discussed. In that article, also discussed, was the internal mechanism which enables its functionality, that being, the Euclidean distance formula.

Distance Correlation functions in a manner similar to Nearest Neighbor, in that, it also utilizes the Euclidean distance formula as a fundamental aspect of its overall synthesis.

Without endeavoring too far into the individual details of the process, we will examine this concept through the following example problem. That being said, the critical details that should be understood and retained are thus:

The product of the distance correlation formula will produce a figure which is equal to or greater than zero, and less than or equal to one. (0 >= X <= 1).

As a result of this unique synthesis, the figure produced, cannot be assessed in the manner in which a Pearson Correlation output is analyzed.

Example:

(This example requires that the R package: “energy”, be downloaded and enabled.)

# Data Vectors #

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

# To apply the appropriate analysis #

dcor(x,y)


This produces the output:

[1] 0.4419842

The above figure is the distance correlation value. Prior to analyzing this figure, we must first derive the strength of the model. This can be achieved with the code below:

dcor.ttest(x, y)

This produces the output:

dcor t-test of independence

data: x and y
T = -0.1138, df = 34, p-value = 0.545
sample estimates:
Bias corrected dcor
          -0.01951283


The figure that is relevant to our purposes, is the p-value.

p-value = 0.545

Ignore the “Bias corrected dcor”, as correcting the bias presents us with a negative value, which cannot exist in a logical sense as distance values cannot be negative.

The p-value of a distance correlation model, is interpreted in a manner which is similar to that of the Pearson correlation model. Meaning, that the p-value is demonstrating the overall strength of the model.

As it pertains to the model output, interpretation is less straight forward. A model output of “1” would indicate a perfect correlation, while a value of “0” would indicate perfect independence.

This can be demonstrated with the code below:

dcor(x,x)

Which produces the output:

[1] 1

As was demonstrated in the prior article, I would also recommend performing a Pearson correlation test on the same data vectors. The output provided by such can be included within the final written analysis.

As a reminder, this analysis can be achieved through the utilization of the following code:

cor.test(x,y)

Which produces the output:

                 Pearson's product-moment correlation
data: x and y
t = 0.36656, df = 8, p-value = 0.7235
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5452231 0.7013920
sample estimates:
       cor
0.1285237


That’s all for now. Stay tuned, Data Heads!

Monday, April 9, 2018

(R) Partial Correlation (SPSS)

In previous articles, the topic of Pearson Correlation was discussed. In this entry, we will discuss a different aspect pertaining to the subject, Partial Correlation.

Partial Correlation refers to a method which is utilized to measure the correlation between two variables, while also controlling for a third variable. This method of correlation measurement can produce, in certain cases, a more accurate correlation figure.

So, for example, if we were presented with the following data vectors, and we wished to create a partial correlation, we would proceed by utilizing the code below:

(This example requires that the R package: “ppcor”, be downloaded and enabled.)

# Data Vectors #

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)
z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)

# To perform the necessary analysis #

pcor.test(x, y, z)


This produces the following output:

  estimate p.value statistic n gp Method
1 0.1283774 0.7420369 0.3424887 10 1 pearson


In the case of our example, the correlation of the variables “x” and “y” is being assessed, with “z” being accounted for as the control variable.

estimate – This figure represents the r-value of the output.

statistic – This figure represents the t-value of the output.

p.value – This figure represents the p-value of the output.

n – This figure is the number of observations contained within the test set.

Method – This is the correlation method that was utilized to generate the output.

The “estimate” value is the correlation figure. It is this value, and value labeled “p.value”, which we will primarily concern ourselves with. “Estimate”, is a measurement of the strength of the correlation, while “p.value”, measures the strength of the model. Typically, if models are being compared for accuracy, the model which possesses the lower “p.value” should be utilized.

If we are reporting our results in APA format, we must also perform a standard Pearson Correlation. For more information as to how to interpret this model, please consult prior articles pertaining to the subject matter.

# Pearson Correlation #

cor.test(x, y)

This produces the output:

Pearson's product-moment correlation

data: x and y
t = 0.36656, df = 8, p-value = 0.7235
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5452231 0.7013920
sample estimates:
       cor
0.1285237


t - This figure represents the t-value of the output.

df – This figure represents the degrees of freedom of the model.

p-value - This figure represents the p-value of the output.

cor - This figure represents the r-value of the output.

To report the findings found within these output figures, we would write the following conclusion:

A partial correlation was utilized to evaluate the null hypothesis that there is no significant relationship between variable “x” and variable “y” (N=10). There was weak partial correlation between variable “x” (M = 4.8, SD = 3.79) and variable “y” (M = 79, SD = 14.35(, controlling for variable “z”, r(7)* = .13, p = .74. Results of the zero order correlation yielded that there was a weak correlation between “x” and “y”, r(8) = .13, p = .72, indicating that controlling for variable “z” had little effect on the strength of the relationship between two variables.

* Equals the number of observations minus the number of variable categories.

If we were to perform this same analysis in SPSS, the steps would resemble:

Example (SPSS)

Let’s begin with our sample data set:


From the top most menu, select “Analyze”, then select “Correlate”, followed by “Bivariate”.

This should cause the following menu to populate:


Through the utilization of the arrow button located in the center of the interface, designate “X” and “Y” as the model variables.

When you have completed this step, click “OK”.

This produces the output screen:


With the Pearson Correlation completed, we will now generate a partial correlation model.


From the top most menu, select “Analyze”, then select “Correlate”, followed by “Partial”.

This will cause the following menu to appear:


Through the utilization of the top central arrow button, designate “X” and “Y” as the model variables. 

Once this has been completed, utilize the bottom arrow button to designate “Z” as the control variable.

When finished, click “OK”.

This should generate the output screen:


With both output sections generated, you should now be prepared to complete your written summary.

That’s all for now. Stay inquisitive, Data Heads!

(R) General Linear Mixed Models

A General Linear Mixed Model is a methodology of modeling utilized to create linear models which account for categorical variables. This is achieved through the synthesis of a model which contains dynamic intercept values as an aspect of its design. The categorical variables which are included as elements of the model design can be singular or numerous. However, if multiple variable of this nature are included, they will be assessed in categorical combinations.

Example:

We will be utilizing the following data vectors:

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)
v <- c(6, 10, 6, 13, 19, 12, 11, 17, 18, 12)
x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)
z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)


These vectors can be combined to create a single data frame:

model <- data.frame(v, w, x, y, z)

If we were to create a simple linear model, the necessary code could be applied:

lmmodeltest <- lm(x ~ y + z, data=model)

summary(lmmodeltest)


Which produces the output:

Call:
lm(formula = x ~ y + z, data = model)

Residuals:
Min 1Q Median 3Q Max
-4.027 -2.882 -1.643 2.668 5.623

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.587008 53.572275 0.048 0.963
y                0.034052 0.099425 0.342 0.742
z                -0.002543 0.285784 -0.009 0.993

Residual standard error: 4.267 on 7 degrees of freedom
Multiple R-squared: 0.01653, Adjusted R-squared: -0.2645
F-statistic: 0.05883 on 2 and 7 DF, p-value: 0.9433


Now, we will produce a general linear mixed model which accounts for the variable “w”.

(This variable, the variable which is accounted for, but is not as an independent variable within the model, is referred to as a “random variable”.)

# You will need to download and enable the package: “nlme” #

mmodeltest1 <- lme(x ~ y + z, data=model, random=~1|w)

summary(mmodeltest1)


This produces the output:

Linear mixed-effects model fit by REML
   Data: model
AIC          BIC         logLik
65.41286 65.14241 -27.70643

Random effects:
Formula: ~1 | w
      (Intercept) Residual
StdDev: 3.995423 1.498283

Fixed effects: x ~ y + z
                 Value Std.Error DF t-value p-value
(Intercept) 2.5870077 53.57227 7 0.0482900 0.9628
y             0.0340519 0.09942 7 0.3424887 0.7420
z            -0.0025432 0.28578 7 -0.0088991 0.9931
Correlation:
(Intr) y
y -0.066
z -0.989 -0.081

Standardized Within-Group Residuals:
      Min          Q1               Med           Q3             Max
-0.3313798 -0.2371629 -0.1352219 0.2195812 0.4627224

Number of Observations: 10
Number of Groups: 10


Generally speaking, the models look identical. What differs between the two models are the coefficient values. It is these values that determine the value of the intercept.

For example, the values for the intercept of the original linear model are:

coef(lmmodeltest)

Output:

(Intercept)         y                     z
2.587007686 0.034051914 -0.002543224

However, the values for the general linear mixed model are:

coef(mmodeltest1)

Output:

   (Intercept)          y                    z
311 4.95003237 0.03405191 -0.002543224
331 7.48857278 0.03405191 -0.002543224
332 1.33355478 0.03405191 -0.002543224
351 -0.66296480 0.03405191 -0.002543224
352 7.51706478 0.03405191 -0.002543224
356 0.95902860 0.03405191 -0.002543224
357 0.43833139 0.03405191 -0.002543224
366 4.85601183 0.03405191 -0.002543224
383 -0.06589015 0.03405191 -0.002543224
399 -0.94366472 0.03405191 -0.002543224

As you can observe, the intercept coefficient values, which are sorted categorically, differ. However, the coefficient values of the dependent variables remain static. This phenomenon allows for dynamic model creation.

For example, if we were to test the simple linear model with prior data set variables, the outcome for the first observation would be:

yvalue <- 97

zvalue <- 188

interceptvalue <- 2.5870077

xvalue <- (yvalue * 0.034052) + (zvalue * -0.002543) + interceptvalue

xvalue


Which would produce the output:

[1] 5.41192

The actual value of x in the initial set was:

8

If we were to test the generalized mixed linear model with prior data set variables, the outcome for the first observation would be:

yvalue <- 97

zvalue <- 188

interceptvalue <- 4.85601183

xvalue <- (yvalue * .0340519) + (zvalue * -0.0025432) + interceptvalue

xvalue


Which would produce the output:

[1] 7.680925

Which is closer to the actual value of x:

8

To produce a general linear mixed model which accounts for both the variables “w” and “v”, the following code can be utilized:

mmodeltest2 <- lme(x ~ y + z, data=model, random=~1|w/v)

summary(mmodeltest2)


This code produces the output:

Linear mixed-effects model fit by REML
  Data: model
    AIC       BIC          logLik

67.41286 67.08832 -27.70643

Random effects:
Formula: ~1 | w
       (Intercept)
StdDev: 2.916513

Formula: ~1 | v %in% w
       (Intercept) Residual
StdDev: 2.916513 1.093692

Fixed effects: x ~ y + z
                 Value Std.Error DF t-value p-value
(Intercept) 2.5870077 53.57227 7 0.0482900 0.9628
y            0.0340519 0.09942 7 0.3424887 0.7420
z            -0.0025432 0.28578 7 -0.0088991 0.9931

Correlation:
(Intr) y
y -0.066
z -0.989 -0.081

Standardized Within-Group Residuals:
       Min              Q1                Med              Q3              Max
-0.24189513 -0.17312025 -0.09870703 0.16028629 0.33777045

Number of Observations: 10
Number of Groups:
      w v %in% w
      10      10


With the model output produced, we must again generate coefficient values. This can be achieved through the utilization of the following code:

coef(mmodeltest2)

Which produces the output:

     (Intercept)            y                    z
311/19 5.1052676 0.03405191 -0.002543224
331/13 7.8105734 0.03405191 -0.002543224
332/6 1.2512112 0.03405191 -0.002543224
351/17 -0.8764666 0.03405191 -0.002543224
352/12 7.8409371 0.03405191 -0.002543224
356/11 0.8520811 0.03405191 -0.002543224
357/12 0.2971775 0.03405191 -0.002543224
366/6 5.0050705 0.03405191 -0.002543224
383/10 -0.2401681 0.03405191 -0.002543224
399/18 -1.1756067 0.03405191 -0.002543224


Each combination of categorical variables, along with their coinciding intercept values, had been output in the above data stream. If the model were to be utilized to assess new observational values generated from the same system of observation, cross tabulated categorical values, would determine the value of the utilized intercept.

The type of categorical data which would satisfy the “random” element of model synthesis, typically would refer to variables such as: gender, race, ethnicity, zipcode, etc.

That’s all for now, Data Heads! Thanks for visiting! I appreciate your patronage.

Wednesday, April 4, 2018

(R) Loglinear Analysis (SPSS)

Today we will be discussing an incredibly difficult concept, Loglinear Analysis. Loglinear analysis is similar to both the logistical regression and the chi-squared methodology of analysis. The loglinear method analyzes data that is structured in a manner which resembles the chi-square model. The main differentiation between the two methods, is the utilization of the poisson distribution as an aspect of the loglinear model. While the chi-square model allows for the testing of a hypothesis, the loglinear method allows us to create a predictive model as an end result of the analysis. The similarities which exist between loglinear and logistic regression are evident in the utilization of the log function. Both models require log transformations of variables in order to be utilized for application.

In research, it is the odds ratio that is the most important aspect of this model’s output. Again, this concept is difficult to conceptualize, and in my personal opinion, is not worth the complexity required to produce the meager products of its synthesis.

Example:

Below is our sample data set:


*A note on structuring data for the purpose of loglinear analysis. When utilizing SPSS to perform this type of analysis, I have found that it is best to start counting categorical cases at “1”, as opposed to “0”. That would mean, in the context of our example, that each prompt bearing a “Yes” label has an underlying value of “1”, and each prompt bearing a “No” label has an underlying value of “2”. *

To perform a Loglinear analysis, select "Analyze" from the top drop down menu, then select "Loglinear". From the next menu, select "General".


This series of selections should populate the following menu:


Using the topmost middle arrow, designate “Smoking” and “Obese” as “Factor(s)”.

Next, click on the button, “Model”.

This should generate the following sub-menu:



Make sure that the box adjacent to “Predicted values” is checked.

After this has been completed, click on the box labeled “Model”.

The following sub-menu should appear:


Select the option “Build terms”.

Once this has been selected, use the middle drop down menu and center arrow to designate “Smoking” and “Obese” as “Interaction” type variables.

Click “Continue” when you have completed the above steps.

Click the “Options” button to generate the following sub-menu:


Be sure to check every option beneath “Display” and then click “Continue”.

This should create the output screens below:


The Pearson Chi-Square value measures the strength of the overall model. In this case, our model should not be seriously considered as being a decent statistical representation of the underlying phenomenon, as the significance value far exceeds .05.


Most of the other output can be disregarded. First, we will consider the model attributes. The model itself, if it were illustrated as a linear equation, would resemble:

Y = .847x + .405x1 + .182

The variable Y is the square root of the predicted number of outcomes which coincide with the various variable combinations.

For example, if this experiment were repeated with the same number of participants, we could predict the number of individuals within each category.

In the category of non-obese smokers, we would expect to find:

Y = (.847 * 1) + (.405 * 0) + .182

Y = 1.0296


(Remember that 1.029 is the square root of the actual figure. Therefore, we must find the exponential value that the Y-value represents.)

So:

exp(Y) = 2.798 (Non-Obese Smokers)

This value can be confirmed in the above table labeled, “Cell Counts and Residuals”.

So what does any of this mean, and how would we report these results in a meaningful way? The LogLinear model attempts to create probabilities which could be potentially utilized to prepare for future experimental results. In the case of our example, each variable was shown to be an insignificant aspect of the model. This is illustrated by the “Sig” values found adjacent to each corresponding variable within the “Parameter Estimates” table. Combined together, it is unsurprising to witness such a low chi-square value, which acts as measurement determining the overall shared significance of the model.

As mentioned previously, the true worthwhile output that is gathered from this method exists in the form of estimated odds. To generate these odds, we must perform the following calculations:

(The numbers utilized for these calculations are taken directly from the “Cell Counts and Residuals” table.)

Estimated Odds Obese Smokers

Yes/No

4.2 / 2.8 = 1.5

Estimated Odds Obese Non-Smokers

Yes/No

1.8 / 1.2 = 1.5

Odds Ratio (Estimated Odds of Obese Smokers / Estimated Odds of Obese Non-Smokers)

1.5 / 1.5 = 1

So let’s state all of our conclusions in a substantive and succinct written summary:

A loglinear analysis was utilized to produce a model. The Pearson’s chi-squared value of this model was X2(1) = 1.270, p = .261*. This indicates that the model results were not found to be significant**. Estimated odds ratios indicate that members within this study are equally*** likely to be obese whether or not they are smokers.

* Use the Greek letter Chi exponentially squared. The corresponding output can be found in the “Goodness-of-Fit Tests” table.
** Your results should be significant.

*** (1 Times)

Here is the code to perform this analysis within the R platform:

Obese <- c("Yes", "Yes", "No", "No")

Smoking <- c("Yes", "No", "Yes", "No")

Count <- c(5, 1, 2, 2)

Data <- data.frame(Count, Obese, Smoking)

DataModel <- glm(Count ~ Obese + Smoking , family = poisson)

# If you’d like to include interaction effects, utilize the code comment below #

# DataModel <- glm(Count ~ Obese + Smoking + Smoking * Obese , family = poisson) #

summary(DataModel)


This produces the output:

Call:
glm(formula = Count ~ Obese + Smoking, family = poisson)

Deviance Residuals:
1            2           3          4
0.3789 -0.6515 -0.5041 0.6658

Coefficients:

                         Estimate      Std. Error       z value       Pr(>|z|)
(Intercept)          0.1823          0.6952         0.262           0.793
ObeseYes           0.4055          0.6455         0.628           0.530
SmokingYes       0.8473         0.6901         1.228            0.219


(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3.3137 on 3 degrees of freedom
Residual deviance: 1.2654 on 1 degrees of freedom
AIC: 17.973

Number of Fisher Scoring iterations: 4


# To test for goodness of fit #

# The model must be re-created in matrix form #

Model <-matrix(c(5, 1, 2, 2),

nrow = 2,

dimnames = list("Smoker" = c("Yes", "No"),

"Obese" = c("Yes", "No")))

# To run the chi-square test #

# ‘correct = FALSE’ disables the Yates’ continuity correction #



chisq.test(Model, correct = FALSE)

This produces the output:

Pearson's Chi-squared test

data: Model
X-squared = 1.2698, df = 1, p-value = 0.2598


So now you have another model at your disposal, Data Heads. Please come back to visit again soon!