Reflections of a Data Scientist: March 2018

Monday, March 26, 2018

(R) McNemar Test (SPSS)

(Introduction to Concept in R)

(Introduction to Concept in SPSS)

The McNemar Test, pronounced Mac-Ne-Mar, and not Mic-Nee-Mar, is a method utilized to test marginal probabilities for paired nominal data sets contained within a 2 x 2 contingency table. This test utilizes the chi-squared distribution as an aspect of its methodology, therefore, it is often confused with Pearson's Chi-Squared Test. I am aware that all of this may appear very confusing at first, however, as you continue to read this entry, I can assure you that this befuddlement will cease.

Let's begin by illustrating what a McNemar Test is not. We will achieve this through an explanation of the differences between McNemar and Chi-squared. The most evident difference between the two tests is that a chi-squared test, is not limited to two rows and two columns. Additionally, each data structure, regardless of construction, is assembled to explain and assess a very specific conceptual inquiry.

Within the chi-squared matrix, if a goodness of fit evaluation is not being performed, then each row represents a different category, with each column representing a different segment within that category. If the sum is taken from across each row, the resulting figure will represent the total count of the values contained within the entire category. This will be illustrated within our chi-squared example problem.

A McNemar Test matrix is limited to 2 columns, each containing 2 rows. Unlike the chi-squared matrix, one single category is split between four separate segments. Therefore, the total count of values within the table category is the sum of all cell values within the table.

WARNING – PRIOR TO UTILIZNG ANY OF THE FOLLOWING TESTS, PLEASE READ EACH EXAMPLE THOROUGHLY!!!!!

Example (Chi-Squared):

While working as a statistician at a local university, you are tasked to evaluate, based on survey data, the level of job satisfaction that each member of the staff currently has for their occupational role. The data that you gather from the surveys is as follows:

General Faculty
130 Satisfied 20 Unsatisfied (Total 150 Members of General Faculty)

Professors
30 Satisfied 20 Unsatisfied (Total 50 Professors)

Adjunct Professors
80 Satisfied 20 Unsatisfied (Total 50 Adjunct Professors)

Custodians
20 Satisfied 10 Unsatisfied (Total 30 Custodians)

The question remains however, as to whether the assigned role of each staff member, has any impact on the survey results. To decide this, with 95% confidence, you must follow the subsequent steps.

First, we will need to input this survey data into R as a matrix. This can be achieved by utilizing the code below:

Model <- matrix(c(130, 30, 80, 20, 20, 20, 20, 10), nrow = 4, ncol=2)

The result should resemble:

Once this step has been completed, the next step is as simple as entering the code:

chisq.test(Model)

Console Output:

Pearson's Chi-squared test

data: Model
X-squared = 18.857, df = 3, p-value = 0.0002926

Findings:

Degrees of Freedom (df) - 3
Confidence Interval (CI) - .95
Alpha (a) (1-CI) - .05
Chi Square Test Statistic - 18.857

This creates the hypothesis test parameters:

H0 : There is a not correlation between job type and job satisfaction (Null Hypothesis). Job type and job satisfaction are independent variables.

HA: There is a correlation between job type and job satisfaction. Job type and job satisfaction are not independent variables.

With p-value less than .05, (.0002926 < .05), we can state, that with 95 % confidence, that there is a correlation between job type and overall satisfaction.

Reject: Null Hypothesis.

Example (McNemar's Test):

Typically, examples from the McNemar Test are created to emulate drug trials. For that reason, and also due to this example type best exemplifying all aspects of the test, our example will be structured in a similar manner.

In our fictitious drug trial, individuals are gathered from a select demographic which is particularly susceptible to heart disease. A new drug has been synthesized which has been created to prevent the onset of heart disease, and cure it in individuals who are already afflicted with such. The data is organized in the following contingency table (2x2).

# Code to create contingency table #

medication <-
matrix(c(75, 55, 22, 44),
nrow = 2,
dimnames = list("Affliction" = c("Heart Disease Prior: Present", "Heart Disease Prior: Absent"),
"Drug Trial" = c("After Drug Trial: Present", "After Drug Trial: Absent")))

# Code to print output to the console window #

medication

Which creates the output:

Drug Trial

Affliction After Drug Trial: Present After Drug Trial: Absent
Heart Disease Prior: Present 75 22
Heart Disease Prior: Absent 55 44

As you can observe from the output, there is on single category, the patient group for the drug trial.

This group is segmented into four categories.

Those with heart disease prior to the trial, who still were afflicted after the trial: 75

Those without heart disease prior to the trial, who were afflicted with heart disease after the trial: 55

Those with heart disease prior to the trial, who did not have heart disease after the trial: 22

Those without heart disease prior to the trial, who still were un-afflicted after the trial: 44

The total number of participants who participated in this trial: 196 (75 + 55 + 22 + 44)

# Additionally, a more basic method for assembling the matrix can be achieved through the utilization of the following code #

medication <- matrix(c(75, 55, 22, 44), nrow = 2, ncol=2)

# Code to print output to the console window #

medication

Which creates the output:

[,1] [,2]
[1,] 75 22
[2,] 55 44

To perform the McNemar Test, utilize the following code:

# Code to perform the McNemar Test #

mcnemar.test(medication)

This produces the following output:

McNemar's Chi-squared test with continuity correction

data: medication
McNemar's chi-squared = 13.299, df = 1, p-value = 0.0002656

From this output we can address our hypothesis.

This hypothesis is stated as such:

H0: pb = pc
HA: pb NE pc

Or:

Null: The two marginal probabilities for each outcome as are the same.

Alternative: The two marginal probabilities for each outcome as are NOT the same.

What we are seeking to investigate, is the potential significant change in those who were treated with the drug, in comparison to those who were not.

With p-value less than .05, (0.0002656 < .05), will reject the null hypothesis, and we will conclude that at a 95% confidence interval, that the new experimental drug is having a significant impact on trial participants.

Now let’s re-create the same calculation within the SPSS platform.

Example (McNemar's Test - SPSS):

The data must be structured slightly differently in SPSS to perform this analysis. Though it cannot be seen from this data view, each row of observational data must be populated to match the previously stated frequencies.

Therefore, there will be a total of 196 rows.

75 Rows will have “Present” listed in the first column, and “Present” listed in the second column.

55 Rows will have “Absent” listed in the first column, and “Present” listed in the second column.

22 Rows will have “Present” listed in the first column, and “Absent” listed in the second column.

44 Rows will have “Absent” listed in the first column, and “Absent” listed in the second column.

Once this data set has been assembled, the finished product should resemble:

There are two methods which can be utilized within SPSS to perform The McNemar Test.

The first method can be performed by following the steps below:

To begin, select “Analyze”, followed by “Nonparametric Tests”, and then “Related Samples”.

This populates the following screen:

Within the options tab, select the two column variables which will be utilized for analysis. In the case of our example, the column variables which will act as “Test Fields” are “PreTrial” and “PostTrial”. The middle arrow button can be utilized to designate this distinction.

Once this has been completed, select the option tab “Settings”.

From this tab, select the option “Customize tests”, then select the box to the right of “McNemar’s test”. After these steps are complete, click on the “Run” button.

This should produce the output below:

Double clicking on this output summary produces additional output.

Example (McNemar's Test - SPSS) (cont.):

Below is a separate method which can be utilized within SPSS to perform The McNemar Test. The output that is produced from this alternative path of synthesis differs from the prior output produced.

To begin, select “Analyze”, followed by “Descriptive Statistics”, and then “Crosstabs”.

Through the utilization of the middle arrow buttons, we will designate “Pretrial” as our “Row(s)” variable, and “PostTrial” as our “Column(s)” variable.

After clicking the right menu button labeled “Statistics”, select the box adjacent to the option “McNemar”, then click continue. After completing this series of steps, click “OK” from the initial menu.

This should generate the output below:

The crosstabulation diagram presents us with a frequency table pertaining to the column variables contained within the data sheet. The McNemar Test, as you recall from our previous R example, can be performed with this summary information within the R platform.

Prior to discussing the Chi-Squared Test table, I would like address the WARNING that was issued at the beginning of this article.

*** WARNING ***

In the last article, I issued a similar warning which pertained to, in the context of the article, SPSS not providing correct model output. I hypothesized that this could be potentially due to SPSS shifting model parameters in order to assist the user. However, the danger of this functionality, is that SPSS does not explicitly alert the user to this practice. As a result of such, SPSS may be inadvertently misleading the user by attempting to provide a more appropriate model.

The output for the chi-squared test in this instance, is not accurate. In all actuality, it is not a chi-squared value at all. The result of the analyzation provided by SPSS are the results generated from a Yates Correction. The chart’s footnote states, “Binomial distribution used”, this should be evidence to the vigilant that something strange is happening, as the McNemar Test utilizes a chi-squared distribution.

Therefore, I would not recommend the SPSS package for performing a McNemar test, and would instead utilize the R platform.

That’s all for now. Stay active, Data Heads!

Friday, March 23, 2018

(R) Friedman Test (SPSS)

In previous articles, we discussed the concept of non-parametric tests. In the event that you are just tuning in, or you are un-familiar with the term “non-parametric test”, I will provide you with a brief conceptual overview.

What is a non-parametric test?

A non-parametric test is a method of analysis which is utilized to analyze sets of data which do not comply with a specific distribution type. As a result of such, this particular type of test is by design, more robust.

Many tests require as a prerequisite, that the underlying data be structured in a certain manner. However, typically these requirements do not significantly cause test results to be adversely impacted, as many tests which are parametric in nature, have a good deal of robustness included within their models.

Therefore, though I believe that it is important to be familiar with tests of this particular type, I would typically recommend performing their parametric alternatives. The reason for this recommendation relates to the general acceptance and greater familiarity that the parametric tests provide.

Friedman Test

The Friedman Test provides a non-parametric alternative to the one way repeated measures analysis of variance (ANOVA). Like many other non-parametric tests, it utilizes a ranking system to increase the robustness of measurement. In this manner, the test is similar to tests such as the Kruskal-Wallis test. This method of analysis was originally derived by Milton Friedman, an American economist.

Example

Below is our sample data set:

To begin, select “Analyze”, “Nonparametric Tests”, followed by “Legacy Dialogs”, and then “K Related Samples”.

This populates the following screen:

Through the utilization of the center arrow button, designate “Measure1”, “Measure2”, and “Measure3” as “Test Variables”. Make sure that the box “Friedman” is selected beneath the text which reads: “Test Type”.

After clicking “OK”, the following output should be produced:

Since the “Asymp. Sig.” (p-value) equals .667, we will not reject the null hypothesis (.667 > .05). Therefore, we can conclude, that the three conditions did not significantly differ.

As a reference, if the original data was analyzed within SPSS through the utilization of a repeated measures analysis of variance (ANOVA) model, the probability outcome would be incredibly similar.

To perform this test within R, we will utilize the following code:

# Create the data vectors to populate each group #

Measure1 <- c(7.00, 5.00, 16.00, 3.00, 19.00, 10.00, 16.00, 9.00, 10.00, 18.00, 6.00, 12.00)

Measure2 <- c(14.00, 12.00, 8.00, 17.00, 18.00, 7.00, 16.00, 10.00, 16.00, 9.00, 10.00, 8.00)

Measure3 <- c(9.00, 13.00, 3.00, 6.00, 2.00, 16.00, 15.00, 7.00, 13.00, 17.00, 9.00, 13.00)

# Create the data matrix necessary to perform the analysis #

results <-matrix(c(Measure1, Measure2, Measure3),
ncol = 3,
dimnames = list(1 : 12,
c("Measure1", "Measure2", "Measure3")))

# Perform the analysis through the utilization of The Friedman Test #

friedman.test(results)

This produces the output:

Friedman rank sum test

data: results
Friedman chi-squared = 0.80851, df = 2, p-value = 0.6675

That’s it for this article, stay tuned for more. Never stop learning, Data Heads!

Thursday, March 22, 2018

(R) The Kolmogorov-Smirnov Test & The Wald Wolfowitz Test (SPSS)

The Kolmogorov-Smirnov Test, and The Wald Wolfowitz Tests are two very similar tests, in that, they are both utilized to make inferences pertaining to the distributions of sample data. Additionally, both methods can be employed to analyze either a single data set, or two separated independent sets.

The Wald Wolfowitz Test

A method for analyzing a single data set in order to determine whether the elements within the data set were sampled independently.

Hypothesis Format:

H0: Each element in the sequence is independently drawn from the same distribution. (The elements share a common distribution).

HA: Each element in the sequence is not independently drawn from the same distribution. (The elements do not share a common distribution).

The Wald Wolfowitz Test (2-Sample)

A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

Hypothesis Format:

H0: The two samples were drawn from the same population.

HA: The two samples were not drawn from the same population.

The Kolmogorov-Smirnov Test (Lilliefors Test Variant)

A method for analyzing a single data set in order to determine whether the data was sampled from a normally distributed population.

Hypothesis Format:

H0: The data conforms to a normal probability distribution.

HA: The data does not conform to a normal probability distribution.

The Kolmogorov-Smirnov Test (2-Sample)

A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

Hypothesis Format:

H0: The two samples were drawn from the same population.

HA: The two samples were not drawn from the same population.

In almost all cases, given the robustness of the test, Kolmogorov-Smirnov is the preferred alternative to The Wald Wolfowitz Test. If one sample is significantly greater than the other in observational size, then the model accuracy could be potentially impacted.

WARNING – PRIOR TO UTILIZNG ANY OF THE FOLLOWING TESTS, PLEASE READ EACH EXAMPLE THOROUGHLY!!!!!

Example - The Wald Wolfowitz Test (2-Sample) / The Kolmogorov-Smirnov Test (2-Sample)

Below is our sample data set:

It is important to note, that SPSS will not perform this analysis unless the data variable that you are utilizing is set to “Nominal”, and the group variable that you utilizing is set to “Ordinal”.

If this is the case, you can proceed by selecting “Nonparametric Tests” from the “Analyze” menu, followed by “Independent Samples”.

This should case the following menu to populate. From the “Fields” tab, use the center middle arrow to select “VarA” as the “Test Fields” variable. Then, use the bottom middle arrow to designate “Group” as the “Groups” variable.

After completing the prior task, from the “Settings” tab, select the option “Customize tests”, then check the boxes adjacent to the options “Kolmogorov-Smirnov (2 samples)” and “Test sequence for randomness (Wald-Wolfowitz for 2 samples)”. Once these selections have been made, click “Run” to proceed.

This should produce the following output:

The results of both tests are displayed, and included with such, are the null hypothesizes which are implicit within each methodology.

In both cases, two independent data sets are being assessed to determine as to whether or not they originate from the same distribution. Insinuated within this conclusion, is the understanding that if they both do indeed originate from the same distribution, than both data sets may be samples which originate from the same population as well.

Given the significance of both tests, we can conclude that this is likely the case.

We will not reject the null hypothesis, and therefore we can conclude, that at a 95% confidence interval, that the two samples were drawn from the same population.

The result that was generated for the Kolmogorov-Smirnov Test can be verified in R with the following code:

x <- c(28.00, 21.00, 7.00, 46.00, 8.00, 2.00, 12.00, 7.00, 24.00, 22.00, 44.00, 15.00, 14.00, 34.00, 38.00, 24.00, 25.00, 26.00)

y <- c(6.00, 36.00, 27.00, 41.00, 10.00, 2.00, 20.00, 4.00, 29.00, 38.00, 2.00, 17.00, 10.00, 13.00, 42.00, 43.00, 7.00, 14.00)

ks.test(x, y)

Which produces the output:

Two-sample Kolmogorov-Smirnov test

data: x and y
D = 0.22222, p-value = 0.7658
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : cannot compute exact p-value with ties

The warning is making us aware of data which consists of similar values within the analysis.

I could not find a function or package within R which can perform a 2-Sample Wald Wolfowitz test. Therefore, we will assume that the SPSS output is correct.

Example - The Kolmogorov-Smirnov Test (One Sample)

For this example, we will be utilizing the same data set which was previously assessed.

We will begin by selecting “Nonparametric Tests” from the “Legacy Dialogs” menu, followed by “1-Sample K-S”.

In this particular case, we will be assuming that there is no group assignment for our set of variables.

After encountering the following menu, utilize the center arrow to designate “VarA” as a “Test Variable”. Beneath the “Test Distribution” dialogue, select “Normal”. Once this has been completed, click “OK”.

The following output should be produced:

!!! WARNING !!!

In writing this article, both as it pertains to this example and the example which proceeds it, more time was spent assessing output than is typically the case. The reason for such, is that the output that was produced through the utilization of this particular function is incorrect. Or perhaps, alternatively, we could say that the output that was calculated, was calculated in a manner which differs from the way in which the output is traditionally derived.

Though the test statistic generated within the output is correct, the “Asymp. Sig. (2-tailed)” value is defining a lower bound, and not a specific value.

In the case of SPSS, the platform is assuming that you would like to utilize Lilliefors Test, which is a more robust version of the Kolmogorov-Smirnov Test. This test was derived specifically, with the Kolmogorov-Smirnov Test acting as an internal component, to perform normality analysis on single samples.

The following code can be utilized to perform this test within the R platform:

Example - Lilliefors Test*

# The package “nortest” must be installed and enabled #

xy <- c(28.00, 21.00, 7.00, 46.00, 8.00, 2.00, 12.00, 7.00, 24.00, 22.00, 44.00, 15.00, 14.00, 34.00, 38.00, 24.00, 25.00, 26.00, 6.00, 36.00, 27.00, 41.00, 10.00, 2.00, 20.00, 4.00, 29.00, 38.00, 2.00, 17.00, 10.00, 13.00, 42.00, 43.00, 7.00, 14.00)

lillie.test(xy)

This produces the output:

Lilliefors (Kolmogorov-Smirnov) normality test

data: xy
D = 0.11541, p-value = 0.2611

We will not reject the null hypothesis, and therefore we can conclude, that at a 95% confidence interval, that the sample above was drawn from a normal population distribution.

To again check the p-value, I re-analyzed the data vector “xy” with another statistical package, and was presented with the following output figures:

p-value = 0.2531

D = 0.1154

Therefore, I would recommend that if you were interested in running this particular test, that the test be performed within the R platform.

*- https://en.wikipedia.org/wiki/Lilliefors_test

Example - The Wald Wolfowitz Test

Again we will be utilizing the prior data set.

We will begin by selecting “Nonparametric Tests” from the “Legacy Dialogs” menu, followed by “Runs”.

This should cause the following menu to populate:

Utilize the center arrow to designate “VarA” as a “Test Variable List”. Beneath the “Cut Point” dialogue, select “Media”. Once this has been completed, click “OK”.

This series of selections should cause console output to populate.

!!! WARNING !!!

I have not the slightest idea as to how the “Asymp. Sig. (2-Tailed)” value was derived. For this particular example, after re-analyzing the data within the R platform, I then re-calculated the formula by hand. While the “Test Value” (Median), “Total Cases”, and “Number of Runs” are accurate, the p-value is not. This of course, is the most important component of the model output.

The only assumption that I can make as is relates to this occurrence, pertains to a document that I found while searching for an answer to this quandary.

This text originates from a two page flyer advertising the merits of SPSS. My only conclusion is that what this flyer insinuates, is that SPSS might be programmed to shift test methodology on the basis of the data which is being analyzed. While an expert statistician would probably catch these small alterations right away, decide for himself as to whether or not they are suitable, and then proceed with mentioning such in his research abstract; a more novice researcher may jeopardize his research by neglecting to notice. That is why, when publishing, if given the option, I will run data through multiple platforms to verify results prior to submission.

If you would like to perform this test, I would recommend utilizing R to do so.

The Wald Wolfowitz Test

# The package “randtests” must be installed and enabled #

xy <- c(28.00, 21.00, 7.00, 46.00, 8.00, 2.00, 12.00, 7.00, 24.00, 22.00, 44.00, 15.00, 14.00, 34.00, 38.00, 24.00, 25.00, 26.00, 6.00, 36.00, 27.00, 41.00, 10.00, 2.00, 20.00, 4.00, 29.00, 38.00, 2.00, 17.00, 10.00, 13.00, 42.00, 43.00, 7.00, 14.00)

runs.test(xy)

This produces the output:

Runs Test

data: xy
statistic = -1.691, runs = 14, n1 = 18, n2 = 18, n = 36, p-value = 0.09084
alternative hypothesis: nonrandomness

We will not reject the null hypothesis, and therefore we can conclude, that at a 95% confidence interval, that each element in the sequence is independently drawn from the same distribution.

That's all for now, Data Heads! Keep visiting for more exciting articles.

Thursday, March 15, 2018

(R) Cronbach’s Alpha / Reliability Analysis (SPSS)

Cronbach’s Alpha, AKA tau-equivalent reliability, AKA coefficient alpha, is an estimate of internal consistency reliability. Cronbach’s Alpha is primarily utilized to measure for the correlation of response data collected from sociological surveys. This methodology was originally derived by Lee Joseph Cronbach, an American psychologist.

What this statistical method is seeking to quantify, is the reliability of survey questions pertaining to a specific survey category. The alpha statistic, which is the eventual product of the analysis, is derived though a formula which is comprised of aspects which account for both the number of questions contained within a category, and their shared correlations.

So for example, if you were administering a survey to measure the satisfaction of employees within a particular department as it pertains to supervisor interaction. Typically, you would expect all of the responses of the survey participants, as it is relates to the questions of that category, to cluster in a particular manner. If the responses were not arranged in such a way, it would be more difficult to assess the accuracy of the overall categorical rating. If such were the case, it typically would be best to remove the questions containing outlier responses prior to administering the survey during a subsequent cycle.

Example

Below is a data set which contains categorical responses to a factious survey.

From the “Analyze” menu, select “Scale”, then select “Reliability Analysis”.

This should populate the following menu:

Using the center middle arrow, designate all question response variables as “Items”.

Next, click on the “Statistics” button.

The menu sub-menu above should appear. Select the check boxes adjacent to the options “Scale if item deleted” and “Correlations”.

After clicking “Continue”, click “OK” to generate the system output.

Reliability Statistics – This table displays the Cornbach’s Alpha score. Typically, according to the various research papers produced by the researcher Nunnally*, the alpha score should typically fall somewhere between .70 - .79 if the research is exploratory, between .80 - .89 if the research is basic, and above .90 in applied research scenarios.

Inter-Item Correlation Matrix – This table is displaying the correlation of each variable as it pertains to the other variables within the analysis.

Item-Total Statistics – The most important aspect of this table is the section which reads “Cronbach’s Alpha if Item Deleted”. This column is presenting exactly what the column header suggests, the value which alpha would assume if the corresponding variable were removed.

One final note on Cornbach’s Alpha, variables are not required to utilize the same scale of measurement. However, it is uncommon to witness the utilization of this methodology given those circumstances.

Analysis within the R platform

I will now briefly illustrate how to obtain the same results in R.

# With package "psy" installed and enabled #

Q1 <- c(2, 3, 5, 1, 5)

Q2 <- c(5, 2, 2, 2, 2)

Q3 <- c(2, 1, 5, 5, 2)

Q4 <- c(1, 3, 5, 4, 4)

Q5 <- c(2, 5, 3, 4, 2)

x <- data.frame(Q1, Q2, Q3, Q4, Q5)

cronbach(x)

Which produces the output:

$sample.size
[1] 5

$number.of.items
[1] 5

$alpha
[1] -0.5255682

That’s all for now, Data Heads. Stay tuned for future articles!

*- https://youtu.be/EdCdTzpZrVI

Wednesday, March 14, 2018

Discriminant Analysis (SPSS)

Discriminant Analysis is, in summary, a less robust variation of Binary Logistical Analysis. In every case I would recommend utilizing Binary Logistical Analysis in lieu of Discriminate Analysis. However, since this is a website dedicated to all things statistical, we will briefly cover this topic.
Discriminant Analysis is a very sensitive modeling methodology, as outliers and group size can potentially cause miscalculation. Additionally, there are various assumptions that must be accounted for prior to application.

These assumptions* are:

Multivariate Normality - Independent variables are normal for each level of the grouping variable.

Homogeneity of Variance - Variances among group variables are the same across levels of predictors.

Multicollinearity - Predictive power can decrease with an increased correlation between predictor variables.

Independence - Participants are assumed to be randomly sampled, and a participant’s score on one variable is assumed to be independent of scores on that variable for all other participants.

Example:

We’ll begin with a familiar sample data set:

From the “Analyze” menu, select “Classify”, then select “Discriminant”.

The following menu should appear. Using the topmost middle arrow, select “Cancer” as the “Grouping Variable”. Using the center arrow, select “Age”, “Obese” and “Smoking” as “Independents”.

Click on the “Define Range” button to populate the following sub-menu. Since “Cancer” is a binary variable, we will set “Minimum” to “0”, and “Maximum” to “1”.

(Note: “1” indicates “Cancer”, and “0” indicates “No Cancer Detected”)

After clicking “Statistics”, check the box adjacent to “Unstandardized”.

Clicking on “Save” will populate the menu below. Check the options labeled “Predicted group membership” and “Probabilities of group membership”.

Once this has been completed, click “OK”.

The following output should be generated:

Wilks’ Lambda – Two useful values are being provided within this table output. The first value is the Wilk’s Lambda value. This value is similar to the coefficient of determination, however, its value is interpreted in an inverse manner. Meaning, a value of 0 would equate perfect correlation. Therefore, if you would like to determine the equivalent r-squared value for interpretation only, you could subtract this value from the value of 1 and consider the difference. The second value worth noting is the Chi-square significance: “Sig”. This value is illustrating the significance of the Wilk’s Lambda. If the p-value is less than .05, we can determine that the derived model is significant in determining a predictive outcome.

Canonical Discriminat Function Coefficients – The values presented in the above table are the components of the predictive model. If we were to construct the model as an equation, it would resemble:

Logit(p) = (Age * .026) + (Obese * -.347) + (Smoking * 2.418) - 2.263

The logit value can be utilized in tandem with the R function “plogis” to generate the probability of a positive outcome. For more information pertaining to this function, please consult the article related to Binary Logistical Analysis that was previously featured on this blog.

The final output that we will review is the output that was produced within the original data sheet.

We are presented with three new variables, “Dis_1” represents the model’s predicted outcome given the dependent variable data (1 or 0). “Dis2_2” represents the probability (.00 – 1.00) of a positive outcome occurring, “Dis2_1” represents the probability of a negative outcome occurring.

That’s all for now. Stay subscribed for more analytics and articles. Until next time, Data Heads!

*- https://en.wikipedia.org/wiki/Discriminant_function_analysis

Friday, March 9, 2018

Nearest Neighbor / Dimension Reduction (Pt. II) (SPSS)

The “Nearest Neighbor” function is a feature that is included within the SPSS platform. “Nearest Neighbor”, is the term that the creators of the SPSS platform decided to designate to describe a process more commonly known as Euclidean distance measurement.

Euclidean distance measurement is the measurement of distances between two points within Euclidean space. Euclidean space is essentially non curved space. Typically non-Euclidean space refers most commonly to spheres.

When the nearest neighbor function is utilized, SPSS will treat each variable row input as a series of co-ordinates. Through the application of the Euclidean distance formula, the system will then analyze the data for the closest points contained within the same Euclidean space.

The Euclidean distance formula can be extended to include any number of dimensions. However, typically, for the sake of data visualization and accuracy, the number of variables commonly utilized for analysis is limited to three.

You can now understand how this topic is relevant as it pertains to dimension reduction. Dimension reduction allows the practitioner to reduce the number of dimensions contained within his data to a pragmatic amount. Nearest neighbor then enables the practitioner to search for similarities between observation entries.

Example (Dimension Reduction):

Here is our data set from the last example. I am going to make a slight modification which will prove useful during nearest neighbor analysis.

We will proceed with performing the same dimensional reduction analysis which was demonstrated in the previous example. However, we perform two steps differently.

The first being, is that we will not include all variables within our analysis. The variable “ID” will be excluded.

The next change pertains to the “Extraction” option. Since we do not want to exceed three dimensions of Euclidean space, we will directly specify that SPSS create exactly three component factors.

This step is completely optional, however, for the eventual output to match the output provided, this step must be completed.

To further reduce the dimensional space between variable points, select the "Rotation" option, then select "Quartimax". Once this has been completed, click "Continue".

Finally, we will specify that the component variables be saved to the original data sheet. This option can be enabled from the “Scores” menu.

After the analysis has been completed, the original data sheet should resemble:

Each new “FAC” variable represents a newly derived component, and the score which is contained within each cell represents the component score which coincides with each observation.

Example (Nearest Neighbor):

Now that we have our components defined, let’s move forward in our Euclidean distance analysis.

From the “Analyze” menu, select “Classify”, then select “Nearest Neighbor”.

The menu below should appear:

Select “Variables”, and utilize the middle center arrow to designate all component variables as “Features”. The selected variables are the values which will be analyzed though the utilization of the “Nearest Neighbor” procedure.

(Note: In this particular case, selecting "Noramlize scale features" is not required, as the data variables within the "Features:" designation box are already normalized. This process occurred during the "Save as variables" step. However, if there was a scenario in which data variables were not normalized during a prior rotation step, you should enable the "Normalize scale features" option.)

Using the topmost center arrow, designate “ID” as the “Target” variable.

After selecting the “Neighbors” tab, be sure that the value of k is set to “3”. This value specifies the number of relationships which SPSS will assess between each set of variable co-ordinates.

From the “Partitions” tab, modify the value found within the “Training %” box to equal “100”.

Finally, within the “Save” tab, check the box next to “Predicted Value or category” description.

Once all of these steps are complete, click “OK”.

This should provide the output:

What is being presented in the output screen is a 3 dimensional model which utilizes the component variables as co-ordinate points. If only 2 component variables were utilized, the output would instead include a 2 dimensional model.

Double click on the model image to access the following model viewer:

Clicking on a particular point will make it a focal point. As such, the closest K number of relationships will be illustrated on the graphic.

If you would like at this time, you have the option to reduce the K number of relationships that are illustrated. It is important to note, that the illustrated relationships that are being displayed are classified by the distance from which they reside from the focal point.

From the left menu adjacent to the graphic, adjusting the “View” to display “Neighbor and Distance Table” will present the following chart:

What is presented in this chart is the “ID” variable for the point that is currently selected. The three nearest neighbors, as measured through the utilization of the Euclidean distance formula. The Euclidean distances between the selected variable and the closest neighbor variables are presented in the rightmost portion of the chart.

Let’s now examine the data sheet output:

The rightmost column was added as a result of the option which we selected from the “Save” tab. What is being displayed in this new column, is the observational ID variable that is closest in proximity to the coinciding ID column when assessed through the utilization of the Euclidean distance formula.

Conclusion

What is the appropriate and applicable utilization of nearest neighbor? That question is mostly up to the end user. However, nearest neighbor analysis allows for the drawing of similarities between single observations of data. Suppose that we were trying to compare baseball players based on traditionally collected statistics, the above example would provide the perfect format for accomplishing such a task. In addition to being useful, the nearest neighbor function within SPSS provides beautiful output, which is impressive to any set of eyes.

That’s all for now, Data Heads! Stay subscribed for more interesting articles!