Reflections of a Data Scientist: (R) The Kolmogorov-Smirnov Test & The Wald Wolfowitz Test (SPSS)

The Kolmogorov-Smirnov Test, and The Wald Wolfowitz Tests are two very similar tests, in that, they are both utilized to make inferences pertaining to the distributions of sample data. Additionally, both methods can be employed to analyze either a single data set, or two separated independent sets.

The Wald Wolfowitz Test

A method for analyzing a single data set in order to determine whether the elements within the data set were sampled independently.

Hypothesis Format:

H0: Each element in the sequence is independently drawn from the same distribution. (The elements share a common distribution).

HA: Each element in the sequence is not independently drawn from the same distribution. (The elements do not share a common distribution).

The Wald Wolfowitz Test (2-Sample)

A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

Hypothesis Format:

H0: The two samples were drawn from the same population.

HA: The two samples were not drawn from the same population.

The Kolmogorov-Smirnov Test (Lilliefors Test Variant)

A method for analyzing a single data set in order to determine whether the data was sampled from a normally distributed population.

Hypothesis Format:

H0: The data conforms to a normal probability distribution.

HA: The data does not conform to a normal probability distribution.

The Kolmogorov-Smirnov Test (2-Sample)

A method for analyzing two separate sets of data in order to determine whether they originate from similar distributions.

Hypothesis Format:

H0: The two samples were drawn from the same population.

HA: The two samples were not drawn from the same population.

In almost all cases, given the robustness of the test, Kolmogorov-Smirnov is the preferred alternative to The Wald Wolfowitz Test. If one sample is significantly greater than the other in observational size, then the model accuracy could be potentially impacted.

WARNING – PRIOR TO UTILIZNG ANY OF THE FOLLOWING TESTS, PLEASE READ EACH EXAMPLE THOROUGHLY!!!!!

Example - The Wald Wolfowitz Test (2-Sample) / The Kolmogorov-Smirnov Test (2-Sample)

Below is our sample data set:

It is important to note, that SPSS will not perform this analysis unless the data variable that you are utilizing is set to “Nominal”, and the group variable that you utilizing is set to “Ordinal”.

If this is the case, you can proceed by selecting “Nonparametric Tests” from the “Analyze” menu, followed by “Independent Samples”.

This should case the following menu to populate. From the “Fields” tab, use the center middle arrow to select “VarA” as the “Test Fields” variable. Then, use the bottom middle arrow to designate “Group” as the “Groups” variable.

After completing the prior task, from the “Settings” tab, select the option “Customize tests”, then check the boxes adjacent to the options “Kolmogorov-Smirnov (2 samples)” and “Test sequence for randomness (Wald-Wolfowitz for 2 samples)”. Once these selections have been made, click “Run” to proceed.

This should produce the following output:

The results of both tests are displayed, and included with such, are the null hypothesizes which are implicit within each methodology.

In both cases, two independent data sets are being assessed to determine as to whether or not they originate from the same distribution. Insinuated within this conclusion, is the understanding that if they both do indeed originate from the same distribution, than both data sets may be samples which originate from the same population as well.

Given the significance of both tests, we can conclude that this is likely the case.

We will not reject the null hypothesis, and therefore we can conclude, that at a 95% confidence interval, that the two samples were drawn from the same population.

The result that was generated for the Kolmogorov-Smirnov Test can be verified in R with the following code:

x <- c(28.00, 21.00, 7.00, 46.00, 8.00, 2.00, 12.00, 7.00, 24.00, 22.00, 44.00, 15.00, 14.00, 34.00, 38.00, 24.00, 25.00, 26.00)

y <- c(6.00, 36.00, 27.00, 41.00, 10.00, 2.00, 20.00, 4.00, 29.00, 38.00, 2.00, 17.00, 10.00, 13.00, 42.00, 43.00, 7.00, 14.00)

ks.test(x, y)

Which produces the output:

Two-sample Kolmogorov-Smirnov test

data: x and y
D = 0.22222, p-value = 0.7658
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : cannot compute exact p-value with ties

The warning is making us aware of data which consists of similar values within the analysis.

I could not find a function or package within R which can perform a 2-Sample Wald Wolfowitz test. Therefore, we will assume that the SPSS output is correct.

Example - The Kolmogorov-Smirnov Test (One Sample)

For this example, we will be utilizing the same data set which was previously assessed.

We will begin by selecting “Nonparametric Tests” from the “Legacy Dialogs” menu, followed by “1-Sample K-S”.

In this particular case, we will be assuming that there is no group assignment for our set of variables.

After encountering the following menu, utilize the center arrow to designate “VarA” as a “Test Variable”. Beneath the “Test Distribution” dialogue, select “Normal”. Once this has been completed, click “OK”.

The following output should be produced:

!!! WARNING !!!

In writing this article, both as it pertains to this example and the example which proceeds it, more time was spent assessing output than is typically the case. The reason for such, is that the output that was produced through the utilization of this particular function is incorrect. Or perhaps, alternatively, we could say that the output that was calculated, was calculated in a manner which differs from the way in which the output is traditionally derived.

Though the test statistic generated within the output is correct, the “Asymp. Sig. (2-tailed)” value is defining a lower bound, and not a specific value.

In the case of SPSS, the platform is assuming that you would like to utilize Lilliefors Test, which is a more robust version of the Kolmogorov-Smirnov Test. This test was derived specifically, with the Kolmogorov-Smirnov Test acting as an internal component, to perform normality analysis on single samples.

The following code can be utilized to perform this test within the R platform:

Example - Lilliefors Test*

# The package “nortest” must be installed and enabled #

xy <- c(28.00, 21.00, 7.00, 46.00, 8.00, 2.00, 12.00, 7.00, 24.00, 22.00, 44.00, 15.00, 14.00, 34.00, 38.00, 24.00, 25.00, 26.00, 6.00, 36.00, 27.00, 41.00, 10.00, 2.00, 20.00, 4.00, 29.00, 38.00, 2.00, 17.00, 10.00, 13.00, 42.00, 43.00, 7.00, 14.00)

lillie.test(xy)

This produces the output:

Lilliefors (Kolmogorov-Smirnov) normality test

data: xy
D = 0.11541, p-value = 0.2611

We will not reject the null hypothesis, and therefore we can conclude, that at a 95% confidence interval, that the sample above was drawn from a normal population distribution.

To again check the p-value, I re-analyzed the data vector “xy” with another statistical package, and was presented with the following output figures:

p-value = 0.2531

D = 0.1154

Therefore, I would recommend that if you were interested in running this particular test, that the test be performed within the R platform.

*- https://en.wikipedia.org/wiki/Lilliefors_test

Example - The Wald Wolfowitz Test

Again we will be utilizing the prior data set.

We will begin by selecting “Nonparametric Tests” from the “Legacy Dialogs” menu, followed by “Runs”.

This should cause the following menu to populate:

Utilize the center arrow to designate “VarA” as a “Test Variable List”. Beneath the “Cut Point” dialogue, select “Media”. Once this has been completed, click “OK”.

This series of selections should cause console output to populate.

!!! WARNING !!!

I have not the slightest idea as to how the “Asymp. Sig. (2-Tailed)” value was derived. For this particular example, after re-analyzing the data within the R platform, I then re-calculated the formula by hand. While the “Test Value” (Median), “Total Cases”, and “Number of Runs” are accurate, the p-value is not. This of course, is the most important component of the model output.

The only assumption that I can make as is relates to this occurrence, pertains to a document that I found while searching for an answer to this quandary.

This text originates from a two page flyer advertising the merits of SPSS. My only conclusion is that what this flyer insinuates, is that SPSS might be programmed to shift test methodology on the basis of the data which is being analyzed. While an expert statistician would probably catch these small alterations right away, decide for himself as to whether or not they are suitable, and then proceed with mentioning such in his research abstract; a more novice researcher may jeopardize his research by neglecting to notice. That is why, when publishing, if given the option, I will run data through multiple platforms to verify results prior to submission.

If you would like to perform this test, I would recommend utilizing R to do so.

The Wald Wolfowitz Test

# The package “randtests” must be installed and enabled #

xy <- c(28.00, 21.00, 7.00, 46.00, 8.00, 2.00, 12.00, 7.00, 24.00, 22.00, 44.00, 15.00, 14.00, 34.00, 38.00, 24.00, 25.00, 26.00, 6.00, 36.00, 27.00, 41.00, 10.00, 2.00, 20.00, 4.00, 29.00, 38.00, 2.00, 17.00, 10.00, 13.00, 42.00, 43.00, 7.00, 14.00)

runs.test(xy)

This produces the output:

Runs Test

data: xy
statistic = -1.691, runs = 14, n1 = 18, n2 = 18, n = 36, p-value = 0.09084
alternative hypothesis: nonrandomness

We will not reject the null hypothesis, and therefore we can conclude, that at a 95% confidence interval, that each element in the sequence is independently drawn from the same distribution.

That's all for now, Data Heads! Keep visiting for more exciting articles.

Reflections of a Data Scientist

Thursday, March 22, 2018

(R) The Kolmogorov-Smirnov Test & The Wald Wolfowitz Test (SPSS)

No comments:

Post a Comment