Reflections of a Data Scientist: (R) Identifying Normally Distributed Data

A good portion of statistics, as it is taught in the academic setting, focuses specifically on a single aspect of the subject matter, that aspect being the normal distribution. Normal distributions often occur in very large observational sets, however, there are also instances where a small observational pattern may exhibit characteristics of a normal distribution. Once a set is identified as conforming to this distribution type, various inferences can be made about the data, and various modeling techniques can be applied.

For this article, we will be using the sample data set: "SDSample":

SDSample <- c(7.00, 5.10, 4.80, 2.90, 4.80, 5.80, 6.40, 6.10, 4.30, 7.20, 5.30, 4.00, 5.50, 5.40, 4.70, 4.50, 5.00, 4.70, 6.10, 5.10, 5.20, 5.00, 4.20, 5.10, 4.90, 5.30, 2.90, 5.80, 3.50, 4.90, 5.80, 6.10, 3.00, 5.90, 4.30, 5.30, 4.70, 6.40, 4.60, 3.50, 5.00, 3.50, 4.10, 5.70, 4.90, 6.10, 5.30, 6.90, 4.60, 4.90, 4.00, 3.90, 4.50, 5.90, 5.20, 7.20, 4.60, 4.40, 5.40, 5.90, 3.10, 5.60, 5.10, 4.40, 4.50, 3.10, 4.50, 6.00, 6.00, 5.10, 7.30, 4.60, 3.20, 4.10, 5.10, 4.90, 5.10, 5.60, 4.10, 5.70, 4.70, 5.70, 5.50, 4.50, 5.20, 5.00, 5.40, 5.10, 3.90, 4.30, 4.10, 4.30, 4.40, 2.40, 5.40, 6.30, 5.50, 4.30, 4.90, 2.90)

Creating a Frequency Histogram

Assuming that "SDSample" contains sample observation data, our first in analyzing the data in order to check for normality, is to create a histogram.

hist(SDSample,
freq = TRUE,
col = "Blue",
xlab = "Vector Values",
ylab = "Frequency",
main = "Frequency Histogram")

The output should resemble:

Shapiro-Wilk Normality Test

After viewing the histogram, we should proceed with performing a Shapiro-Wilks Test. This can be achieved with the following example code:

shapiro.test(SDSample)

This produces the following console output:

Shapiro-Wilk normality test

data: SDSample
W = 0.98523, p-value = 0.3298

But what does this mean?

Without an assumed alpha level, it means nothing at all. However, with a determined alpha level, a hypothesis test can be created which tests for normality. We will assume an alpha value of

Well, assuming an alpha value of .05 (α = .05), or stating that we wish to, with 95% confidence, state that the data does fit a normal distribution...through the use of the Shapiro-Wilk normality test, we can create a hypothesis test to prove just that.

If P <= .05 we would reject the null hypothesis, meaning, that we could state that:

With 95% confidence the data does not fit the normal distribution.

If P > .05, we would accept the hypothesis, meaning, that we could state that:

No significant departure from normality was found.

In this case, P = 0.3298, and 0.3298 > .05, therefore, we can state:

No significant departure from normality was found.

THE RESULTS OF THIS TEST DO NOT MEAN THAT THIS DATA WAS TAKEN FROM A SOURCE WHICH WAS NORMALLY DISTRIBUTED. NOR DOES IT INDICATE THAT THE DATA ITSELF IS NORMALLY DISTRIBUTED. IT SIMPLY STATES THAT:

Assuming an Alpha Value of .05, and applying the Shapiro-Wilk normality test, no significant departure from normality was found.

However, since the test is biased by sample size, the test may indicate statistically significant results in a large samples, even when this is not the case. Thus a Q-Q plot is required for verification in addition to the test. *

Q-Q Plot

As previously mentioned, the size of the data set that Shapiro-Wilk normality test is applied to can have a significant impact on its accuracy. This is why Q-Q plot utilization is recommended to double check the results of the test.

To create a Q-Q plot, please utilize the sample code:

qqnorm(SDSample, main="")
qqline(SDSample)

This should produce the following output:

Ideally, if the data is normally distributed, the dotted plots should follow the solid trend line as closely as possible.

The Q-Q plot is reasonably consistent with normality.

For more information on how to interpret the Q-Q plot, please click on the link below:

http://data.library.virginia.edu/understanding-q-q-plots/

Plotting The Probability Density of A Normal Distribution

Before proceeding with this coding sample, I want to be clear, that this method does not produce a Kernel Density Plot. Meaning, that the method that is presented below, will take any number of data points and plot them as if the distribution was perfectly normalized. Therefore, this graphical representation only serves as just that. Any data that is subject to the methods below will be graphed as if it the data selection occurred within a normal distribution.

First we will need to find the mean of the data vector:

mean(SDSample)

# mean = 4.940 #

Then we need to derive the standard deviation.

sd(SDSample)

# sd = 0.9978765 #

# Now we will assign the ‘SDSample’ vector to vector "x" #

x <- SDSample

# This code produces a new vector which consists of the probability density values of all "SDSample" data vector values. It does so under the assumption, that these values occurred within a normal distribution with a mean value of 4.940, and a standard deviation of 0.9978765 #

y <- dnorm(SDSample, mean = 4.940, sd = 0.9978765)

# This code plots the distribution #

plot(x,y, main="Normal Distribution / Mean = 4.940 / SD = .998", ylab="Density", xlab="Value", las=1)

# This code creates a vertical line on the plot which indicates the position of the mean value #

abline(v=4.940)

From this code, we are presented with the image:

Kernel Density Plot

In this graphic, what is being illustrated, is the density of independently occurring values on the x-axis. Notice that the illustration is not perfectly bell shaped. This was not going to be initially include this as part of this article, but I feel that it should be presented due to its significance, and also, to demonstrate how it differs from the previous example.

d <- density(SDSample)
plot(d, main ="Kernel Density of X")
polygon(d, col="grey", border="blue")

The output would be:

In the next article, we will discuss inferences that can be made pertaining to normally distributed data, and methods which can be utilized to draw further conclusions.

* https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test

Reflections of a Data Scientist

Sunday, August 27, 2017

(R) Identifying Normally Distributed Data - Pt. I

No comments:

Post a Comment