Reflections of a Data Scientist: (R) Stationary Data and Random Walks

In this entry, we are going to address a subject which is rather complicated at its core. Just as it is complicated, it is also obscure, in that, it is rarely discussed in textbooks, or even online for that matter.

Stationary Data refers to data types which are essentially static, in that, the underlying process which generates the data points, does not possess a directional aspect. Stationary data sets can be random. However, data sets which are random are not required to be stationary.

Investopedia Defines NON-Stationary Data as:

"In contrast to the non-stationary process that has a variable variance and a mean that does not remain near, or returns to a long-run mean over time, the stationary process reverts around a constant long-term mean and has a constant variance independent of time."

This phenomenon, when illustrated, would resemble something similar to the graphic below:

In this case, if "x" would be the time variable, and "y" would be the coinciding measurement.

The rules for a stationary data set are thus:

1. The expectation of the process is equal to a constant, meaning, time does not act as a contributing factor. (The mean of the process does not vary across time.)

2. The variance of the series is constant across time.

3. Variance in the series must only be impacted by changes within the “x” variable, and not, by changes within the “y” variable.

How does this data series differ from a "random walk"?

In the case of a "random walk" series of data, the main assumption, which acts as an aspect of differentiation, pertains to the correlation between variables "x" and "y". Mainly, in that, in a manner which is similar to the first rule of stationary data, the correlation between "x" and "y" will reach the value of "0" over an infinite series of time.

While the value of "0", as it relates to the value of the correlation, may be reached as time progresses, data generated by random walks possess an aspect known as "drift". It is this inherent component of the random walk series which causes trends to emerge which may, in some cases, present the illusion of linearity.

Below is a graphical representation of fictitious random walk data:

To demonstrate the aforementioned concepts, we will attempt two examples:

Example (Stationary Data Series)

# Requires that the package: “tseries”, be downloaded and enabled. #

# Create Stationary Data Set #

Value <- arima.sim(model = list(order = c(0,0,0)), n = 1000)

# Alternative Hypothesis (H1): Data is stationary #

adf.test(Value)

# Alternative Hypothesis (H1): Data is not a random walk #

PP.test(Value)

# Plot data points #

plot(Value)

In the case of our randomly generated stationary data, the graphical output is as follows:

The console output is as follows:

Augmented Dickey-Fuller Test

data: Value
Dickey-Fuller = -8.4201, Lag order = 9, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(Value) : p-value smaller than printed p-value

>
> # Alternative Hypothesis (NA): Data is not a random walk #
>
> PP.test(Value)

Phillips-Perron Unit Root Test

data: Value
Dickey-Fuller = -30.432, Truncation lag parameter = 7, p-value = 0.01

Example (Random Walk Data Series)

# Requires that the package: “tseries”, be downloaded and enabled. #

# Create Random Walk Set #

Value <- arima.sim(model = list(order = c(0, 1, 0)), n = 1000)

# Alternative Hypothesis (H1): Data is stationary #

adf.test(Value)

# Alternative Hypothesis (H1): Data is not a random walk #

PP.test(Value)

# Plot data points #

plot(Value)

In the case of our randomly generated stationary data, the graphical output is as follows:

The console output is as follows:

Augmented Dickey-Fuller Test

data: Value
Dickey-Fuller = -2.2039, Lag order = 9, p-value = 0.492
alternative hypothesis: stationary

>
> # Alternative Hypothesis (NA): Data is not a random walk #
>
> PP.test(Value)

Phillips-Perron Unit Root Test

data: Value
Dickey-Fuller = -2.4451, Truncation lag parameter = 7, p-value = 0.3899

Methods Utilized and Conclusions

The Augmented Dickey-Fuller Test is a methodology of analysis utilized to test data sets for stationarity. The lag value can be set by the user to determine the sensitivity of the model. Typically, this value is set to reflect the number of trend periods which exist within the data. I recommend leaving the value at its default setting. If this is the case, the function will assume a lag value of: trunc((length(x)-1)^(1/3)). More information pertaining to this option, and the function itself as it exists within the “tseries” package, can be found by utilizing the following command:

?? adf.test

The Phillips-Perron Unit Root Test is very similar to the previously mentioned test, however, The Phillips-Perron Unit Root Test, for our purposes, is being utilized to test a time series for random walk potential. The working hypothesis in this scenario would be:

H0: The data set shares similarities with a random walk series of data

H1: The data set does not share similarities with a random walk series of data

For more information on the function utilized, you may call the following command:

??PP.test

In our first example series, the p-value for the Augmented Dickey-Fuller Test was less than 0, thus, indicating that while assuming an alpha value 0f .05, we can state that the data was stationary. Since the p-vale related to the Phillips-Perron Unit Root Test was .01, assuming an alpha value of .05, we can state that the data does not exhibit the patterns typically observed within random walk data.

In our second example series, the p-value for the Augmented Dickey-Fuller Test was 0.492, thus, indicating that while assuming an alpha value 0f .05, we cannot state that the data was stationary. Since the p-value related to the Phillips-Perron Unit Root Test was 0.3899, assuming an alpha value of .05, we can state that the data does exhibit the patterns typically observed within a random walk data series.

For additional information pertaining to the subject matter discussed within this article, please visit the resources below:

https://www.quora.com/Is-a-random-walk-the-same-thing-as-a-non-stationary-time-series

https://www.youtube.com/watch?v=JytDF8ph2ko

Reflections of a Data Scientist

Wednesday, August 1, 2018

(R) Stationary Data and Random Walks

No comments:

Post a Comment