Monday, August 31, 2020

(R) Markov Chains

Per Wikipedia, “A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state of the attained in the previous event”.

Explained in a less broad manner, a Markov chain could be described as a way of assessing probabilistic systems by assessing fluidity as it applies to both a single variable, and the other variables contained within a system.

For example, in the case of weather systems, a day which is cloudy may subsequently be followed by a day which is also cloudy, a day without clouds, or a rainy day. However, the probability of each subsequent event will undoubtedly be impacted by the composition of the current state.

Another example of the applied methodology is assessment of market share. If company A offers a product which potentially retains 60% of its current consumers annually, but also has the potential to lose 40% of that consumer base to company B on an annual basis, and company B potentially retains 80% of its current annually, but also has the potential to lose 20% of that consumer base to company A, what is the impact of the phenomenon described on an annual basis?

Let’s explore both examples:

First, we’ll create a model which can predict weather.

We’ll assume that the following probabilities appropriately describe the autumn forecasts for weather in Winnipeg.

Cloudy Clear Snowy Rainy

Cloudy 33% 17% 25% 25%

Clear 25% 50% 12% 13%

Snowy 19% 15% 33% 33%

Rainy 20% 20% 10% 50%

To further understand this probability matrix, assume that currently the day’s forecast in Winnipeg is “Cloudy”. This would typically indicate that the following day would have weather which is either “Cloudy” (33%), “Clear” (17%), “Snowy” (25%), or “Rainy” (25%).

Now, we’ll run the information through the R-Studio platform:

EXAMPLE A – Weather Model

# With the libraries ‘markovchain’ and ‘diagram’ downloaded and enabled #

# Create a Transition Matrix #

trans_mat <- matrix(c(.33, .17, .25, .25, .25, .50, .12, .13, .19, .15, .33, .33, .20, .20, .10, .50),nrow = 4, byrow = TRUE)

stateNames <- c("Cloudy","Clear", "Snowy", "Rainy")

row.names(trans_mat) <- stateNames

colnames(trans_mat) <- stateNames

# Check input #

trans_mat

# Console Output #


Cloudy Clear Snowy Rainy
Cloudy 0.33 0.17 0.25 0.25
Clear 0.25 0.50 0.12 0.13
Snowy 0.19 0.15 0.33 0.33
Rainy 0.20 0.20 0.10 0.50


# Create a Discrete Time Markov Chain #

disc_trans <- new("markovchain",transitionMatrix=trans_mat, states=c("Cloudy","Clear", "Snowy", "Rainy"), name="Weather")

# Check input #

disc_trans

# Console Output #


Weather
A 4 - dimensional discrete Markov Chain defined by the following states:
Cloudy, Clear, Snowy, Rainy
The transition matrix (by rows) is defined as follows:
Cloudy Clear Snowy Rainy
Cloudy 0.33 0.17 0.25 0.25
Clear 0.25 0.50 0.12 0.13
Snowy 0.19 0.15 0.33 0.33
Rainy 0.20 0.20 0.10 0.50


# Illustrate the Matrix Transitions #

plotmat(trans_mat,pos = NULL,

lwd = 1, box.lwd = 2,

cex.txt = 0.8,

box.size = 0.1,

box.type = "circle",

box.prop = 0.5,

box.col = "light yellow",

arr.length=.1,

arr.width=.1,

self.cex = .4,

self.shifty = -.01,

self.shiftx = .13,

main = "")


This produces the output graphic:



(As it pertains to the graphic- something important to note is the direction of the arrows. The arrow direction in the graphic is inverted. Therefore, I would only use the graphic as an auxiliary for personal reference.)

# We will assume that the current forecast is cloudy by creating the vector below #

Current_state<-c(1, 0, 0, 0)

# Now we will utilize the following code to predict the weather for tomorrow #

steps<-1

finalState<-Current_state*disc_trans^steps

finalState

# Console Output #

Cloudy Clear Snowy Rainy
[1,] 0.33 0.17 0.25 0.25

This output indicates that tomorrow will have a 33% chance of being cloudy, a 17% chance of being clear, a 25% chance of being snowy, and a 25% chance of being rainy.

# Let’s predict the weather for the following day #

steps<-2

finalState<-Current_state*disc_trans^steps

finalState

# Console Output #


Cloudy Clear Snowy Rainy
[1,] 0.2428372 0.2621651 0.1839856 0.311012

With this information, we can assume that generally there is a 24% chance of rain, a 26% chance of the day being clear, an 18% of the day being snowy, and a 31% chance of the day being rainy.

It would be helpful if the rounded figures summed to 1. But I think that you probably understand the example regardless. 

EXAMPLE A – Market Share

Let’s re-visit our market share example:

Company A offers a product which potentially retains 60% of its current consumers annually, but also has the potential to lose 40% of that consumer base to company B on an annual basis, and company B potentially retains 80% of its current annually, but also has the potential to lose 20% of that consumer base to company A, what is the impact of the phenomenon described on an annual basis?

Let’s make a few assumptions.

First, we will assume that the projection given above is accurate.

Next, we’ll assume that the total customer base as it pertains to the product is 60,000,000.

Finally, we’ll assume that the Company A possesses 20% of this market, and Company B possesses 80% of this market. 12,000,000 individuals and 48,000,000 respectively.

# With the libraries ‘markovchain’ and ‘diagram’ downloaded and enabled #

# Create a Transition Matrix #

trans_mat <- matrix(c(0.6,0.4,0.8,0.2),nrow = 2, byrow = TRUE)

stateNames <- c("Company A","Company B")

row.names(trans_mat) <- stateNames

colnames(trans_mat) <- stateNames

# Check input #

trans_mat

# Console Output #


Company A Company B
Company A 0.6 0.4
Company B 0.8 0.2


# Create a Discrete Time Markov Chain # 

disc_trans <- new("markovchain",transitionMatrix=trans_mat, states=c("Company A","Company B"), name="Market Share")

disc_trans

# Check input #

disc_trans

# Console Output #


Market Share
A 2 - dimensional discrete Markov Chain defined by the following states:
Company A, Company B
The transition matrix (by rows) is defined as follows:
Company A Company B
Company A 0.6 0.4
Company B 0.8 0.2


# Illustrate the Matrix Transitions #

plotmat(trans_mat,pos = NULL,

lwd = 1, box.lwd = 2,

cex.txt = 0.8,

box.size = 0.1,

box.type = "circle",

box.prop = 0.5,

box.col = "light yellow",

arr.length=.1,

arr.width=.1,

self.cex = .4,

self.shifty = -.01,

self.shiftx = .13,

main = "")


This produces the output graphic:


(Again, as it pertains to the graphic- something important to note is the direction of the arrows. The arrow direction in the graphic is inverted. Therefore, I would only use the graphic as an auxiliary for personal reference.)

# We will assume that the market share is as follows #

# This reflects the information provided in the example description above #

Current_state<- c(0.20,0.80)

# Now we will utilize the following code to predict the market share for the next year #

steps<-1

finalState<-Current_state*disc_trans^steps

finalState

# Console Output #


Company A Company B
[1,] 0.76 0.24


As illustrated, one year out, Company A now controls 76% of the market share (45,600,000)*, and Company B controls 24% of the market share (14,400,000).

* Assuming that original market share does not increase or decline in overall individuals. The calculation for the figures is: 60,000,000 * .76 and 60,000,000 * .24.

Similar to our previous example, we can also project the current trend for multiple consecutive time periods.

# The following code to predicts the market share for the following two years #

steps<-2

finalState<-Current_state*disc_trans^steps

finalState

# Console Output #


Company A Company B
[1,] 0.648 0.352


Steady state in the case of this example, will predict the potential equilibrium which will be reached if the trends continue ad infinitum.

# Steady state Matrix # 

steadyStates(disc_trans)

# Console Output #


Company A Company B
[1,] 0.6666667 0.3333333


Company A in this scenario now controls approximately 66.66% of the market share, and Company B controls 33.33% of the market share.

Tuesday, August 25, 2020

(R) Exotic Analysis – Distance Correlation T-Test

In prior articles, I explained the various test of correlation which are available within the R programming language. One of those methods which was described but is rarely utilized outside of the textbook, is the Distance Correlation T-Test methodology.

In this entry, I will briefly explain when it is appropriate to utilize the distance correlation, and how to appropriate apply the methodology within the R framework.

Now I must begin by stating that what I am about to describe is uncommon, and should only be utilized in situations which absolutely warrant application.

The distance correlation as described within the context of this blog is:

Distance Correlation – A method which tests model variables for correlation through the utilization of a Euclidean distance formula.


So when would I apply the Distance Correlation T-Test? To answer this question, only in situations in which other correlation methods are inapplicable. In the case which I am about to demonstrate, an example of the inapplicability of other methods would be situations in which one variable is continuous, and the other is categorical.

Example:

(This example requires that the R package: “energy”, be downloaded and enabled.)


# Data Vectors #

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

dcor.ttest(x, y)

mean(x)

sd(x)

mean(y)

sd(y)


This produces the output:

dcor t-test of independence

data: x and y
T = -0.1138, df = 34, p-value = 0.545
sample estimates:
Bias corrected dcor
-0.01951283

> mean(x)
[1] 4.8
> sd(x)
[1] 3.794733
> mean(y)
[1] 79
> sd(y)
[1] 14.3527


Conclusion:

There was a not significant difference in GROUP X (M = 4.80, SD = 3.79), as compared to GROUP Y (M = 79, SD = 14.35), t(34) = -0.11, p = .55.

However, you may be wondering, what is the difference between the Distance Correlation T-Test, the Distance Correlation Method, and the Pearson Test of Correlation?

Distance Correlation T-Test – Utilized to test for significance in situations in which one variable is continuous, and the other is categorical. This method can also be utilized in other situations, however, if both variables are continuous, then the Pearson Test of Correlation is most appropriate. 

Distance Correlation Method – Utilized to test for correlation between two variables when assessed through the application of the Euclidean Distance Formula. This model output value is similar to coefficient of determination, in that, it can range from 0 (no correlation), to 1 (perfect correlation).

The Pearson Test of Correlation – Utilized to determine if values are correlated. This method should typically be utilized above all other tests of correlation. However, it is only appropriate to utilize this method when both variables are continuous.