Reflections of a Data Scientist: (R) Distance Correlation

In a previous entry, the topic of Nearest Neighbor was discussed. In that article, also discussed, was the internal mechanism which enables its functionality, that being, the Euclidean distance formula.

Distance Correlation functions in a manner similar to Nearest Neighbor, in that, it also utilizes the Euclidean distance formula as a fundamental aspect of its overall synthesis.

Without endeavoring too far into the individual details of the process, we will examine this concept through the following example problem. That being said, the critical details that should be understood and retained are thus:

The product of the distance correlation formula will produce a figure which is equal to or greater than zero, and less than or equal to one. (0 >= X <= 1).

As a result of this unique synthesis, the figure produced, cannot be assessed in the manner in which a Pearson Correlation output is analyzed.

Example:

(This example requires that the R package: “energy”, be downloaded and enabled.)

# Data Vectors #

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)
y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

# To apply the appropriate analysis #

dcor(x,y)

This produces the output:

[1] 0.4419842

The above figure is the distance correlation value. Prior to analyzing this figure, we must first derive the strength of the model. This can be achieved with the code below:

dcor.ttest(x, y)

This produces the output:

dcor t-test of independence

data: x and y
T = -0.1138, df = 34, p-value = 0.545
sample estimates:
Bias corrected dcor
-0.01951283

The figure that is relevant to our purposes, is the p-value.

p-value = 0.545

Ignore the “Bias corrected dcor”, as correcting the bias presents us with a negative value, which cannot exist in a logical sense as distance values cannot be negative.

The p-value of a distance correlation model, is interpreted in a manner which is similar to that of the Pearson correlation model. Meaning, that the p-value is demonstrating the overall strength of the model.

As it pertains to the model output, interpretation is less straight forward. A model output of “1” would indicate a perfect correlation, while a value of “0” would indicate perfect independence.

This can be demonstrated with the code below:

dcor(x,x)

Which produces the output:

[1] 1

As was demonstrated in the prior article, I would also recommend performing a Pearson correlation test on the same data vectors. The output provided by such can be included within the final written analysis.

As a reminder, this analysis can be achieved through the utilization of the following code:

cor.test(x,y)

Which produces the output:

Pearson's product-moment correlation
data: x and y
t = 0.36656, df = 8, p-value = 0.7235
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5452231 0.7013920
sample estimates:
cor
0.1285237

That’s all for now. Stay tuned, Data Heads!

Reflections of a Data Scientist

Friday, April 13, 2018

(R) Distance Correlation

No comments:

Post a Comment