Reflections of a Data Scientist: (R) Canonical Correlation (SPSS)

Canonical Correlation, pronounced “can-non-ick-cal”, is a complicated methodology utilized to measure the correlation of single variable sets, against variables comprised of the combination of the original sets. In many ways, the functionality of this method is similar to that of dimension reduction. However, as it pertains to Canonical Correlation, variable sets are first manually selected by the user, and then subsequently, combined to create new variables. These new variables share the sum of dimensionality of the prior sets combined.

Once the combined sets have been derived, each original variable set is tested for correlation as it pertains to the newly derived set. These new sets, though the segments have already been previously defined, can be defined as variables independently. As such, they exist in a manner similar to latent variables, in that, they may illustrate an encompassing macro-phenomenon.

Example:

Let’s begin with our sample data set:

To begin our analysis, we must select, from the topmost menu, “Analyze”, then “Correlate”, followed by “Canonical Correlation”.

We will create two new sets, which will represent macro-phenomenons. “Set 1” will be comprised of the variables “X” and “Y”. “Set 2” will be comprised of the variables “Z” and “W”. Designating the variables can be achieved through the utilization of the two middle arrow buttons.

Once this designation is complete, click “OK”.

This creates the following output:

Various aspects of output are produced from the requested analysis, though, we only need to concern ourselves with the above listed portions.

“Sig,”, is detailing the significance of the correlated relationship of the combined variables, which were assembled to create each new data variable. If either correlation figure contains a significance value, which is greater than .05, then the analysis must be abandoned. (Assuming an alpha level of .05)

In our example, we will imagine that this is not the case.

As such, we will continue to the next step of the analysis, which is the assessment of each variable set independently, as it is related to the combined variable sets.

The result of such, would lead to the synthesis of the following summary:

Tests of dimensionality for the canonical correlation analysis, indicate that the two canonical dimensions are not statistically significant at the .05 level. Dimension 1 had a canonical correlation of .609, while for the dimension 2, the canonical correlation was much lower at .059.

In examining the standardized canonical coefficients for the two dimensions across variable sets, the following conclusions can be derived. For the first canonical dimension, the variable which had the strongest influence, was variable “W” (.997), followed by variable “X” (-.890), which preceded variable “Y” (-.357), followed by variable “Z” (-118). For the second canonical dimension, the variable which had the strongest influence was variable “Z” (.993), followed by variable “Y” (.943), which was in turn followed by variable “X” (.475), which was subsequently followed by variable “W” (.088).

If we wanted to repeat our analysis through the utilization of the “R” platform, we could do so with the following code:

(This example requires that the R package: “CCA”, be downloaded and enabled.)

# Data Vectors #

x <- c(8, 1, 4, 10, 8, 10, 3, 1, 1, 2)

y <- c(97, 56, 97, 68, 94, 66, 81, 76, 86, 69)

z <- c(188, 184, 181, 194, 190, 180, 192, 184, 192, 191)

w <- c(366, 383, 332, 331, 311, 352, 356, 351, 399, 357)

v <- c(6, 10, 6, 13, 19, 12, 11, 17, 18, 12)

# Vector consolidation into matrices #

xy <- matrix(c(x,y), ncol=2)

zw <- matrix(c(z,w), ncol=2)

# Application of analysis #

cc(xy, zw)

Which produces the output:

$cor
[1] 0.60321616 0.07284239

$names
$names$Xnames
NULL

$names$Ynames
NULL

$names$ind.names
NULL

$xcoef
[,1] [,2]
[1,] -0.23449315 -0.12499512
[2,] -0.02480536 0.06573124

$ycoef
[,1] [,2]
[1,] -0.01825046 0.199544843
[2,] 0.03902046 0.002265523

$scores
$scores$xscores
[,1] [,2]
[1,] -1.1968746 0.7831779
[2,] 1.4615973 -1.0368371
[3,] -0.2589020 1.2831584
[4,] -0.9465054 -1.3730183
[5,] -1.1224585 0.5859842
[6,] -0.8968947 -1.5044807
[7,] 0.3724769 0.3564537
[8,] 0.9654901 0.2777877
[9,] 0.7174364 0.9351001
[10,] 0.9046344 -0.3073261

$scores$yscores
[,1] [,2]
[1,] 0.468749446 0.1074573
[2,] 1.205099133 -0.6522082
[3,] -0.730193019 -1.3663844
[4,] -1.006469470 1.2254331
[5,] -1.713876856 0.3819432
[6,] 0.068466671 -1.5206187
[7,] 0.005542988 0.8829815
[8,] -0.043555633 -0.7247049
[9,] 1.683422831 0.9803990
[10,] 0.062813910 0.6857021

$scores$corr.X.xscores
[,1] [,2]
[1,] -0.9355965 -0.3530712
[2,] -0.4703893 0.8824590

$scores$corr.Y.xscores
[,1] [,2]
[1,] -0.03496377 0.072719921
[2,] 0.60070892 0.006634506

$scores$corr.X.yscores
[,1] [,2]
[1,] -0.5643669 -0.02571855
[2,] -0.2837464 0.06428042

$scores$corr.Y.yscores
[,1] [,2]
[1,] -0.05796226 0.9983188
[2,] 0.99584355 0.0910803

In the output, I marked the standardized canonical correlation coefficients in bold.

That’s all for now, Data Heads! Please stay subscribed for more great articles.

Reflections of a Data Scientist

Saturday, April 14, 2018

(R) Canonical Correlation (SPSS)

No comments:

Post a Comment