Wednesday, November 7, 2018

(R) Dimension Reduction and K-Nearest Neighbor

Continuing with the present theme of prior articles, in today’s entry, we will discuss the utilization of the R platform as it pertains to dimension reduction.

This demonstration will also include an example which seeks to illustrate the k-nearest neighbor method of analysis. Both methodologies were previously demonstrated within the SPSS platform.

If you are un-familiar with this particular methodology, or if you wish to re-familiarize yourself, please consult the following articles: Dimension Reduction (SPSS) and Nearest Neighbor / Dimension Reduction (Pt. II) (SPSS).


For this demonstration, we will be utilizing the same data set which was previously utilized to demonstrate the analytical process within SPSS. This data set can be found within this site’s GitHub Repository.

# Load the data into the R platform #

# Be sure to change the ‘filepathway’ so that it matches the file location on your #
# computer #

DimensionReduct <- read.table("C:\\filepathway\\DimensionReduction.csv", fill = TRUE, header = TRUE, sep = "," )

# First, we must remove the ‘ID’ column from the data frame #

DimensionReduct <- DimensionReduct[c(-1)]

# Next we will perform a bit of preliminary analysis with the following code: #

# The function option: ‘scale. = TRUE.’ requests scaling prior to analysis. If the data #
# frame has already been scaled, indicate ‘False’ as it pertains to this option. #

pca_existing <- prcomp(DimensionReduct, scale. = TRUE)

# Summary output can be induced from the following functions: #



Console Output:

Importance of components:
                            PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 1.4008 1.3598 1.0840 1.0114 0.87534 0.76901 0.58431 0.54010
Proportion of Variance 0.2453 0.2311 0.1469 0.1279 0.09578 0.07392 0.04268 0.03646
Cumulative Proportion 0.2453 0.4764 0.6233 0.7512 0.84694 0.92086 0.96354 1.00000

Graphical Console Output:

# To view the eigenvalues which were utilized to generate the above graph: #

eigenvals <- pca_existing$sdev^2

# To view proportional eigenvalue output #


# Re-load the data into the R platform #

# Be sure to change the ‘filepathway’ so that it matches the file location on your #
# computer #

DimensionReduct <- read.table("C:\\filepathway\\DimensionReduction.csv", fill = TRUE, header = TRUE, sep = "," )

# First, we must remove the ‘ID’ column from the data frame #

DimensionReduct0 <- DimensionReduct[c(-1)]

# Download and enable the package: ‘Psych’, in order to utilize the function: ‘principal’ #

# Download and enable the package: ‘GPArotation’, in order to utilize the #
# option: ‘rotate’ #

# The ‘nfactor’ option indicates the number of dimensional factors to utilize for analysis #

# The ‘rotate’ option indicates the type of rotation methodology which will be applied #

# The code below generates the principal components requested: #

pca <- principal(DimensionReduct0, nfactors=3, rotate = "quartimax")

# We must export these scores into a accessible format. The code below achieves such: #

pcascores <- data.frame(pca$scores)

# Prior to initiating the K-Nearest Neighbors process, we must isolate the previously #
# removed ‘ID’ #

# variable in order to have it later act as a classification factor. #

cl = DimensionReduct1[,1, drop = TRUE]

# We are now prepared to perform the primary analysis. #

# ‘k = 3’ indicates the number of components to utilize for analysis. #

# Download and enable the package: ‘class’, in order to utilize the function: ‘knn’ #

KNN <- knn(pcascores, pcascores, cl, k = 3)

# With the analysis completed, we must now assemble all of the outstanding #
# components into a single data frame. #

FinalFrame <- data.frame(DimensionReduct , pcascores)

FinalFrame$KNN <- KNN

This data frame should resemble the following:

# (with the package: ‘plot3D’, downloaded and enabled) #

# Create a 3D graphical representation for the K-Nearest Neighbor analysis #

scatter3D(FinalFrame$RC3, FinalFrame$RC2, FinalFrame$RC1, phi = 0, bty = "g", pch = 20, cex = 2)

# Create data labels for the graphic #

text3D(FinalFrame$RC3, FinalFrame$RC2, FinalFrame$RC1, labels = FinalFrame$ID,
add = TRUE, colkey = FALSE, cex = 1)

If you would prefer to have the data presented in a graphical format which is three dimensional and rotatable, you could utilize the following code:

# Enable 'rgl' #

# #

plot3d(FinalFrame$RC3, FinalFrame$RC2, FinalFrame$RC1, col = blues9, size=10)

This creates a new window which should resemble the following:

If you drag your mouse cursor across the graphic while holding the left mouse button, you can rotate the image display. 

One final note un-related to our example demonstrations. If you are performing a nearest neighbor analysis, and your data has not been previously scaled, be sure to scale your data prior to proceeding with the procedure.

Tuesday, November 6, 2018

(R) K-Means Cluster

In continuing with the premise of the prior article, we will again explore a previously discussed methodology which was last demonstrated within the SPSS platform.

If you are un-familiar with this particular methodology, or if you wish to re-familiarize yourself, please consult the following article: K-Means Cluster (SPSS).


For this demonstration, we will be utilizing the same data set which was previously utilized to demonstrate the analytical process within the SPSS platform. This data set can be found within this site’s GitHub Repository.

# Load the data into the R platform #

# Be sure to change the ‘filepathway’ so that it matches the file location on your #

# computer #

KMeans <- read.table("C:\\filepathway\\kmeans.csv", fill = TRUE, header = TRUE, sep = "," )

# We’re going to assume that the variables: ‘ZCont_Var1’, ‘ZCont_Var2’, ‘ZCont_3’ # 

# are not included within the initial data frame. #

# Therefore, we must scale the variables: ‘Cont_Var1, ‘Cont_Var2’, ‘Cont_Var3’ prior to # 
# performing analysis. #

ScaledKMeans <- scale(KMeans[4:6])

# In this example, we are going to create a two cluster model. Also, we will be utilizing #

# an ‘n’ value of 10. This figure represents the number of iterations which will be # 
# attempted while the underlying mechanism of the model decides on an # 
# appropriate configuration. #

ScaledKMeansCluster <- kmeans(ScaledKMeans, 2, nstart = 10)

# Once the model has been created, we will assign the cluster values to a data frame #

# in order to discern which cluster categorizations pertain to each observational value. #

KMeans$ClusterID <- as.factor(ScaledKMeansCluster$cluster)

# Finally, we will graph the results of the analysis. The variables which will represent #

# the scales of the graph’s axis are: Zcont_Var1 and Zcont_Var2. #

# To achieve the desired result, we must download and enable the package: “ggplot2” #

ggplot(KMeans, aes(KMeans$Zcont_Var1, KMeans$Zcont_Var2, color = KMeans$ClusterID)) + geom_point()

Console Output:

(R) Hierarchical Cluster

Previously, we discussed how to appropriately perform hierarchical cluster analysis within the SPSS platform. In this entry, we will discuss the same method of analysis, however, we will be utilizing the R platform to perform this function.

If you are un-familiar with this particular methodology, or if you wish to re-familiarize yourself, please consult the following article: Hierarchical Cluster (SPSS).


For this demonstration, we will be utilizing the same data set which was previously utilized to demonstrate the process within the SPSS platform. This data set can be found within this site’s GitHub Repository.

# Load the data into the R platform #

# Be sure to change the ‘filepathway’ so that it matches the file location on your #

# computer #

HFrame <- read.table("C:\\filepathway\\hcluster.csv", fill = TRUE, header = TRUE, sep = "," )

# After specifying the variables to analyze (Cont_Var1, Cont_Var2, Cont_Var3), # 

# we must utilize the dist() function to create a matrix which calculates the distance # 
# between the variable observation values. #

clusters0 <- (dist(HFrame[, 4:6]))

# Now we are ready to prepare our model. "hclust()" is the function which we will #

# utilize to enable model generation. There are other agglomeration methods which #
# can specified to generate different model variations. However, in the case of our #
# example, we will be utilizing the “average” method, as it produces a model which #
# best resembles the equivalent SPSS output. # 

clusters1 <- hclust(clusters0, method = "average" , members = NULL)

# Next, we will download and enable the package: “ggdendro”. This package allows #

# for the production of enhanced visualizations as it pertains hierarchical model # 
# illustration. #

# Rotate the plot and remove default theme #

ggdendrogram(clusters1, rotate = TRUE, theme_dendro = FALSE)

# The above code should produce an output illustration which resembles an SPSS #

# graphic. #

# As was also the case within the SPSS example demonstration, we will choose 5 #
# clusters from which to classify our data observations. #

# k = 5 is designating the number of clusters to utilize for cluster categorization #

clusters2 <- cutree(clusters1, k = 5)

# Finally, we will download and enable the package: “dplyr"  in order to utilize the #

# mutate()  function. #

# This function allows us to create a new data frame which contains variable #

# observational values and their corresponding categorical distinctions. #

finalcluster <- mutate(HFrame, cluster = clusters2)

# The final data frame will resemble the following illustration #