Reflections of a Data Scientist: K-Means Cluster (SPSS)

The K-Means Cluster method is not exclusive to SPSS. However, there are various methods which can be utilized to generate this cluster model type. Therefore, the models generated by different platforms will differ in structure depending on the underlying algorithms utilized by each program.

Both the K-Means Cluster method and the Two Step Cluster method perform the same function, in that they generate cluster groupings models. However, the methods utilized, and the overall conclusion reached by each analytical process is different. Therefore, though each method produces a similar type of model, the model itself may be vastly different depending on the data and the algorithm which was utilized in the model’s creation.

Also, it should be noted, that the K-Means Cluster method was not created to include model variables that are categorical in nature.

Example:

Above is the data set that was created as a result of the prior exercise. We will be utilizing the variables “ZCont_Var1”, “ZCont_Var2”, and “ZCont_Var3” to create our model. The reason for such, is that variables of equal scale are required to generate correct model content. If variables are originally on different scales, as was the case for “Cont_Var1”, “Cont_Var2” and “Cont_Var3”, they can be standardized through the utilization of the “Descriptives” method. For more information pertaining to this utilization, please consult the prior article.

With the variables properly standardized, we can begin model creation by selecting “Analyze”, then “Classify”, followed by the “K-Means Cluster” option.

This sequence of selections presents us with the following menu:

Unlike the previous model, for the K-Means method, we must manually specify the number of clusters that we wish to create for analyzation. The default number is “2”.

Selecting “Save” from the “K-Means Cluster Analysis” menu, presents us with the following menu:

Select “Cluster membership” and “Distance from cluster center” from this interface, then click “Continue”.

Next, select “Options”, make sure that the boxes “Initial cluster centers”, “ANOVA table” and “Cluster information for each case” are checked.

Click “Continue” to return back to the initial menu, then click “OK”. This will generate our output model.

You can ignore the “Initial Cluster Centers” table.

“Iteration History” is important, as it is utilized to determine the strength of the model. The K-Means method requires that the user select the number of clusters to be included within the analysis. Therefore, finding the appropriate number of clusters to include as it pertains to the quality of the model, requires some repetition on the part of the user. What “Iteration History” illustrates, is the number of algorithmic iterations that were utilized in reaching a value of .000 for both center clusters. There will be instances when 10 iterations are attempted, and not a single cluster will reach a value of .000. To build the strongest model possible, a user must create multiple models with the same data set, changing only the “Number of Clusters”. The model output which contains the smallest number of iterations needed to reach .000 across clusters, is the model which should be finally implemented by the user.

The “Cluster Membership” chart provides information pertaining to each observation and the cluster group in which it adheres to. The header “Distance”, is referring to the distance between the central cluster, and the cluster which is being analyzed. The less distance between these two points, the stronger the correlation.

“Final Cluster Centers” provides the final values of the central clusters utilized by the model. This data can be graphed for analytic purposes. To perform this function, click the “Final Cluster Centers” data a few times. Eventually you will be able to select the values contained within the chart. Highlight all of the cells contained therein, and then perform a right click. This series of commands should cause a menu to appear. From this menu, select “Create Graph” and then select “Bar”.

This should create the following graph:

Based on this graphic, we can assume that the model is producing significant results pertaining to the clustering of data. The reason for stating such, is that from observing the illustration, we can determine that each cluster represents an inverse measurement of the set variables.

“Distance between Final Cluster Centers” illustrates exactly what the title implies. A larger value is preferable for these cell entries.

We have discussed the “ANOVA” process in prior articles. In the case of “ANOVA” as it pertains to this output, what is being measured, is the significance of each variable within the model. Entries with a value of less than .05 are optimal.

“Number of Cases in each Cluster” are the number of data points which are represented within each cluster designation. As with the Two Step Cluster, it is best to have data clusters which are approximately equal in size.

Reflections of a Data Scientist

Wednesday, January 31, 2018

K-Means Cluster (SPSS)

No comments:

Post a Comment