Reflections of a Data Scientist: January 2018

Wednesday, January 31, 2018

K-Means Cluster (SPSS)

The K-Means Cluster method is not exclusive to SPSS. However, there are various methods which can be utilized to generate this cluster model type. Therefore, the models generated by different platforms will differ in structure depending on the underlying algorithms utilized by each program.

Both the K-Means Cluster method and the Two Step Cluster method perform the same function, in that they generate cluster groupings models. However, the methods utilized, and the overall conclusion reached by each analytical process is different. Therefore, though each method produces a similar type of model, the model itself may be vastly different depending on the data and the algorithm which was utilized in the model’s creation.

Also, it should be noted, that the K-Means Cluster method was not created to include model variables that are categorical in nature.

Example:

Above is the data set that was created as a result of the prior exercise. We will be utilizing the variables “ZCont_Var1”, “ZCont_Var2”, and “ZCont_Var3” to create our model. The reason for such, is that variables of equal scale are required to generate correct model content. If variables are originally on different scales, as was the case for “Cont_Var1”, “Cont_Var2” and “Cont_Var3”, they can be standardized through the utilization of the “Descriptives” method. For more information pertaining to this utilization, please consult the prior article.

With the variables properly standardized, we can begin model creation by selecting “Analyze”, then “Classify”, followed by the “K-Means Cluster” option.

This sequence of selections presents us with the following menu:

Unlike the previous model, for the K-Means method, we must manually specify the number of clusters that we wish to create for analyzation. The default number is “2”.

Selecting “Save” from the “K-Means Cluster Analysis” menu, presents us with the following menu:

Select “Cluster membership” and “Distance from cluster center” from this interface, then click “Continue”.

Next, select “Options”, make sure that the boxes “Initial cluster centers”, “ANOVA table” and “Cluster information for each case” are checked.

Click “Continue” to return back to the initial menu, then click “OK”. This will generate our output model.

You can ignore the “Initial Cluster Centers” table.

“Iteration History” is important, as it is utilized to determine the strength of the model. The K-Means method requires that the user select the number of clusters to be included within the analysis. Therefore, finding the appropriate number of clusters to include as it pertains to the quality of the model, requires some repetition on the part of the user. What “Iteration History” illustrates, is the number of algorithmic iterations that were utilized in reaching a value of .000 for both center clusters. There will be instances when 10 iterations are attempted, and not a single cluster will reach a value of .000. To build the strongest model possible, a user must create multiple models with the same data set, changing only the “Number of Clusters”. The model output which contains the smallest number of iterations needed to reach .000 across clusters, is the model which should be finally implemented by the user.

The “Cluster Membership” chart provides information pertaining to each observation and the cluster group in which it adheres to. The header “Distance”, is referring to the distance between the central cluster, and the cluster which is being analyzed. The less distance between these two points, the stronger the correlation.

“Final Cluster Centers” provides the final values of the central clusters utilized by the model. This data can be graphed for analytic purposes. To perform this function, click the “Final Cluster Centers” data a few times. Eventually you will be able to select the values contained within the chart. Highlight all of the cells contained therein, and then perform a right click. This series of commands should cause a menu to appear. From this menu, select “Create Graph” and then select “Bar”.

This should create the following graph:

Based on this graphic, we can assume that the model is producing significant results pertaining to the clustering of data. The reason for stating such, is that from observing the illustration, we can determine that each cluster represents an inverse measurement of the set variables.

“Distance between Final Cluster Centers” illustrates exactly what the title implies. A larger value is preferable for these cell entries.

We have discussed the “ANOVA” process in prior articles. In the case of “ANOVA” as it pertains to this output, what is being measured, is the significance of each variable within the model. Entries with a value of less than .05 are optimal.

“Number of Cases in each Cluster” are the number of data points which are represented within each cluster designation. As with the Two Step Cluster, it is best to have data clusters which are approximately equal in size.

Tuesday, January 30, 2018

“Descriptives” (SPSS)

“Descriptives” is a method embedded within the SPSS software package. This method can be used to generate descriptive statistics for variables within an SPSS data set. We will use a modified version of the previous data set for our example.

Example:

From the “Analyze” menu, select “Descriptive Statistics”, then choose “Descriptives”.

This will present you with the menu below:

The variables which will be analyzed for output have been moved to the right side of the input screen.

The option “Save standardized values as variables” has been selected. You will have to manually select this option as it is not checked by default. Once this is complete, click “OK”.

The output that is produced is as follows:

The summary statistics table presents summary information pertaining to each variable set.

The primary reason for this exercise, was to illustrate the option selected prior to the outputs generation. That option being, “Save standardized values as variables”.

If we were again to re-visit our data sheet, you will notice that an additional column has been added.

The new data that is contained in each column, represents the z-score of each similar variable as it pertains to the larger set. This may seem like irrelevant information given the example. However, its importance will be demonstrated in the next example.

Graphing Cluster Output (SPSS)

The single most endearing aspect of creating cluster models, is the beautiful and eclectic graphs that can be generated from the systemic output. The methodology that is illustrated in this example can be utilized for any cluster model, it is not simply limited to Two Step Cluster Analysis.

We will be using the data that was generated from our prior example.

Example:

To begin, select “Graphs” from the topmost menu, then select “Chart Builder”.

This should generate the following interface.

From this interface, select “Scatter/Dot” from the “Gallery” option. Drag the variable “Two Step Cluster Number” to the upper right corner of the illustration so that it rests in the box labeled “Set color”. You may drag whatever variables you wish to the left and bottom of the example graph. These variables will represent the X and Y axis of your graphical output.

The initial output will resemble:

Double click on the graphic to enable a customization menu.

In the “Chart Editor” menu, click on the circle to the left of “1”, this will launch an additional menu.

From this menu, by changing the colors of “Fill” and “Border”, you can increase the readability of your output.

We will perform the same step for the variable “1” within the legend.

The final product will resemble the image below.

Typically, if the model is created with care, and is of high “Cluster Quality”, the graphical output will resemble something such as:

Image Source: https://en.wikipedia.org/wiki/Cluster_analysis

In the next article, we will discussing the SPSS function: “Descriptives”. The utilization of this function is necessary for the creation of K-Means Clusters, which which be the topic of the subsequent article.

Monday, January 29, 2018

Two Step Cluster (SPSS)

Two Step Cluster Analysis is a synthetic methodology that utilizes algorithms to create groupings based on similarities between collections of variables within a single data set. The procedure itself is not a statistical concept. You will not find this method discussed within statistics textbooks. Two Step Cluster Analysis only exists within SPSS. The algorithm that creates the output is proprietary, and therefore, cannot be reverse engineered or re-produced by hand. Therefore, I would recommend that this method only be used sparingly for certain situations. Additionally, model output should always be provided along with SPSS syntax (code) and data.

Example:

To create a two step cluster analysis within SPSS, first choose “Analyze” from the top drop down menu. After this option has been selected, choose “Classify”, and then choose “TwoStep Cluster”.

You will be presented with a menu which presents the following options:

For our example model, we will be creating analytic output which includes the categorical variables “Cat_Var1” and “Cat_Var2”. Additionally, we will also include the continuous variable “Cont_Var1”. For “Distance Measure”, if your model does not contain categorical variables, and if you wish to manually specify the “Number of Clusters” which the model will include for analysis, it is best to change the option “Log-likelihood” to “Euclidean”.

TwoStep Cluster Analysis (Menu Explanation)

This menu is establishing the parameters in which the model will adhere to upon its creation. If “Determine automatically” is selected, the output, which will contain the model itself, will comprise of a full analyzation of the selected variables, which therein will comprise of groupings which the computer algorithm determined most appropriate for the situation.

If “Specify fixed” is selected, the computer will put forth its best efforts to create the amount of groupings specified by the user. This forced number of groupings will be utilized for the creation of the model output. (Reminder: If your model will comprise of only categorical variables, and you would like to specify the number of groupings, it is best to change “Distance Measure” to “Euclidean”.)

Example (cont.)

If we were to continue with our example and select “Output” from the above menu, we would be presented with the following:

Let’s select “Cont_Var2” from our “Variables” list. This will move the selection to the “Evaluation_Fields” menu box. Selecting “Create cluster membership variable” beneath the “Working Data File” header will write output to the data table after the model output is provided.

Clicking on “Continue” from this menu, and “OK” from the prior menu, will provide the following output:

Model Summary (Explanation)

Model Summary

Algorithm – This cell entry is providing the algorithm utilized to create the model.

Inputs – This cell entry is providing the number of inputs utilized to create the model.

Clusters – This cell entry is providing the number of clusters produced by the sorting algorithm.

Cluster Quality

This output is illustrating the overall strength of the model.

(Double clicking on the TwoStep Cluster output provides the following illustration)

What is shown in the above output is a graphical illustration of the clusters which combined, represent the model in its entirety.

Chart (Explanation)

Size of Smallest Cluster – This is number of entries from which the smallest cluster is comprised. To the right of this value is the percentage of the model which the cluster represents.

Size of Largest Cluster – This is number of entries from which the largest cluster is comprised. To the right of this value is the percentage of the model which the cluster represents.

Ratio of Sizes: Largest Cluster to Smallest Cluster – This value is representative of the ratio produced when largest cluster is divided by smallest cluster. The value of this ratio should be no greater than 2.

If you change the “View” in the menu below the graphical output from “Cluster Sizes” to “Predictor Importance”, you will be presented with the following graphic:

If you change the “View” in the menu below the model summary from “Clusters” to “Model Summary”, you will be presented with the following graphic:

Within this table, we are presented with the following:

Cluster – Each cluster segmented by a numerical value.

Label – There is no default label provided. However, if you would like to create a label for cluster “1”, this field enables you to do so.

Description - There is no default description provided. If you would like to create a description for cluster “1”, this field enables you to do so.

Size – The size of each cluster as it relates to the total number of observations contained within the model. Percentage of Total Model (number of observations within cluster).

Inputs

Listed in the order of predictive importance are the variables which make up each cluster. If you hover your mouse above any cell, a box will appear which contains a key pertaining to what is represented within the cell.

If a variable is categorical, its most frequent category is listed along with the frequency of its occurrence within the group.

If a variable is continuous, its mean value is listed instead.

You may recall that at the beginning of this exercise that we selected “Cont_Var2” for our “Evaluation Field”. These next steps will demonstrate what this accomplished.

On the bottom right side of the menu bar which is displayed beneath the table graphic, there is a button that reads “Display”. Click this button and then select the option “Evaluation Fields”. This should populate “Cont_Var2” within the “Fields” box. Click “OK” after this variable appears.

This adds a bottom row to the chart which contains the previously selected variable.

This variable was not included in the creation of the model, however, its values are displayed as if it were part of the model. This allows for the comparison of non-model variables to the clusters created by the algorithm.

Returning to the initial data set, you will witness an additional column has been created.

This column indicates the cluster that each observational entry adheres to. This data can be utilized to graph the findings of the model, a topic which will be discussed in our next entry.

Sunday, January 28, 2018

“Means” (SPSS)

The “Means” function within SPSS generates summary statistic data and allows for the comparison of categorical means within a data set. This feature is similar to the function summary(), which is found within the “R” platform.

I will demonstrate the premise of the function with an example:

The column: “VAR00001” represents categorical data.

The column: “VAR00002” represents numerical data.

To perform the “Means” function, first select “Analyze” from the top menu, then select “Compare Means”. After this selection has been made, click on the option “Means”.

This should bring up the following menu. I have selected “Options” from this interface, and have selected additional options to be considered for analyzation. The “Dependent List” variable will be your numerical variable, and the “Layer 1 of 1” will contain your categorical variable.

Clicking “OK” produces the output:

Case Processing Summary

Included – These two columns contain the count (N) and percentage of the total of the number of sample values that were included in the analysis.

Cases Excluded – These two columns contain the count (N) and percentage of the total of the number of sample values that were excluded from the analysis. Excluded values are those numerical values which did not contain input data.

Included – These two columns contain the count (N) and percentage of the total of the number of sample values that were included, and excluded, within the analysis.

Report

Mean – The mean value of the total values analyzed from this categorical variable.

N – The number of observed values which are included within the category.

Std. Deviation – The standard deviation of the values contained within the category.

Minimum – The minimum value of the observed set of values contained within the category.

Maximum - The maximum value of the observed set of values contained within the category.

Median – The median value of the observed set of values contained within the category.

How to reproduce this analyzation within the “R” platform:

Cat1 <- c(12)

Cat2 <- c(11, 13, 14, 19)

Cat3 <- c(17, 18)

summary(Cat1)

sd(Cat1)

summary(Cat2)

sd(Cat2)

summary(Cat3)

sd(Cat3)

Which produces the output:

> summary(Cat1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12 12 12 12 12 12
> sd(Cat1)
[1] NA
>
> summary(Cat2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.00 12.50 13.50 14.25 15.25 19.00
> sd(Cat2)
[1] 3.40343
>
> summary(Cat3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 17.25 17.50 17.50 17.75 18.00
> sd(Cat3)
[1] 0.7071068

Tuesday, January 16, 2018

Summary Independent-Samples T Test (Two Sample T-Test) (SPSS)

In a previous article, I demonstrated within the “R” platform, how to perform a one sample t-test with only summary information. In this article, we will examine how to perform a two sample t-test within SPSS given only summary information.

Using a slightly modified example problem from a prior exercise:

Example:

A scientist creates a chemical which he believes changes the temperature of water. He applies this chemical to water and takes the following measurements:

Summary Data Pertaining to First Set of Measurements

N = 8

SD = 2.167124

Mean = 72.875

He then measures temperature in samples which the chemical was not applied.

Summary Data Pertaining to the Second Set of Measurements

N = 8

SD = 1.669046

Mean = 75.25

Can the scientist conclude, with a 95% confidence interval, that his chemical is in some way altering the temperature of the water?

To perform this analysis, select “Compare Means” from the “Analyze” drop down, then select “Summary Independent-Samples T Test”.

You should be presented with the menu below. Simply enter the information that was provided in the example problem and click “OK”. You have the option to change the Confidence Level (%), but since we are assuming a 95% confidence level, doing so is un-necessary.

This presents the output:

Typically, unless stated otherwise for academic purposes, you will assume that variances are equal.

From this output we can conclude:

With a p-value of 0.028 (.028 < .05), and a corresponding t-value of - 2.456, we can state that, at a 95% confidence interval, that the scientist's chemical is altering the temperature of the water.

One-Sample T-Test (SPSS)

In further exploring the utilization of the various built in functions of SPSS, today we will be assessing the usage of the One-Sample T-Test.

A One-Sample T-Test measures the significance of a sample data set’s mean against the known, or assumed, mean of a population.

Example:

A high school gym instructor measures how many push-ups each individual student can perform on the school’s intramural day. His results are as follows:

Is the mean of the set, assuming an alpha of .05, significantly different from the national average of push-ups by student (18)?

To calculate this data is SPSS, first choose “Analyze” from the top menu, then choose “Compare Means”, and finally, select “One-Sample T Test”.

Performing the previous tasks should bring up the menu below. “Test Variable(s)” will be the variable set that you wish to analyze, “Options” will allow you to change the confidence interval percentage. Since our alpha is .05, we will leave the “Confidence Interval Percentage” at 95%.

This produces the output:

Our hypothesis test for this scenario is:

H0: µ = x (The sample mean is equal to the population mean)

H1: µ ≠ x (The sample mean is not equal to the population mean)

Since we are looking for general differentiation, our test will be two tailed.

With a p value of .166, we cannot reject the null hypothesis, and therefore, can assume that the sample mean does not significantly differ from the population mean.