Reflections of a Data Scientist: Removing Duplicate Entries (SPSS)

In a previous article, we discussed how to remove duplicate entries which were present within data sets stored within the SAS platform. In this entry, we will discuss how to remove duplicate entries which exist within data sets contained within the SPSS platform.

Example:

We will be utilizing a familiar data set to demonstrate the capacity of the SPSS program as it pertains to performing this function.

To check for duplicate entries, we must first select “Data” from the topmost menu, after such, we will then select the option “Identify Duplicate Cases”.

This should cause the following menu to populate:

In this menu, we are presented with various options which pertain to variable qualifications and sorting. In the case of this example, we will identify variables “VARA” and “VARB”, as variables in which to identify duplicate entries.

To achieve this, we will utilize the topmost center arrow to designate “VARA” and “VARB” as variables in which to “Define matching cases by”.

After following the aforementioned steps, the menu should resemble the graphic above. Click “OK” to proceed with the exercise.

The following tables are generated to the output screen:

The table entitled, “Indicator of each last matching case as Primary”, illustrates the number of duplicate cases which were identified within the sample data set.

The data set itself has been modified through the addition of a column which contains information pertaining to each entry.

As you can observe from the data above, variables which contained duplicate entries within columns “VARA” and “VARB” have been identified. Through the utilization of the additional column, you can now endeavor on deciding which variables ought to be deleted within the set.

* WARNING *

The duplicate removal function utilized by SPSS is sensitive to entry casing. Meaning, that if variable “VARA” contained the entries: “JACK” and “Jack”, neither entry would be marked as a potential duplicate.

Therefore, to avoid errors related to such, string variable entries should be modified prior to performing the duplicate removal function.

Case modification can be achieve through the utilization of the following syntax:

/* Modify VARA and VARD to contain all upper case entries */

DO REPEAT var = VARA VARD.
COMPUTE var = UPCASE(var).

END REPEAT.
EXECUTE.

/* OR */

/* Modify VARA and VARD to contain all lower case entries */

DO REPEAT var = VARA VARD.
COMPUTE var = LOWER(var).
END REPEAT.
EXECUTE.

That’s all for now, Data Heads. Stay tuned for more exciting articles!

Reflections of a Data Scientist

Friday, July 6, 2018

Removing Duplicate Entries (SPSS)

No comments:

Post a Comment