Reflections of a Data Scientist: Removing Duplicate Entries (MS-Excel)

In previous articles, we discussed the steps required to remove duplicate entries from data sets stored within the SPSS and SAS platforms. In this entry, we will discuss how to achieve the same results as it pertains to data sets which are stored within the Excel platform.

Example:

Again, we will be using a familiar data set to illustrate the functionality of the platform.

The functionality of Excel is a bit limited as compared to the prior platforms. I make this statement based on the limitations which exist as it pertains to the selection of variables which can be identified in tandem. The Excel platform can only identify a single variable in which to filter for duplicate entries. Meaning, that if we desired to sort for duplicate entries across multiple variables, we would be unable to do so within the current medium.

However, if we wished to sort for, and subsequently remove duplicate values within a single column variable, we could do so by following the proceeding steps.

To begin, we will select the variable column in which we wish to utilize as our primary key from which duplicates will be identified. In the case of our example, the variable which we will select will be “VARA”. With this decided, we must highlight all of the rows within the aforementioned variable column.

Once this has been achieved, click on the “Home” tab within the menu ribbon, and then click on “Conditional Formatting”.

Clicking on “Conditional Formatting” should cause this menu to appear:

From the menu, click on the option “Highlight Cell Rules”, followed by “Duplicate Values”.

You should notice the following changes have been made within the work sheet:

“Light Red Fill with Dark Red Text” is the default coloration used by the platform to identify duplicate values. However, you are presented with the option to select a differing methodology of identification within the generated interface.

While Excel lacks the ability to identify unique entries across multiple columns, it does possess the capacity to completely remove entries specified in this manner.

For example, if we wished to create a new workbook which only contained completely unique entries across all columns, we could achieve such by completing the following steps.

First, we would have to select all of the row observations from which we wish to sort.

Once this has been achieved, click on the “Data” tab within the menu ribbon, and then click on “Remove Duplicates”.

This should cause the following menu to appear:

As we want to sort across all variable entries, we will leave all of the variable boxes selected.

Next, click “OK”.

This will generate the following prompt:

And subsequently, the data worksheet should resemble the following:

The transformation indicates that the process has correctly performed its function in removing non-unique entries.

Reflections of a Data Scientist

Thursday, July 12, 2018

Removing Duplicate Entries (MS-Excel)

No comments:

Post a Comment