Friday, May 12, 2017

NODUPKEY / DUPOUT (SAS)

Today’s article will discuss another incredibly important SAS feature. The feature that I will be discussing is NODUPKEY.

NODUPKEY is essentially a way of removing duplicates from a SAS set. When specified, the SAS program will scan a data set for all duplicate values of a variable indicated by the user. Every duplicate value that occurs after the first occurrence of the variable value, will be deleted. A good best practice is to always make a copy of the original set prior to utilizing NODUPKEY.

The basic format for this code statement is:

PROC SORT DATA = <DATASET> NODUPKEY;
BY <VARIABLE>;
RUN;


So if we were to run the following example code:

PROC SORT DATA = SETA NODUPKEY;
BY VARA;
RUN;


On our example set:



The outcome would resemble:



As you can see, the duplicate values for VARA have been removed from the set.

There may be instances where you may want to keep track of the row observations that SAS removed from the data set. For this scenario, DUPOUT can be utilized as an option which can be invoked along with NODUPKEY.

The basic format for this code statement is:

PROC SORT DATA = <DATASET> NODUPKEY DUPOUT= <OUTPUTSETNAME>;
BY <VARIABLE>
RUN;


If we were to run this example code:

PROC SORT DATA = SETA NODUPKEY DUPOUT=DUPOUTA;
BY VARA;
RUN;


On our example set.

The result would be, that within the “DUPOUTA” set, the following entries would exist:



You are not limited to sorting by a single variable when removing duplicate entries. For example, if you wanted to remove entries within a set that had two duplicate variables, you could list two variables in which you wanted to sort by.

PROC SORT DATA = <DATASET> NODUPKEY;
BY <VARIABLE> <VARIABLE>;
RUN;


If we were to run this example code:

PROC SORT DATA = SETA NODUPKEY;
BY VARA VARB;
RUN;


On our example set, the result would be:



If you wish to remove only entries which are EXACT duplicates of other entries, based on all data variables, you have the ability to do such by utilizing the _ALL_ option.

For example:

PROC SORT DATA = <DATASET> NODUPKEY;
BY _ALL_;
RUN;


If we were to run this example code:

PROC SORT DATA = SETA NODUPKEY;
BY _All_;
RUN;


On our example set, the result would be



























The NODUPKEY / DUPOUT sort options are incredibly important, and should be mastered by all SAS users. Remember, always be sure to make a copy of your original data set before utilizing the NODUPKEY option, and always be sure to review your data after the NODUPKEY option has been utilized.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.