Reflections of a Data Scientist: May 2017

Wednesday, May 24, 2017

SAS Arrays (SAS)

Certain aspects of the SAS programming language do not comply with established programming paradigms. This is abundantly clear when utilizing SAS arrays. In this article, I hope to clarify how basic arrays can be utilized within the SAS programming language.

SAS allows you to define the number of elements within an array. However, I am not sure why you would ever want to do this. Instead, I would recommend using the ‘*’ (wildcard character) to define the number of array elements. This enables you to add array elements without having to manually re-count the number of elements each time a new array element is manually added.

We could create an array such as:

ARRAY <ArrayName> {*} /* Number of elements contained within the array */ <Array elements separated by spaces> ;

Do i = <number1> to <number2>; /* Creates and initiates the counter variable, unfortunately, you will need to manually count the array elements. The sum of elements is what the "i" value will equal. */

In this case, the array will scan for values within a data set, and create a variable flag if matching values are found.

If strip(<ARRAYNAME{i}>) in ('1' , '2', '3', '4') then FLAGVAR = 1;

End;

Run;

The strip function removes potential blank spaces from variable observations listed as array elements. This function is utilized to reduce potential errors when matching data values.

As an example, let us utilize the following code and data set:

Data Fruitsetb (drop=i);

Set Fruitseta;

ARRAY FruitArray {*} /* Number of elements contained within the array */ VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 ;

Do i = 1 to 6; /* Creates and initiates the counter variable, unfortunately, you will need to manually count the array elements. The sum of elements is what the "i" value will equal. */

If strip(FruitArray {i}) in('Apples' , 'Oranges', 'Cherries') then FLAGVAR = 1;

End;

Run;

In the example, we are scanning across the variable columns (VAR1, VAR2, VAR3, VAR4, VAR5, VAR6) and looking for observations which contain (Apples OR Oranges OR Cherries). For each row observation which contains a match, a flag variable of 1 is generated. '(drop = i);' removes the newly created counter variable column.

The new set which is created from this code would resemble:

Since you are scanning across sequential variables, you could also utilize the following code and obtain the same results:

Data Fruitsetb (drop=i);

Set Fruitseta;

ARRAY FruitArray {*} /* Number of elements contained within the array */ VAR1-VAR6 ;

Do i = 1 to 6; /* Creates and initiates the counter variable, unfortunately, you will need to manually count the array elements. The sum of elements is what the "i" value will equal. */

If strip(FruitArray {i}) in('Apples' , 'Oranges', 'Cherries') then FLAGVAR = 1;

End;

Run;

This is just a basic introduction to arrays. SAS Arrays have many other uses which can accomplish more difficult tasks. For more information on SAS Arrays, please visit The SAS University website.

Tuesday, May 23, 2017

SAS Macro – Slider (SAS)

Though not included as a part of our initial example set, there will be instances while working with data, where, for cosmetic purposes, you may want to move data variables from the right side of the data set, to the left side of the data set. This can only be achieved if the data values which are being consolidated are of the same data type.

Example Set:

To consolidate variables to the leftmost portion of the data set to eliminate blank variable columns, run the following macro code:

%MACRO SLIDERA;

%do e = 1 %to 4; /* Number of variable columns which need to be consolidated. */
%let y = %eval(&e+1); /* Number of variable columns that need to consolidated plus 1. */

/* Modify the values below to match the data variables of your non-sample set. It is important that variable names are listed in sequential order from left to right. Ex. VAR | VAR1 | VAR2 | VAR3 | etc.*/

If DATAVAL = "" then
do;
if DATAVAL NE "" then
do;
DATAVAL = DATAVAL1;
DATAVAL1 = "";
end;
end;

If DATAVAL&e = "" then
do;
if DATAVAL&y NE "" then
do;
DATAVAL&e = DATAVAL&y;
DATAVAL&y = "";
end;
end;

%end;

%mend;

This code will need to be run once for each movement left that is made per variable. This is achieved as an aspect of the next macro. I would recommend assigning whatever variable value that was initially assigned to variable ‘e’ in the macro above, to the variable value of‘f’ in the macro below.

%MACRO SLIDERB;

%do f = 1 %to 4;

Data SETA ; /* The name of your data set. */
Set SETA ; /* The name of your data set. */

%SLIDERA;

Run;

%end;

%mend;

Now run the macro 'SLIDERB'.

%SLIDERB;

The macro ‘SLIDERB’ contains the macro ‘SLIDERA’. When initialized, ‘SLIDERB’ runs the ‘SLIDERA’ macro a total of 4 times. Each time the macro ‘SLIDERB’ loops, it overwrites the previous data set. In doing so, it moves each value over one time, and then in looping, it re-initializes the process of moving each variable value over again. The final set should resemble:

Given the size of certain data sets and the amount of variables that they may contain, running this particular set of macros may be more efficient if run over-night. I would always recommend copying the initial set that you wish to modify before utilizing this macro.

SAS Macro – U-Turn (SAS) (Pt. 3)

This macro is to be utilized in tandem with the macro detailed in the previous article.

Already, from utilizing the Splitter Macro, we have separated all duplicate entries from the original data set so that they are contained in unique and separate sets. Next, utilizing the Namer Macro, we renamed all of the separate data set variables. While these are novel concepts, for the data to have any real useful purpose, it must be re-joined to the original set.

The goal is to create a set in which data entries that contain similar unique primary identifiers, have their entries listed as additional variables within the same observation. This sounds much more complicated than it actually is.

To achieve this goal, we will be utilizing the U-Turn Macro. The code can be found below:

%macro uturn;

%do i = 1 %to 3; /* Number of Dupout sets */

proc sql;
create table Datasetb1 as
select a.*, b.*
from Datasetb1 as a left join dupout&i as b
on a.VARA = b.VARA&i
;
quit;

%end;

%mend;

The “i” macro variable must be modified to the total number of dupout sets created by the splitter macro.

Illustrated, what is occurring is shown below:

If you have followed the steps properly by compiling and running the macros described throughout the previous articles, your data should resemble our example set:

In the next article, I will demonstrate another macro which performs an-unrelated function as it pertains to our previous examples.

SAS Macro – Namer (SAS) (Pt. 2)

In order to re-join the unique data sets back together into one single set, the variable names must be modified so that data integrity is maintained. To achieve this, compile and run the following macro:

%macro namer;

%do i = 1 %to 3; /* Number of Dupout sets */

Data DupOut&i;
Set DupOut&i;
Rename VARA = VARA&i;
Rename DATAVAL = DATAVAL&i;
Run;

%end;

%mend;

“i” in this particular case is going to represent the total number of sets that were created from the splitter macro.

Illustrated, what this macro accomplishes is the following:

In the next article, we will review the final macro, and the last step required to re-assemble the data set.

SAS Macro – Splitter (SAS) (Pt. 1)

Throughout the previous entries, I promised that I would upload some of the basic macros that I have created in SAS. After scouring large amounts of program code that I’ve written for previous work assignments, I was able to decide on the macros that I found to be most helpful and re-useable. These macros will be posted to this blog.

The first macro that I will be demonstrating is the splitter macro. This macro was created to help with the proper transposition of data observations contained within SAS data sets.

Ex Before:

Ex. After:

The Method:

Before beginning the compilation and processing of the macro, I would recommend creating a copy of the data set that you wish to alter.

/* Create a copy of the original data set */

Data DATASETB1;
Set SETA;
Run;

Next, you will need to remove duplicate entries from the copy of the original data set. Make sure to sort based on a primary and unique identifier.

Proc Sort Data = DATASETB1 nodupkey dupout= DupOut1;
By VARA;
Run;

For the macro to perform its function correctly, you must run the PROC FREQ procedure on the unique identifier that was specified in the previous set. The reason for doing such, is that you need to locate the unique identifier observation with the greatest frequency.

PROC FREQ DATA = DupOut1;
TABLES VARA;
RUN;

We are now ready to create and compile the macro, the code for which can be found below:

%macro splitter;

%do i = 1 %to 2; /* Highest frequency of unique re-occuring variable observation (VARA in our example) minus 2 */
%let h = %eval(&i+1);

proc sort data = DupOut&i nodupkey dupout= DupOut&h;
by VARA;
run;

%end;

%mend;

To reiterate, the “i” variable in the above listed code must be modified to the number produced by the PROC FREQ DATA step. The “i” variable will be the highest frequency occurrence of a unique variable from the PROC FREQ procedure minus 2.

What this code seeks to achieve is the separation of duplicate values from the initial data set and the establishment of separate data sets, each containing a unique list of entries. Such as the illustration below demonstrates:

In the next article, we will review another macro which will help us in re-naming the variables of the newly created data sets. Once we have achieved this step, we can then begin re-assembling the separated data into a single set.

Friday, May 12, 2017

NODUPKEY / DUPOUT (SAS)

Today’s article will discuss another incredibly important SAS feature. The feature that I will be discussing is NODUPKEY.

NODUPKEY is essentially a way of removing duplicates from a SAS set. When specified, the SAS program will scan a data set for all duplicate values of a variable indicated by the user. Every duplicate value that occurs after the first occurrence of the variable value, will be deleted. A good best practice is to always make a copy of the original set prior to utilizing NODUPKEY.

The basic format for this code statement is:

PROC SORT DATA = <DATASET> NODUPKEY;
BY <VARIABLE>;
RUN;

So if we were to run the following example code:

PROC SORT DATA = SETA NODUPKEY;
BY VARA;
RUN;

On our example set:

The outcome would resemble:

As you can see, the duplicate values for VARA have been removed from the set.

There may be instances where you may want to keep track of the row observations that SAS removed from the data set. For this scenario, DUPOUT can be utilized as an option which can be invoked along with NODUPKEY.

The basic format for this code statement is:

PROC SORT DATA = <DATASET> NODUPKEY DUPOUT= <OUTPUTSETNAME>;
BY <VARIABLE>
RUN;

If we were to run this example code:

PROC SORT DATA = SETA NODUPKEY DUPOUT=DUPOUTA;
BY VARA;
RUN;

On our example set.

The result would be, that within the “DUPOUTA” set, the following entries would exist:

You are not limited to sorting by a single variable when removing duplicate entries. For example, if you wanted to remove entries within a set that had two duplicate variables, you could list two variables in which you wanted to sort by.

PROC SORT DATA = <DATASET> NODUPKEY;
BY <VARIABLE> <VARIABLE>;
RUN;

If we were to run this example code:

PROC SORT DATA = SETA NODUPKEY;
BY VARA VARB;
RUN;

On our example set, the result would be:

If you wish to remove only entries which are EXACT duplicates of other entries, based on all data variables, you have the ability to do such by utilizing the _ALL_ option.

For example:

PROC SORT DATA = <DATASET> NODUPKEY;
BY _ALL_;
RUN;

If we were to run this example code:

PROC SORT DATA = SETA NODUPKEY;
BY _All_;
RUN;

On our example set, the result would be

The NODUPKEY / DUPOUT sort options are incredibly important, and should be mastered by all SAS users. Remember, always be sure to make a copy of your original data set before utilizing the NODUPKEY option, and always be sure to review your data after the NODUPKEY option has been utilized.

Thursday, May 4, 2017

DATALINES (SAS)

In today’s article we will be discussing datalines, which is a SAS statement for reading in data. Typically, this statement is used to read in small data sets for testing.

The standard template for the utilization of this statement is:

DATA <nameofthedataset>;
INPUT <VARA> <datatype> <VARB> <datatype> <VARC> <datatype>;
DATALINES;
VAR1 VAR2 VAR3
VARA VAR2 VAR3
;
RUN;

For example, let’s say that we wanted to read in data that refers to various individuals by first and last name to create a new SAS data set. You could utilize the following code to do so:

DATA Nameset;
Length FirstName $20 MiddleInt $5 LastName $50; /* Get into the habit of defining lengths */
INPUT FirstName $ MiddleInt $ LastName $;
DATALINES;
Bob J. Bobowski
Susan L. Patel
Roy M. Peterson
Jamie Q. Jones
;
RUN;

This would result in the creation of the SAS data set “Nameset”, with the following entries:

You also have the ability to utilize various read-in options while creating data sets through the usage of the DATALINES statement. The code for such would resemble:

DATA <nameofthedataset>;
<INFILE DATALINES> <INPUTOPTION>;
INPUT <VARA> <datatype> <VARB> <datatype> <VARC> <datatype>;
DATALINES;
VAR1, VAR2, VAR3
VARA, VAR2, VAR3
;
RUN;

Below is an example of code that utilizes the DELIMITER= option.

DATA Nameset;
INFILE DATALINES DELIMITER=','
Length FirstName $20 MiddleInt $5 LastName $50; /* Get into the habit of defining lengths */
INPUT FirstName $ MiddleInt $ LastName $;
DATALINES;
Bob, J., Bobowski
Susan, L., Patel
Roy, M., Peterson
Jamie, Q., Jones
;
RUN;

The data set data would remain the same, even though the input method differed.

Datalines is a quick and relatively painless method for inputting small SAS data sets. I would recommend familiarizing yourself with this particular concept, as it does come in handy from time to time.

Wednesday, May 3, 2017

The PROC FREQ(uency) Procedure (SAS)

Proc Frequency or PROC FREQ, is a commonly utilized SAS procedure. To utilize this procedure, the basic form is:

PROC FREQ DATA = <nameofdataset>;
TABLE <columntosummarize>;
RUN;

Ex. DataSetA

PROC FREQ DATA = DataSetA;

TABLE Gender;
RUN;

The default output that is produced from this code will display:

1. Each unique variable found within the specified column.

2. The number of occurrences of each variable.

3. The total percentage of the occurrence of each variable when weighed against the total number of occurrences.

4. The sum of frequency counts of the variable value, and all other values listed above it in the table.

5. The cumulative frequency of the value divided by the total number of observations.

If you wanted to output the summary results of multiple variable columns, you may specify the option within the Proc Freq procedure.

PROC FREQ DATA = <nameofdataset>;
TABLE <columntosummarize1> <columntosummarize2> <columntosummarize3>;
RUN;

Ex. DataSetA

PROC FREQ DATA = DataSetA;
TABLE Gender HairColor;
RUN;

Additionally, you are given the option to enable various options that change the display of the data summary output.

Some of these options include:

NOPERCENT - which suppresses cell percentages.

NOCUM - which eliminates the cumulative frequency.

These options are invoked on the second line of the Proc Freq procedure.

PROC FREQ DATA = <nameofdataset>;
TABLE <columntosummarize1> <columntosummarize2> <columntosummarize3> / <option1> <option2> <option3>;
RUN;

Ex. DataSetA

PROC FREQ DATA = DataSetA>;
TABLE HairColor / NOCUM NOPERCENT;
RUN;

Two Way Tables

A two way table is a useful way to sort the frequency of one variable against another. To utilize this procedure, the basic form is:

PROC FREQ DATA = <nameofdataset>;
TABLE <columntosummarize1> * <columntosummarize2>;
RUN;

Ex. DataSetA

PROC FREQ DATA = DataSetA;
TABLE Gender * HairColor;
RUN;

The PROC FREQ procedure is probably the most utilized and most useful SAS feature. Therefore, you should familiarize yourself with the PROC FREQ functionality, and make it a part of your daily SAS usage.