Reflections of a Data Scientist: (R) Using R like SAS

If you are like yours truly, making the transition from SAS to R can leave you longing for certain aspects of the former. It isn’t that R can’t perform many of the similar functions which were native in SAS. However, some of the features that were cornerstones of SAS, are much more obscure in R. This article attempts to highlight some of those key features, and introduce their R equivalents.

# To perform the first series of exercises, please create the sample data frame below: #

ID <- c(1,2,3,4,5,6,7,8,9,10)
Age <- c(19,25,31,30,33,18,22,28,29,30)
H <- c(20,26,18,10,9,12,18,19,20, 11)
HR <- c(5,2,7,5,5,1,1,0,0,10)

BaseballPlayers <- data.frame(ID, Age, H, HR)

Using Conditionals to Subset

In SAS, you can use SQL to subset data sets. Additionally, you also have the ability to create subsetted data sets by utilizing the DATA statement.

For example, if were working with a data set that contained information pertaining to baseball players, and we wanted to create a new set based on players who have hit more than 5 Home Runs, then the code would resemble:

Data HomeRunGreaterFive;
Set BaseballPlayers;
Where HR > 5;
Run;

In R, the code to perform this function would appear as:

HomeRunGreaterFive <- subset(BaseballPlayers, HR > 5)

In SAS, if you wanted to create a new data set based on players who have more than 15 Hits OR 1 Home Run, the code might be assembled such as:

Data HitsAndHomeRuns;
Set BaseballPlayers;
Where H > 15 OR HR > 1;
Run;

In R, the code would look like:

HitsAndHomeRuns <- subset(BaseballPlayers, H > 15 | HR > 1)

Finally, if you were programming in SAS, and wanted to create a new data set containing players who have more than 15 Hits and 1 Home Run, the code could be compiled as:

Data HitsAndHomeRuns;
Set BaseballPlayers;
Where H > 15 AND HR > 1;
Run;

In R, the code would be:

HitsAndHomeRuns <- subset(BaseballPlayers, HR > 5 & H > 4)

Using Conditionals to Delete a Row

This refers to the code statement “delete” in SAS. Such as “IF Age > 30 THEN delete”. Delete, when used in SAS, deletes rows of data that do not match the conditional statement. So, for example, let’s say that you wanted to subset an existing set of baseball players based on player ages. In this scenario, we'll assume that you wanted to create a new set of data that did not include players that were older than 30. This would be achieved with the following SAS code:

Data PlayersYoungerThirty;
Set BaseballPlayers;
If Age > 30 then delete;
Run;

In R, this code would resemble:

PlayersYoungerThirty <- BaseballPlayers[!(BaseballPlayers$Age > 30),]

Dropping Variables

Again we will return to our baseball example. In SAS, you have the ability to remove variables from a data set by utilizing the drop statement. For this scenario, let’s imagine that you wanted to create a new data set that did not include the eighth, ninth, and tenth variables of an existing data set, as those variable contained extraneous data that was un-needed for your summary set. In SAS, this code would probably look something like:

Data NewCleanSet (Drop = ID age);
Set BaseballPlayers;
Run;

In R, we can reference those columns by order, and the code would resemble:

# Remove variables: ID; Age #

NewCleanSet <- BaseballPlayers[c(-1, -2)]

Perform a Left Join

R has its own native equivalents for performing data merges. However, the left join, in my opinion, is the cleanest way to accomplish this task. While a "join" is an aspect of SQL, it can be utilized within SAS through the utilization of PROC SQL. If we were going to utilize “left join” to merge two tables within SAS, through SQL functionality, the code would resemble:

proc sql ;
create table NEWJOINEDTABLE as
select A.*, B.*
from TABLEA as A left join TABLEB as B
on A.DATAONE = B.DATATWO
;
quit;

Now, I’ve done quite a bit of research into how to emulate this functionality within R. The best, hands down approach for accomplishing a similar result, requires the SQLDF package. Once you have that package downloaded, you can perform almost any SQL function within R.

The code above, with the SQLDF package downloaded and enabled, would translate into the R code below:

# Create Example Data Frames: #

DATAONE <- c(1,2,3,4,5)
DATAA <- c("A", "B", "C", "D", "E")
TABLEA <- data.frame(DATAONE, DATAA)

DATATWO <- c(1,2,3,4,5)
DATAB <- c("Spade", "Club", "Diamond", "Heart", "Joker")
TABLEB <- data.frame(DATATWO, DATAB)

# 1. Enable package: 'sqldf' #

library(sqldf)

# 2. Perform Left Join #

NEWJOINEDTABLE <- sqldf('select A.* ,
B.* from TABLEA as A
left join TABLEB as B
on A.DATAONE = B.DATATWO')

Utilizing PROC FREQ

I searched far and wide for a decent replacement for the absolutely superb PROC FREQ statement of SAS. The closest that I could come to the original SAS iteration, requires a package. Therefore, for this method to work, you will need to download the ‘gmodels' package. Also, make sure you have it enabled when running the R code equivalent.

In SAS, the code to generate a frequency table containing home run information from DataTableA is:

Proc Freq Data = DataTableA;
tables HR;
Run;

In R would look like (with 'gmodels' downloaded/enabled):

library(gmodels)

CrossTable(DataTableA$HR)

Adding Leading Zeroes

Always a problem, regardless of system, losing leading zeroes, and subsequently having to re-add them, is truly a burden on any data professional. There is an entire entry on this blog, on how to accomplish this in SAS. Here is how to accomplish re-adding leading zeroes in R. Please be aware, that if your column data contains numerical information, that utilizing this method changes the data to character type.

# Example Data Frame #

modifiedVar <- c(1,2,3,4,5,6,7,8)

DATAFRAME <- data.frame(modifiedVar)

# Code #

DATAFRAME$modifiedVar <- sprintf("%04d", DATAFRAME$modifiedVar)

%04d specifies the total length of the variable.

So if the above code was being utilized to on the following column variables, the following results would occur:

Old Var = 1
New Modified Var = 0001

Old Var = 10
New Modified Var = 0010

Old Var = 100
New Modified Var = 0100

If you wanted to add additional zeroes, you would simply need to change the “%04d” option to a larger value.

Dropping Tables

This code is a native SQL function. However, I still use it within SAS by utilizing the PROC SQL statement.

If you wanted to drop three tables within SAS though the usage of the drop statement, the code would resemble:

Proc SQL;
drop tableA, tableB, tableC;
quit;

To achieve the same result while using R:

rm(DATAFRAME, BaseballPlayers)

Fixing Dates

Finally, we come to dates, which to every SAS user, is the bane of their existence. I will not go through how to modify SAS dates in this article, as there is an independent post dedicated to such on this blog. However, below are four different lines of code.

# Example Data Frames #

DateA <- c("01/20/2020", "02/13/1980", "03/30/1970", "04/13/1991")

DataFrameA <- data.frame(DateA)

###################################################

DateB <- c("01/25/2020", "02/21/1980", "05/30/1970", "09/13/1998")

DataFrameB <- data.frame(DateB)

These two lines are for changing the data type of variable columns, within an existing data frame, to a date type format.

DataFrameA$DateA <- as.Date(DataFrameA$DateA, format="%m/%d/%Y")

DataFrameB$DateB <- as.Date(DataFrameB$DateB, format="%m/%d/%Y")

Once this is accomplished, you have the ability to create a new column, which will contain the number of days elapsed between the two dates:

DataFrameA$DaysDifference <- difftime(DataFrameA$DateA, DataFrameB$DateB, units = 'days')

You could also perform a similar function, and generate a new column which contains the number of weeks elapsed:

DataFrameA$WeeksDifference <- difftime(DataFrameA$DateA, DataFrameB$DateB, units = "weeks")

The next article should be posted within a few days, I have yet to decide on a specific topic to discuss. In the interim, please continue to visit my blog, I appreciate your patronage.

Reflections of a Data Scientist

Thursday, August 3, 2017

(R) Using R like SAS

No comments:

Post a Comment