Reflections of a Data Scientist: (R) Getting Down with "dplyr"

I was originally intending to include a sub-entry within the previous article to discuss "dplyr". "dplyr" is an auxiliary package which simplifies many functions which are innate within the basic R platform. Much of the style of the functional coding within the dplyr package is structured in a manner which is incredibly similar to SQL.

(All of the examples below require the package: "dplyr" to be downloaded and enabled)

The code to generate the example data set which will be utilized within this exercise is as follows:

Person <- c(1, 2, 3, 4, 5, 6, 7)

Gender <- c(1, 1, 1, 0, 0, 0, 0)

HairColor <- c(0, 1, 2, 3, 3, 3, 0)

EyeColor <- c(0, 1, 2, 2, 2, 0, 0)

FavGenre <- c(0, 1, 2, 2, 2, 3, 4)

DataFrameA <- data.frame(Person, Gender, HairColor, EyeColor, FavGenre)

Console Output:

Person Gender HairColor EyeColor FavGenre

1 Seth 1 0 0 0

2 Rob 1 1 1 1

3 Roy 1 2 2 2

4 Jane 0 3 2 2

5 Suzie 0 3 2 2

6 Lisa 0 3 0 3

7 Alexa 0 0 0 4

Reference Data Columns by Name

Let’s say that you are working with the example data frame and you wished to either create a new data frame, or simply wished to generate a summarization of data observations which exist within select variable fields.

The following code will enable these actions:

# Display observation column “Person” #

select(DataFrameA, Person)

# Display observation columns “Person” and “FavGenre” #

select(DataFrameA, Person, FavGenre)

# Display the observation columns for variables between and including “HairColor” and “EyeColor” #

select(DataFrameA, HairColor:EyeColor)

Filtering Observational Data by Variable Values

In this particular instance, we will assume that you are working with the same example data set, however, in this case, you desired to only view observational data which satisfied a pre-conceived variable conditions.

# Display only observations where the variable “Gender” is equal to 1 #

filter(DataFrameA, Gender == 1)

# Display only observations where the variable “Gender” is equal to 1, AND the variable “HairColor” is equal to 2#

filter(DataFrameA, Gender == 1, HairColor == 2)

# Display only observations where the variable “Gender” IS NOT equal to 1, OR the variable “HairColor” is equal to 2#

filter(DataFrameA, Gender != 1 | HairColor == 2)

Sort Data easily with the “arrange” Function

If you have previously worked extensively within the R platform, you’ll understand how difficult it can be to properly sort data. Thankfully, dplyr simplifies this task with the following function.

# Sort the data frame “DataFrameA”, by the variable “Person” (ascending) #

arrange(DataFrameA, Person)

# Sort the data frame “DataFrameA”, by the variable “Person” (descending) #

arrange(DataFrameA, desc(Person))

Simply Re-name Data with the Rename() Function

In previous articles, we discussed the difficulty that surrounds re-naming R column variables. As was the case with “arrange()”, dplyr also provides a simpler alternative with the function “rename()”.

# Re-name the variable “HairColor”, “WigColor”. Results are stored within the data frame: “newdataframe” #

newdataframe <- rename(DataFrameA, WigColor = HairColor)

Create a New Data Variable from an Existing Variable

Another task which dplyr simplifies is the ability to create new variables from existing variables within the same data frame. This is achieved through the utilization of the "mutate()" function.

# Create the new variable: “NewVar” by multiplying the variable “HairColor” by 2 #

# Results are stored within the data frame: “newdataframe” #

newdataframe <- mutate(DataFrameA, NewVar = HairColor * 2)

Create a New Data Frame with Specific Variables

In this example, we will be demonstrating the dplyr function: “select”, which allows for the selection of various existing data frame variables, typically for the purpose of creating a new data frame.

# Create a new data frame: “newdataframe”, which includes the variables: “Person” and “EyeColor” from DataFrameA #

newdataframe <- select(DataFrameA, Person, EyeColor)

Count Distinct Entries

In a similar manner in which SQL allows a user to count distinct variable entries, dplyr also contains a function which allows the user to achieve a similar result: “n_distinct()”.

# Count the distinct number of variable entries for the variable “Person” within DataFrameA #

n_distinct(DataFrameA$Person, na.rm=FALSE)

# Count the distinct number of variable entries for the variable “EyeColor” within DataFrameA #

n_distinct(DataFrameA$EyeColor, na.rm=FALSE)

# In both cases, na.rm=False, designates the option which excludes missing values from the overall count #

Performing Data Joins

Also included within the dplyr package, are functions which enable the user to perform data joins in a manner which is similar to SQL. Though examples of this functionality are not included within this article, more information pertaining to utilization of these commands can be found by running:

??join

within the R input window.

Reflections of a Data Scientist

Tuesday, August 21, 2018

(R) Getting Down with "dplyr"

No comments:

Post a Comment