Tuesday, August 21, 2018

(R) Cleaning R Data with "tidyr"

Often, when encountering data which has been gathered for analysis, you will be completely dependent on a third party's adherence to integrity standards. What commonly occurs as a result of such, is a disastrously organized document filled with discursive values. As is often the case, you, as the analyst, will spend a larger majority of your time organizing data, time which could otherwise be spent producing analytical results. In this article, we will review some of the methods which can be utilized to achieve the former task, and also, introduce a few new methods which are also useful for accomplishing such. As the title indicates, this entry will devoted entirely to the R platform.

Difficult Characters within (.csv)s

Columns which contain string values often create problems for R as it relates to importing .csv files. Overlooking the catalyst for the dilemma, the novice data analyst will often immediately begin by tinkering with the "read.table()" options.

The solution to this dilemma is often far more simplistic than the corrective efforts that the novice attempts to apply.

Commonly, the issue arises from the usage of the following characters being present within string observational data:

"

'

Removing these characters from a (.csv) files prior to attempting input is the most expedient path towards resolution. 

Removing Blank Observation Entries within String Variable Columns

While working with qualitative data, there often arises the need to remove blank observational entries. The reason for this occurrence is commonly due to the issues which emerge from the quantification of the aforementioned data subsequent to data transformation.

To delete all row variable entries, and the corresponding columns associated with such, the following code can be utilized:

# Remove all rows which contain blank observational entries within column "B" #

DataFrameA <- DataFrameA[!(DataFrameA$B == ""),]

# Create Example Data Frame #

A <- c(1, 2, 3, 4, 5, 6, 7, 8)

B <- c("apple", "orange", "grape", "tangerine","" ,"" ,"" , "cherry")

DataFrameA <- data.frame(A, B)

# Print to Console #

print(DataFrameA)


Console Output:

> print(DataFrameA)

A B

1 1 apple

2 2 orange

3 3 grape

4 4 tangerine

5 5

6 6

7 7

8 8 cherry


# Utilize code to remove blank entries #

DataFrameA <- DataFrameA[!(DataFrameA$B == ""),]

# Print to Console #

print(DataFrameA)


Console Output:

> print(DataFrameA)

A B

1 1 apple

2 2 orange

3 3 grape

4 4 tangerine

8 8 cherry


Counting the Total Number of Rows within a Data Frame

We will now assume that you have successfully imported your data into the R platform. However, prior to performing analysis, you may wish to survey the data frame in order to ensure that nothing is amiss. A good way to begin accomplishing this task is to count the total number of rows within the newly imported data frame. The following function will assist you with this task.

# Re-Generate Data Frame #

A <- c(1, 2, 3, 4, 5, 6, 7, 8)

B <- c("apple", "orange", "grape", "tangerine","" ,"" ,"" , "cherry")

DataFrameA <- data.frame(A, B)

# Count the rows within the data set #

NROW(DataFrameA)


Console Output:

[1] 8

Tidying up Data with "tidyr"

One of the most useful packages within the R platform is “tidyr”. This package adds additional functionality that was either absent from the basic R library, or too verbose within its initial incarnation. The following examples demonstrate some of the most useful functions included within the “tidyr” package.

Utilizing “separate()” to Separate Data

The "separate()" command provides the ability to create new data variables within an R data frame from previously conjoined data variables.

# Create the sample data frame #

ID <- c(11111, 22222, 55555)

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

Address1 <- c("123 SAS Lane, Cleveland, OH, 18111-44432",

"456 Cherry Valley, Miami, FL, 69785-33325",

"789 Python Way, Los Angeles, CA, 74715-99925")

BadData <- data.frame(ID, Name, Address1)


When viewed within the R-Studio console window, the data frame resembles the following graphic:


To separate the “Address1” variable into new variables based on comma placement, we will utilize the following code:

# With the tidyr package downloaded and enabled #

# Separate “Address1” Variable into new variables: “AddressLine”, “City”, “State”, “Zip” #

GoodData <- separate(BadData, Address1, c("AddressLine", "City", "State", "Zip"),

sep = ",")


After initiating the function above, the data frame will resemble the following:


We will new take one final step, and separate the “Name” variable into additional new variables based on comma placement.

# With the tidyr package downloaded and enabled #

# Separate “Name” Variable into new variables: “LastName”, “FirstName” #

GoodData0 <- separate(GoodData, Name, c("LastName", "FirstName"), sep = ",")


After initiating the function above, the data frame will resemble the following: 


Re-uniting Data with “unite()”

The inverse function of “separate()” within the “tidyr” package is “unite()”. In the case of our example, we will be re-merging two separate data frame variables into a single variable, based on comma placement.

# Re-create the sample data frame #

ID <- c(11111, 22222, 55555)

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

Address1 <- c("123 SAS Lane, Cleveland, OH, 18111-44432",

"456 Cherry Valley, Miami, FL, 69785-33325",

"789 Python Way, Los Angeles, CA, 74715-99925")

BadData <- data.frame(ID, Name, Address1)

# With the tidyr() package downloaded and enabled #

# Separate “Address1” Variable into new variables: “AddressLine”, “City”, “State”, “Zip” #

GoodData <- separate(BadData, Address1, c("AddressLine", "City", "State", "Zip"), sep = ",")

# Separate “Name” Variable into new variables: “LastName”, “FirstName” #

GoodData0 <- separate(GoodData, Name, c("LastName", "FirstName"), sep = ",")


# Unite the “City” and “State” variables into a single variable labeled “City_State” #

unitedata <- unite(GoodData0, "City_Sate", c(City, State), sep = ", ")

After initiating the function, the data frame will resemble the following:


Transposing Data with “gather()” and “spread()”

In a series of prior articles featured on this website, I created a multi-part macro for the purpose of transposing data within the SAS platform. Thankfully, within R, there is a much simpler solution for solving the complexities of data transposition.

The test data frame which we will be utilizing can be generated with the following code:

# Create the sample data frame #

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

A <- c("Avalue1", "Avalue2", "Avalue3")

B <- c("Bvalue1", "Bvalue2", "Bvalue3")

C <- c("Cvalue1", "Cvalue2", "Cvalue3")

D <- c("Dvalue1", "Dvalue2", "Dvalue3")

TestData <- data.frame(Name, A, B, C, D)

The data frame resembles the following graphic:


However, let’s say, for the sake of our example, that you instead wanted a data frame that resembled the graphic below:


The code to achieve such is as follows:

# With the tidyr package downloaded and enabled #

GatherData <- gather(TestData, NewVar, NewVar2, A, B, C, D)

The template options for this function are:

gather(1,2,3,4…etc)

1 = The data frame referenced.

2 = The first new variable.

3 = The second new variable.

4…etc = The variables passed to the functions from the initial data frame.

To have the “D” variable remain independent, the code to utilize is:

# With the tidyr package downloaded and enabled #

GatherData <- gather(TestData, NewVar, NewVar2, A, B, C)

Which presents the graphic:


Another transposition function which is included within the “tidyr” package is “spread()”. Spread exists as the inverse of the “gather()” function.

Let’s demonstrate the functions capabilities:

# Create the sample data frame #

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

A <- c("Avalue1", "Avalue2", "Avalue3")

B <- c("Bvalue1", "Bvalue2", "Bvalue3")

C <- c("Cvalue1", "Cvalue2", "Cvalue3")

D <- c("Dvalue1", "Dvalue2", "Dvalue3")

TestData <- data.frame(Name, A, B, C, D)

# With the tidyr package downloaded and enabled #

GatherData <- gather(TestData, NewVar, NewVar2, A, B, C, D)

Graphically this data resembles the following:


Now we will utilize the “spread()” function to re-adjust the data.

# With the tidyr package downloaded and enabled #

SpreadData <- spread(GatherData, NewVar, NewVar2)

The template options for this function are:

spread(1, 2, 3)

1 = The data frame referenced.

2 = The first variable referenced.

3 = The second variable referenced.

The output is as follows:


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.