Reflections of a Data Scientist: July 2017

Monday, July 31, 2017

(R) Conditionals

Today we will be discussing conditional statements within R. Conditionals are very easy to understand, and extremely powerful when implemented. In typical fashion, I will first address a coding concept, followed by an example of the code being utilized.

If you are familiar with generally practiced coding standards and paradigms, you should be familiar with conditional statements.

Typically, in other languages, an IF statement would resemble something like:

if (condition is met) DOSOMETHING;

The exact structuring of the statement depends on the coding language.

In R, conditional coding resembles the following:

ifelse(condition, if true do this, if false do this)

Example:

For this example, we will pretend that you are again using the iconic DataFrameA, and in this particular scenario, you want to create a flag variable within a blank data column.

# First we will create our sample data frame with the code below: #

A <- c(1,1,1,2,2,3,3)
B <-c(2,1,3,2,3,3,1)
DataFrameA <- data.frame(A, B)

DataFrameA

#########################################################

DataFrameA

A B C
1 2
1 1
1 3
2 2
2 3
3 3
3 1

The code that you will create, will check both column A, and column B, if either column contains a row value that matches, an 'X' will be created in column C.

To achieve this, the following line of code can be utilized:

DataFrameA$C <- ifelse(DataFrameA$A == DataFrameA$B, 'X', ' ')

Additionally, if you wanted to create code that creates an 'X' value for a match, or a 'Y' value for a non-matching variable, the following code can be utilized:

DataFrameA$C <- ifelse(DataFrameA$A == DataFrameA$B, 'X', 'Y')

In the first example, the newly modified DataFrameA would resemble:

A B C
1 2
1 1 X
1 3
2 2 X
2 3
3 3 X
3 1

In the second example, the newly modified DataFrameA would resemble:

A B C
1 2 Y
1 1 X
1 3 Y
2 2 X
2 3 Y
3 3 X
3 1 Y

A few quick notes on conditionals in R. Please note the use of '==' instead of '=' in the above listed example. In R, '==' is used to assess conditions, not '='. Also, DataFrameA$C is referring to the column C in DataFrameA, DataFrameA$A is referring to column A in DataFrameA, and DataFrameA$B is referring to column B in DataFrameA.

These examples are simple, but the applications for this concept are endless. In the next article, we will be discussing some of the similarities between R and SAS, and how to achieve similar functionality in R as it pertains to SAS.

Thursday, July 27, 2017

(R) Data Frame Maintenance

The topic of today's post is: Data Frame Maintenance. In this article, I will demonstrate various techniques that can be utilized to accomplish the tasks associated with such.

Let's say, for example, that you are working with a data frame named: "DataFrameA". For whatever reason, the third column of this particular data frame needs to be re-named. The code to accomplish this task is below:

colnames(DataFrameA)[<#ofcolumntochange>] <- "New Column Name"

So, if you wanted to change the name of the third column of DataFrameA to, “DataBlog", the code would resemble:

colnames(DataFrameA)[3] <- "DataBlog"

Changing Column Variable Type

Now, let's say that you wanted to change the data type that is contained within a column of an existing data frame. Again, we will use "DataFrameA" for our example.

This code will change a column which contains integers, to a column that contains factors:

DataFrameA$VarA <- as.factor(DataFrameA$VarA)

This code will change a column which contains factors, to a column that contains integers:

DataFrameA$VarA <- as.integer(DataFrameA$VarA)

This code will change a column which contains factors, to a column that contains characters:

DataFrameA$VarA <- as.character(DataFrameA$VarA)

Stacking Data Frames

Perhaps you want to stack two data frames, one on top of the other.

If each data frame has the same column names, then the following code is ideal:

NewDataFrame <- rbind(topdataframe, bottomdataframe)

If one data frame contains an additional column that is not included within the other, you will need to add the missing column to the data frame before stacking the data.

For example, if Data Frame A contains:

A B C
1 9 4
2 18 8
3 27 12
4 36 16

And Data Frame B contains:

A B
1 3
2 6
3 9
4 12

You would first need to add a column containing missing values to the bottom data frame by running the example code:

DataFrameB$C <- NA

This code modifies Data Frame B so that it resembles:

A B C
1 3 NA
2 6 NA
3 9 NA
4 12 NA

The data frames can now be stacked with the code:

NewDataFrame <- rbind(DataFrameA, DataFrameB)

And the new data frame will resemble:

A B C
1 9 4
2 18 8
3 27 12
4 36 16
1 3 NA
2 6 NA
3 9 NA
4 12 NA

Adding a Vector as a Column

For this example, we'll pretend that you wanted to add a new column, in the form of a vector, to an existing data frame.

If the column is of the same length, row wise, then adding it to a data frame is simple.

Utilize the code:

DataFrameName$NewColumnName <- NewColumntoAdd

If the column is shorter, row wise, in comparison to the data frame in which it is being added, then you will first have to add additional values to the vector before utilizing the above code.

For example, if NewColumntoAdd is 35 rows in length, and DataFameA is 36 rows in length, you could add the additional values needed to complete the subsequent task with the following code:

AdditonalDataVector <- rep(c(NA), times=1) # Or however many NA rows are needed #

NewColumntoAdd <- c(NewColumntoAdd, AdditionalDataVector)

Now you can successfully run the code:

DataFrameName$NewColumnName <- NewColumntoAdd

Re-Ordering Columns within a Data Frame

To accomplish this task you have two options.

The first option is to re-order the column data by column name.

So for example, if you were working on a data frame (DataFrameA), with the column names of ("A", "B", "C", "D"), and you wanted to re-order the columns so that they were displayed such as ("B", "C", "A", "D"), you could run the code:

DataFrameA <- DataFrameA[c("B", "C", "A", "D")]

You also have the option of re-ordering the columns by column number.

If this was the option that you wished to utilize, the code would resemble:

DataFrameA <- DataFrameA[c(2,3,1,4)]

That is all for this entry. I have not yet decided what the topic for the next post, but I promise you that it will contain more helpful R related information.

Sunday, July 16, 2017

(R) Data Frame Extraction

In this article, we will be discussing how to extract data from existing data frames within R.

If you aren’t already familiar with the function of braces( ‘[‘ and ‘]’ ) within R, we will briefly review their usage.

When you encounter braces in R, the variables specified within the braces themselves, are instructing R to query and return data.

[ X , Y ]

Above is an example of how such a query would appear within the R code base.

X - Specifies Row

Y - Specifies Column

So if a programmer were to write the code:

E <- DataFrameA[ 1 , 2, drop = FALSE]

R would interpret this to mean: return the data from Row: 1, Column: 2, and store this data in factor variable ‘E’.

Leaving either the left or the right position empty in a braces related query, instructs R to return ALL data.

Therefore:

E <- DataFrameA[ 1 , , drop = FALSE]

Would instruct R to return ALL Column data from Row:1. (And store this data in ‘E’)

While:

E <- DataFrameA[ , 1 ] Would instruct R to return ALL Row data from Column:1. (And store this data in ‘E’)

The following are examples of code samples which extract data from R Data Frames.

E <- DataFrameA[3, 2, drop = FALSE] Extracts the third element in the second column of DataFrameA, and stores that element in factor variable ‘E'.

E <- DataFrameA[c(1 , 3), 2, drop = FALSE] Extracts the data within row 1 and row 3, within column 2, of DataFrameA. The data is then stored in factor variable ‘E’.

E <- DataFrameA[5, ] Extracts row 5, and all column data contained within row 5. The data will be stored in DataFrame ‘E'.

E <- DataFrameA[ , 8] Extract all rows data from column 8. The data is then stored in factor variable ‘E’.

Saturday, July 15, 2017

(R) Vector Creation and Extraction

In this article, we will discuss how to create vectors manually, and also, how to create vectors from data contained within existing vectors.

Just as a reminder, a vector is a sequence of data elements of the same basic type. Each element in a sequence is referred to as a component.*

Elements of a vector are displayed to the console such as:

[1] 3 5 7 9

However, just because the data is printed as a row, does not mean that the data cannot be added to an existing data frame as a column. Therefore, though vector data prints to the console horizontally, you can imagine it as also existing vertically, like so:

3
5
7
9

The [1] indicates the beginning of the vector. If the console runs out of horizontal space while printing the vector, the remainder of the vector will be printed to the next line of the console, the new line will begin with a value which indicates the sequence order.

For Example:

[1] 1 2 5 7 9
[6] 11 13 15 17

Vector Creation

Here are a few examples vector creating code:

x <- seq(from=2, to=12, by=2)

This creates a vector which contains the values: 2 4 6 8 10 12

The code is instructing R to count from 2 to 12, by 2, and then store the values in vector 'x'.

x <- rep(seq(from=2, to=12, by=2), times=2)

This creates a vector which contains the values: 2 4 6 8 10 12 2 4 6 8 10 12

The code is instructing R to count from 2 to 12, by 2, twice, and then store the values in vector 'x'.

x <- rep(c("o", "m", "g"), times=3)

This creates a vector which contains the values: O M G O M G O M G

This code is instructing R to repeat the values "O", "M" "G", three times, and then store the values in vector 'x'.

Now, let's say that you want to manipulate the data within the vectors, here are few methods.

Vector Data Manipulation

Adding Within Vectors

x <- x + 10

This code adds the value of 10 to each value within vector x, and then stores the values within vector 'x'.

So if vector 'x' contained the data: 1 2 3 4 5 6

The above code would modify the vector so that it contained the data: 11 12 13 14 15 16

Multiplying Within Vectors

x <- x * 0

This code multiplies each value within vector 'x' by the value of 0. The values are then subsequently stored within vector 'x'.

So if vector 'x' contained the data: 1 2 3 4 5 6

The above code would modify the vector so that it would contain the data: 0 0 0 0 0 0

If two vectors are of the same length, they may be added, subtracted, multiplied or divided.

For Example:

If vector 'y' contained the values: 2 2 2 2 2

And vector 'x' contained the values: 2 2 2 2 2

The Code:

w <- x + y

Would generate 'w' as a vector, containing the values of: 4 4 4 4 4

If vectors of different lengths are combined in this way...

For Example:

If vector 'y' contained the values: 2 2 2 2 2

And vector 'x' contained the values: 2 2 2 2 2

w <- x + y

Would present the user with the error:

Warning message: In x + y : longer object length is not a multiple of shorter object length.

Extracting From Vectors

Assuming that 'u' is a vector which contains the values of: 1 2 3 4 5 6

q <- u[3]

This example would extract the third value of the vector 'u', and store the data in the vector 'q'.

Therefore, 'q' would contain a value of 3.

q <- u[-3]

In this case, all values but the third value of vector 'u' would be extracted, and the data would be stored in vector 'q'.

Therefore, 'q' would contain the values: 1 2 4 5 6

q <- u[1:2]

Here, the first two values are extracted from u, and stored in vector 'q'.

Therefore, 'q' would contain the values: 1 2

q <- u[c(1,2)]

This is another way of extracting the first two values of 'u', which will then be stored in vector 'q'.

Again, 'q' would contain the values: 1 2

q <- u[-c(1,2)]

In this example, all values from vector 'u' are extracted, except the values of 1 and 2.

Therefore, vector 'q' would contain the values: 3 4 5 6

q <- u[u<4]

This method extracts all values within vector 'u' that are less than 4.

It is for this reason, the vector 'q' would contain the values: 1 2 3

* - www.r-tutorial.com/r-introduction/vector

Saturday, July 8, 2017

(R) Checking Data Integrity

I wanted to address, before moving to the topic of data integrity, two additional methods that can be utilized for importing data into the R platform. I do not personally utilize either of these methods due to their reliance on the user interface. The prior methods discussed, leave import records within the code. These recorded pathways will be helpful to the user who must return to a project at a later date.

However, if you were specifically searching for a more user friendly method to utilize when importing data, the following methods may better suit your needs.

The easiest method to utilize when importing data, is the following. This particular data import method leaves absolutely no record for the user, and R-Studio must be installed for this method to successfully execute.

First, you will need to open R-Studio. After this has been accomplished, you will need to click on the drop down menu option that reads: "Import Dataset".

Each data set variation requires that you install a certain R-Package before proceeding. If you have previously installed the required package that is necessary for the file variant, you will be able to proceed with the import.

The other method that can be utilized to import data into R, is a hybrid of code and user interaction.

You will need to run the following code template from the R console.

<datasetname> <- read.table(file.choose(), <importoptions>)

So, if we were to utilize this template on our previous examples, the code would resemble the following:

(Assuming that the file is a .csv)

DataFrameA <- read.table(file.choose(), fill = TRUE, header=TRUE, sep="," )

(Assuming that the file is tab delineated)

DataFrameA <- read.table(file.choose(), fill = TRUE, header=TRUE, sep="\t" )

Either variation will cause your operating system to open a window which contains the file interface of native to your system.

From this user interface, you will be able to select the file that you would like to import into R.

Checking Data Integrity

After your data has been successfully imported into R, you should check the integrity of the data to make sure that all of the original data was imported correctly. Listed below, are some of the commands that can be used to ensure that data integrity was maintained.

<DataFrameName>[c(1, 2, 3), ] - Displays the row data contained within the first three rows of the data frame, and all corresponding column data.

<DataFrameName>[1:3, ] - Performs the same action as the above command. However, if additional rows are required for viewing, this command does not necessitate the selection of each particular row in the command option.

dim(<DataFrameName>) - This command displays the dimensions of the selected data frame. Information is displayed as Row x Column.

summary(<VarName>) - This command produces an abridged statistical summary of all numerical data, and a frequency distribution of all non-numerical data.

levels(<VarName>) - This command displays all variable variations in the selected variable column.

class(VarName) - This command will indicate the variable type of the specified variable listed.

You have the option of printing the entire data set to the console. However, this is only feasible if the data contained within the data frame is not overly large. If you do choose to print the data frame data to the console, I would recommend enabling the option below before proceeding. This option enlarges console output width, which allows for the printed data to display correctly.

options("width"=200)

The command below prints the data frame to the console:

print(<DataFrameName>)

If the utilization of this command is infeasible due to the size of the data frame, you could instead utilize the head or tail commands.

The head command template is:

head(<DataFrameName>, n=<number of rows to display>)

Executing this command will display the first n number of rows contained within the data frame.

Example:

# Print the first 10 rows of the data set #

head(DataFrameA, n=10)

The tail command template is:

tail(<DataFrameName>, n=<number of rows to display>)

Executing this demand will display the last n number of rows contained within the data frame.

Example:

# Print the last 5 rows of the data set #

tail(DataFrameA, n=5)

The "fix" command performs a similar command to the previous commands listed. In many cases, it may be best to run this command initially when checking for data integrity before proceeding with other commands.

"fix" allows the user to edit, through a graphical user interface which is launched subsequently to the command's execution, individual data entries within the data frame. Additionally, the user will also be presented with the opportunity to change variable type data. The downside to this particular command, is that it only presents the first 5000 rows of data. Also, there will be no record left within the code which indicates whether any data modifications took place.

# Allows the user to edit the first 5000 observations #

fix(DataFrameName)

There are two other commands which you should also familiarize yourself with, though they are very similar in function to the commands which were previously discussed. Those commands being: "sapply", and "str".

"sapply", if utilized in a manner similar to the example below, will display all variables within a data frame, and their corresponding data type, to the console window.

sapply(DataFrameName, class)

"str" if utilized in the manner displayed below, will display dimensional data pertaining to the data frame, level data pertaining to character type variables, each variable's type, and the first few variable entries for each corresponding variable.

In the next article, I will discuss how to generate data summaries and measure for frequency within R. Additionally, I will also address how to export R data, and how to save data frames in .rda format.

Monday, July 3, 2017

(R) Establishing Working Directory & Importing Data

This is the first article, of what will probably be many articles, pertaining to R-Software. I am assuming that you are familiar with R-Software, and that you have the software installed. Additionally, I am also assuming that you have RStudio, the IDE, also installed.

Once you have the R Console open, you will first want to set your working directory.

This can be achieved with the command:

setwd("<pathway of working directory>")

For example, you could create a designated folder on your Window's Desktop for such a directory, and make that folder your working directory. The code for such would resemble:

setwd("C:/Users/Name/Desktop/RWorkDirectory")

It is important to note that you will have to change the default "\" to "/", as R does not utilize the backslash in path directory listings.

The advantage for establishing a working directory, is that it allows for a certain level of convenience in importing, exporting, and saving data.

For example, if you were importing data without establishing a working directory, the code template for such would resemble:

(Assuming that the file is a .csv)

DataFrameA <- read.table("C:/Users/Name/Desktop/RWorkDirectory/Filename.csv", fill = TRUE, header = TRUE, sep = "," )

or

(Assuming that the file is tab delineated)

DataFrameB <- read.table("C:/Users/Name/Desktop/RWorkDirectory/Filename.txt", fill = TRUE, header = TRUE, sep = "\t" )

If you had established the working directory, the code statement would be much shorter:

DataFrameA <- read.table("Filename.csv", fill = TRUE, header=TRUE, sep="," )

or

DataFrameA <- read.table("Filename.txt", fill = TRUE, header=TRUE, sep="\t" )

Import Options

Fill, Header, and Sep are optional statements, but typically their inclusion is necessary. Here is what each option enables:

Fill - This option notifies R that the variable observation data is of unequal length, and that some records will be missing observational data. In the case of missing data, 'N/A' values will be added if this option is enabled.

Header - This indicates to R, that the first row of data contains column names.

Sep - This indicates the type of delineation that separates each data observation. "," indicates a comma separated file, and "\t" indicates a tab delineated file. Additionally, if the data values are separated by some other exotic format, (ex. #, @, or |), you can indicate this as an import option, by listing it after sep =. Ex sep = "|".

Get Working Directory

If you ever forget where your work directory is located, you can always have it printed to the console by utilizing the command:

getwd()

In our example case, running the above command should output:

C:/Users/Name/Desktop/RWorkDirectory

In the next article, I will discuss how to check the integrity of newly imported data.