Reflections of a Data Scientist: (R) Checking Data Integrity

I wanted to address, before moving to the topic of data integrity, two additional methods that can be utilized for importing data into the R platform. I do not personally utilize either of these methods due to their reliance on the user interface. The prior methods discussed, leave import records within the code. These recorded pathways will be helpful to the user who must return to a project at a later date.

However, if you were specifically searching for a more user friendly method to utilize when importing data, the following methods may better suit your needs.

The easiest method to utilize when importing data, is the following. This particular data import method leaves absolutely no record for the user, and R-Studio must be installed for this method to successfully execute.

First, you will need to open R-Studio. After this has been accomplished, you will need to click on the drop down menu option that reads: "Import Dataset".

Each data set variation requires that you install a certain R-Package before proceeding. If you have previously installed the required package that is necessary for the file variant, you will be able to proceed with the import.

The other method that can be utilized to import data into R, is a hybrid of code and user interaction.

You will need to run the following code template from the R console.

<datasetname> <- read.table(file.choose(), <importoptions>)

So, if we were to utilize this template on our previous examples, the code would resemble the following:

(Assuming that the file is a .csv)

DataFrameA <- read.table(file.choose(), fill = TRUE, header=TRUE, sep="," )

(Assuming that the file is tab delineated)

DataFrameA <- read.table(file.choose(), fill = TRUE, header=TRUE, sep="\t" )

Either variation will cause your operating system to open a window which contains the file interface of native to your system.

From this user interface, you will be able to select the file that you would like to import into R.

Checking Data Integrity

After your data has been successfully imported into R, you should check the integrity of the data to make sure that all of the original data was imported correctly. Listed below, are some of the commands that can be used to ensure that data integrity was maintained.

<DataFrameName>[c(1, 2, 3), ] - Displays the row data contained within the first three rows of the data frame, and all corresponding column data.

<DataFrameName>[1:3, ] - Performs the same action as the above command. However, if additional rows are required for viewing, this command does not necessitate the selection of each particular row in the command option.

dim(<DataFrameName>) - This command displays the dimensions of the selected data frame. Information is displayed as Row x Column.

summary(<VarName>) - This command produces an abridged statistical summary of all numerical data, and a frequency distribution of all non-numerical data.

levels(<VarName>) - This command displays all variable variations in the selected variable column.

class(VarName) - This command will indicate the variable type of the specified variable listed.

You have the option of printing the entire data set to the console. However, this is only feasible if the data contained within the data frame is not overly large. If you do choose to print the data frame data to the console, I would recommend enabling the option below before proceeding. This option enlarges console output width, which allows for the printed data to display correctly.

options("width"=200)

The command below prints the data frame to the console:

print(<DataFrameName>)

If the utilization of this command is infeasible due to the size of the data frame, you could instead utilize the head or tail commands.

The head command template is:

head(<DataFrameName>, n=<number of rows to display>)

Executing this command will display the first n number of rows contained within the data frame.

Example:

# Print the first 10 rows of the data set #

head(DataFrameA, n=10)

The tail command template is:

tail(<DataFrameName>, n=<number of rows to display>)

Executing this demand will display the last n number of rows contained within the data frame.

Example:

# Print the last 5 rows of the data set #

tail(DataFrameA, n=5)

The "fix" command performs a similar command to the previous commands listed. In many cases, it may be best to run this command initially when checking for data integrity before proceeding with other commands.

"fix" allows the user to edit, through a graphical user interface which is launched subsequently to the command's execution, individual data entries within the data frame. Additionally, the user will also be presented with the opportunity to change variable type data. The downside to this particular command, is that it only presents the first 5000 rows of data. Also, there will be no record left within the code which indicates whether any data modifications took place.

# Allows the user to edit the first 5000 observations #

fix(DataFrameName)

There are two other commands which you should also familiarize yourself with, though they are very similar in function to the commands which were previously discussed. Those commands being: "sapply", and "str".

"sapply", if utilized in a manner similar to the example below, will display all variables within a data frame, and their corresponding data type, to the console window.

sapply(DataFrameName, class)

"str" if utilized in the manner displayed below, will display dimensional data pertaining to the data frame, level data pertaining to character type variables, each variable's type, and the first few variable entries for each corresponding variable.

In the next article, I will discuss how to generate data summaries and measure for frequency within R. Additionally, I will also address how to export R data, and how to save data frames in .rda format.

Reflections of a Data Scientist

Saturday, July 8, 2017

(R) Checking Data Integrity

No comments:

Post a Comment