## Tuesday, August 15, 2017

### (R) Histogram and Box Plot

As promised, today we will be discussing two types of R graphs, The Histogram and The Box Plot. I also have created an R function that can be utilized to distinguish outliers.

Box Plot

For this example, we will be using data vector 'F'. Feel free to follow along, the code that creates vector ‘F' is below:

F <- c(5,12,9,12,5,6,2,2)

To create a vertical box plot, the following example code can be utilized:

boxplot(F, main="Box Plot", ylab="Box Plot Demo")

F: is the data vector.
‘main =‘ Displays the title of the graph.
‘ylab =‘ Provides the title of the y-axis.

If we were to graph vector 'F' through the utilization of the code above, the output would resemble:

If you wanted to use the same vector to create a horizontal box plot, you would use this set of code:

boxplot(F, main="Box Plot",  xlab="X-Axis title",  ylab="Box Plot Demo", horizontal = TRUE )

The outcome of the above code resembles:

In this case, we adding an x-axis title with 'xlab=', and additionally, we are also changing the 'horizontal=' option to TRUE. By default, this option is FALSE.

These are just basic examples of box plots, there are many other features and customizable options that can be utilized to create the perfect box plot for your needs. If would like more information on these options, please utilize the '?boxplot' option within R.

Tracking Outliers

In R, outliers for box plots are defined as values that fall 1.5 * IQR below the first quartile, and 1.5 * IQR above the third quartile. Though these appear in the graph, they are not defined by R when plotted. To find out what these outlier values are, if such values exist, I have created the following function:

OutlierFunction <- function(t) {

q1 <- fivenum(t)
q1 <- q1 #Q1

q3 <- fivenum(t)
q3 <- q3 #Q3

iqrange <- q3 - q1

out1 <<- (q1 - (iqrange * 1.5))
out2 <<- (q3 + (iqrange * 1.5))

lowout <<- subset(t, t < out1, na.rm=TRUE )
highout <<- subset(t, t > out2, na.rm=TRUE )

}

The vector, or data frame column that you wish to assess, must be passed into the function through the utilization of the call:

OutlierFunction(<dataframecolumn or vector>)

The outliers which fall below the left whisker of the box plot are stored in the permanent data vector 'lowout'. The outliers which are above the right whisker of the box plot are stored in the permanent data vector 'highout'.

Histograms

I have created two examples which demonstrate R's capacity to create histograms.

This example demonstrates a histogram which measures density along the Y-Axis:

hist(F,
freq = FALSE,
col = "Green",
xlab = "X-Axis Label",
main = "Hist Demo")

F: is the data vector.
'freq =' Specifies the histogram type.
'col =' Specifies the color of the graph.‘xlab =‘ Provides the title of the x-axis.
‘main =‘ Displays the title of the graph.

Here is the graphical output for this example code:

This example demonstrates a histogram which measures frequency along the X-Axis:

hist(F,
freq = TRUE,
breaks = 4,
col = "orange",
xlab = "X-Axis Label",
main = "Hist Demo")

F: is the data vector.
'freq =' Specifies the histogram type.
‘breaks =‘ Specifies the number of cells of the histogram.
'col =' Specifies the color of the graph.‘xlab =‘ Provides the title of the x-axis.
‘main =‘ Displays the title of the graph.

Here is the graphical output for this example code:

The main differentiation between the two is the 'freq=' option. If the option is labeled as TRUE, the histogram plots frequency. If FALSE, the histogram plots density.

Additionally, there are times that you may want to add vertical lines to assess central tendency. The code for adding these lines to an existing histogram can be found below:

# Adds a black line with a width of '3' which indicates the mean value #
abline(v=mean(F), col="black", lwd = 3)

# Adds a red line with a width of '3' which indicates the median value #
abline(v=median(F), col="red", lwd = 3)

In the next entry, I will discuss Stem and Leaf Plots and Central Frequency Plots.