Tuesday, July 10, 2018

(R) Qualitative Data

Given the omnipresent increase in velocity which is inherent within the many aspects of modern existence, the potential for large scale dilemmas to occur, seemingly from out of nowhere, is a challenge that we must all collectively face.

Two significant issues inhibit our abilities to appropriately address the much larger catalyst, those being: an overall lack of dynamic data, and a system of generalizations which are no longer applicable in the modern environment.

Data collection, analysis, and the subsequent presentation of such, requires large amounts of time and effort. As a result of such, while deciding which way to steer our allegorical ship, we may find ourselves far removed from our original destination.

Additionally, as is pertains to generalizations, much of the world is now experiencing increased materialistic standards of existence. This inevitably leads to a personal increase in hyper-individuality, and consequently, an increase in sensitivity to outside stimuli. As a result of such, it is difficult to draw conclusions through the utilization of generalities which were previously applicable in the form of quantitative rating scales.

At this current time, the collection of qualitative information, and the application of traditional survey analytical methodologies, are the most optimal method for preparing dynamic and specialized solutions to rapidly emergent criticism.

The challenge, of course, is how to sculpt this un-structured data into something which can be quantifiably assessed and presented.

In today's article we will discuss the best way to apply traditional analysis to un-structured qualitative data. I use the term "traditional" as other methods do exist which can be utilized to collect and analyze data with greater efficiency, at least, as it pertains to dynamism.*

Example (Word Clouds and Word Frequency):

We’ll suppose that your organization issued a survey which contained qualitative prompts. The data collected is exhibited below:


This data, as it is exists within its present incarnation, tells us nothing, and cannot, within its current format, be presented to outside parties.

The following steps rectify this scenario as it pertains to quantifying this data, and generating an orderly report which is easily readable by non-technically orientated individuals.

# Packages required to generate example output: #

# tm – An R centric text mining package. #

# wordcloud – Generates word cloud illustrations within the R platform. #

# RColorBrewer – Enables additional color schemes for word cloud output. #

# syuzhet – Extracts sentiment and sentiment-derived plot arcs from text. #

# Install and enable all of the packages listed above #


First, we must create the data frame needed to demonstrate the example code:

ID <- c(1,2,3,4,5)

Comment <- c("My life is terrible.", "Give me free stuff!", "Obviously, anyone who is cognitive, can observe the inconsistences which are inherit within your demeanor.", "Sucks. Sucks. Sucks. Sucks!", "K.")

WordCloudEx <- data.frame(ID, Comment)

# The code below ensures that the comment data is stored within a vector format. This step is required, as the functions inherent within the aforementioned package libraries, will not function properly unless the data is housed within the proper format. #

WordCloudEx$Comment <- as.vector(WordCloudEx$Comment)


There will be instances in which you may want to correct the spelling of the collected qualitative data, or remove un-needed expletives. An aspect to keep in mind in scenarios such as these, is that the gsub function, a function native to the “tm” package, is case sensitive. This is demonstrated below.

# Replace data instances #

# Information is listed twice due to case sensitivity #

WordCloudEx$Comment <- gsub("Sucks.", "stinks", WordCloudEx$Comment)

WordCloudEx$Comment <- gsub("sucks.", "stinks", WordCloudEx$Comment)


We must now transform the data into a format which can be read and manipulated by the “tm” package.

# This function morphs the data vector into a format which can manipulated by the "tm" package #

cloudvar <- Corpus(VectorSource(WordCloudEx$Comment))


If our first goal is to create a word cloud, we must massage certain aspects of the data in order to remove output redundancies. This can be achieve through the utilization of the code below.

# Remove data redundancies #

# Remove "stopwords" such as: "the", "he", or "she" #

cloudvar <- tm_map(cloudvar, removeWords, stopwords('english'))

# Convert the text to lower case #

cloudvar<- tm_map(cloudvar, content_transformer(tolower))

# Remove numbers #

cloudvar <- tm_map(cloudvar, removeNumbers)

# Remove punctuations #

cloudvar<- tm_map(cloudvar, removePunctuation)

# Eliminate extra white spaces #

cloudvar <- tm_map(cloudvar, stripWhitespace)

This next function was copied from the website listed below.

The function itself creates a variable which contains the frequency values of words contained within the qualitative data variable.

The “head” function, which is native to the R programming language, in this instance, allows for the display of the topmost utilized words, and their relative frequencies.

# Function copied from the website below #

# http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know #

dtm <- TermDocumentMatrix(cloudvar)

m <- as.matrix(dtm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

# Displays the frequency of the top 10 most utilized words #

head(d, 10)


If you would instead prefer to view the 5 most utilized words, the code would resemble the following:

head(d, 5)

This second value contained within the function can be modified to any value which you see fit. For example, the following code displays the frequency of the top 12 most utilized words.

head(d, 12)

The output for the code:

head(d, 10)

Would resemble the following:

word freq

stinks stinks 4
life life 1
terrible terrible 1
free free 1
give give 1
stuff stuff 1
anyone anyone 1
can can 1
cognitive cognitive 1
demeanor demeanor 1


We are now about ready to create our word cloud.

The code which will enable such is as follows:

set.seed(3455)
wordcloud(words = d$word, freq = d$freq, min.freq = 1, scale = c(2, 0.5),
     max.words=200, random.order=FALSE, rot.per=0.35,
     colors=brewer.pal(8, "Dark2"))


Which produces the following graphic:


Many of the options in the code which are required to create the word cloud, can be left at their default values. However, there are a few of the aspects of the code which you may want to consider for modification.

set.seed(<number>) – The wordcloud() function relies on random number generation as it applies to the placement of the words within the graphical output. Changing this value will alter the output illustration.

min.freq – This option indicates the number of times a word must appear prior to it being considered for inclusion within the output.

scale – This option sets the size of the output graphic. If aspects of the graphics are cut off, try inputting different values for each scale value. (ex: (4, 0.2) etc.)

max.words – This option indicates the maximum number of words to be included within the word cloud.

colors – Since we are utilizing the RColorBrewer package, we have additional options which can utilized to produce varying colorful outputs. Other variations which can implemented are as follows:

colors=brewer.pal(8, "Accent"))
colors=brewer.pal(8, "Dark2"))
colors=brewer.pal(12, "Paired"))
colors=brewer.pal(9, "Paired1"))
colors=brewer.pal(8, "Paired2"))
colors=brewer.pal(9, "Set1"))
colors=brewer.pal(8, "Set2"))
colors=brewer.pal(12, "Set3"))


Example (Response Sentiment Analysis):

While word clouds illustrate the frequency of commonly utilized words, it is also necessary to gauge the overall sentiment of the surveyed sample as it pertains to the given prompt.

This is achieved through the application of the “Syuzhet” dictionary to words found within each prompt contained within each variable column. Each word is assigned a score as it pertains to connotation. Neutral words receive a score of “0”. Words which are positive receive a score of “1”. Words which are negative receive a score of “-1”.

Typically, the Syuzhet function evaluates data in a way which applies a score to each sentence. Therefore, if the variable which requires analysis contains numerous sentences, than a few additional steps must be taken prior to proceeding.

# Remove sentence completion characters #

WordCloudForm <- gsub("[.]","", WordCloudEx$Comment)
WordCloudForm <- gsub("[!]","", WordCloudForm)
WordCloudForm <- gsub("[?]","", WordCloudForm)

# Transform the product of prior procedure into a single sentence # 


WordCloudForm <- paste(WordCloudForm, ".")

# Perform Analysis #

s_v <- get_sentences(WordCloudEx$Comment)

syuzhet_vector <- get_sentiment(s_v, method="syuzhet")

############################################################

nrc_data <- get_nrc_sentiment(s_v)

valence1 <- (nrc_data[, 9]*-1) + nrc_data[, 10]

############################################################

Two variable are produced as a product of the analysis.

nrc_data – This variable contains further analysis through the attribution of each identified word to a fundamental emotion. There are five observations within the set as there are five responses contained within the variable column.

# Calling the “nrc_data” variable produces the following output #

nrc_data

# Output #




Generally this data is not considered useful.

valence1 – This variable contains the sum of all positive and negative word values contained within the qualitative variable column.

# Calling the “valence1” variable produces the following output #

valence1

# Output #

[1] -1 0 1 0 0

There are numerous applications for this information. You could assess the mean value of the responses with the mean() function, or the standard deviation of the responses with the sd() function. Or, you could produce a general summary of the range with the summary() function.

mean(valence1)

sd(valence1)

summary(valence1) 


Outputs:

> mean(valence1)
[1] 0
> sd(valence1)
[1] 0.7071068
> summary(valence1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1 0 0 0 0 1


If you would prefer, and this is something that I would recommend doing, you could add the “valence1” column to the data frame to track the total associated with each individual response.

The code required to achieve this is:

QualDataOutput <- data.frame(WordCloudEx, valence1)

Which produces a data frame which resembles:


As you can observe from the newly created data frame, the “Syuzhet” dictionary is not a perfect method for sentiment analysis. However, at this present time, it is one of the better alternatives. 
For more information on the “Syuzhet” dictionary, its entries, and its assignments, please check out the links below:

http://saifmohammad.com/WebPages/lexicons.html

http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

That’s all for now, Data Heads! Stay tuned for more exiting methods of analysis!

* - What I am referring to, is "dynamism" as a collective term as it is applicable to the generation of near instantaneous results. This may be covered in a future article. However, in practice, this would resemble data mining from various platforms, and immediate analysis from pre-established methodology.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.