Reflections of a Data Scientist: (R) Qualitative Data

In this article we will briefly re-visit the "tm" package, and in doing such, we will discuss a specific function which it contains. This particular function, which we will be utilizing within this article, assists with word association as it pertains to qualitative data.

Example (Word Correlation):

As the tools to optimally analyze qualitative data are constantly involving, having yet to reach a point where a particular standard is preferred, there are numerous methods which can be utilized, either individually, or in tandem, to assess qualitative data. One of these many methodologies is the word correlation technique.

Let us assume that the variable "data" contains responses gathered from a qualitative prompt within a survey.

# With the package: "tm" downloaded and enabled #

# Prompt Responses #

data <- c("word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5")

# Data Re-Structuring #

frame <- data.frame(data)

myCorpus <- Corpus(VectorSource(frame$data))

tdm <- TermDocumentMatrix(myCorpus)

Once the data has been structured in a manner in which it can be analyzed by the proceeding function within the "tm" package, the following function may be utilized to generate correlation output.

# Example Output Function #

findAssocs(tdm, "word2", 0.1)

# Which Produces Output #

$`word2`

word3 word4 word5

0.61 0.41 0.25

What this is essentially illustrating, is the strength of the correlation which "word2" possesses, as it pertains to the other words found within the variable response data. Only words which possess a correlation greater than (.01) are enabled to appear. This was specified in the code portion BOLDED below.

findAssocs(tdm, "word2", 0.1)

The word that we were specifically placed within the function, "word2", as is illustrated in the previous line of code, is the word in which the function will seek to identify, and then, subsequently compare to other words within with the data variable.

The Math Behind the Function

It is certainly worthwhile to discuss the math which is utilized to generate the output values.

First, let's view our data frame as a matrix by running the code below:

as.matrix(tdm)

This produces the output:

Docs

Terms 1 2 3 4 5

word1 1 1 1 1 1

word2 0 1 1 1 1

word3 0 0 1 1 1

word4 0 0 0 1 1

word5 0 0 0 0 1

What this output is illustrating, is the appearance of the word, "word2", occurring four times throughout each observational occurrence. Likewise, the word, "word5", only occurs once in its entirety (1 time in total).

Therefore, when the correlation function is calculating for correlation between the words: "word1" and "word5", the equation is calculated such as:

# Correlation - "word2" and "word5" #

cor(c(0,1,1,1,1),c(0,0,0,0,1))

Additional combinations resemble:

# Correlation - "word2" and "word3" #

cor(c(0,1,1,1,1),c(0,0,1,1,1))

# Correlation - "word2" and "word4 #

cor(c(0,1,1,1,1),c(0,0,0,1,1))

# Output #

[1] 0.25

[1] 0.6123724

[1] 0.4082483

This method of demonstrating correlation can be useful in identifying the co-occurrence of words which connotate positive or negative sentiment with the words that they ultimately describe. Better methods for performing the overall scope of the task will likely be synthesized within the upcoming decade. In the interim, stay ponderous!

** Much of the example code, with some modifications, was copied from a post made by user: "rtw30606", which was featured on the R mailing list archive. The link to this post can be found here: http://r.789695.n4.nabble.com/findAssocs-td3845751.html **

Reflections of a Data Scientist

Monday, July 16, 2018

(R) Qualitative Data - Pt. (II)

No comments:

Post a Comment