Reflections of a Data Scientist: July 2018

Tuesday, July 31, 2018

(Python) Basic Data Manipulation

In the previous article, we discussed data types. In this entry we will review a few basic data manipulation techniques. To reiterate, I do understand that this subject matter is extremely foundational, and that is indeed the premise.

To assign a variable:

a = 15

b = 16

To print a variable:

print(a)

To multiply variables:

print(a * b)

To add variables:

print(a + b)

To subtract variables:

print(a - b)

To create an exponential value:

# Square the value of "a" #

print(a ** 2)

To produce a remainder value:

# This operator is also known as a "modulus" #

print(a % b)

To store the result of a calculation:

c = a + b

To join multiple strings:

a = "the dog"

b = "is"

c = "brown"

d = a + " " + b + " " + c

# The additional quotes create space between each string variable #

print(d)

Console Output:

the dog is brown

Assignment Trickery

Let us consider the following lines of code:

a = 5

b = a

a = 6

What would we expect to print to the console if we performed the following function?

print(b)

You might assume that the value that would be output pertaining to variable “b” would be:

6

However, this is not the case. Instead of the value “6”, the value “5” is printed to the console. The reason for such is that the value of “b” was assigned the value of variable “a” prior to the re-assignment. As is the case, "b" refers to the value of “a” at the time, not the variable itself in a dynamic sense.

If we wanted to correct this confusion, we could do so with the following code:

a = 5

b = a

a = 6

b = a

print(b)

Which would produce the following console output:

5

Fun with Strings

The following are a few useful functions which are especially useful when modifying string variables.

strip()

The strip function removes blank spaces from the beginning and end of a string variable value:

a = " Strip this string "

print(a.strip())

The console output is as follows:

Strip this string

len()

The length function provides the length of the string variable passed to the function.

a = "What’s going on?"

print(len(a))

The console output is as follows:

16

lower()

The lower function modifies the string variable to all lower case lettering.

a = "WHAT’S GOING ON?!"

print(a.lower())

The console output is as follows:

what’s going on?!

upper()

The upper function modifies the string variable to all upper case lettering.

a = "make this all upper case!"

print(a.upper())

The console output is as follows:

MAKE THIS ALL UPPER CASE!

split()

The split() function splits the string into a list containing each element as an aspect of the prior string.

a = "split, this, line up!"

print(a.split(",")) # The separator in this case is a comma (",") #

The console output is as follows:

['split', ' this', ' line up!']

(Python) Python Data Types

In building a foundation for future entries on the topic of Python programming, we must first discuss the data types inherit within the Python programming language. I am aware that this is not the most riveting material, however, bear with me, as a strong understanding of the fundamental aspects is essential for later mastery of the subject.

Numbers

An integer, or an "int", is essentially a whole number. A float is a variable which possesses a decimal value. Unlike other programming languages, a numeric value does not need to be assigned a type when it is initially declared.

# integer #

a = 11

# float #

a = 11.11

Strings

A string is a non-numeric value. Typically a string is series of characters. For example, "This is a string", is a string type variable. However, like the example below, a string can also contain numerical values, such as, "1000". Strings receive a different treatment within the programming language, as compared to numeric variables. This will be demonstrated within future articles.

# string #

a = "11"

Lists

A list is a collection of elements. Lists are ordered and modifiable.

# list #

a = ["a"," b", "c","d"]

# list #

a =[1,2,3,4,5]

# list #

a = ["a","b","c",4,5]

# multi-dimensional list #

a = [["a", "b", "c"], [1, 2, 3]]

Tuple

A list is a collection of elements. Lists are ordered and un-modifiable.

# Fixed in size #

# tuple #

a = ('a', 'b', 'c', 'd', 'e')

Dictionary

A dictionary is a collection of elements which are unordered. It is modifiable and does not allow for duplicate entries.

# dictionary #

fruit = {'red': apple, 'green': pear}

Checking Data Types

To check a variable type within the python console, the following function can be utilized:

type()

# Example #

a = 4.44

print(type(a))

Which produces the output:

<class 'float'>

Changing Variable Types

There are few different options available as it pertains to modifying variable types. The utilization of such, depends on the type of variable that you would like to produce.

str() - To produce a string variable.

float() - To produce a float variable.

int() - To produce an integer variable.

# Example #

a = 4.44

# Modify the variable to a string variable type #

a = str(a)

print(type(a))

# Modify the variable to a float variable type #

a = float(a)

print(type(a))

# Modify the variable to an integer variable type #

a = int(a)

print(type(a))

In the next entry we will begin discussing some if the various options which are available for differing variable type. Stay Tuned!

Monday, July 16, 2018

(Python) Anaconda Introduction

Today begins what will likely result in a long series of articles pertaining to the programming language: Python.

Python is lightweight scripting language which is useful for many purposes. It may not be too much of an exaggeration to state that Python is now almost primarily used for data science purposes. If this is not indeed the case, it is not inaccurate to state that Python is heavily utilized for the purposes of data mining and data science.

You will need to make two installations prior to beginning the exercises featured within the subsequent entries. First, you will need to install the Python programming language. Next, you will need to install an integrated development environment (IDE). The latter will enable your ability to interact with the Python language from outside of the console. These two elements interact in a manner which is similar to the way in which R-Studio and the R programming language communicate.

The subsequent exercises utilize the Anaconda Python distribution. Anaconda Python is a version of Python 3 which includes within its installation, various IDEs and Python packages. As a result of such, you only need to make a single installation. I have chosen this distribution as it includes binaries which enable the R package "Keras", to function.

"Keras" is a machine learning package which will be discussed in later articles.

I would recommend viewing a tutorial video as it relates to exactly how to download and install Anaconda Python.

However, once the program is installed, you can initiate the platform by double clicking the desktop icon:

Or, additionally, it can be accessed from the start menu:

The result should be the following interface:

Spyder is installed as an aspect of the Anaconda package, and it is this particular program that I utilize as my IDE when creating Python exercises. If you require more information as to how this interface functions, there are many resources which can be found online to assist you.

There are additional options as it relates to distribution and IDEs, and you are more than welcome to explore these. In the next article we will begin to discuss Python data types. Stay tuned, Data Heads.

(R) R-Misc. - Pt. (II)

In this entry we will be discussing various topics which are relatively minute in scale, however, are none the less deservant of mention. All the subjects explored within this post pertain to the R programming language and its relative functionality.

Increasing or Decreasing the Number of Significant Figures within the “R” Console Output

There are times in which a greater number of significant figures may be desired from output provided by a function within the R platform. The default number of significant figures produced from function output is 7.

However, this can altered through the utilization of the following code:

options(digits = x)

In this case, x would connotate the number of significant digits desired by the user.

Disabling Scientific Notation within the “R” Console Output

There may also be occasions when a particular calculation provides output which utilizes scientific notation.

Ex:

4.5934e-06

To disable this output option to view a standardized display of the value, utilize the following code:

options(scipen = 999)

Clear the Console Window within R-Studio

If an instance ever arises in which you would like to clear the console output within R-Studio, pressing the following key combination in unison will remove all of the previously generated output:

Ctrl + L

That’s all for now. In the next article, we will being discussing the Python programming language.

(R) Qualitative Data - Pt. (II)

In this article we will briefly re-visit the "tm" package, and in doing such, we will discuss a specific function which it contains. This particular function, which we will be utilizing within this article, assists with word association as it pertains to qualitative data.

Example (Word Correlation):

As the tools to optimally analyze qualitative data are constantly involving, having yet to reach a point where a particular standard is preferred, there are numerous methods which can be utilized, either individually, or in tandem, to assess qualitative data. One of these many methodologies is the word correlation technique.

Let us assume that the variable "data" contains responses gathered from a qualitative prompt within a survey.

# With the package: "tm" downloaded and enabled #

# Prompt Responses #

data <- c("word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5")

# Data Re-Structuring #

frame <- data.frame(data)

myCorpus <- Corpus(VectorSource(frame$data))

tdm <- TermDocumentMatrix(myCorpus)

Once the data has been structured in a manner in which it can be analyzed by the proceeding function within the "tm" package, the following function may be utilized to generate correlation output.

# Example Output Function #

findAssocs(tdm, "word2", 0.1)

# Which Produces Output #

$`word2`

word3 word4 word5

0.61 0.41 0.25

What this is essentially illustrating, is the strength of the correlation which "word2" possesses, as it pertains to the other words found within the variable response data. Only words which possess a correlation greater than (.01) are enabled to appear. This was specified in the code portion BOLDED below.

findAssocs(tdm, "word2", 0.1)

The word that we were specifically placed within the function, "word2", as is illustrated in the previous line of code, is the word in which the function will seek to identify, and then, subsequently compare to other words within with the data variable.

The Math Behind the Function

It is certainly worthwhile to discuss the math which is utilized to generate the output values.

First, let's view our data frame as a matrix by running the code below:

as.matrix(tdm)

This produces the output:

Docs

Terms 1 2 3 4 5

word1 1 1 1 1 1

word2 0 1 1 1 1

word3 0 0 1 1 1

word4 0 0 0 1 1

word5 0 0 0 0 1

What this output is illustrating, is the appearance of the word, "word2", occurring four times throughout each observational occurrence. Likewise, the word, "word5", only occurs once in its entirety (1 time in total).

Therefore, when the correlation function is calculating for correlation between the words: "word1" and "word5", the equation is calculated such as:

# Correlation - "word2" and "word5" #

cor(c(0,1,1,1,1),c(0,0,0,0,1))

Additional combinations resemble:

# Correlation - "word2" and "word3" #

cor(c(0,1,1,1,1),c(0,0,1,1,1))

# Correlation - "word2" and "word4 #

cor(c(0,1,1,1,1),c(0,0,0,1,1))

# Output #

[1] 0.25

[1] 0.6123724

[1] 0.4082483

This method of demonstrating correlation can be useful in identifying the co-occurrence of words which connotate positive or negative sentiment with the words that they ultimately describe. Better methods for performing the overall scope of the task will likely be synthesized within the upcoming decade. In the interim, stay ponderous!

** Much of the example code, with some modifications, was copied from a post made by user: "rtw30606", which was featured on the R mailing list archive. The link to this post can be found here: http://r.789695.n4.nabble.com/findAssocs-td3845751.html **

Histograms w/Standard Error Bars (MS-Excel)

In a prior article, I mentions that in many cases, it may be best to export aggregated data to Excel in order to provide the highest quality graphical outputs. In this article, we will be utilizing Excel to provide high quality graphics as it pertains to histograms with standard error overlays.

The data that we will be employing for the example illustration was also utilized for the same purpose in the article prior.

Example:

Presented below is the data set:

If you did not calculate the mean, standard deviation, or standard error values within a prior platform, the following cell code will assist with the calculations of each.

Mean:

(For Group A)

=AVERAGE(C2:C11)

Standard Deviation:

(For Group A)

=STDEV(C2:C12)

Standard Error:

(For Group A)

=(STDEV(C2:C11))/(SQRT(COUNT(C2:C11)))

The product of these calculations should resemble the following:

Next, from the workbook’s mean values, we will create a basic histogram. You should have the ability to perform this task without further instructions. The result will resemble the graphic below.

To add bars which indicate the standard error of the mean, follow the proceeding steps.

Click on the chart graphic, and then select the “Design” tab of the upper ribbon menu. Next, select the “Add Chart Element” option.

This will generate a drop down menu. From this menu select “Error Bars”, followed by “More Error Bars Options”.

This should cause error bars to appear within the graphic.

However, these error bars require the appropriate formatting.

To achieve this, click on the error bar graphic, this should cause a menu to appear on the right side of the Excel console.

In the aforementioned menu, make sure that the option “Custom” is selected, then select “Specify Value”.

This should cause the following prompt to generate:

Click on the arrow located next to each option header: “Positive Error Value” and “Negative Error Value”, this will enable you to select the series which represents each prompt.

For both options, highlight the entire row of data which pertains to the standard error values.

Once this has been completed, the product is a beautiful graphical representation of the data.

I’ll leave the rest of the beautification process to you.

I hope that you enjoyed this article and found it helpful. Many more are on the way.

(R) Histograms w/Standard Error Bars

A patron of the site recently contacted me and asked that we again review the graphing capacity of the R language, specifically addressing R's ability to create standard error bars as auxiliary aspects of output as it applies to histograms.

As I wrote in a previous article, it is often best to utilize Microsoft Excel for all graphical endeavors. However, if for whatever reason, you do not have access to the Excel platform, or, you desire to check your data prior to exporting the results to Excel, than the following method could be considered as a provisional solution.

Example:

Measurements have been collected from various groups. The data pertaining to the observations from each group are below:

# Data #

GroupA <- c(11, 17, 38, 20, 21, 40, 18, 48, 50, 37)

GroupB <- c(1, 37, 36, 13, 27, 50, 12, 24, 19, 3)

GroupC <- c(14, 33, 29, 25, 29, 36, 46, 7, 20, 38)

GroupD <- c(6, 34, 31, 26, 49, 34, 11, 36, 13, 28)

# Calculate the mean score related to each observational series #

MeanA <- mean(GroupA)

MeanB <- mean(GroupB)

MeanC <- mean(GroupC)

MeanD <- mean(GroupD)

# Create a vector which will contain all of the means measurements #

MeansAll <- c(MeanA, MeanB, MeanC, MeanD)

# Calculate the standard error of the mean related to each observational series #

# Requires package: "Plotrix", downloaded and enabled #

stderrA <- std.error(GroupA)

stderrB <- std.error(GroupB)

stderrC <- std.error(GroupC)

stderrD <- std.error(GroupD)

# Create a vector which will contain all of the standard error measurements #

stderrAll <- c(stderrA, stderrB, stderrC, stderrD)

# Combine all of the aggregate values within a single data frame #

AllData <- data.frame(MeansAll, stderrAll)

# Create a basic bar plot #

barCenters <- barplot(height = AllData$MeansAll, main = "Average Measurement per Group",

xlab = "Group", ylab = "Mean Measurment",

yaxt="n",ylim=c(0,40))

This produces the output:

We'll take a break right now to discuss what is occurring in the code block above. As to not cause garish output, the y-axis label is being temporarily disabled (yaxt = "n"). We are also manually setting the height restrictions of the graph (0-40).

The next bit of code is a clever hack which creates the standard error overlays.

# Create standard error overlays #

arrows(barCenters, AllData$MeansAll-AllData$stderrAll,

barCenters, AllData$MeansAll+AllData$stderrAll, length=0.05, angle=90, code=3)

This produces the output:

# Label y-axis #

ticks<-c(0, 10, 20, 30, 40)

axis(2,at=ticks,labels=ticks)

# Label x-axis #

group <- c("A", "B", "C", "D")

axis(1, at=barCenters, labels=group)

This produces the final output:

That's all for now. Stay tuned for more mathematical content!

Thursday, July 12, 2018

Removing Duplicate Entries (MS-Excel)

In previous articles, we discussed the steps required to remove duplicate entries from data sets stored within the SPSS and SAS platforms. In this entry, we will discuss how to achieve the same results as it pertains to data sets which are stored within the Excel platform.

Example:

Again, we will be using a familiar data set to illustrate the functionality of the platform.

The functionality of Excel is a bit limited as compared to the prior platforms. I make this statement based on the limitations which exist as it pertains to the selection of variables which can be identified in tandem. The Excel platform can only identify a single variable in which to filter for duplicate entries. Meaning, that if we desired to sort for duplicate entries across multiple variables, we would be unable to do so within the current medium.

However, if we wished to sort for, and subsequently remove duplicate values within a single column variable, we could do so by following the proceeding steps.

To begin, we will select the variable column in which we wish to utilize as our primary key from which duplicates will be identified. In the case of our example, the variable which we will select will be “VARA”. With this decided, we must highlight all of the rows within the aforementioned variable column.

Once this has been achieved, click on the “Home” tab within the menu ribbon, and then click on “Conditional Formatting”.

Clicking on “Conditional Formatting” should cause this menu to appear:

From the menu, click on the option “Highlight Cell Rules”, followed by “Duplicate Values”.

You should notice the following changes have been made within the work sheet:

“Light Red Fill with Dark Red Text” is the default coloration used by the platform to identify duplicate values. However, you are presented with the option to select a differing methodology of identification within the generated interface.

While Excel lacks the ability to identify unique entries across multiple columns, it does possess the capacity to completely remove entries specified in this manner.

For example, if we wished to create a new workbook which only contained completely unique entries across all columns, we could achieve such by completing the following steps.

First, we would have to select all of the row observations from which we wish to sort.

Once this has been achieved, click on the “Data” tab within the menu ribbon, and then click on “Remove Duplicates”.

This should cause the following menu to appear:

As we want to sort across all variable entries, we will leave all of the variable boxes selected.

Next, click “OK”.

This will generate the following prompt:

And subsequently, the data worksheet should resemble the following:

The transformation indicates that the process has correctly performed its function in removing non-unique entries.

Tuesday, July 10, 2018

(R) Qualitative Data

Given the omnipresent increase in velocity which is inherent within the many aspects of modern existence, the potential for large scale dilemmas to occur, seemingly from out of nowhere, is a challenge that we must all collectively face.

Two significant issues inhibit our abilities to appropriately address the much larger catalyst, those being: an overall lack of dynamic data, and a system of generalizations which are no longer applicable in the modern environment.

Data collection, analysis, and the subsequent presentation of such, requires large amounts of time and effort. As a result of such, while deciding which way to steer our allegorical ship, we may find ourselves far removed from our original destination.

Additionally, as is pertains to generalizations, much of the world is now experiencing increased materialistic standards of existence. This inevitably leads to a personal increase in hyper-individuality, and consequently, an increase in sensitivity to outside stimuli. As a result of such, it is difficult to draw conclusions through the utilization of generalities which were previously applicable in the form of quantitative rating scales.

At this current time, the collection of qualitative information, and the application of traditional survey analytical methodologies, are the most optimal method for preparing dynamic and specialized solutions to rapidly emergent criticism.

The challenge, of course, is how to sculpt this un-structured data into something which can be quantifiably assessed and presented.

In today's article we will discuss the best way to apply traditional analysis to un-structured qualitative data. I use the term "traditional" as other methods do exist which can be utilized to collect and analyze data with greater efficiency, at least, as it pertains to dynamism.*

Example (Word Clouds and Word Frequency):

We’ll suppose that your organization issued a survey which contained qualitative prompts. The data collected is exhibited below:

This data, as it is exists within its present incarnation, tells us nothing, and cannot, within its current format, be presented to outside parties.

The following steps rectify this scenario as it pertains to quantifying this data, and generating an orderly report which is easily readable by non-technically orientated individuals.

# Packages required to generate example output: #

# tm – An R centric text mining package. #

# wordcloud – Generates word cloud illustrations within the R platform. #

# RColorBrewer – Enables additional color schemes for word cloud output. #

# syuzhet – Extracts sentiment and sentiment-derived plot arcs from text. #

# Install and enable all of the packages listed above #

First, we must create the data frame needed to demonstrate the example code:

ID <- c(1,2,3,4,5)

Comment <- c("My life is terrible.", "Give me free stuff!", "Obviously, anyone who is cognitive, can observe the inconsistences which are inherit within your demeanor.", "Sucks. Sucks. Sucks. Sucks!", "K.")

WordCloudEx <- data.frame(ID, Comment)

# The code below ensures that the comment data is stored within a vector format. This step is required, as the functions inherent within the aforementioned package libraries, will not function properly unless the data is housed within the proper format. #

WordCloudEx$Comment <- as.vector(WordCloudEx$Comment)

There will be instances in which you may want to correct the spelling of the collected qualitative data, or remove un-needed expletives. An aspect to keep in mind in scenarios such as these, is that the gsub function, a function native to the “tm” package, is case sensitive. This is demonstrated below.

# Replace data instances #

# Information is listed twice due to case sensitivity #

WordCloudEx$Comment <- gsub("Sucks.", "stinks", WordCloudEx$Comment)

WordCloudEx$Comment <- gsub("sucks.", "stinks", WordCloudEx$Comment)

We must now transform the data into a format which can be read and manipulated by the “tm” package.

# This function morphs the data vector into a format which can manipulated by the "tm" package #

cloudvar <- Corpus(VectorSource(WordCloudEx$Comment))

If our first goal is to create a word cloud, we must massage certain aspects of the data in order to remove output redundancies. This can be achieve through the utilization of the code below.

# Remove data redundancies #

# Remove "stopwords" such as: "the", "he", or "she" #

cloudvar <- tm_map(cloudvar, removeWords, stopwords('english'))

# Convert the text to lower case #

cloudvar<- tm_map(cloudvar, content_transformer(tolower))

# Remove numbers #

cloudvar <- tm_map(cloudvar, removeNumbers)

# Remove punctuations #

cloudvar<- tm_map(cloudvar, removePunctuation)

# Eliminate extra white spaces #

cloudvar <- tm_map(cloudvar, stripWhitespace)

This next function was copied from the website listed below.

The function itself creates a variable which contains the frequency values of words contained within the qualitative data variable.

The “head” function, which is native to the R programming language, in this instance, allows for the display of the topmost utilized words, and their relative frequencies.

# Function copied from the website below #

# http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know #

dtm <- TermDocumentMatrix(cloudvar)

m <- as.matrix(dtm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

# Displays the frequency of the top 10 most utilized words #

head(d, 10)

If you would instead prefer to view the 5 most utilized words, the code would resemble the following:

head(d, 5)

This second value contained within the function can be modified to any value which you see fit. For example, the following code displays the frequency of the top 12 most utilized words.

head(d, 12)

The output for the code:

head(d, 10)

Would resemble the following:

word freq

stinks stinks 4
life life 1
terrible terrible 1
free free 1
give give 1
stuff stuff 1
anyone anyone 1
can can 1
cognitive cognitive 1
demeanor demeanor 1

We are now about ready to create our word cloud.

The code which will enable such is as follows:

set.seed(3455)
wordcloud(words = d$word, freq = d$freq, min.freq = 1, scale = c(2, 0.5),
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))

Which produces the following graphic:

Many of the options in the code which are required to create the word cloud, can be left at their default values. However, there are a few of the aspects of the code which you may want to consider for modification.

set.seed(<number>) – The wordcloud() function relies on random number generation as it applies to the placement of the words within the graphical output. Changing this value will alter the output illustration.

min.freq – This option indicates the number of times a word must appear prior to it being considered for inclusion within the output.

scale – This option sets the size of the output graphic. If aspects of the graphics are cut off, try inputting different values for each scale value. (ex: (4, 0.2) etc.)

max.words – This option indicates the maximum number of words to be included within the word cloud.

colors – Since we are utilizing the RColorBrewer package, we have additional options which can utilized to produce varying colorful outputs. Other variations which can implemented are as follows:

colors=brewer.pal(8, "Accent"))
colors=brewer.pal(8, "Dark2"))
colors=brewer.pal(12, "Paired"))
colors=brewer.pal(9, "Paired1"))
colors=brewer.pal(8, "Paired2"))
colors=brewer.pal(9, "Set1"))
colors=brewer.pal(8, "Set2"))
colors=brewer.pal(12, "Set3"))

Example (Response Sentiment Analysis):

While word clouds illustrate the frequency of commonly utilized words, it is also necessary to gauge the overall sentiment of the surveyed sample as it pertains to the given prompt.

This is achieved through the application of the “Syuzhet” dictionary to words found within each prompt contained within each variable column. Each word is assigned a score as it pertains to connotation. Neutral words receive a score of “0”. Words which are positive receive a score of “1”. Words which are negative receive a score of “-1”.

Typically, the Syuzhet function evaluates data in a way which applies a score to each sentence. Therefore, if the variable which requires analysis contains numerous sentences, than a few additional steps must be taken prior to proceeding.

# Remove sentence completion characters #

WordCloudForm <- gsub("[.]","", WordCloudEx$Comment)
WordCloudForm <- gsub("[!]","", WordCloudForm)
WordCloudForm <- gsub("[?]","", WordCloudForm)

# Transform the product of prior procedure into a single sentence #

WordCloudForm <- paste(WordCloudForm, ".")

# Perform Analysis #

s_v <- get_sentences(WordCloudEx$Comment)

syuzhet_vector <- get_sentiment(s_v, method="syuzhet")

############################################################

nrc_data <- get_nrc_sentiment(s_v)

valence1 <- (nrc_data[, 9]*-1) + nrc_data[, 10]

############################################################

Two variable are produced as a product of the analysis.

nrc_data – This variable contains further analysis through the attribution of each identified word to a fundamental emotion. There are five observations within the set as there are five responses contained within the variable column.

# Calling the “nrc_data” variable produces the following output #

nrc_data

# Output #

Generally this data is not considered useful.

valence1 – This variable contains the sum of all positive and negative word values contained within the qualitative variable column.

# Calling the “valence1” variable produces the following output #

valence1

# Output #

[1] -1 0 1 0 0

There are numerous applications for this information. You could assess the mean value of the responses with the mean() function, or the standard deviation of the responses with the sd() function. Or, you could produce a general summary of the range with the summary() function.

mean(valence1)

sd(valence1)

summary(valence1)

Outputs:

> mean(valence1)
[1] 0
> sd(valence1)
[1] 0.7071068
> summary(valence1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1 0 0 0 0 1

If you would prefer, and this is something that I would recommend doing, you could add the “valence1” column to the data frame to track the total associated with each individual response.

The code required to achieve this is:

QualDataOutput <- data.frame(WordCloudEx, valence1)

Which produces a data frame which resembles:

As you can observe from the newly created data frame, the “Syuzhet” dictionary is not a perfect method for sentiment analysis. However, at this present time, it is one of the better alternatives.

For more information on the “Syuzhet” dictionary, its entries, and its assignments, please check out the links below:

http://saifmohammad.com/WebPages/lexicons.html

http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

That’s all for now, Data Heads! Stay tuned for more exiting methods of analysis!

* - What I am referring to, is "dynamism" as a collective term as it is applicable to the generation of near instantaneous results. This may be covered in a future article. However, in practice, this would resemble data mining from various platforms, and immediate analysis from pre-established methodology.

Friday, July 6, 2018

Removing Duplicate Entries (SPSS)

In a previous article, we discussed how to remove duplicate entries which were present within data sets stored within the SAS platform. In this entry, we will discuss how to remove duplicate entries which exist within data sets contained within the SPSS platform.

Example:

We will be utilizing a familiar data set to demonstrate the capacity of the SPSS program as it pertains to performing this function.

To check for duplicate entries, we must first select “Data” from the topmost menu, after such, we will then select the option “Identify Duplicate Cases”.

This should cause the following menu to populate:

In this menu, we are presented with various options which pertain to variable qualifications and sorting. In the case of this example, we will identify variables “VARA” and “VARB”, as variables in which to identify duplicate entries.

To achieve this, we will utilize the topmost center arrow to designate “VARA” and “VARB” as variables in which to “Define matching cases by”.

After following the aforementioned steps, the menu should resemble the graphic above. Click “OK” to proceed with the exercise.

The following tables are generated to the output screen:

The table entitled, “Indicator of each last matching case as Primary”, illustrates the number of duplicate cases which were identified within the sample data set.

The data set itself has been modified through the addition of a column which contains information pertaining to each entry.

As you can observe from the data above, variables which contained duplicate entries within columns “VARA” and “VARB” have been identified. Through the utilization of the additional column, you can now endeavor on deciding which variables ought to be deleted within the set.

* WARNING *

The duplicate removal function utilized by SPSS is sensitive to entry casing. Meaning, that if variable “VARA” contained the entries: “JACK” and “Jack”, neither entry would be marked as a potential duplicate.

Therefore, to avoid errors related to such, string variable entries should be modified prior to performing the duplicate removal function.

Case modification can be achieve through the utilization of the following syntax:

/* Modify VARA and VARD to contain all upper case entries */

DO REPEAT var = VARA VARD.
COMPUTE var = UPCASE(var).

END REPEAT.
EXECUTE.

/* OR */

/* Modify VARA and VARD to contain all lower case entries */

DO REPEAT var = VARA VARD.
COMPUTE var = LOWER(var).
END REPEAT.
EXECUTE.

That’s all for now, Data Heads. Stay tuned for more exciting articles!