Reflections of a Data Scientist: August 2018

Thursday, August 23, 2018

(Python) Loops for Data Projects

This article was created for the purpose of demonstrating and reviewing the application of loops within the Python platform. As the title indicates, the demonstrations included within this entry will only be applicable to a limited number of scenarios. Python possesses a richness of options as it pertains to the capabilities inherent within the basic platform. I would heavily recommend further researching the topic of loops as they exist within the Python library if any of this information seems particularly difficult.

The While Loop

The “While Loop” is a simple enough concept. While a certain condition is true, a task will be implemented until the condition becomes false.

For example:

# Create counter variable #

i = 0

# Create while loop #

while i != 5:

print('Loop:', i)

i = i+1

Which produces the output:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4

The most difficult aspect of Python “while loops” is adjusting to the Python coding syntax. For more information on the topic of loops, I would suggest performing additional research related to such.

The For Loop

The “For Loop” is similar to the “while loop” as it evaluates a condition prior to execution. However, the “for loop”, due the way in which its syntax is structured, allows for a greater customization of options which are particularly useful as it pertains to data science projects.

Let’s explore some examples which demonstrate the applicability of the “for loop”.

Using the For Loop to Cycle through a List

# Create List Variable #

list = [0, 1, 2, 3, 4, 5]

# Code the For Loop #

for x in list:

print(x)

Console Output:

0
1
2
3
4
5

Using the For Loop to Cycle through an Index

# Create List Variable #

list = [0, 1, 2, 3, 4, 5]

# Code the For Loop #

for index, list in enumerate(list):

print('Index: ', index)

print('Value: ', list)

Console Output:

Index: 0
Value: 0
Index: 1
Value: 1
Index: 2
Value: 2
Index: 3
Value: 3
Index: 4
Value: 4
Index: 5
Value: 5

Using the For Loop to Cycle through a Two Dimensional List

# Create List Variable #

list = [["Key0", 0],

["Key1", 1],

["Key2", 2],

["Key3", 3],

["Key4", 4]]

# Code the For Loop #

for x in list :

print(x[0], ":", x[1])

Console Output:

Key0 : 0
Key1 : 1
Key2 : 2
Key3 : 3
Key4 : 4

Using the For Loop to Cycle through a Dictionary

# Create Dictionary #

dictionary = {"Def0":"0", "Def1":"1", "Def2":"2", "Def3":"3", "Def4":"4"}

# Cycle through Dictionary #

for key, entry in dictionary.items():

print("Value - " + key + " : " + entry)

Console Output:

Value - Def0 : 0
Value - Def1 : 1
Value - Def2 : 2
Value - Def3 : 3
Value - Def4 : 4

Using the For Loop to Cycle through a Numpy Array

# Create List #

list = [0, 1, 2, 3, 4, 5]

# Transform list into numpy array #

numpylist = numpy.array(list)

# Cycle through list #

for x in numpylist:

print(x)

Console Output:

0
1
2
3
4

Each example independently possesses little significance. However, as we progress throughout the study of Python and continue to demonstrate example functionality, the overall usefulness of these code samples will become increasingly evident.

(Python) Pip and SQL

As was the case with the R data platform, numerous auxiliary packages also exist within Python which enable additional functionality. In today’s article, we will be discussing the 'pip' package, which allows for the installation and maintenance of auxiliary Python packages.

We will also be briefly discussing, within the contents of this article, the 'pandasql' package, which enables the emulation of SQL related functionality within the Python platform.

Installing a Package with Pip

'Pip' functionality, as it pertains to the appropriate coding to utilize, is determinant upon the Python IDE which is being currently operated.

As it is applicable to the Jupyter Notebook IDE, installing a package through 'pip' utilization would resemble the following:

# Install a package using 'pip' #

import pip

pip.main(['install', 'nameofpackag'])

In the case of our example, in which we wish to install 'pandasql', the code to achieve such would be:

# Install 'pandasql' through 'pip' #

import pip

pip.main(['install', 'pandasql'])

If the code successfully runs, you should receive an output which resembles:

Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3

Update a Package with Pip

There will also be instances in which you wish to update a package which has already been previously installed. 'Pip' can accomplish this through the utilization of the following code:

# Update a package #

import pip

pip.main(['install', '--upgrade', 'pip'])

In the above case, 'pip' itself is being upgraded. The code which is being utilized can be modified so long as it resembles the template below:

# Update a package #

import pip

pip.main(['install', '--upgrade', 'NameofPackage'])

If the code successfully runs, you should receive an output which resembles:

Installing collected packages: pip
Found existing installation: pip 9.0.1
Uninstalling pip-9.0.1:
Successfully uninstalled pip-9.0.1
Successfully installed pip-18.0

Emulating SQL Functionality within Python with ‘PandaSQL’

As the package name implies ('PandaSQL'), data must be formatted within a pre-allotted panda data frame. Once this has been accomplished, 'PandaSQL' enables for the manipulation of data within the Python platform as if it were an SQL server.

I will not provide an example for this particular package but I will provide the coding template for utilizing its functionality.

In the case of 'PandaSQL', the following code line must always be included prior to writing pseudo-SQL statements.

pysqldf = lambda q: sqldf(q, globals())

Additionally, the code must always be stored in a variable designated as 'q'.

Finally, 'PandaSQL' code can be written in exactly the same format as regular SQL code. However, the key differentiating factor, is that the code must be surrounded by three sets of quotation marks (“””).

Therefore if we were to write some sample code which utilizes the 'PandaSQL' package, the code would resemble:

from pandasql import *

import pandas

pysqldf = lambda q: sqldf(q, globals())

q = """

SELECT

VARA,

VARB

FROM

pandadataframe3

ORDER BY VARA;

"""

df = pysqldf(q)

The output of which would be stored in the Python variable: "df".

That’s it for now, Data Heads! Stay tuned for my informative articles.

Tuesday, August 21, 2018

(R) Getting Down with "dplyr"

I was originally intending to include a sub-entry within the previous article to discuss "dplyr". "dplyr" is an auxiliary package which simplifies many functions which are innate within the basic R platform. Much of the style of the functional coding within the dplyr package is structured in a manner which is incredibly similar to SQL.

(All of the examples below require the package: "dplyr" to be downloaded and enabled)

The code to generate the example data set which will be utilized within this exercise is as follows:

Person <- c(1, 2, 3, 4, 5, 6, 7)

Gender <- c(1, 1, 1, 0, 0, 0, 0)

HairColor <- c(0, 1, 2, 3, 3, 3, 0)

EyeColor <- c(0, 1, 2, 2, 2, 0, 0)

FavGenre <- c(0, 1, 2, 2, 2, 3, 4)

DataFrameA <- data.frame(Person, Gender, HairColor, EyeColor, FavGenre)

Console Output:

Person Gender HairColor EyeColor FavGenre

1 Seth 1 0 0 0

2 Rob 1 1 1 1

3 Roy 1 2 2 2

4 Jane 0 3 2 2

5 Suzie 0 3 2 2

6 Lisa 0 3 0 3

7 Alexa 0 0 0 4

Reference Data Columns by Name

Let’s say that you are working with the example data frame and you wished to either create a new data frame, or simply wished to generate a summarization of data observations which exist within select variable fields.

The following code will enable these actions:

# Display observation column “Person” #

select(DataFrameA, Person)

# Display observation columns “Person” and “FavGenre” #

select(DataFrameA, Person, FavGenre)

# Display the observation columns for variables between and including “HairColor” and “EyeColor” #

select(DataFrameA, HairColor:EyeColor)

Filtering Observational Data by Variable Values

In this particular instance, we will assume that you are working with the same example data set, however, in this case, you desired to only view observational data which satisfied a pre-conceived variable conditions.

# Display only observations where the variable “Gender” is equal to 1 #

filter(DataFrameA, Gender == 1)

# Display only observations where the variable “Gender” is equal to 1, AND the variable “HairColor” is equal to 2#

filter(DataFrameA, Gender == 1, HairColor == 2)

# Display only observations where the variable “Gender” IS NOT equal to 1, OR the variable “HairColor” is equal to 2#

filter(DataFrameA, Gender != 1 | HairColor == 2)

Sort Data easily with the “arrange” Function

If you have previously worked extensively within the R platform, you’ll understand how difficult it can be to properly sort data. Thankfully, dplyr simplifies this task with the following function.

# Sort the data frame “DataFrameA”, by the variable “Person” (ascending) #

arrange(DataFrameA, Person)

# Sort the data frame “DataFrameA”, by the variable “Person” (descending) #

arrange(DataFrameA, desc(Person))

Simply Re-name Data with the Rename() Function

In previous articles, we discussed the difficulty that surrounds re-naming R column variables. As was the case with “arrange()”, dplyr also provides a simpler alternative with the function “rename()”.

# Re-name the variable “HairColor”, “WigColor”. Results are stored within the data frame: “newdataframe” #

newdataframe <- rename(DataFrameA, WigColor = HairColor)

Create a New Data Variable from an Existing Variable

Another task which dplyr simplifies is the ability to create new variables from existing variables within the same data frame. This is achieved through the utilization of the "mutate()" function.

# Create the new variable: “NewVar” by multiplying the variable “HairColor” by 2 #

# Results are stored within the data frame: “newdataframe” #

newdataframe <- mutate(DataFrameA, NewVar = HairColor * 2)

Create a New Data Frame with Specific Variables

In this example, we will be demonstrating the dplyr function: “select”, which allows for the selection of various existing data frame variables, typically for the purpose of creating a new data frame.

# Create a new data frame: “newdataframe”, which includes the variables: “Person” and “EyeColor” from DataFrameA #

newdataframe <- select(DataFrameA, Person, EyeColor)

Count Distinct Entries

In a similar manner in which SQL allows a user to count distinct variable entries, dplyr also contains a function which allows the user to achieve a similar result: “n_distinct()”.

# Count the distinct number of variable entries for the variable “Person” within DataFrameA #

n_distinct(DataFrameA$Person, na.rm=FALSE)

# Count the distinct number of variable entries for the variable “EyeColor” within DataFrameA #

n_distinct(DataFrameA$EyeColor, na.rm=FALSE)

# In both cases, na.rm=False, designates the option which excludes missing values from the overall count #

Performing Data Joins

Also included within the dplyr package, are functions which enable the user to perform data joins in a manner which is similar to SQL. Though examples of this functionality are not included within this article, more information pertaining to utilization of these commands can be found by running:

??join

within the R input window.

(R) Cleaning R Data with "tidyr"

Often, when encountering data which has been gathered for analysis, you will be completely dependent on a third party's adherence to integrity standards. What commonly occurs as a result of such, is a disastrously organized document filled with discursive values. As is often the case, you, as the analyst, will spend a larger majority of your time organizing data, time which could otherwise be spent producing analytical results. In this article, we will review some of the methods which can be utilized to achieve the former task, and also, introduce a few new methods which are also useful for accomplishing such. As the title indicates, this entry will devoted entirely to the R platform.

Difficult Characters within (.csv)s

Columns which contain string values often create problems for R as it relates to importing .csv files. Overlooking the catalyst for the dilemma, the novice data analyst will often immediately begin by tinkering with the "read.table()" options.

The solution to this dilemma is often far more simplistic than the corrective efforts that the novice attempts to apply.

Commonly, the issue arises from the usage of the following characters being present within string observational data:

"

'

Removing these characters from a (.csv) files prior to attempting input is the most expedient path towards resolution.

Removing Blank Observation Entries within String Variable Columns

While working with qualitative data, there often arises the need to remove blank observational entries. The reason for this occurrence is commonly due to the issues which emerge from the quantification of the aforementioned data subsequent to data transformation.

To delete all row variable entries, and the corresponding columns associated with such, the following code can be utilized:

# Remove all rows which contain blank observational entries within column "B" #

DataFrameA <- DataFrameA[!(DataFrameA$B == ""),]

# Create Example Data Frame #

A <- c(1, 2, 3, 4, 5, 6, 7, 8)

B <- c("apple", "orange", "grape", "tangerine","" ,"" ,"" , "cherry")

DataFrameA <- data.frame(A, B)

# Print to Console #

print(DataFrameA)

Console Output:

> print(DataFrameA)

A B

1 1 apple

2 2 orange

3 3 grape

4 4 tangerine

5 5

6 6

7 7

8 8 cherry

# Utilize code to remove blank entries #

DataFrameA <- DataFrameA[!(DataFrameA$B == ""),]

# Print to Console #

print(DataFrameA)

Console Output:

> print(DataFrameA)

A B

1 1 apple

2 2 orange

3 3 grape

4 4 tangerine

8 8 cherry

Counting the Total Number of Rows within a Data Frame

We will now assume that you have successfully imported your data into the R platform. However, prior to performing analysis, you may wish to survey the data frame in order to ensure that nothing is amiss. A good way to begin accomplishing this task is to count the total number of rows within the newly imported data frame. The following function will assist you with this task.

# Re-Generate Data Frame #

A <- c(1, 2, 3, 4, 5, 6, 7, 8)

B <- c("apple", "orange", "grape", "tangerine","" ,"" ,"" , "cherry")

DataFrameA <- data.frame(A, B)

# Count the rows within the data set #

NROW(DataFrameA)

Console Output:

[1] 8

Tidying up Data with "tidyr"

One of the most useful packages within the R platform is “tidyr”. This package adds additional functionality that was either absent from the basic R library, or too verbose within its initial incarnation. The following examples demonstrate some of the most useful functions included within the “tidyr” package.

Utilizing “separate()” to Separate Data

The "separate()" command provides the ability to create new data variables within an R data frame from previously conjoined data variables.

# Create the sample data frame #

ID <- c(11111, 22222, 55555)

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

Address1 <- c("123 SAS Lane, Cleveland, OH, 18111-44432",

"456 Cherry Valley, Miami, FL, 69785-33325",

"789 Python Way, Los Angeles, CA, 74715-99925")

BadData <- data.frame(ID, Name, Address1)

When viewed within the R-Studio console window, the data frame resembles the following graphic:

To separate the “Address1” variable into new variables based on comma placement, we will utilize the following code:

# With the tidyr package downloaded and enabled #

# Separate “Address1” Variable into new variables: “AddressLine”, “City”, “State”, “Zip” #

GoodData <- separate(BadData, Address1, c("AddressLine", "City", "State", "Zip"),

sep = ",")

After initiating the function above, the data frame will resemble the following:

We will new take one final step, and separate the “Name” variable into additional new variables based on comma placement.

# With the tidyr package downloaded and enabled #

# Separate “Name” Variable into new variables: “LastName”, “FirstName” #

GoodData0 <- separate(GoodData, Name, c("LastName", "FirstName"), sep = ",")

After initiating the function above, the data frame will resemble the following:

Re-uniting Data with “unite()”

The inverse function of “separate()” within the “tidyr” package is “unite()”. In the case of our example, we will be re-merging two separate data frame variables into a single variable, based on comma placement.

# Re-create the sample data frame #

ID <- c(11111, 22222, 55555)

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

Address1 <- c("123 SAS Lane, Cleveland, OH, 18111-44432",

"456 Cherry Valley, Miami, FL, 69785-33325",

"789 Python Way, Los Angeles, CA, 74715-99925")

BadData <- data.frame(ID, Name, Address1)

# With the tidyr() package downloaded and enabled #

# Separate “Address1” Variable into new variables: “AddressLine”, “City”, “State”, “Zip” #

GoodData <- separate(BadData, Address1, c("AddressLine", "City", "State", "Zip"), sep = ",")

# Separate “Name” Variable into new variables: “LastName”, “FirstName” #

GoodData0 <- separate(GoodData, Name, c("LastName", "FirstName"), sep = ",")

# Unite the “City” and “State” variables into a single variable labeled “City_State” #

unitedata <- unite(GoodData0, "City_Sate", c(City, State), sep = ", ")

After initiating the function, the data frame will resemble the following:

Transposing Data with “gather()” and “spread()”

In a series of prior articles featured on this website, I created a multi-part macro for the purpose of transposing data within the SAS platform. Thankfully, within R, there is a much simpler solution for solving the complexities of data transposition.

The test data frame which we will be utilizing can be generated with the following code:

# Create the sample data frame #

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

A <- c("Avalue1", "Avalue2", "Avalue3")

B <- c("Bvalue1", "Bvalue2", "Bvalue3")

C <- c("Cvalue1", "Cvalue2", "Cvalue3")

D <- c("Dvalue1", "Dvalue2", "Dvalue3")

TestData <- data.frame(Name, A, B, C, D)

The data frame resembles the following graphic:

However, let’s say, for the sake of our example, that you instead wanted a data frame that resembled the graphic below:

The code to achieve such is as follows:

# With the tidyr package downloaded and enabled #

GatherData <- gather(TestData, NewVar, NewVar2, A, B, C, D)

The template options for this function are:

gather(1,2,3,4…etc)

1 = The data frame referenced.

2 = The first new variable.

3 = The second new variable.

4…etc = The variables passed to the functions from the initial data frame.

To have the “D” variable remain independent, the code to utilize is:

# With the tidyr package downloaded and enabled #

GatherData <- gather(TestData, NewVar, NewVar2, A, B, C)

Which presents the graphic:

Another transposition function which is included within the “tidyr” package is “spread()”. Spread exists as the inverse of the “gather()” function.

Let’s demonstrate the functions capabilities:

# Create the sample data frame #

Name <- c("Doe, John", "Smith, Steve", "Sas, Randy")

A <- c("Avalue1", "Avalue2", "Avalue3")

B <- c("Bvalue1", "Bvalue2", "Bvalue3")

C <- c("Cvalue1", "Cvalue2", "Cvalue3")

D <- c("Dvalue1", "Dvalue2", "Dvalue3")

TestData <- data.frame(Name, A, B, C, D)

# With the tidyr package downloaded and enabled #

GatherData <- gather(TestData, NewVar, NewVar2, A, B, C, D)

Graphically this data resembles the following:

Now we will utilize the “spread()” function to re-adjust the data.

# With the tidyr package downloaded and enabled #

SpreadData <- spread(GatherData, NewVar, NewVar2)

The template options for this function are:

spread(1, 2, 3)

1 = The data frame referenced.

2 = The first variable referenced.

3 = The second variable referenced.

The output is as follows:

Thursday, August 16, 2018

(Python) Jupyter Notebook

In a previous entry, we discussed the Python Anaconda distribution and the Spyder IDE. In this article, we will be discussing a different IDE known as Jupyter Notebook. Jupyter Notebook can be utilized through the Anaconda platform as it is installed as an embedded aspect of such.

Jupyter Notebook’s capabilities are especially adept as it relates to the creation of Python programs which are specifically created for data science purposes. The reason for such, is like R-Studio, Jupyter Notebook allows for the creation and long term storage of data variables. This differs from the capabilities of Spyder, and other IDEs, which release variables from memory at the point of a program’s termination. Jupyter Notebook also enables, as a consequence of such, the ability to run separate portions of a program within the IDE. This is incredibly useful as it pertains to data projects, as there are often meddlesome aspects of data frames which require re-assessment.

To begin using Jupyter Notebook, you must first launch the Anaconda platform, this can be achieved by double clicking on the Anaconda desktop shortcut:

Once the initial Anaconda interface has loaded, to initiate the Jupyter Notebook IDE, click on the “Launch” button located below the option associated with Jupyter Notebook.

If your previous efforts were successful, a new tab should be opened within the default web browser. It is important that you remember to not accidentally close this tab when multi-tasking, or you will lose all of the work which was un-saved until that point.

At the initial screen, you must select the directory in which you wish to operate within. Also, from this menu, you have the ability to select files to load for further editing.

In our case, we will be creating a new file. To achieve this, we will first click the right button labeled: “New”. This will generate a drop-down menu, from which, we will select: “Notebook: Python 3”.

In the graphic below, I have already typed and executed a small block of code.

Each “ln []:”, represents a space for input, which, when ran, will execute the entirety of the code contained therein. Variables which are stored in this manner will remain as such. However, when the IDE session is terminated, the variables which were created during the session will be lost. Output is generated beneath each “ln[]:” space.

The rest of the platform is rather self-explanatory. The disk icon saves the session file, the cross icon adds additional “ln[]s”, and the menu bar enables various archaic options.

To change the title of the file, double click on the current file's title, this title is located to the right of the “Jupyter” logo. Doing such generates the screen above, which enables the process to be completed.

Finally, you will likely want to save your data. While clicking on the disk icon accomplishes this function, I would recommend utilizing the drop down file menu instead. From this menu, you can select the desired type of file format that you wish to utilize. I would recommend (downloading) saving files within both the Notebook (.ipynb) and Python (.py) file formats. The former allows for a greater ease of editing within the Jupyter platform, and the latter allows your file to be read across Python platforms regardless of the IDE.

In the next article, we will continue our journey through Python programming. Thanks for subscribing, and stay tuned!

Tuesday, August 14, 2018

(R) Modifying Strings with “stringr”

As the previous "R" articles have primarily addressed qualitative data assessment, I feel that at this time, that it is more than appropriate to discuss string manipulation within the “R” platform. This article will be demonstrate the functions which are available within the “stringr” package. This package provides numerous functional additions which allow for enhanced string manipulation.

Modify Punctuation

# With the package: "stringr" downloaded and enabled #

# Demo Set A #

DemoA <- c('The white and black cow')

# Demo Set B #

DemoB <- c('Jumped over the moon')

Changing Cases of String Contents

# Modify an entire string to upper case #

a <- str_to_upper(DemoA)

# Modify an entire string to lower case #

b <- str_to_lower(DemoA)

# Modify an entire string to resemble a book title #

c <- str_to_title(DemoA)

# View modifications #

print(a)

print(b)

print(c)

Console Output:

> print(a)[1] "THE WHITE AND BLACK COW"
> print(b)
[1] "the white and black cow"
> print(c)
[1] "The White And Black Cow"

Concatenate Strings

# Concatenate a String Variable #

a <- str_c(DemoA, DemoB)

# Join strings but separate joined values with " " #

b <- str_c(DemoA, DemoB, sep = " ")

# View modifications #

print(a)

print(b)

Console Output:

> print(a)
[1] "The white and black cowJumped over the moon"
> print(b)
[1] "The white and black cow Jumped over the moon"

Counting and Locating Sting Elements

# Count the spaces within a string variable #

a <- str_count(DemoA, pattern = " ")

# Extract string aspect which matches the string variable: “black” #

b <- str_extract (DemoA, "black")

# Count the length of a string in its entirety (includes spaces) #

c <- str_length(DemoA)

# View Outputs #

print(a)

print(b)

print(c)

Console Output:

> print(a)
[1] 4
> print(b)
[1] "black"
> print(c)
[1] 23

Removing String Aspects

# Remove the spaces within a string variable #

a <- str_remove_all(DemoA, " ")

# Replace one character within a string variable with a different character #

b <- str_replace_all(DemoA, " ", "-")

# View Outputs #

print(a)

print(b)

Console Output:

> print(a)
[1] "Thewhiteandblackcow"
> print(b)
[1] "The-white-and-black-cow"

Trimming String Elements

# Remove Un-necessary Whitespace #

DemoB <- c('Jumped over the moon ')

a <- str_trim(DemoB, side = c("right"))

DemoB <- c(' Jumped over the moon')

b <- str_trim(DemoB, side = c("left"))

print(b)

DemoB <- c(' Jumped over the moon ')

c <- str_trim(DemoB, side = c("both"))

print(a)

print(b)

print(c)

Console Output:

> print(a)
[1] "Jumped over the moon"
> print(b)
[1] "Jumped over the moon"
> print(c)
[1] "Jumped over the moon"

Sorting String Arrays

# Sort a string array alphabetically #

fruits <- c('apple', 'oranges', 'cherry', 'grape')

a <- str_sort(fruits)

# Sort a string array in reverse alphabetical order #

b <- str_sort(fruits, decreasing = TRUE)

print(a)

print(b)

Console Output:

> print(a)
[1] "apple" "cherry" "grape" "oranges"
> print(b)
[1] "oranges" "grape" "cherry" "apple"

Friday, August 10, 2018

(Python) Importing Data

In this article, we will discuss how to properly import data frames into Python through the utilization of various Python modules. Throughout this post, there are numerous links to online resources maintained by the module creators. These resources should be utilized as necessary, as they provide important options which may be useful in addressing additional aspects related to package functions.

Importing (.csv) Data as a Multidimensional Numpy Array

There may be instances in which you wish to have a data frame transformed and stored as a multi-dimensional array. The reason for perusing such an option would typically be necessitated due to a desire to produce a machine learning model. Models of this type require the aforementioned variable format.

Limitations exist pertaining to this import function. Primarily amongst such, is the functions inability to import data which contains non-numerical elements, and the inability of the function to import columns which contain non-existent entries.

Therefore, for the following function featured within the example to correctly perform its purpose, the data called therein, must not contain missing values, and must not contain non-numerical elements.

# Enable Numpy #

import numpy

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTest.csv"

# Create a variable to store the data #

# The "delimiter" option specifies the delimiter contained within the data file #

# The "skiprows" option indicates that the first row containing variable names will be omitted #

numpyex = numpy.loadtxt(filepath, delimiter=',', skiprows=1)

# Print the result of the data import process to the console #

print(numpyex)

Console Output:

[[ 83. 2036. 803. 544. 243. 28. 843. 46.]

[ 93. 2015. 804. 465. 296. 15. 815. 32.]

[ 49. 1967. 804. 430. 189. 47. 817. 46.]

[ 100. 1957. 802. 511. 256. 42. 561. 37.]

[ 22. 1925. 803. 529. 172. 96. 345. 32.]

[ 31. 1895. 810. 435. 194. 40. 861. 46.]

[ 94. 1889. 802. 503. 228. 7. 883. 46.]

[ 4. 1722. 802. 535. 260. 80. 300. 50.]

[ 25. 1715. 808. 437. 200. 77. 776. 37.]

[ 46. 1704. 809. 445. 310. 52. 410. 53.]

[ 15. 1646. 802. 502. 223. 79. 296. 31.]

[ 74. 1611. 800. 420. 200. 99. 808. 43.]

[ 79. 1429. 805. 504. 185. 67. 806. 50.]

[ 13. 1401. 801. 415. 283. 23. 235. 53.]

[ 4. 1334. 802. 484. 277. 79. 946. 37.]

[ 47. 1290. 807. 428. 171. 15. 481. 42.]

[ 49. 1274. 805. 406. 306. 12. 296. 34.]

[ 7. 1161. 803. 489. 298. 93. 381. 28.]

[ 93. 1132. 805. 415. 195. 31. 221. 40.]

[ 60. 1131. 804. 413. 185. 5. 308. 33.]]

Let us consider another example which demonstrates additional options.

# Enable Numpy #

import numpy

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTest.csv"

# Create a variable to store the data #

# The "delimiter" option specifies the delimiter contained within the data file #

# The "dtype" option indicates that the data file will consist of string variables only #

# The "skiprows" option indicates that the first row containing variable names will be omitted #

# The "usecols" option specifies which rows will be included within the input #

numpyex = numpy.loadtxt(filepath, delimiter=',', dtype=str, skiprows=1, usecols=[0,1,2,3])

# Print the result of the data import process to the console #

print(numpyex)

Console Output:

[[ 83. 2036. 803. 544.]

[ 93. 2015. 804. 465.]

[ 49. 1967. 804. 430.]

[ 100. 1957. 802. 511.]

[ 22. 1925. 803. 529.]

[ 31. 1895. 810. 435.]

[ 94. 1889. 802. 503.]

[ 4. 1722. 802. 535.]

[ 25. 1715. 808. 437.]

[ 46. 1704. 809. 445.]

[ 15. 1646. 802. 502.]

[ 74. 1611. 800. 420.]

[ 79. 1429. 805. 504.]

[ 13. 1401. 801. 415.]

[ 4. 1334. 802. 484.]

[ 47. 1290. 807. 428.]

[ 49. 1274. 805. 406.]

[ 7. 1161. 803. 489.]

[ 93. 1132. 805. 415.]

[ 60. 1131. 804. 413.]]

For more information pertaining to this function and its internal options:

https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.loadtxt.html

Now, if you absolutely must import array data which contains both string data and numerical data, a different array function exists within the “numpy” package. This function also allows for columns which contain missing elements.

# Enable Numpy #

import numpy

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestII.csv"

# Create a variable to store the data #

# The "delimiter" option specifies the delimiter contained within the data file #

# The "skip_header" option indicates that the first row containing variable names will be omitted #

# The "dtype" option indicates that the data type of each element will be automatically decided #

# The "encoding" option specifies which encoding methodology should be employed when decoding the input file #

numpyex2 = numpy.genfromtxt(filepath, delimiter=',', skip_header=1, dtype=None, encoding=None)

# Print the result of the data import process to the console #

print(numpyex)

Console Output:

[( 83, 2036, 803, 544, 'BMW') ( 93, 2015, 804, 465, 'Volvo')

( 49, 1967, 804, 430, 'Jeep') (100, 1957, 802, 511, 'Subaru')

( 22, 1925, 803, 529, 'Mitsubishi') ( 31, 1895, 810, 435, '')

( 94, 1889, 802, 503, '') ( 4, 1722, 802, 535, '')

( 25, 1715, 808, 437, '') ( 46, 1704, 809, 445, 'Ford')

( 15, 1646, 802, 502, 'Chevy') ( 74, 1611, 800, 420, 'BMW')

( 79, 1429, 805, 504, 'Volvo') ( 13, 1401, 801, 415, 'Jeep')

( 4, 1334, 802, 484, 'Subaru') ( 47, 1290, 807, 428, 'Mitsubishi')

( 49, 1274, 805, 406, 'Toyota') ( 7, 1161, 803, 489, 'Lexus')

( 93, 1132, 805, 415, 'Nissan') ( 60, 1131, 804, 413, 'Honda')]

For more information pertaining to this function and its internal options:

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.genfromtxt.html

Importing (.csv) Data as a Panda Data Frame

Typically, due to the common data variable type, and the traditional aspects of data integrity and presentation, you will most likely prefer to import data into Python through the utilization of functions inherit within the "pandas" package.

# Enable Pandas

import pandas

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestII.csv"

# Create a variable to store the data #

pandadataframe = pandas.read_csv(filepath)

# Print the result of the data import process to the console #

print(pandadataframe)

Console Output:

VarA VarB VarC VarD VarE

0 83 2036 803 544 BMW

1 93 2015 804 465 Volvo

2 49 1967 804 430 Jeep

3 100 1957 802 511 Subaru

4 22 1925 803 529 Mitsubishi

5 31 1895 810 435 NaN

6 94 1889 802 503 NaN

7 4 1722 802 535 NaN

8 25 1715 808 437 NaN

9 46 1704 809 445 Ford

10 15 1646 802 502 Chevy

11 74 1611 800 420 BMW

12 79 1429 805 504 Volvo

13 13 1401 801 415 Jeep

14 4 1334 802 484 Subaru

15 47 1290 807 428 Mitsubishi

16 49 1274 805 406 Toyota

17 7 1161 803 489 Lexus

18 93 1132 805 415 Nissan

19 60 1131 804 413 Honda

If only the first seven rows of data were required, the following code could be utilized to accomplish this task:

# Import only the first seven rows of data from the example data frame #

pandadataframe = pandas.read_csv(filepath, nrows=7)

# Print the result of the data import process to the console #

print(pandadataframe)

Console Output:

VarA VarB VarC VarD VarE

0 83 2036 803 544 BMW

1 93 2015 804 465 Volvo

2 49 1967 804 430 Jeep

3 100 1957 802 511 Subaru

4 22 1925 803 529 Mitsubishi

5 31 1895 810 435 NaN

6 94 1889 802 503 NaN

For more information pertaining to this function and its internal options:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Importing (.xlsx) Data as a Panda Data Frame

There may be instances in which, you wish to import Microsoft Excel data into the Python coding platform. The following code will enable you to achieve such.

# Enable Pandas

import pandas

# Specify the appropriate file path #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestIII.xlsx"

# Create a variable to store the data #

pandadataframe = pandas.ExcelFile(filepath)

# Print workbook spreadsheet names #

print(pandadataframe.sheet_names)

# Assign "Sheet1" to a variable #

# (This variable is specifying which sheet of the workbook you will be importing) #

pandadataframesheet1 = pandadataframe.parse('Sheet1')

# Print the result of the data import process to the console #

print(pandadataframesheet1)

Console Output:

['Sheet1']

VarA VarB VarC VarD VarE VarF

0 83 2036 803 544 BMW One

1 93 2015 804 465 Volvo NaN

2 49 1967 804 430 Jeep NaN

3 100 1957 802 511 Subaru One

4 22 1925 803 529 Mitsubishi One

5 31 1895 810 435 Toyota NaN

6 94 1889 802 503 Lexus NaN

7 4 1722 802 535 Nissan NaN

8 25 1715 808 437 Honda NaN

9 46 1704 809 445 Ford One

10 15 1646 802 502 Chevy NaN

11 74 1611 800 420 BMW NaN

12 79 1429 805 504 Volvo One

13 13 1401 801 415 Jeep NaN

14 4 1334 802 484 Subaru NaN

15 47 1290 807 428 Mitsubishi NaN

16 49 1274 805 406 Toyota One

17 7 1161 803 489 Lexus One

18 93 1132 805 415 Nissan NaN

19 60 1131 804 413 Honda NaN

To remove the “NaN” entries from the “VarF” column, you can utilize the following line of code:

# Replace "NaN" values with "N/A" #

pandadataframesheet1.fillna('N/A', inplace=True)

# Print sheet #

print(pandadataframesheet1)

Console Output:

['Sheet1']

VarA VarB VarC VarD VarE VarF

0 83 2036 803 544 BMW One

1 93 2015 804 465 Volvo N/A

2 49 1967 804 430 Jeep N/A

3 100 1957 802 511 Subaru One

4 22 1925 803 529 Mitsubishi One

5 31 1895 810 435 Toyota N/A

6 94 1889 802 503 Lexus N/A

7 4 1722 802 535 Nissan N/A

8 25 1715 808 437 Honda N/A

9 46 1704 809 445 Ford One

10 15 1646 802 502 Chevy N/A

11 74 1611 800 420 BMW N/A

12 79 1429 805 504 Volvo One

13 13 1401 801 415 Jeep N/A

14 4 1334 802 484 Subaru N/A

15 47 1290 807 428 Mitsubishi N/A

16 49 1274 805 406 Toyota One

17 7 1161 803 489 Lexus One

18 93 1132 805 415 Nissan N/A

19 60 1131 804 413 Honda N/A

Additionally, if you desired to create a new pandas data frame which contained only "VarA" and "VarB", the code would resemble:

pandasdataframevaravarb = pandadataframesheet1 [['VarA', 'VarB']]

print(pandasdataframevaravarb)

Console Output:

VarA VarB

0 83 2036

1 93 2015

2 49 1967

3 100 1957

4 22 1925

5 31 1895

6 94 1889

7 4 1722

8 25 1715

9 46 1704

10 15 1646

11 74 1611

12 79 1429

13 13 1401

14 4 1334

15 47 1290

16 49 1274

17 7 1161

18 93 1132

19 60 1131

For more information pertaining to this function and its internal options:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Importing (.sas7bdat) Data as a Panda Data Frame

The following example presents a scenario in which a file within the SAS format is imported as a pandas data frame. I cannot provide exercise data as it pertains to this exercise as my SAS License has since expired.

# Enable Pandas #

import pandas

# Specify the appropriate file path #

filepath = "C:\\Users\\Username\\Desktop\\SASFile.sas7bdat"

# Create a variable to store the data #

pandadataframesas = pandas.read_sas(filepath)

# Print the result of the data import process to the console #

print(pandadataframesas)

With SAS imports, the possibility is always present that the data contained within certain variable columns may appear with a mysterious (b’) occurring prior to each entry.

For example:

b’ 11111

b’ 22222

b’ 33333

To rectify this issue, which is caused by encoding formats, utilize the subsequent code:

dataframename['variablename'] = dataframename['variablename'].str.decode('utf-8')

In a more realistic scenario, the code might resemble something similar to the following:

DataFrameA ['id'] = DataFrameA ['id'] .str.decode('utf-8')

For more information pertaining to this function and its internal options:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html#pandas.read_sas

Exporting Python Data

After you have sorted, edited, and analyzed your data, you’ll most likely want to export the finalized version of the data frame to an outside format. The code below will assist you with this task.

# Exporting Data to (.csv) Format #

# Option: 'sep' specifies the designation which will be utilized to separate the data file contents #

pandadataframe.to_csv("C:\\Users\\Username\\Desktop\\pandadataframe.csv", sep=',', encoding='utf-8')

# Exporting Data to (.xlsx) Format #

# Option: 'sheet_name' designates the name of the first sheet of the Excel workbook #

pandadataframesheet1.to_excel("C:\\Users\\Username\\Desktop\\pandadataframesheet1.xlsx", sheet_name='Sheet1')

For more information pertaining to this function and its internal options:

http://pandas.pydata.org/pandas-docs/version/0.20.2/generated/pandas.DataFrame.to_excel.html

For more information pertaining to this function and its internal options:

http://pandas.pydata.org/pandas-docs/version/0.20.2/generated/pandas.DataFrame.to_csv.html

For code and exercise data pertaining to this article, please click on the link below to visit our new GitHub repository:

https://github.com/RDScientist/CodeRepo

Monday, August 6, 2018

(Python) Dictionaries

Having already thoroughly explored the topic of lists within Python, we will now, in a similar manner, explore a data type which is, in many ways, incredibly similar to lists. In a prior entry, I defined Python dictionaries as:

A dictionary is a collection of elements which are unordered. It is modifiable and does not allow duplicate entries.

A Python dictionary is essentially a multi-dimensional list type variable, which, as stated above, cannot contain duplicate entries.

All of this will be elucidated as we work through the following examples.

Single Entry Dictionaries

As mentioned previously, a dictionary variable is very similar to a list variable. We will illustrate this through the creation of a dictionary variable, which can be enabled, through the synthesis of two list variables.

First, let's define each list type variable:

# List 1 #

key = ['key0', 'key1', 'key2', 'key3', 'key4']

# List 2 #

entry = ['entry0', 'entry1', 'entry2', 'entry3', 'entry4']

# Combine lists into a dictionary by utilizing the "dict" and "zip" functions #

dictionaryex = dict(zip(key, entry))

# Print the dictionary variable #

print(dictionaryex)

Console Output:

{'key0': 'entry0', 'key1': 'entry1', 'key2': 'entry2', 'key3': 'entry3', 'key4': 'entry4'}

Additionally, we also have the option of manually creating a dictionary variable. The code to achieve this would resemble:

# Manually create the dictionary variable #

dictionaryex = {'key0': 'entry0', 'key1': 'entry1', 'key2': 'entry2', 'key3': 'entry3', 'key4': 'entry4'}

# Print the dictionary variable #

print(dictionaryex)

Console Output:

{'key0': 'entry0', 'key1': 'entry1', 'key2': 'entry2', 'key3': 'entry3', 'key4': 'entry4'}

Modifying Dictionary Elements

As was the case with lists, there may be instances in which you desire to add elements, remove elements, or change elements. To code to achieve each is as follows.

# Manually create the dictionary variable #

dictionaryex = {'key0': 'entry0', 'key1': 'entry1', 'key2': 'entry2', 'key3': 'entry3', 'key4': 'entry4'}

# Add element to dictionary variable #

dictionaryex['key5'] = 'entry5'

# Modify element association #

dictionaryex['key0'] = 'en-TREE-0'

# Remove element and associated entry #

del(dictionaryex['key2'])

# Print dictionary variable to view modifications #

print(dictionaryex)

Console Output:

{'key0': 'en-TREE-0', 'key1': 'entry1', 'key3': 'entry3', 'key4': 'entry4', 'key5': 'entry5'}

Calling Specific Dictionary Elements

There may be instance in which you wish to call a specific element in a dictionary without knowing what that element initially is. To identify a dictionary element and have its value returned, you may utilize the following functions.

# Manually create the dictionary variable #

dictionaryex = {'key0': 'entry0', 'key1': 'entry1', 'key2': 'entry2', 'key3': 'entry3', 'key4': 'entry4'}

# Print out the dictionary keys #

print(dictionaryex .keys())

# Print out the entry pertaining to "key2" #

print(dictionaryex['key2'])

Console Output:

dict_keys(['key0', 'key1', 'key2', 'key3', 'key4'])entry2

You may be wondering why there is an absence of a dictionary feature, the reason for such is due to the nature of the variable, as mentioned previously, "a dictionary is a collection of elements which are unordered", therefore, without this ordering structure, there is no way to call an element by specifically calling its place within the overall order.

Multi-Dimensional Dictionaries

There may arise an occasion when you wish to store multiple entries pertaining to a single key within a dictionary variable. We will explore this phenomenon within the examples below.

# Create multi-dimensional dictionary #

keys = { 'skeletonkey': { 'house': 'hauntedhouse', 'inhabitants':'scary ghosts' },

'carkey': { 'house':'trailer', 'inhabitants':'a man' },

'housekey': { 'house':'manor', 'inhabitants':'empty' }}

# Print out the inhabitants of associated with the skeleton key #

print(keys['skeletonkey']['inhabitants'])

# Add an additional entry #

keys['generickey'] = {'house':'generic', 'inhabitants':'N/A'}

# Print modified dictionary #

print(keys)

Console Output:

scary ghosts
{'skeletonkey': {'house': 'hauntedhouse', 'inhabitants': 'scary ghosts'}, 'carkey': {'house': 'trailer', 'inhabitants': 'a man'}, 'housekey': {'house': 'manor', 'inhabitants': 'empty'}, 'generickey': {'house': 'generic', 'inhabitants': 'N/A'}}

Friday, August 3, 2018

(Python) Lists - Pt. (III)

As was discussed in the prior entry, there may be instances in which you wish to call a specific list element without initially being aware of the element’s place within the list. Conversely, there may also be instances in which you wish to call an element by position without being aware of what element specifically coincides with that position.

Calling Specific List Elements (cont.)

# Index Functions #

# Create list variables #

a = ["apple", "orange", "pear", "kiwi", "mango"]

b = [0, 1, 2, 3, 4]

# Call and display variable by position #

print(b.index(3))

# Call variable and display position #

print(a.index("pear"))

Coinciding Console Output:

3

2

Calling Elements from Multi-Dimensional Lists

Though you are unlikely to encounter this data type variation within your career as a data technician, for the sake of proficiency, we will briefly discuss how to query from multi-dimensional list types.

# Create Multi-Dimensional List Variable #

fruits = [["apple", "orange"], ["kiwi", "mango"], ["cherry", "strawberry"]]

# Import numpy package #

import numpy

# Store the new list variable as numpy array type #

fruitsnumpy = numpy.array(fruits)

# Print out the third row of the numpy array type variable #

print(fruitsnumpy[2,:])

# Print out the second column of the numpy array type variable #

print(fruitsnumpy[:,1])

# Print out the second element of the first list within the numpy array type variable #

print(fruitsnumpy[:1,1])

# Print out the second element of the third list within the numpy array type variable #

print(fruitsnumpy[2,1:])

Associated Console Output:

['cherry' 'strawberry']

['orange' 'mango' 'strawberry']

['orange']

['strawberry']

Utilizing numpy.logical to logically assess Numpy Arrays

Within the numpy package, there exists two variations of a single function which allows for the logical assessment of numpy array data. Data can be assessed to either coincide with augments which satisfy multiple AND-or-OR scenarios, each exclusively. This is better demonstrated below.

# Utilizing the numpy logical function to logically assess numpy data lists #

# Import numpy package #

import numpy

# Create test set #

testset = [5,10,15,20,25,30,35,40,45,50]

# Transition test set into numpy array #

testset = numpy.array(testset)

# Return the Boolean output related to the assessment below #

# testset > 20 OR testset < 10 #

print(numpy.logical_or(testset > 20, testset < 10))

# Return the Boolean output related to the assessment below #

# testset > 20 AND testset < 45 #

print(numpy.logical_and(testset > 20, testset < 45))

Associated Console Output:

[ True False False False True True True True True True]

[False False False False True True True True False False]

Thursday, August 2, 2018

(Python) Lists – Pt. (II)

In this article, we will continue to review various functions within the Python programming language which are related to Python lists.

Additionally, we will also demonstrate functions which are native to the “Numpy” package (pronounced: “Num-Pie”).

Import a Package

To employ the functions included within the in the “Numpy” package, we must first import the “Numpy” package, which essentially invokes the library and allows the aspects therein to be utilized. If the Anaconda distribution is the Python distribution which you are operating within, the “Numpy” package is included within the distribution itself. If you are not utilizing the Anaconda distribution, the “Numpy” package must be separately downloaded.

To import, or invoke a package (library) for utilization, you must include the following line within your Python code prior to utilizing a function related to such within your code:

import <name of the package>

In the case of “Numpy”, the code would resemble:

import numpy

Numpy Arrays

Numpy arrays differ from typical Python list variables, each variable type possesses benefits dependent on the circumstances which necessitate their existence.

For a variable to be utilized in a manner which is enabled specifically through the “Numpy” package, it must first be transformed into a numpy array type variable.

# “Numpy” Example #

# Import numpy package #

import numpy

# Create list variables #

a = ["apple", "orange", "pear", "kiwi", "mango"]

b = [0, 1, 2, 3, 4]

# Create numpy array type variables #

anumpy = numpy.array(a)

bnumpy = numpy.array(b)

# What occurs when you multiply a traditional list variable by 2 #

print(b * 2)

Console Output:

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

# What occurs when you multiply a numpy array by 2 #

print(bnumpy * 2)

Console Output:

[0 2 4 6 8]

Additionally, “Numpy” provides the option to assess values through comparison operators.

# Identify all values contained within array “b” which are less than 2 #

# Re-create list variables #

b = [0, 1, 2, 3, 4]

# Create numpy array type variable #

bnumpy = numpy.array(b)

# Assess values within the array #

c = bnumpy < 2

# Print the result #

print(c)

Console Output:

[ True True False False False]

# Create an array which satisfies the above assessment #

d = bnumpy[c]

# Print the result #

print(d)

Console Output:

[0 1]

Sorting Python Lists

The following are Python functions which can be utilized to sort Python lists.

# A list can only be sorted if it contains exclusively, numeric values OR string values #

# Create list variables #

a = ["apple", "orange", "pear", "kiwi", "mango"]

b = [0, 1, 2, 3, 4]

# Reverse the list order #

a = a[::-1]

print(a)

Console Output:

['mango', 'kiwi', 'pear', 'orange', 'apple']

b.reverse

b= b[::-1]

print(b)

Console Output:

[4, 3, 2, 1, 0]

# Re-create list variables #

a = ["apple", "orange", "pear", "kiwi", "mango"]

b = [0, 1, 2, 3, 4]

# Sort list “A” in reverse alphabetical order #

c = sorted(a, reverse=True)

print(c)

Console Output:

['pear', 'orange', 'mango', 'kiwi', 'apple']

# Sort list “B” in order from greatest to least #

d = sorted(b, reverse=True)

print(d)

Console Output:

[4, 3, 2, 1, 0]

# Re-create list variables #

a = ["apple", "orange", "pear", "kiwi", "mango"]

b = [0, 1, 2, 3, 4]

# Sort list "A" in alphabetical order #

c = sorted(a, reverse=False)

print(c)

Console Output:

['apple', 'kiwi', 'mango', 'orange', 'pear']

# Sort list “B” in order from least to greatest #

d = sorted(b, reverse=False)

print(d)

Console Output:

[0, 1, 2, 3, 4]

Calling Specific List Elements

There may be instance in which you wish to call a specific element in a list without knowing what that element initially is. To identify a list element and have its value returned, you may utilize the following functions.

# Create list variables #

a = ["apple", "orange", "pear", "kiwi", "mango"]

b = [0, 1, 2, 3, 4]

# Print the largest numeric element within a set #

print(max(b))

Console Output:

4

# Print the smallest numeric element within a set #

print(min(b))

Console Output:

0

# Print the last element of a set as sorted alphabetically #

print(max(a))

Console Output:

pear

# Print the first element of a set as sorted alphabetically #

print(min(a))

Console Output:

apple

Wednesday, August 1, 2018

(R) Stationary Data and Random Walks

In this entry, we are going to address a subject which is rather complicated at its core. Just as it is complicated, it is also obscure, in that, it is rarely discussed in textbooks, or even online for that matter.

Stationary Data refers to data types which are essentially static, in that, the underlying process which generates the data points, does not possess a directional aspect. Stationary data sets can be random. However, data sets which are random are not required to be stationary.

Investopedia Defines NON-Stationary Data as:

"In contrast to the non-stationary process that has a variable variance and a mean that does not remain near, or returns to a long-run mean over time, the stationary process reverts around a constant long-term mean and has a constant variance independent of time."

This phenomenon, when illustrated, would resemble something similar to the graphic below:

In this case, if "x" would be the time variable, and "y" would be the coinciding measurement.

The rules for a stationary data set are thus:

1. The expectation of the process is equal to a constant, meaning, time does not act as a contributing factor. (The mean of the process does not vary across time.)

2. The variance of the series is constant across time.

3. Variance in the series must only be impacted by changes within the “x” variable, and not, by changes within the “y” variable.

How does this data series differ from a "random walk"?

In the case of a "random walk" series of data, the main assumption, which acts as an aspect of differentiation, pertains to the correlation between variables "x" and "y". Mainly, in that, in a manner which is similar to the first rule of stationary data, the correlation between "x" and "y" will reach the value of "0" over an infinite series of time.

While the value of "0", as it relates to the value of the correlation, may be reached as time progresses, data generated by random walks possess an aspect known as "drift". It is this inherent component of the random walk series which causes trends to emerge which may, in some cases, present the illusion of linearity.

Below is a graphical representation of fictitious random walk data:

To demonstrate the aforementioned concepts, we will attempt two examples:

Example (Stationary Data Series)

# Requires that the package: “tseries”, be downloaded and enabled. #

# Create Stationary Data Set #

Value <- arima.sim(model = list(order = c(0,0,0)), n = 1000)

# Alternative Hypothesis (H1): Data is stationary #

adf.test(Value)

# Alternative Hypothesis (H1): Data is not a random walk #

PP.test(Value)

# Plot data points #

plot(Value)

In the case of our randomly generated stationary data, the graphical output is as follows:

The console output is as follows:

Augmented Dickey-Fuller Test

data: Value
Dickey-Fuller = -8.4201, Lag order = 9, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(Value) : p-value smaller than printed p-value

>
> # Alternative Hypothesis (NA): Data is not a random walk #
>
> PP.test(Value)

Phillips-Perron Unit Root Test

data: Value
Dickey-Fuller = -30.432, Truncation lag parameter = 7, p-value = 0.01

Example (Random Walk Data Series)

# Requires that the package: “tseries”, be downloaded and enabled. #

# Create Random Walk Set #

Value <- arima.sim(model = list(order = c(0, 1, 0)), n = 1000)

# Alternative Hypothesis (H1): Data is stationary #

adf.test(Value)

# Alternative Hypothesis (H1): Data is not a random walk #

PP.test(Value)

# Plot data points #

plot(Value)

In the case of our randomly generated stationary data, the graphical output is as follows:

The console output is as follows:

Augmented Dickey-Fuller Test

data: Value
Dickey-Fuller = -2.2039, Lag order = 9, p-value = 0.492
alternative hypothesis: stationary

>
> # Alternative Hypothesis (NA): Data is not a random walk #
>
> PP.test(Value)

Phillips-Perron Unit Root Test

data: Value
Dickey-Fuller = -2.4451, Truncation lag parameter = 7, p-value = 0.3899

Methods Utilized and Conclusions

The Augmented Dickey-Fuller Test is a methodology of analysis utilized to test data sets for stationarity. The lag value can be set by the user to determine the sensitivity of the model. Typically, this value is set to reflect the number of trend periods which exist within the data. I recommend leaving the value at its default setting. If this is the case, the function will assume a lag value of: trunc((length(x)-1)^(1/3)). More information pertaining to this option, and the function itself as it exists within the “tseries” package, can be found by utilizing the following command:

?? adf.test

The Phillips-Perron Unit Root Test is very similar to the previously mentioned test, however, The Phillips-Perron Unit Root Test, for our purposes, is being utilized to test a time series for random walk potential. The working hypothesis in this scenario would be:

H0: The data set shares similarities with a random walk series of data

H1: The data set does not share similarities with a random walk series of data

For more information on the function utilized, you may call the following command:

??PP.test

In our first example series, the p-value for the Augmented Dickey-Fuller Test was less than 0, thus, indicating that while assuming an alpha value 0f .05, we can state that the data was stationary. Since the p-vale related to the Phillips-Perron Unit Root Test was .01, assuming an alpha value of .05, we can state that the data does not exhibit the patterns typically observed within random walk data.

In our second example series, the p-value for the Augmented Dickey-Fuller Test was 0.492, thus, indicating that while assuming an alpha value 0f .05, we cannot state that the data was stationary. Since the p-value related to the Phillips-Perron Unit Root Test was 0.3899, assuming an alpha value of .05, we can state that the data does exhibit the patterns typically observed within a random walk data series.

For additional information pertaining to the subject matter discussed within this article, please visit the resources below:

https://www.quora.com/Is-a-random-walk-the-same-thing-as-a-non-stationary-time-series

https://www.youtube.com/watch?v=JytDF8ph2ko

(Python) Lists

In this article, we will discuss the concept of "Lists", which is one of the most useful concepts within the Python programming language.

Lists are very similar to vectors within the R programming language. In a prior article, I defined lists as:

A list is a collection of elements. Lists are ordered and modifiable.

For this to possess any meaning at all, we must delve into some example code.

Create a List

# Create List #

a = [ 'apple', 'orange', 'pear', 'kiwi', 'mango']

Lists, within Python, can contain both numeric and string variables.

# Create List with Mixed Variables #

b = [ 'apple', 1, 'pear', 2, 'mango']

# Print Both Lists #

print(a)

print(b)

It also possible to create a list which consists of multiple lists.

# Create Lists #

a = [ 'apple', 'orange', 'pear', 'kiwi', 'mango']

b = [ 'apple', 1, 'pear', 2, 'mango']

c = [a, b]

# Print New List #

print(c)

Console Output:

[['apple', 'orange', 'pear', 'kiwi', 'mango'], ['apple', 1, 'pear', 2, 'mango']]

Adding Elements to a List

Additional list elements can be added to the end of a list through the utilization of the append() function.

# Create List #

a = [ "apple", "orange", "pear", "kiwi", "mango"]

# Add element: "banana" to list: "a" #

a.append("banana")

# Print list: "a" #

print(a)

Console Output:

['apple', 'orange', 'pear', 'kiwi', 'mango', 'banana']

Referencing Select Elements from a List

To select an element from a list, you must first identify the position of the element. Against what is probably common sense, list elements begin with a reference position of “0”. To illustrate this concept, please view the explanation below:

a = [ "apple", "orange", "pear", "kiwi", "mango"]

“apple” is referenced by element: “0”.

“orange” is referenced by element: “1”.

“pear” is referenced by element: “2”.

“kiwi” is referenced by element: “3”.

“mango” is referenced by element: “4”.

Now, we will demonstrate the way in which to reference an element within a list.

# Print out an initial element from a list #

print(a[0])

Console Output:

apple

# Print out the last element from a list #

print(a[-1])

Console Output:

mango

If you wish to count the list elements in a backwards manner, you have the option of doing so through the utilization of negative values. To illustrate this concept, please view the explanation below:

“apple” is referenced by element: “-5”.

“orange” is referenced by element: “-4”.

“pear” is referenced by element: “-3”.

“kiwi” is referenced by element: “-2”.

“mango” is referenced by element: “-1”.

# Print out the first element from a list #

print(a[-5])

Console Output:

Apple

Referencing a Series of Elements from a List

There may be instances in which you would like to reference a series of elements contained within a variable list.

# Create List #

a = [ "apple", "orange", "pear", "kiwi", "mango"]

firsttwoelements = a[0:2]

print(firsttwoelements)

Console Output:

['apple', 'orange']

What the code “a[0:2]” is referring to, is a reference to elements “0” (apple) through “2” (pear), but not including element “2”.

lasttwoelements = a[-2:]

print(lasttwoelements)

Console Output:

['kiwi', 'mango']

What the code “a[-2:]” is referring to, is a reference to elements through “-2” (kiwi), including element -2.

Modifying List Elements

There may be instances in which a list element requires modification, this can be achieved through the utilization of the code below:

# Create List #

a = [ "apple", "orange", "pear", "kiwi", "mango"]

# Modify the initial list element #

a[0] = "dragon fruit"

# Modify the final list element #

a[-1] = "cherry"

# Print the modified list #

print(a)

Console Output:

['dragon fruit', 'orange', 'pear', 'kiwi', 'cherry']

Combining Lists

Let’s say that we wish to combine two lists into a single list which consists of elements of both prior components.

# Create Lists #

a = [ "apple", "orange", "pear", "kiwi", "mango"]

b = [ "apple", 1, "pear", 2, "mango"]

# Combine Lists #

c = a + b

# Print the modified list #

print(c)

Console Output:

['apple', 'orange', 'pear', 'kiwi', 'mango', 'apple', 1, 'pear', 2, 'mango']

Deleting Elements from a List

What if you wish to delete an element from a list? The following functions will allow you to achieve such.

# Create List #

a = [ "apple", "orange", "pear", "kiwi", "mango"]

# Remove the initial list element #

del(a[0])

# Print the modified list #

print(a)

# Remove the last element from the modified list #

a.pop()

# Print the modified list #

print(a)

This produces the output:

['orange', 'pear', 'kiwi']