Reflections of a Data Scientist: September 2018

"Matplotlib" is a Python package with enables the creation of graphical data outputs within the Python platform. As mentioned in a prior article, in most cases, it may be easier and even more aesthetically pleasing to graph data within an Excel workbook. However, there are instances in which "matplotlib" may be able to provide a graphical variation which is un-available within other software suites.

In this article, we will examine a few different variations of graphical outputs which can be produced through the utilization of the "matplotlib" package. With the fundamentals firmly grasped, you should then possess the ability to conduct further research as it pertains to the more eclectic options which enable less frequently utilized visuals.

A link to the package website is featured at the end of this entry. There, you can find numerous demonstrations and templates which illustrate the package's full capabilities.

For demonstrative purposes, we will again be utilizing the data frame: "PythonImportTestIII".

This data frame can be downloaded from this website's GitHub Repository:

GitHub

(Files are sorted in chronological order by coinciding entry date)

The file itself is in .csv format, and must be imported prior to analysis.

All example require that the following lines be included within the initial code file:

# Enable Matplotlib #

import matplotlib.pyplot as plt

# Enable Pandas #

import pandas

Basic Line Graph

Let's start by demonstrating a basic line graph.

# Adjust output dimensions #

plt.figure(figsize=(5,5))

# Sort data by the X-Axis #

# Not performing this steps causes your data to resemble a scribble #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA'])

# Plot the data #

plt.plot(PythonImportTestIII['VarA'], PythonImportTestIII['VarD'])

# Output the newly ceated graphic #

plt.show()

Not incredibly inspiring. Let's add a few more details to make our graph a bit more complete.

# Adjust output dimensions #

plt.figure(figsize=(5,5))

# Sort data by the X-Axis #

# Not performing this steps causes your data to resemble a scribble #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA'])

# Plot the data #

plt.plot(PythonImportTestIII['VarA'], PythonImportTestIII['VarD'])

# Assign axis labels and graph title #

xlab = 'X-Axis Label'

ylab = 'Y-Axis Label'

title = 'Graph Title'

# Assign axis labels and graph title to graphical output #

plt.xlabel(xlab)

plt.ylabel(ylab)

plt.title(title)

# Output the newly created graphic #

plt.show()

Now let's take it to another level by adding a grid to the background of our data graphic.

# Adjust output dimensions #

plt.figure(figsize=(5,5))

# Sort data by the X-Axis #

# Not performing this steps causes your data to resemble a scribble #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA'])

# Plot the data #

plt.plot(PythonImportTestIII['VarA'], PythonImportTestIII['VarD'])

# Assign axis labels and graph title #

xlab = 'X-Axis Label'

ylab = 'Y-Axis Label'

title = 'Graph Title'

# Assign axis labels and graph title to graphical output #

plt.xlabel(xlab)

plt.ylabel(ylab)

plt.title(title)

# Add a grid to the graph #

plt.grid()

# Output the newly created graphic #

plt.show()

Line Graph with Connected Data Points

If you're anything like yours truly, you'd prefer to have data points specifically displayed within the graphical output. To achieve this, utilize the following code:

# Adjust output dimensions #

plt.figure(figsize=(5,5))

# Sort data by the X-Axis #

# Not performing this steps causes your data to resemble a scribble #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA'])

# Plot the scattered data #

plt.scatter(PythonImportTestIII['VarA'], PythonImportTestIII['VarD'])

# Plot the data lines #

plt.plot(PythonImportTestIII['VarA'], PythonImportTestIII['VarD'])

# Assign axis labels and graph title #

xlab = 'X-Axis Label'

ylab = 'Y-Axis Label'

title = 'Graph Title'

# Assign axis labels and graph title to graphical output #

plt.xlabel(xlab)

plt.ylabel(ylab)

plt.title(title)

# Add a grid to the graph #

plt.grid()

# Output the newly created graphic #

plt.show()

Yes, life is beautiful.

Scatter Plot

To create a pure scatter plot, utilize the code below:

# Adjust output dimensions #

plt.figure(figsize=(5,5))

# Plot the scattered data #

plt.scatter(PythonImportTestIII['VarA'], PythonImportTestIII['VarD'])

# Assign axis labels and graph title #

xlab = 'X-Axis Label'

ylab = 'Y-Axis Label'

title = 'Graph Title'

# Assign axis labels and graph title to graphical output #

plt.xlabel(xlab)

plt.ylabel(ylab)

plt.title(title)

# Output the newly created graphic #

plt.show()

Histogram

To create a histogram, try implementing the following code:

# Adjust output dimensions #

plt.figure(figsize=(6,6))

# Create a histogram for the data #

plt.hist(PythonImportTestIII['VarD'])

# Assign axis labels and graph title #

xlab = 'X-Axis Label'

ylab = 'Y-Axis Label'

title = 'Graph Title'

# Assign axis labels and graph title to graphical output #

plt.xlabel(xlab)

plt.ylabel(ylab)

plt.title(title)

# Output the newly created graphic #

plt.show()

Vertical Bar Chart

To create a vertical bar chart, utilize the code below:

# Create a vertical bar chart for the data #

plt.bar(PythonImportTestIII['VarC'], PythonImportTestIII['VarD'])

# Assign axis labels and graph title #

xlab = 'X-Axis Label'

ylab = 'Y-Axis Label'

title = 'Graph Title'

# Assign axis labels and graph title to graphical output #

plt.xlabel(xlab)

plt.ylabel(ylab)

plt.title(title)

# Output the newly created graphic #

plt.show()

Horizontal Bar Chart

The code to plot a horizontal bar chart is only slightly different:

# Create a horizontal bar chart for the data #

plt.barh(PythonImportTestIII['VarC'], PythonImportTestIII['VarD'])

# Assign axis labels and graph title #

xlab = 'X-Axis Label'

ylab = 'Y-Axis Label'

title = 'Graph Title'

# Assign axis labels and graph title to graphical output #

plt.xlabel(xlab)

plt.ylabel(ylab)

plt.title(title)

# Output the newly created graphic #

plt.show()

Conclusion

The fundamental graphics which I have demonstrated are by no means an adequate summarization of what is offered within the "matplotlib" package. All sorts of additional options exist which include, but are not limited to: error bars, color, multiple lines within a single line graph, stacked bar charts, etc. It isn't an exaggeration to state that an entire blog could be dedicated just to demonstrate the various functionalities which exist inherently with the "matplotlib" package.

Therefore, for more information related to the functionality of this package, I would recommend performing independent research related to the topic. Or, you could visit the package creators' website, which includes numerous additional templates and examples:

Matplotlib Organization

That's all for now. Stay tuned, Data Monkeys!

The topic of today's post is: Data Frame Maintenance within the Python platform, specifically pertaining to data imported through the utilization of the “pandas” package.

All of the exercises featured within this article require the following file to be successfully demonstrated:

PythonImportTestIII.csv

This file can be found within the exercise code and example data set repository:

GitHub Repository

Files are sorted based on article date.

Also, for any of these examples to work, you must be sure to include:

import pandas

import numpy

within the first initial few lines of your Python program code.

Be sure to re-import the data set after performing an example which modifies the underlying data structure.

Checking Data Integrity

After your data has been successfully imported into Python, you should check the integrity of the data structure to ensure that all of the original data was imported correctly. Listed below, are some of the commands that can be utilized to ensure that data integrity was maintained.

If the utilization of this command is infeasible due to the size of the data frame, you could instead utilize the head or tail commands.

The head command template is:

<DataFrameName>.head(<number of rows to display>)

Executing this command will display the first n number of rows contained within the data frame.

# Example: #

# Print the first 10 rows of the data set #

PythonImportTestIII.head(10)

The tail command template is:

<DataFrameName>.tail(<number of rows to display>)

Executing this demand will display the last n number of rows contained within the data frame.

# Example: #

# Print the last 5 rows of the data set #

PythonImportTestIII.tail(5)

Adding a List as a Column

For this example, we'll pretend that you wanted to add a new column in the form of a list, to an existing data frame.

# Add Column #

# Create List #

VarG = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

# Modify List into Panda Series #

VarG = pandas.Series(VarG)

# Add Column as Panda Series #

PythonImportTestIII['VarG'] = VarG.values

# Print Results #

print(PythonImportTestIII)

To demonstrate the scenario in which the list possesses a length which is less that observational size of the data frame:

# Add Column of Un-equal Length #

# Create List #

VarH = [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

# Modify List into Panda Series #

newvar = pandas.Series(VarH)

# Rename Column within Data Frame #

newvar.name = 'VarH'

# Add Column to Data Frame #

PythonImportTestIII = pandas.concat([PythonImportTestIII, newvar], axis=1)

# Print Results #

print(PythonImportTestIII)

Adding an Observation to a Data Frame

In the case of adding an additional row (observation) to an already existent data frame, the following code can be utilized.

We must first code the entries that we wish to add in the same manner in which the initial data frame is encoded.

newobservation = pandas.DataFrame({'VarA': ['30'],

'VarB': ['1000'],

'VarC': ['833'],

'VarD': ['400']},

index = [20])

We will print the row observation to illustrate its structure.

print(newobservation)

To provide this addition to the existing data frame, the following code can be utilized:

PythonImportTestIII = pandas.concat([PythonImportTestIII, newobservation])

Again, we will print to illustrate the structure of the amended data frame:

print(PythonImportTestIII)

Changing a Column Name

Let's say, for example, that you are working with the data frame named: "PythonImportTestIII". For whatever reason, the first column of this particular data frame needs to be re-named. The code to accomplish this task is below:

DataFrame.rename(columns={'originalcolumnname':'newcolumnname'}, inplace = True)

So, if you wanted to change the name of the first column of “PythonImportTestIII" to, “DataBlog", the code would resemble:

PythonImportTestIII.rename(columns={'VarA':'DataBlog'}, inplace=True)

print(PythonImportTestIII)

Changing Column Variable Type

Now, let's say that you wanted to change the data type that is contained within a column of an existing data frame. Again, we will use "PythonImportTestIII" for our example.

This code will change a column variable to a "string" type:

PythonImportTestIII['VarA'] = PythonImportTestIII['VarA'].astype('str')

This code will change a column variable to an "integer" type:

PythonImportTestIII['VarA'] = PythonImportTestIII['VarA'].astype('int')

This code will change a column variable to a "float" type:

PythonImportTestIII['VarA'] = PythonImportTestIII['VarA'].astype('float')

To check the current variable types of variables contained within a data frame, utilize the following code:

PythonImportTestIII.dtypes

Re-Ordering Columns within a Data Frame

For example, if you were working on a data frame (“PythonImportTestIII”), with the column names of ("VarA", "VarB", "VarC", "VarD", "VarE", "VarF"), and you wanted to re-order the columns so that they were displayed such as ("VarF", "VarE", "VarA", "VarB", "VarC", "VarD") you could run the code:

PythonImportTestIII = PythonImportTestIII[['VarF', 'VarE', 'VarA', 'VarB', 'VarC', 'VarD']]

Removing Columns from a Data Frame

Assuming that we were still utilizing the same data frame as previously ("PythonImportTestIII"), and we desired to remove certain columns from it, the following code code be utilized:

# Remove Column Variables: "VarE" and "VarF" from PythonImportTestIII #

PythonImportTestIII = PythonImportTestIII.drop(columns=['VarE', 'VarF'])

print(PythonImportTestIII)

Removing Select Rows from a Data Frame

If we desired to instead, remove certain rows within a data frame, we could achieve such subsequent to determining which rows required removal.

# Remove Rows: "0" and "1" from PythonImportTestIII #

PythonImportTestIII = PythonImportTestIII.drop([0, 1])

Create a New Column Variable from Established Column Variable(s)

To create a new column variable from an already existent variable, the code resembles:

# Create a copy of variable: 'VarE', as a new variable: 'VarG' #

PythonImportTestIII['VarG'] = PythonImportTestIII['VarE']

To create a new column variable as a product of existing variables, the code resembles:

# Create a new variable: 'VarH', as the product value of 'VarA' and 'VarB' #

PythonImportTestIII['VarH'] = PythonImportTestIII['VarA'] * PythonImportTestIII['VarB']

This can similarly be achieved with the code:

PythonImportTestIII['VarH'] = PythonImportTestIII[PythonImportTestIII.columns[0]] + PythonImportTestIII[PythonImportTestIII.columns[1]]

Drop a Data Frame or Python Variable

There may arise an instance in which you desire to remove a previously designated variable, for example:

# Create List Variable: 'a' #

a = [0,1,2,3,4,5]

# Print 'a' to Console #

print(a)

# Delete Variable: 'a' #

del a

# Print 'a' to Console #

print(a)

Console Output:

NameError: name 'a' is not defined

In this case, you will notice the error which is displayed in lieu of the variable description. The variable 'a' is now free to be re-assigned as necessary.

Create a Data Frame without Importing Data

If there is ever the case that you wish to create a data frame from scratch, without importing a previously created data structure, the following code can be utilized:

# Create a new Data Frame #

sampledataframe = pandas.DataFrame({

'Column1': [0, 1, 2, 3],

'Column2': [4, 5, 6, 7]

})

# Print to console #

print(sampledataframe)

# This will also achieve a similar result #

# Create Data Variables #

a = [0, 1, 2, 3]

b = [4, 5, 6, 7]

# Create a new Data Frame #

sampledataframe0 = pandas.DataFrame({

'Column1': a,

'Column2': b

})

# Print to console #

print(sampledataframe0)

Stacking Data Frames

Perhaps you want to stack two data frames, one on top of the other. This can be achieved with the example code:

# Stack the Data Frame: "PythonImportTestIII" on top of itself #

PythonImportTestIIIConcat = pandas.concat([PythonImportTestIII, PythonImportTestIII])

# Print to console #

print(PythonImportTestIIIConcat)

If there were instances where variables from one data frame were not present in the other, a 'NaN' would indicate this discrepancy.

Using Conditionals to Create New Data Frame Variables

In the previous article: Pip and SQL ,we discussed how to download and appropriately utilize the wonderful 'pandasql' package. However, no working demonstration was provided within the entry.

There are many ways to conditionally utilize Python's pandas to create new variables and filter through variables based on conditions. However, I have found that the best way to achieve multi-variable query results is through the utilization of SQL emulation with the 'pandasql' package.

In this first scenario, we will be creating a new variable "VarG" and assigning it a value based on the following conditions:

If "VarD" is less than or equal to 450, then "VarG" will be assigned the value: "<= 450"

If "VarD" is greater than 450 and less than 500, then "VarG" will be assigned the value: "451-499"

If "VarD" is greater than or equal to 500, then "VarG" will be assigned the value: ">= 500"

To achieve this, we will be utilizing the following code:

# Requires the "pandasql" package to have been previously downloaded #

from pandasql import *

pysqldf = lambda q: sqldf(q, globals())

q = """

SELECT *,

CASE

WHEN (VarD <= 450) THEN '<= 450'

WHEN (VarD > 450 AND VarD < 500) THEN '451-499'

WHEN (VarD >= 500) THEN '>= 500'

ELSE 'UNKNOWN' END AS VarG

from PythonImportTestIII;

"""

df = pysqldf(q)

print(df)

PythonImportTestIII = df

# Print Data Frame to Console #

print(PythonImportTestIII)

In this next scenario, we will delete row entries which meet the following conditions:

If "VarD" is less than or equal to 450, then "VarG" will be assigned the value: "X"

If "VarD" is greater than or equal to 500, then "VarG" will be assigned the value: "X"

If "VarD" does not satisfy either of the prior conditions, then "VarG" will be assigned the value:" "

# Requires the "pandasql" package to have been previously downloaded #

pysqldf = lambda q: sqldf(q, globals())

q = """

SELECT *,

CASE

WHEN (VarD <= 450) THEN 'X'

WHEN (VarD >= 500) THEN 'X'

ELSE " " END AS VarG

from PythonImportTestIII;

"""

df = pysqldf(q)

PythonImportTestIII = df

# Print Data Frame to Console #

print(PythonImportTestIII)

# Filter Out Row Observations in which variable: 'VarG' equals 'X' #

PythonImportTestIIIFilter = PythonImportTestIII[PythonImportTestIII.VarG != 'X']

# Print Data Frame to Console #

print(PythonImportTestIIIFilter)

# Remove variable: 'VarG' in its entirety #

PythonImportTestIII = PythonImportTestIII.drop(columns=['VarG'])

# Print Data Frame to Console #

print(PythonImportTestIII)

Extracting Rows and Columns from a Data Frame

Finally, we arrive at the simplest demonstrable task, extracting data entries and variables from an imported data frame. If multiple conditions must be met as it pertains to specifying variable ranges, I would recommend utilizing the above examples to prepare the data frame prior to the extraction process.

Extracting Column Data

# Extract Columns by Variable Name #

ExtractedCols = PythonImportTestIII[['VarA', 'VarB']]

# Extract Columns by Variable Position #

ExtractedCols0 = PythonImportTestIII.iloc[:, 0:2]

# Print to Console #

print(ExtractedCols)

print(ExtractedCols0)

Extracting Row Data

# Extract Rows by Obervation Position #

ExtractedRows0 = PythonImportTestIII.iloc[0:5, :]

# Print to Console #

print(ExtractedRows0)

Reset Index

In our prior example demonstrating pandasql, we produced a new data set resembling:

VarA VarB VarC VarD VarE VarF

1 93 2015 804 465 Volvo None

14 4 1334 802 484 Subaru None

17 7 1161 803 489 Lexus One

As you may notice, the left-most column, the 'index' column, is now mis-labeled.

To correct this, we must rest the index values. This is accomplished through the utilization of the following code:

# Correct Index Values #

PythonImportTestIII = PythonImportTestIII.reset_index(drop = True)

# Print to Console #

print(PythonImportTestIII)

Console Output:

VarA VarB VarC VarD VarE VarF

0 93 2015 804 465 Volvo None

1 4 1334 802 484 Subaru None

2 7 1161 803 489 Lexus One

Sorting Data Frames

Now that you have all of your data clean and extracted, you may want to sort it. Below are the functions to accomplish this task, and the options available within each.

# Sort Data #

# Sort by variable: 'VarA' #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA'])

print(PythonImportTestIII)

# Sort by variable: 'VarA' and 'VarB' #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA', 'VarB'])

print(PythonImportTestIII)

# Sort by variable: 'VarA' (descending order) #

PythonImportTestIII = PythonImportTestIII.sort_values(by='VarA', ascending=False)

print(PythonImportTestIII)

# Sort by variable: 'VarA' (put NAs first) #

PythonImportTestIII = PythonImportTestIII.sort_values(by = 'VarA', na_position = 'first')

print(PythonImportTestIII)

Reflections of a Data Scientist

Saturday, September 1, 2018

(Python) Graphing Data with “Matplotlib”

(Python) Data Frame Maintenance