Reflections of a Data Scientist: June 2019

Hey, Data Heads! I’m back from an extended hiatus with a quick article to demonstrate a very useful Pandas function.

To understand this entry, you must first have some prior experience with both the Python programming language, and the Pandas Python library. If you are unfamiliar with either of the aforementioned topics, information and demonstrations related to such can be found within previous articles featured on this website.

As you may recall from a much earlier article which discussed the SAS programming platform, a limitation exists within the SAS language which inhibits the joining of multiple distinct variables into a single cell entry, with all associated entries from other column variables being combined into a single associated variable adjacent to the distinct variable entry. In prior articles on this topic, I designed a series of macros to accomplish what I have just described, however, in the case of Python, specifically through the utilization of the Pandas library, this task can be achieved through a single line of code.

Example:

We will begin by enabling the Pandas library. After which, we will import the familiar data set: "SetA", into the allocated memory.

Just as a reminder, if you aren’t in the mood to input the .CSV cell entries yourself, this file, and all others, can be found within this website’s associated GitHub repository.

# Enable Pandas Package #

import pandas

# Specify the appropriate file path for import #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Desktop\\SetA.csv"

# Create a variable to store the data #

pandadataframe = pandas.read_csv(filepath)

# Modify the column variable to the appropriate variable format and type #

pandadataframe['VARA'] = pandadataframe['VARA'].astype('str')

pandadataframe['DATAVAL'] = pandadataframe['DATAVAL'].astype('str')

The function below, which serves as the method for generating the desired result, can only be utilized if all related variables referenced are of the "string" type. It is for this reason that the two lines of code above this description perform a variable type modification. This ensures that each variable referenced in the code below is a string type variable.

pandadataframe = pandadataframe.groupby(['VARA'])['DATAVAL'].apply('|'.join).reset_index()

Once the above function has performed its task, we will then perform the print function in order to display the results of such.

print(pandadataframe)

Which displays the following output:

VARA DATAVAL
0 A 1|2
1 B 1|2|3
2 E 1|2|3|4

As we have succeeded with our task, all that remains is saving our newly created data set. This can be achieved through the utilization of the code below:

# Choose file pathway designation to indicate where data will be saved #

pandadataframe.to_csv("C:\\Users\\ Desktop\\SetAOutput.csv", sep=',', encoding='utf-8', index = False)

The data set, when viewed within MS-Excel will resemble the following image:

I hope that you found this article helpful. Soon I’ll be back with another entry, but not too soon. Until then, stay inquisitive, Data Heads!

Reflections of a Data Scientist

Saturday, June 8, 2019

(Python) Joining Distinct Variable Cell Entries with Pandas