Reflections of a Data Scientist: (Python) Pip and SQL

As was the case with the R data platform, numerous auxiliary packages also exist within Python which enable additional functionality. In today’s article, we will be discussing the 'pip' package, which allows for the installation and maintenance of auxiliary Python packages.

We will also be briefly discussing, within the contents of this article, the 'pandasql' package, which enables the emulation of SQL related functionality within the Python platform.

Installing a Package with Pip

'Pip' functionality, as it pertains to the appropriate coding to utilize, is determinant upon the Python IDE which is being currently operated.

As it is applicable to the Jupyter Notebook IDE, installing a package through 'pip' utilization would resemble the following:

# Install a package using 'pip' #

import pip

pip.main(['install', 'nameofpackag'])

In the case of our example, in which we wish to install 'pandasql', the code to achieve such would be:

# Install 'pandasql' through 'pip' #

import pip

pip.main(['install', 'pandasql'])

If the code successfully runs, you should receive an output which resembles:

Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3

Update a Package with Pip

There will also be instances in which you wish to update a package which has already been previously installed. 'Pip' can accomplish this through the utilization of the following code:

# Update a package #

import pip

pip.main(['install', '--upgrade', 'pip'])

In the above case, 'pip' itself is being upgraded. The code which is being utilized can be modified so long as it resembles the template below:

# Update a package #

import pip

pip.main(['install', '--upgrade', 'NameofPackage'])

If the code successfully runs, you should receive an output which resembles:

Installing collected packages: pip
Found existing installation: pip 9.0.1
Uninstalling pip-9.0.1:
Successfully uninstalled pip-9.0.1
Successfully installed pip-18.0

Emulating SQL Functionality within Python with ‘PandaSQL’

As the package name implies ('PandaSQL'), data must be formatted within a pre-allotted panda data frame. Once this has been accomplished, 'PandaSQL' enables for the manipulation of data within the Python platform as if it were an SQL server.

I will not provide an example for this particular package but I will provide the coding template for utilizing its functionality.

In the case of 'PandaSQL', the following code line must always be included prior to writing pseudo-SQL statements.

pysqldf = lambda q: sqldf(q, globals())

Additionally, the code must always be stored in a variable designated as 'q'.

Finally, 'PandaSQL' code can be written in exactly the same format as regular SQL code. However, the key differentiating factor, is that the code must be surrounded by three sets of quotation marks (“””).

Therefore if we were to write some sample code which utilizes the 'PandaSQL' package, the code would resemble:

from pandasql import *

import pandas

pysqldf = lambda q: sqldf(q, globals())

q = """

SELECT

VARA,

VARB

FROM

pandadataframe3

ORDER BY VARA;

"""

df = pysqldf(q)

The output of which would be stored in the Python variable: "df".

That’s it for now, Data Heads! Stay tuned for my informative articles.

Reflections of a Data Scientist

Thursday, August 23, 2018

(Python) Pip and SQL

No comments:

Post a Comment