Friday, August 10, 2018

(Python) Importing Data

In this article, we will discuss how to properly import data frames into Python through the utilization of various Python modules. Throughout this post, there are numerous links to online resources maintained by the module creators. These resources should be utilized as necessary, as they provide important options which may be useful in addressing additional aspects related to package functions.

Importing (.csv) Data as a Multidimensional Numpy Array

There may be instances in which you wish to have a data frame transformed and stored as a multi-dimensional array. The reason for perusing such an option would typically be necessitated due to a desire to produce a machine learning model. Models of this type require the aforementioned variable format.

Limitations exist pertaining to this import function. Primarily amongst such, is the functions inability to import data which contains non-numerical elements, and the inability of the function to import columns which contain non-existent entries.

Therefore, for the following function featured within the example to correctly perform its purpose, the data called therein, must not contain missing values, and must not contain non-numerical elements.

# Enable Numpy #

import numpy

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTest.csv"

# Create a variable to store the data #

# The "delimiter" option specifies the delimiter contained within the data file #

# The "skiprows" option indicates that the first row containing variable names will be omitted #

numpyex = numpy.loadtxt(filepath, delimiter=',', skiprows=1)

# Print the result of the data import process to the console #

print(numpyex)


Console Output:

[[ 83. 2036. 803. 544. 243. 28. 843. 46.]

[ 93. 2015. 804. 465. 296. 15. 815. 32.]

[ 49. 1967. 804. 430. 189. 47. 817. 46.]

[ 100. 1957. 802. 511. 256. 42. 561. 37.]

[ 22. 1925. 803. 529. 172. 96. 345. 32.]

[ 31. 1895. 810. 435. 194. 40. 861. 46.]

[ 94. 1889. 802. 503. 228. 7. 883. 46.]

[ 4. 1722. 802. 535. 260. 80. 300. 50.]

[ 25. 1715. 808. 437. 200. 77. 776. 37.]

[ 46. 1704. 809. 445. 310. 52. 410. 53.]

[ 15. 1646. 802. 502. 223. 79. 296. 31.]

[ 74. 1611. 800. 420. 200. 99. 808. 43.]

[ 79. 1429. 805. 504. 185. 67. 806. 50.]

[ 13. 1401. 801. 415. 283. 23. 235. 53.]

[ 4. 1334. 802. 484. 277. 79. 946. 37.]

[ 47. 1290. 807. 428. 171. 15. 481. 42.]

[ 49. 1274. 805. 406. 306. 12. 296. 34.]

[ 7. 1161. 803. 489. 298. 93. 381. 28.]

[ 93. 1132. 805. 415. 195. 31. 221. 40.]

[ 60. 1131. 804. 413. 185. 5. 308. 33.]]


Let us consider another example which demonstrates additional options.

# Enable Numpy #

import numpy

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTest.csv"

# Create a variable to store the data #

# The "delimiter" option specifies the delimiter contained within the data file #

# The "dtype" option indicates that the data file will consist of string variables only #

# The "skiprows" option indicates that the first row containing variable names will be omitted #

# The "usecols" option specifies which rows will be included within the input #

numpyex = numpy.loadtxt(filepath, delimiter=',', dtype=str, skiprows=1, usecols=[0,1,2,3])

# Print the result of the data import process to the console #

print(numpyex)


Console Output:

[[ 83. 2036. 803. 544.]

[ 93. 2015. 804. 465.]

[ 49. 1967. 804. 430.]

[ 100. 1957. 802. 511.]

[ 22. 1925. 803. 529.]

[ 31. 1895. 810. 435.]

[ 94. 1889. 802. 503.]

[ 4. 1722. 802. 535.]

[ 25. 1715. 808. 437.]

[ 46. 1704. 809. 445.]

[ 15. 1646. 802. 502.]

[ 74. 1611. 800. 420.]

[ 79. 1429. 805. 504.]

[ 13. 1401. 801. 415.]

[ 4. 1334. 802. 484.]

[ 47. 1290. 807. 428.]

[ 49. 1274. 805. 406.]

[ 7. 1161. 803. 489.]

[ 93. 1132. 805. 415.]

[ 60. 1131. 804. 413.]] 


For more information pertaining to this function and its internal options:

Now, if you absolutely must import array data which contains both string data and numerical data, a different array function exists within the “numpy” package. This function also allows for columns which contain missing elements.

# Enable Numpy #

import numpy

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestII.csv"

# Create a variable to store the data #

# The "delimiter" option specifies the delimiter contained within the data file #

# The "skip_header" option indicates that the first row containing variable names will be omitted #

# The "dtype" option indicates that the data type of each element will be automatically decided #

# The "encoding" option specifies which encoding methodology should be employed when decoding the input file #

numpyex2 = numpy.genfromtxt(filepath, delimiter=',', skip_header=1, dtype=None, encoding=None)

# Print the result of the data import process to the console #

print(numpyex)


Console Output:

[( 83, 2036, 803, 544, 'BMW') ( 93, 2015, 804, 465, 'Volvo')

( 49, 1967, 804, 430, 'Jeep') (100, 1957, 802, 511, 'Subaru')

( 22, 1925, 803, 529, 'Mitsubishi') ( 31, 1895, 810, 435, '')

( 94, 1889, 802, 503, '') ( 4, 1722, 802, 535, '')

( 25, 1715, 808, 437, '') ( 46, 1704, 809, 445, 'Ford')

( 15, 1646, 802, 502, 'Chevy') ( 74, 1611, 800, 420, 'BMW')

( 79, 1429, 805, 504, 'Volvo') ( 13, 1401, 801, 415, 'Jeep')

( 4, 1334, 802, 484, 'Subaru') ( 47, 1290, 807, 428, 'Mitsubishi')

( 49, 1274, 805, 406, 'Toyota') ( 7, 1161, 803, 489, 'Lexus')

( 93, 1132, 805, 415, 'Nissan') ( 60, 1131, 804, 413, 'Honda')]


For more information pertaining to this function and its internal options:

Importing (.csv) Data as a Panda Data Frame

Typically, due to the common data variable type, and the traditional aspects of data integrity and presentation, you will most likely prefer to import data into Python through the utilization of functions inherit within the "pandas" package.

# Enable Pandas

import pandas

# Specify the appropriate file path #

# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestII.csv"

# Create a variable to store the data #

pandadataframe = pandas.read_csv(filepath)

# Print the result of the data import process to the console #

print(pandadataframe)


Console Output:

VarA VarB VarC VarD VarE

0 83 2036 803 544 BMW

1 93 2015 804 465 Volvo

2 49 1967 804 430 Jeep

3 100 1957 802 511 Subaru

4 22 1925 803 529 Mitsubishi

5 31 1895 810 435 NaN

6 94 1889 802 503 NaN

7 4 1722 802 535 NaN

8 25 1715 808 437 NaN

9 46 1704 809 445 Ford

10 15 1646 802 502 Chevy

11 74 1611 800 420 BMW

12 79 1429 805 504 Volvo

13 13 1401 801 415 Jeep

14 4 1334 802 484 Subaru

15 47 1290 807 428 Mitsubishi

16 49 1274 805 406 Toyota

17 7 1161 803 489 Lexus

18 93 1132 805 415 Nissan

19 60 1131 804 413 Honda


If only the first seven rows of data were required, the following code could be utilized to accomplish this task:

# Import only the first seven rows of data from the example data frame #

pandadataframe = pandas.read_csv(filepath, nrows=7)

# Print the result of the data import process to the console #

print(pandadataframe)


Console Output:

VarA VarB VarC VarD VarE

0 83 2036 803 544 BMW

1 93 2015 804 465 Volvo

2 49 1967 804 430 Jeep

3 100 1957 802 511 Subaru

4 22 1925 803 529 Mitsubishi

5 31 1895 810 435 NaN

6 94 1889 802 503 NaN


For more information pertaining to this function and its internal options:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Importing (.xlsx) Data as a Panda Data Frame

There may be instances in which, you wish to import Microsoft Excel data into the Python coding platform. The following code will enable you to achieve such.

# Enable Pandas

import pandas

# Specify the appropriate file path #

filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestIII.xlsx"

# Create a variable to store the data #

pandadataframe = pandas.ExcelFile(filepath)

# Print workbook spreadsheet names #

print(pandadataframe.sheet_names)

# Assign "Sheet1" to a variable # 


# (This variable is specifying which sheet of the workbook you will be importing) #

pandadataframesheet1 = pandadataframe.parse('Sheet1')

# Print the result of the data import process to the console #

print(pandadataframesheet1)


Console Output:

['Sheet1']

VarA VarB VarC VarD VarE VarF

0 83 2036 803 544 BMW One

1 93 2015 804 465 Volvo NaN

2 49 1967 804 430 Jeep NaN

3 100 1957 802 511 Subaru One

4 22 1925 803 529 Mitsubishi One

5 31 1895 810 435 Toyota NaN

6 94 1889 802 503 Lexus NaN

7 4 1722 802 535 Nissan NaN

8 25 1715 808 437 Honda NaN

9 46 1704 809 445 Ford One

10 15 1646 802 502 Chevy NaN

11 74 1611 800 420 BMW NaN

12 79 1429 805 504 Volvo One

13 13 1401 801 415 Jeep NaN

14 4 1334 802 484 Subaru NaN

15 47 1290 807 428 Mitsubishi NaN

16 49 1274 805 406 Toyota One

17 7 1161 803 489 Lexus One

18 93 1132 805 415 Nissan NaN

19 60 1131 804 413 Honda NaN

To remove the “NaN” entries from the “VarF” column, you can utilize the following line of code:

# Replace "NaN" values with "N/A" #

pandadataframesheet1.fillna('N/A', inplace=True)

# Print sheet #

print(pandadataframesheet1)

Console Output:

['Sheet1']

VarA VarB VarC VarD VarE VarF

0 83 2036 803 544 BMW One

1 93 2015 804 465 Volvo N/A

2 49 1967 804 430 Jeep N/A

3 100 1957 802 511 Subaru One

4 22 1925 803 529 Mitsubishi One

5 31 1895 810 435 Toyota N/A

6 94 1889 802 503 Lexus N/A

7 4 1722 802 535 Nissan N/A

8 25 1715 808 437 Honda N/A

9 46 1704 809 445 Ford One

10 15 1646 802 502 Chevy N/A

11 74 1611 800 420 BMW N/A

12 79 1429 805 504 Volvo One

13 13 1401 801 415 Jeep N/A

14 4 1334 802 484 Subaru N/A

15 47 1290 807 428 Mitsubishi N/A

16 49 1274 805 406 Toyota One

17 7 1161 803 489 Lexus One

18 93 1132 805 415 Nissan N/A

19 60 1131 804 413 Honda N/A

Additionally, if you desired to create a new pandas data frame which contained only "VarA" and "VarB", the code would resemble:

pandasdataframevaravarb = pandadataframesheet1 [['VarA', 'VarB']]

print(pandasdataframevaravarb)


Console Output:

VarA VarB

0 83 2036

1 93 2015

2 49 1967

3 100 1957

4 22 1925

5 31 1895

6 94 1889

7 4 1722

8 25 1715

9 46 1704

10 15 1646

11 74 1611

12 79 1429

13 13 1401

14 4 1334

15 47 1290

16 49 1274

17 7 1161

18 93 1132

19 60 1131 


For more information pertaining to this function and its internal options:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Importing (.sas7bdat) Data as a Panda Data Frame

The following example presents a scenario in which a file within the SAS format is imported as a pandas data frame. I cannot provide exercise data as it pertains to this exercise as my SAS License has since expired.

# Enable Pandas #

import pandas

# Specify the appropriate file path #

filepath = "C:\\Users\\Username\\Desktop\\SASFile.sas7bdat"

# Create a variable to store the data #

pandadataframesas = pandas.read_sas(filepath)

# Print the result of the data import process to the console #

print(pandadataframesas)


With SAS imports, the possibility is always present that the data contained within certain variable columns may appear with a mysterious (b’) occurring prior to each entry.

For example:

b’ 11111

b’ 22222

b’ 33333


To rectify this issue, which is caused by encoding formats, utilize the subsequent code:

dataframename['variablename'] = dataframename['variablename'].str.decode('utf-8')

In a more realistic scenario, the code might resemble something similar to the following:

DataFrameA ['id'] = DataFrameA ['id'] .str.decode('utf-8')

For more information pertaining to this function and its internal options:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html#pandas.read_sas

Exporting Python Data

After you have sorted, edited, and analyzed your data, you’ll most likely want to export the finalized version of the data frame to an outside format. The code below will assist you with this task.

# Exporting Data to (.csv) Format #

# Option: 'sep' specifies the designation which will be utilized to separate the data file contents #

pandadataframe.to_csv("C:\\Users\\Username\\Desktop\\pandadataframe.csv", sep=',', encoding='utf-8')

# Exporting Data to (.xlsx) Format #

# Option: 'sheet_name' designates the name of the first sheet of the Excel workbook #

pandadataframesheet1.to_excel("C:\\Users\\Username\\Desktop\\pandadataframesheet1.xlsx", sheet_name='Sheet1')

For more information pertaining to this function and its internal options:

For more information pertaining to this function and its internal options:

http://pandas.pydata.org/pandas-docs/version/0.20.2/generated/pandas.DataFrame.to_csv.html

For code and exercise data pertaining to this article, please click on the link below to visit our new GitHub repository:

https://github.com/RDScientist/CodeRepo

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.