Data Science in Python: Pandas

In this article I would like to give you a brief introduction to Pandas, one of the most important toolkits Python provides for data cleaning and processing. I will show you some examples on how Pandas can be used to extract, explore and manipulate data.
You will learn how to read data into a DataFrame, how to query these structures, and how to write a DataFrame to a CSV file.

Why Python and Pandas?

At Webinterpret we are using Python and Pandas for Data Science tasks for a few reasons:

  • Python is the fastest developing language for data science. With it's ever growing technological stack, it has some very strong data science libraries, including SciPy ecosystem and numerous packages for Machine Learning like XGboost or TensorFlow.
  • Unlike R and other languages, Python is not just a statistical language, but also has full capabilities for data collection, cleaning, high performance computing and building production ready models that can be easily integrated with our applications.

The Pandas is an open source library created by Wes McKinney in 2008 and currently it is probably the most fundamental tool for almost every data scientist using Python. It provides a named vector and matrix-like structures, that help us see data in a tabular form, much like in Excel spreadsheets or in SQL databases. A concept of a DataFrame in Pandas is similar to a table in relational theory, so with some background in databases, you'll find Pandas fairly easy to work with.
Because it's open source and openly accessible, Pandas have very strong community, which drives the project forward and serves as an excellence point of reference. There are a couple of places you should consider visiting when looking for help with Pandas. Pandas documentation is a comprehensive source of information in the first place. However, often the best resource is simply Stack Overflow. It is used broadly within the Pandas community to post questions about this toolkit. Quiet often it happens, that questions tagged Pandas related are being answered by actual Pandas developer.

Ok. So now let's dig into Pandas.

Preparing environment

Before we start, we need to set up a virtual environment and install a few packages. For our convenience we will use Jupyter notebook, an open-source web based application that allows to work interactively with code and save the work in a form of notebooks.

$ mkdir pandas; cd pandas
$ virtualenv -p python3 env
$ source env/bin/activate
$ pip install pandas pandasql psycopg2 sqlalchemy jupyter pymongo tqdm

After all is installed you can check if jupyter notebook is working. Running below command should open a new tab in your browser:

$ jupyter notebook

As soon as jupyter is running and we've created a new notebook we can start writing code.

DataFrame Creation

First, we need to import pandas and numpy packages:

import numpy as np
import pandas as pd

Pandas toolkit introduces two very useful data structures: Series and DataFrame.

Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object.

This time we will be working mainly with DataFrames.

To create a DataFrame you can pass an array, and optionally name columns and indexes:

df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['A', 'B', 'C'], index=['a', 'b', 'c'])
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9

You can also create a DataFrame by passing a dict of objects that can be converted to series-like:

df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2

(excample from http://pandas.pydata.org/pandas-docs/stable/10min.html)

A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

Data Exploration

from pandasql import load_meat, load_births
#load example data
meat = load_meat()
births = load_births()

#return n top results, default is 5
births.head()
date births
0 1975-01-01 265775
1 1975-02-01 241045
2 1975-03-01 268849
3 1975-04-01 247455
4 1975-05-01 254545

To get insights into the data we can use two useful functions: info and describe.

births.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 2 columns):
date      408 non-null datetime64[ns]
births    408 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 6.5 KB
births.describe()
births
count 408.000000
mean 320585.502451
std 29449.011996
min 236551.000000
25% 303456.500000
50% 325625.500000
75% 342046.750000
max 387798.000000

And others:

births.columns
Index(['date', 'births', 'long_number'], dtype='object')
births.index
RangeIndex(start=0, stop=408, step=1)
births.head().values
array([[Timestamp('1975-01-01 00:00:00'), 265775],
       [Timestamp('1975-02-01 00:00:00'), 241045],
       [Timestamp('1975-03-01 00:00:00'), 268849],
       [Timestamp('1975-04-01 00:00:00'), 247455],
       [Timestamp('1975-05-01 00:00:00'), 254545]], dtype=object)
#transpose DF
births.head().T
0 1 2 3 4
date 1975-01-01 00:00:00 1975-02-01 00:00:00 1975-03-01 00:00:00 1975-04-01 00:00:00 1975-05-01 00:00:00
births 265775 241045 268849 247455 254545
Sorting
#by an axis
births.sort_index(axis=0, ascending=False).head()
date births
407 2012-12-01 340995
406 2012-11-01 320195
405 2012-10-01 347625
404 2012-09-01 361922
403 2012-08-01 359554
# Sorting by values
births.sort_values(by='births', ascending=False).head()
date births
379 2008-08-01 387798
390 2011-07-01 375384
380 2008-09-01 374711
391 2011-08-01 373333
367 2007-08-01 369316

Working with indexes

# Set index
births = births.set_index('date')
births.head()
births
date
1975-01-01 265775
1975-02-01 241045
1975-03-01 268849
1975-04-01 247455
1975-05-01 254545
births = births.reset_index()

Selecting data

# Selecting columns
meat['beef'].head()
0    751.0
1    713.0
2    741.0
3    650.0
4    681.0
Name: beef, dtype: float64
meat[['date', 'beef', 'pork']].head()
date beef pork
0 1944-01-01 751.0 1280.0
1 1944-02-01 713.0 1169.0
2 1944-03-01 741.0 1128.0
3 1944-04-01 650.0 978.0
4 1944-05-01 681.0 1029.0
# selecting by label
meat.loc[0:3, 'beef':'pork']
beef veal pork
0 751.0 85.0 1280.0
1 713.0 77.0 1169.0
2 741.0 90.0 1128.0
3 650.0 89.0 978.0
# selecting by position
meat.iloc[1:6:2, [0,1,4]]
date beef lamb_and_mutton
1 1944-02-01 713.0 72.0
3 1944-04-01 650.0 66.0
5 1944-06-01 658.0 79.0
# selecting by both label or position:
meat.ix[0:2, 'date':'pork']
date beef veal pork
0 1944-01-01 751.0 85.0 1280.0
1 1944-02-01 713.0 77.0 1169.0
2 1944-03-01 741.0 90.0 1128.0
meat.ix[0:2, 0:4]
date beef veal pork
0 1944-01-01 751.0 85.0 1280.0
1 1944-02-01 713.0 77.0 1169.0
2 1944-03-01 741.0 90.0 1128.0
#Filtering
meat[meat.date > '2000'].head()
date beef veal pork lamb_and_mutton broilers other_chicken turkey cumulative
673 2000-02-01 2175.0 18.0 1558.0 20.0 2487.9 NaN 414.9 6673.8
674 2000-03-01 2300.0 20.0 1704.0 24.0 2687.9 NaN 469.7 7205.6
675 2000-04-01 2027.0 17.0 1398.0 23.0 2340.4 NaN 416.5 6221.9
676 2000-05-01 2303.0 19.0 1542.0 17.0 2741.7 NaN 492.3 7115.0
677 2000-06-01 2369.0 18.0 1538.0 17.0 2672.2 NaN 483.4 7097.6

Manipulating data

meat.sum(axis=1).head()
0    4410.0
1    4062.0
2    4068.0
3    3566.0
4    3788.0
dtype: float64
meat.mean()
beef               1683.463362
veal                 54.198549
pork               1211.683797
lamb_and_mutton      38.360701
broilers           1516.582520
other_chicken        43.033566
turkey              292.814646
cumulative         4384.466989
dtype: float64
meat.std()
beef                501.698480
veal                 39.062804
pork                371.311802
lamb_and_mutton      19.624340
broilers            963.012101
other_chicken         3.867141
turkey              162.482638
cumulative         1983.483767
dtype: float64
#creating a column
meat['foo'] = 'bar'
meat.head()
date beef veal pork lamb_and_mutton broilers other_chicken turkey cumulative examle foo
0 1944-01-01 751.0 85.0 1280.0 89.0 NaN NaN NaN 2205.0 foo bar
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN NaN NaN 2031.0 foo bar
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN NaN NaN 2034.0 foo bar
3 1944-04-01 650.0 89.0 978.0 66.0 NaN NaN NaN 1783.0 foo bar
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN NaN NaN 1894.0 foo bar
meat['example'] = [np.random.random() for x in range(len(meat))]
meat.example.head()
0    0.317909
1    0.718370
2    0.026211
3    0.687999
4    0.290987
Name: example, dtype: float64
# Import garbage collector module
import gc
#
del meat['foo']
del meat['example']

# Collect garbage. It's expecially important to do it when working with large datasets and a little memory
gc.collect()

Use SQL syntax with pandasql

Pandasql (and also pysqldf) allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. Pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

from pandasql import sqldf

#Specifying locals() or globals() can get tedious. You can defined a short helper function to fix this.
pysqldf = lambda q: sqldf(q, globals())

#Run a SQL query.It will insert given DF to SQLight DB in memory and then perform pd.from_sql() function.
#Warning: it is not the most efficient method and it often breaks on larger datasets
pysqldf("SELECT sum(beef), count(1) FROM meat;").head()
sum(beef) count(1)
0 1392224.2 827

More about Pandas vs SQL syntax: http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

Options and settings

births['long_number'] = 2234.0 * 2**100
births['long_number'].head()
0    2.831931e+33
1    2.831931e+33
2    2.831931e+33
3    2.831931e+33
4    2.831931e+33
Name: long_number, dtype: float64
#suppress displaying long numbers in scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)
births['long_number'].head()
0   2831931440909864482943634960809984.00
1   2831931440909864482943634960809984.00
2   2831931440909864482943634960809984.00
3   2831931440909864482943634960809984.00
4   2831931440909864482943634960809984.00
Name: long_number, dtype: float64
#reset option
pd.reset_option('display.float_format')

More about options and settings at:
http://pandas.pydata.org/pandas-docs/stable/options.html

Getting Data In/Out

#CSV
meteorites = pd.read_csv('Meteorite_Landings.csv')
meteorites.head()
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775000, 6.083330)
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.183330, 10.233330)
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.216670, -113.000000)
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.883330, -99.900000)
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.166670, -64.950000)
#SQL
from sqlalchemy import create_engine
localhost = 'postgresql://user:pass@localhost:port/db'
engine = create_engine(localhost)

df = pd.read_sql("select * from tablename limit 100", engine)
#JSON
df = pd.read_json('JEOPARDY_QUESTIONS1.json')
df.head()

# Write out data to CSV
df.to_csv('jeopardy.csv')

Reading MongoDB \ JSON normalization

# read from MongoDB
from pymongo import MongoClient
uri = 'mongodb://user:pass@url/db'
client = MongoClient(uri)
db = client['db_name']
cursor = db.collection.find().limit(100) 
df = pd.DataFrame(list(cursor))
#JSON normalization when dealing with nested documents

from pandas.io.json import json_normalize

cursor = db.collection.find()
df = json_normalize(list(cursor))

Visualization Example

#  for simple graphic embedded in the notebook use
%matplotlib inline

# for more interactive plot use
# %matplotlib notebok

import matplotlib
import matplotlib.pyplot as plt

def convert_data_to_quarters():
    '''
    Converts the housing data to quarters
    and returns it as mean values in a dataframe.
    '''
    #read data from  csv
    df = pd.read_csv('City_Zhvi_AllHomes.csv')
    # create a multi index of State and RegionName
    df = df.set_index(["State","RegionName"])
    # select only interesting columns
    df = df.iloc[:,49:]
    # convert column names to datetime
    df.columns = pd.to_datetime(df.columns)
    df = df.resample('Q',axis=1).mean()
    df.columns = list(map(lambda x: '{}q{}'.format(x.year, x.quarter), df.columns))
    
    return df

df = convert_data_to_quarters()
# Change style of a plot
matplotlib.style.use('ggplot')

df.describe().transpose()['mean'].plot()

ny = df.iloc[0]
ny.plot()

10 minutes to Pandas:
http://pandas.pydata.org/pandas-docs/stable/10min.html

Pandas vs SQL: syntax comparison:
http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html