In this article I would like to give you a brief introduction to Pandas, one of the most important toolkits Python provides for data cleaning and processing. I will show you some examples on how Pandas can be used to extract, explore and manipulate data.
You will learn how to read data into a DataFrame, how to query these structures, and how to write a DataFrame to a CSV file.

Why Python and Pandas?

At Webinterpret we are using Python and Pandas for Data Science tasks for a few reasons:

Python is the fastest developing language for data science. With it's ever growing technological stack, it has some very strong data science libraries, including SciPy ecosystem and numerous packages for Machine Learning like XGboost or TensorFlow.
Unlike R and other languages, Python is not just a statistical language, but also has full capabilities for data collection, cleaning, high performance computing and building production ready models that can be easily integrated with our applications.

The Pandas is an open source library created by Wes McKinney in 2008 and currently it is probably the most fundamental tool for almost every data scientist using Python. It provides a named vector and matrix-like structures, that help us see data in a tabular form, much like in Excel spreadsheets or in SQL databases. A concept of a DataFrame in Pandas is similar to a table in relational theory, so with some background in databases, you'll find Pandas fairly easy to work with.
Because it's open source and openly accessible, Pandas have very strong community, which drives the project forward and serves as an excellence point of reference. There are a couple of places you should consider visiting when looking for help with Pandas. Pandas documentation is a comprehensive source of information in the first place. However, often the best resource is simply Stack Overflow. It is used broadly within the Pandas community to post questions about this toolkit. Quiet often it happens, that questions tagged Pandas related are being answered by actual Pandas developer.

Ok. So now let's dig into Pandas.

Preparing environment

Before we start, we need to set up a virtual environment and install a few packages. For our convenience we will use Jupyter notebook, an open-source web based application that allows to work interactively with code and save the work in a form of notebooks.

$ mkdir pandas; cd pandas
$ virtualenv -p python3 env
$ source env/bin/activate
$ pip install pandas pandasql psycopg2 sqlalchemy jupyter pymongo tqdm

After all is installed you can check if jupyter notebook is working. Running below command should open a new tab in your browser:

$ jupyter notebook

As soon as jupyter is running and we've created a new notebook we can start writing code.

DataFrame Creation

First, we need to import pandas and numpy packages:

import numpy as np
import pandas as pd

Pandas toolkit introduces two very useful data structures: Series and DataFrame.

Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object.

This time we will be working mainly with DataFrames.

To create a DataFrame you can pass an array, and optionally name columns and indexes:

df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['A', 'B', 'C'], index=['a', 'b', 'c'])
df

	A	B	C
a	1	2	3
b	4	5	6
c	7	8	9

You can also create a DataFrame by passing a dict of objects that can be converted to series-like:

df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2

(excample from http://pandas.pydata.org/pandas-docs/stable/10min.html)

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

Data Exploration

from pandasql import load_meat, load_births
#load example data
meat = load_meat()
births = load_births()

#return n top results, default is 5
births.head()

	date	births
0	1975-01-01	265775
1	1975-02-01	241045
2	1975-03-01	268849
3	1975-04-01	247455
4	1975-05-01	254545

To get insights into the data we can use two useful functions: info and describe.

births.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 2 columns):
date      408 non-null datetime64[ns]
births    408 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 6.5 KB

births.describe()

	births
count	408.000000
mean	320585.502451
std	29449.011996
min	236551.000000
25%	303456.500000
50%	325625.500000
75%	342046.750000
max	387798.000000

And others:

births.columns

Index(['date', 'births', 'long_number'], dtype='object')

births.index

RangeIndex(start=0, stop=408, step=1)

births.head().values

array([[Timestamp('1975-01-01 00:00:00'), 265775],
       [Timestamp('1975-02-01 00:00:00'), 241045],
       [Timestamp('1975-03-01 00:00:00'), 268849],
       [Timestamp('1975-04-01 00:00:00'), 247455],
       [Timestamp('1975-05-01 00:00:00'), 254545]], dtype=object)

#transpose DF
births.head().T

	0	1	2	3	4
date	1975-01-01 00:00:00	1975-02-01 00:00:00	1975-03-01 00:00:00	1975-04-01 00:00:00	1975-05-01 00:00:00
births	265775	241045	268849	247455	254545

Sorting

#by an axis
births.sort_index(axis=0, ascending=False).head()

	date	births
407	2012-12-01	340995
406	2012-11-01	320195
405	2012-10-01	347625
404	2012-09-01	361922
403	2012-08-01	359554

# Sorting by values
births.sort_values(by='births', ascending=False).head()

	date	births
379	2008-08-01	387798
390	2011-07-01	375384
380	2008-09-01	374711
391	2011-08-01	373333
367	2007-08-01	369316

Working with indexes

# Set index
births = births.set_index('date')
births.head()

	births
date
1975-01-01	265775
1975-02-01	241045
1975-03-01	268849
1975-04-01	247455
1975-05-01	254545

births = births.reset_index()

Selecting data

# Selecting columns
meat['beef'].head()

0    751.0
1    713.0
2    741.0
3    650.0
4    681.0
Name: beef, dtype: float64

meat[['date', 'beef', 'pork']].head()

	date	beef	pork
0	1944-01-01	751.0	1280.0
1	1944-02-01	713.0	1169.0
2	1944-03-01	741.0	1128.0
3	1944-04-01	650.0	978.0
4	1944-05-01	681.0	1029.0

# selecting by label
meat.loc[0:3, 'beef':'pork']

	beef	veal	pork
0	751.0	85.0	1280.0
1	713.0	77.0	1169.0
2	741.0	90.0	1128.0
3	650.0	89.0	978.0

# selecting by position
meat.iloc[1:6:2, [0,1,4]]

	date	beef	lamb_and_mutton
1	1944-02-01	713.0	72.0
3	1944-04-01	650.0	66.0
5	1944-06-01	658.0	79.0

# selecting by both label or position:
meat.ix[0:2, 'date':'pork']

	date	beef	veal	pork
0	1944-01-01	751.0	85.0	1280.0
1	1944-02-01	713.0	77.0	1169.0
2	1944-03-01	741.0	90.0	1128.0

meat.ix[0:2, 0:4]

	date	beef	veal	pork
0	1944-01-01	751.0	85.0	1280.0
1	1944-02-01	713.0	77.0	1169.0
2	1944-03-01	741.0	90.0	1128.0

#Filtering
meat[meat.date > '2000'].head()

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey	cumulative
673	2000-02-01	2175.0	18.0	1558.0	20.0	2487.9	NaN	414.9	6673.8
674	2000-03-01	2300.0	20.0	1704.0	24.0	2687.9	NaN	469.7	7205.6
675	2000-04-01	2027.0	17.0	1398.0	23.0	2340.4	NaN	416.5	6221.9
676	2000-05-01	2303.0	19.0	1542.0	17.0	2741.7	NaN	492.3	7115.0
677	2000-06-01	2369.0	18.0	1538.0	17.0	2672.2	NaN	483.4	7097.6

Manipulating data

meat.sum(axis=1).head()

0    4410.0
1    4062.0
2    4068.0
3    3566.0
4    3788.0
dtype: float64

meat.mean()

beef               1683.463362
veal                 54.198549
pork               1211.683797
lamb_and_mutton      38.360701
broilers           1516.582520
other_chicken        43.033566
turkey              292.814646
cumulative         4384.466989
dtype: float64

meat.std()

beef                501.698480
veal                 39.062804
pork                371.311802
lamb_and_mutton      19.624340
broilers            963.012101
other_chicken         3.867141
turkey              162.482638
cumulative         1983.483767
dtype: float64

#creating a column
meat['foo'] = 'bar'
meat.head()

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey	cumulative	examle	foo
0	1944-01-01	751.0	85.0	1280.0	89.0	NaN	NaN	NaN	2205.0	foo	bar
1	1944-02-01	713.0	77.0	1169.0	72.0	NaN	NaN	NaN	2031.0	foo	bar
2	1944-03-01	741.0	90.0	1128.0	75.0	NaN	NaN	NaN	2034.0	foo	bar
3	1944-04-01	650.0	89.0	978.0	66.0	NaN	NaN	NaN	1783.0	foo	bar
4	1944-05-01	681.0	106.0	1029.0	78.0	NaN	NaN	NaN	1894.0	foo	bar

meat['example'] = [np.random.random() for x in range(len(meat))]
meat.example.head()

0    0.317909
1    0.718370
2    0.026211
3    0.687999
4    0.290987
Name: example, dtype: float64

# Import garbage collector module
import gc
#
del meat['foo']
del meat['example']

# Collect garbage. It's expecially important to do it when working with large datasets and a little memory
gc.collect()

Use SQL syntax with pandasql

Pandasql (and also pysqldf) allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. Pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

from pandasql import sqldf

#Specifying locals() or globals() can get tedious. You can defined a short helper function to fix this.
pysqldf = lambda q: sqldf(q, globals())

#Run a SQL query.It will insert given DF to SQLight DB in memory and then perform pd.from_sql() function.
#Warning: it is not the most efficient method and it often breaks on larger datasets
pysqldf("SELECT sum(beef), count(1) FROM meat;").head()

	sum(beef)	count(1)
0	1392224.2	827

More about Pandas vs SQL syntax: http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

Options and settings

births['long_number'] = 2234.0 * 2**100
births['long_number'].head()

0    2.831931e+33
1    2.831931e+33
2    2.831931e+33
3    2.831931e+33
4    2.831931e+33
Name: long_number, dtype: float64

#suppress displaying long numbers in scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)
births['long_number'].head()

0   2831931440909864482943634960809984.00
1   2831931440909864482943634960809984.00
2   2831931440909864482943634960809984.00
3   2831931440909864482943634960809984.00
4   2831931440909864482943634960809984.00
Name: long_number, dtype: float64

#reset option
pd.reset_option('display.float_format')

More about options and settings at:
http://pandas.pydata.org/pandas-docs/stable/options.html

Getting Data In/Out

#CSV
meteorites = pd.read_csv('Meteorite_Landings.csv')
meteorites.head()

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation
0	Aachen	1	Valid	L5	21.0	Fell	01/01/1880 12:00:00 AM	50.77500	6.08333	(50.775000, 6.083330)
1	Aarhus	2	Valid	H6	720.0	Fell	01/01/1951 12:00:00 AM	56.18333	10.23333	(56.183330, 10.233330)
2	Abee	6	Valid	EH4	107000.0	Fell	01/01/1952 12:00:00 AM	54.21667	-113.00000	(54.216670, -113.000000)
3	Acapulco	10	Valid	Acapulcoite	1914.0	Fell	01/01/1976 12:00:00 AM	16.88333	-99.90000	(16.883330, -99.900000)
4	Achiras	370	Valid	L6	780.0	Fell	01/01/1902 12:00:00 AM	-33.16667	-64.95000	(-33.166670, -64.950000)

#SQL
from sqlalchemy import create_engine
localhost = 'postgresql://user:pass@localhost:port/db'
engine = create_engine(localhost)

df = pd.read_sql("select * from tablename limit 100", engine)

#JSON
df = pd.read_json('JEOPARDY_QUESTIONS1.json')
df.head()

# Write out data to CSV
df.to_csv('jeopardy.csv')

Reading MongoDB \ JSON normalization

# read from MongoDB
from pymongo import MongoClient
uri = 'mongodb://user:pass@url/db'
client = MongoClient(uri)
db = client['db_name']
cursor = db.collection.find().limit(100) 
df = pd.DataFrame(list(cursor))

#JSON normalization when dealing with nested documents

from pandas.io.json import json_normalize

cursor = db.collection.find()
df = json_normalize(list(cursor))

Visualization Example

#  for simple graphic embedded in the notebook use
%matplotlib inline

# for more interactive plot use
# %matplotlib notebok

import matplotlib
import matplotlib.pyplot as plt

def convert_data_to_quarters():
    '''
    Converts the housing data to quarters
    and returns it as mean values in a dataframe.
    '''
    #read data from  csv
    df = pd.read_csv('City_Zhvi_AllHomes.csv')
    # create a multi index of State and RegionName
    df = df.set_index(["State","RegionName"])
    # select only interesting columns
    df = df.iloc[:,49:]
    # convert column names to datetime
    df.columns = pd.to_datetime(df.columns)
    df = df.resample('Q',axis=1).mean()
    df.columns = list(map(lambda x: '{}q{}'.format(x.year, x.quarter), df.columns))
    
    return df

df = convert_data_to_quarters()

# Change style of a plot
matplotlib.style.use('ggplot')

df.describe().transpose()['mean'].plot()

png

ny = df.iloc[0]
ny.plot()

png

Links

10 minutes to Pandas:
http://pandas.pydata.org/pandas-docs/stable/10min.html

Pandas vs SQL: syntax comparison:
http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

Python

Data Science in Python: Pandas

Why Python and Pandas?

Preparing environment

DataFrame Creation

Data Exploration

Sorting

Working with indexes

Selecting data

Manipulating data

Use SQL syntax with pandasql

Options and settings

Getting Data In/Out

Reading MongoDB \ JSON normalization

Visualization Example

Links

Daniel Wykrętowicz

Newsletter

Want to join Daniel Wykrętowicz
and the rest of our
rockstar team?

Comments

Why Python and Pandas?

Preparing environment

DataFrame Creation

Data Exploration

Sorting

Working with indexes

Selecting data

Manipulating data

Use SQL syntax with pandasql

Options and settings

Getting Data In/Out

Reading MongoDB \ JSON normalization

Visualization Example

Links

Daniel Wykrętowicz

Tensorflow - introduction

Introduction to NiFi with WebInterpret

Newsletter

Want to join Daniel Wykrętowiczand the rest of our rockstar team?

Comments

Want to join Daniel Wykrętowicz
and the rest of our
rockstar team?