Python Pandas

pandas is a library for data manipulation.

Contents

Python Pandas

Example

To compute statistics:

import pandas as pd

s = pd.Series([10, 20, 30])
mu = s.mean()
sigma = s.std()

To import a CSV file:

df = pd.read_csv("example.csv", index_col="UniqueID", usecols=["UniqueID", "Height", "Weight"])

Series

The core data type is pandas.core.series.Series. These are arrays indexed from 0. Ideally they store elements of the same data type (a.k.a. dtype), although if an efficient type cannot be inferred, it falls back to object.

The builtin Python operators perform element-wise math.

import pandas as pd

pd.Series(["foo", "bar", "baz"])  # 0   foo
                                  # 1   bar
                                  # 2   baz
                                  # dtype: object

Series objects have these attributes:

Attribute Name	Description
`axes`
`iloc`
`index`	`RangeIndex` of indices
`is_unique`	are all elements unique?
`hasnans`	are any elements NaN?
`loc`
`shape`	`(rows)`
`size`	count of elements
`values`	internal numpy.ndarray storing the elements

These methods are descriptive, rather than being general programming utilities.

Method Names	Description	Example
`describe`	Creates a `Series` with descriptive statistics
`head`	First N elements	`s.head(5)`
`info`	Prints descriptive statistics
`tail`	Last N elements	`s.tail(5)`
`value_counts`	Creates a `Series` with counts of unique values

These methods create and return a new Series.

Method Names	Description	Example
`add`	Element-wise addition
`apply`	Element-wise function mapping	`s.apply(len)`
`copy`
`div`	Element-wise division
`map`	Element-wise value mapping; NaN if no match	`s.map({True: 1})`
`mul`	Element-wise multiplication
`sort_index`	Sorted by indices	`s.sort_index(ascending=True)`
`sort_values`	Sorted by values	`s.sort_values(ascending=True)`
`sub`	Element-wise subtraction

These methods return a scalar value computed from the Series:

Method Names	Description
`count`	Count non-missing elements
`get`
`max`
`mean`
`median`
`min`
`mode`
`product`
`std`
`sum`

Describe

s = pd.Series([1,2,3,4,5,6,7,8,9,10])
s.describe()  # count    10.00000
              # mean      5.50000
              # std       3.02765
              # min       1.00000
              # 25%       3.25000
              # 50%       5.50000
              # 75%       7.75000
              # max      10.00000
              # dtype: float64

Get

The get method returns one or more elements based on index matching.

There is an optional default keyword argument.

Note that the get method can take a list of indices. A new Series will be returned only if all matches are found, and the singleton default will be returned otherwise.

Data Frames

Building upon Series is pandas.core.frame.DataFrame.

For example, the Series methods which return a scalar value are instead defined for a DataFrame to return a Series: a scalar value for each column.

df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
df.mean()  # a   1.5
           # b   2.5

DataFrame objects have these attributes:

Attribute Names	Description
`axes`
`columns`	`Index` of column names
`dtypes`	`Series` of dtypes
`iloc`
`index`
`loc`
`shape`	`(rows, columns)`
`size`	count of elements
`values`	internal numpy.ndarray storing the elements

Others

The module also exposes several implementation details of Series and DataFrame objects: pandas.core.indexes.base.Index (generally returned by a column attribute), pandas.core.indexes.base.RangeIndex (generally returned by an index attribute), pandas.core.indexing._LocIndexer (generally returned by a loc method), and pandas.core.indexing._iLocIndexer (generally returned by an iloc method).

CategoryRicottone

Python/Pandas