Python Pandas
pandas is a library for data manipulation.
Contents
Example
To compute statistics:
import pandas as pd s = pd.Series([10, 20, 30]) mu = s.mean() sigma = s.std()
To import a CSV file:
df = pd.read_csv("example.csv", index_col="UniqueID", usecols=["UniqueID", "Height", "Weight"])
Series
The core data type is pandas.core.series.Series. These are arrays indexed from 0. Ideally they store elements of the same data type (a.k.a. dtype), although if an efficient type cannot be inferred, it falls back to object.
The builtin Python operators perform element-wise math.
import pandas as pd
pd.Series(["foo", "bar", "baz"]) # 0 foo
# 1 bar
# 2 baz
# dtype: objectSeries objects have these attributes:
Attribute Name |
Description |
axes |
|
iloc |
|
index |
RangeIndex of indices |
is_unique |
are all elements unique? |
hasnans |
are any elements NaN? |
loc |
|
shape |
(rows) |
size |
count of elements |
values |
internal numpy.ndarray storing the elements |
These methods are descriptive, rather than being general programming utilities.
Method Names |
Description |
Example |
describe |
Creates a Series with descriptive statistics |
|
head |
First N elements |
s.head(5) |
info |
Prints descriptive statistics |
|
tail |
Last N elements |
s.tail(5) |
value_counts |
Creates a Series with counts of unique values |
|
These methods create and return a new Series.
Method Names |
Description |
Example |
add |
Element-wise addition |
|
apply |
Element-wise function mapping |
s.apply(len) |
copy |
|
|
div |
Element-wise division |
|
map |
Element-wise value mapping; NaN if no match |
s.map({True: 1}) |
mul |
Element-wise multiplication |
|
sort_index |
Sorted by indices |
s.sort_index(ascending=True) |
sort_values |
Sorted by values |
s.sort_values(ascending=True) |
sub |
Element-wise subtraction |
|
These methods return a scalar value computed from the Series:
Method Names |
Description |
count |
Count non-missing elements |
get |
|
max |
|
mean |
|
median |
|
min |
|
mode |
|
product |
|
std |
|
sum |
|
Describe
s = pd.Series([1,2,3,4,5,6,7,8,9,10])
s.describe() # count 10.00000
# mean 5.50000
# std 3.02765
# min 1.00000
# 25% 3.25000
# 50% 5.50000
# 75% 7.75000
# max 10.00000
# dtype: float64
Get
The get method returns one or more elements based on index matching.
There is an optional default keyword argument.
Note that the get method can take a list of indices. A new Series will be returned only if all matches are found, and the singleton default will be returned otherwise.
Data Frames
Building upon Series is pandas.core.frame.DataFrame.
For example, the Series methods which return a scalar value are instead defined for a DataFrame to return a Series: a scalar value for each column.
df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra'])
df.mean() # a 1.5
# b 2.5DataFrame objects have these attributes:
Attribute Names |
Description |
axes |
|
columns |
Index of column names |
dtypes |
Series of dtypes |
iloc |
|
index |
|
loc |
|
shape |
(rows, columns) |
size |
count of elements |
values |
internal numpy.ndarray storing the elements |
Others
The module also exposes several implementation details of Series and DataFrame objects: pandas.core.indexes.base.Index (generally returned by a column attribute), pandas.core.indexes.base.RangeIndex (generally returned by an index attribute), pandas.core.indexing._LocIndexer (generally returned by a loc method), and pandas.core.indexing._iLocIndexer (generally returned by an iloc method).
