Summary Statistics Using Pandas
Sometimes the default statistics provided by pandas’ describe()
method are not enough. In such cases, you can generate custom statistics using the agg()
method.
I’ll explain this using below list of numbers:
import numpy as np
import pandas as pd
# 30 random points from normal distribution
# with mean = 0 and standard deviation = 15
data = np.random.normal(5, 15, 30)
data_df = pd.DataFrame({"data":data})
The Problem
You can get some statistics using pandas’ describe()
method like below:
data_df.describe()
data | |
---|---|
count | 30.00 |
mean | 5.92 |
std | 15.47 |
min | -23.72 |
25% | -3.37 |
50% | 4.62 |
75% | 13.86 |
max | 37.45 |
But let’s say you need the below statistics:
- Minimum Value
- Maximum Value
- Range
- Mean
- Median
- Variance
- Inter Quartile Range (IQR)
The output of the describe()
didn’t include range, variance, or IQR.
How can you get all the statistics you want?
The Solution
You’ll need to do two things:
- Write custom aggregate functions for statistics like range and IQR
- Use pandas
agg()
method to generate all the statistics
Custom Aggregate Functions
# Input's inter quartile range
# it's the distance between 75th and 25th percentiles
def IQR(column):
q25, q75 = column.quantile([0.25, 0.75])
return q75-q25
# Input's range
# It's the difference between input's maximum and minimum values
#
# range() is already a built-in function in Python.
# So I chose another name for our custom function
def range_f(column):
return column.max() - column.min()
The agg()
Method
Let’s prepare a list of all the required statistics.
You can mix a variety of functions in this list:
- pandas’ built-in functions like
min
,max
, etc. - the custom functions we defined above.
- NumPy aggregate functions. For example, we’ll use NumPy’s
var()
to calculate variance.
stats_list = [
'min', 'max',
range_f, # custom function
'mean', 'median', 'std',
np.var, # numpy function
IQR # custom function
]
Next, we generate the statistics using pandas’ agg()
method.
summary_stats = data_df.agg(func=stats_list)
summary_stats
data | |
---|---|
min | -23.72 |
max | 37.45 |
range_f | 61.17 |
mean | 5.92 |
median | 4.62 |
std | 15.47 |
var | 239.35 |
IQR | 17.23 |
Pretty Names
The statistics above look good. But you may not like the default names for the statistics.
You can use custom names:
# custom names for the statistics.
# Make sure they are in the same order
# as in the 'stat_list' variable
pretty_names = [
'Minimum', 'Maximum', 'Range', 'Mean', 'Median',
'Standard Deviation', 'Variance', 'IQR'
]
# update the index labels with our custom names
summary_stats.index = pretty_names
summary_stats
data | |
---|---|
Minimum | -23.72 |
Maximum | 37.45 |
Range | 61.17 |
Mean | 5.92 |
Median | 4.62 |
Standard Deviation | 15.47 |
Variance | 239.35 |
IQR | 17.23 |
That’s it! We have the statistics just the way you wanted.