Summary Statistics Using Pandas

Yashmeet Singh · Mar 21, 2022 · 3 minute read

Sometimes the default statistics provided by pandas’ describe() method are not enough. In such cases, you can generate custom statistics using the agg() method.

I’ll explain this using below list of numbers:

import numpy as np
import pandas as pd
 
# 30 random points from normal distribution  
# with mean = 0 and standard deviation = 15
data = np.random.normal(5, 15, 30)
data_df = pd.DataFrame({"data":data})

The Problem

You can get some statistics using pandas’ describe() method like below:

data_df.describe()

	data
count	30.00
mean	5.92
std	15.47
min	-23.72
25%	-3.37
50%	4.62
75%	13.86
max	37.45

But let’s say you need the below statistics:

Minimum Value
Maximum Value
Range
Mean
Median
Variance
Inter Quartile Range (IQR)

The output of the describe() didn’t include range, variance, or IQR.

How can you get all the statistics you want?

The Solution

You’ll need to do two things:

Write custom aggregate functions for statistics like range and IQR
Use pandas agg() method to generate all the statistics

Custom Aggregate Functions

# Input's inter quartile range
# it's the distance between 75th and 25th percentiles
def IQR(column): 
    q25, q75 = column.quantile([0.25, 0.75])
    return q75-q25
 
# Input's range
# It's the difference between input's maximum and minimum values
# 
# range() is already a built-in function in Python. 
# So I chose another name for our custom function
def range_f(column):
    return column.max() - column.min()

The `agg()` Method

Let’s prepare a list of all the required statistics.

You can mix a variety of functions in this list:

pandas’ built-in functions like min, max, etc.
the custom functions we defined above.
NumPy aggregate functions. For example, we’ll use NumPy’s var() to calculate variance.

stats_list = [
    'min', 'max', 
    range_f, # custom function 
    'mean', 'median',  'std',
    np.var, # numpy function
    IQR   # custom function
]

Next, we generate the statistics using pandas’ agg() method.

summary_stats = data_df.agg(func=stats_list)
summary_stats

	data
min	-23.72
max	37.45
range_f	61.17
mean	5.92
median	4.62
std	15.47
var	239.35
IQR	17.23

Pretty Names

The statistics above look good. But you may not like the default names for the statistics.

You can use custom names:

# custom names for the statistics. 
# Make sure they are in the same order 
# as in the 'stat_list' variable
pretty_names = [
    'Minimum', 'Maximum', 'Range', 'Mean', 'Median', 
    'Standard Deviation', 'Variance', 'IQR'
]
 
# update the index labels with our custom names
summary_stats.index = pretty_names
summary_stats

	data
Minimum	-23.72
Maximum	37.45
Range	61.17
Mean	5.92
Median	4.62
Standard Deviation	15.47
Variance	239.35
IQR	17.23

That’s it! We have the statistics just the way you wanted.

Title Image by Pexels

Pandas Statistics

Comment anonymously

M ↓ Markdown

Commento

The Problem

The Solution

Custom Aggregate Functions

The agg() Method

Pretty Names

Share this article:

What's Popular

Explore More

The `agg()` Method