Image: Track and field with numbered tracks

Summary Statistics Using Pandas

Yashmeet Singh · · 3 minute read

Sometimes the default statistics provided by pandas’ describe() method are not enough. In such cases, you can generate custom statistics using the agg() method.

I’ll explain this using below list of numbers:

import numpy as np
import pandas as pd
 
# 30 random points from normal distribution  
# with mean = 0 and standard deviation = 15
data = np.random.normal(5, 15, 30)
data_df = pd.DataFrame({"data":data})

The Problem

You can get some statistics using pandas’ describe() method like below:

data_df.describe()
data
count 30.00
mean 5.92
std 15.47
min -23.72
25% -3.37
50% 4.62
75% 13.86
max 37.45

But let’s say you need the below statistics:

  • Minimum Value
  • Maximum Value
  • Range
  • Mean
  • Median
  • Variance
  • Inter Quartile Range (IQR)

The output of the describe() didn’t include range, variance, or IQR.

How can you get all the statistics you want?

The Solution

You’ll need to do two things:

  • Write custom aggregate functions for statistics like range and IQR
  • Use pandas agg() method to generate all the statistics

Custom Aggregate Functions

# Input's inter quartile range
# it's the distance between 75th and 25th percentiles
def IQR(column): 
    q25, q75 = column.quantile([0.25, 0.75])
    return q75-q25
 
# Input's range
# It's the difference between input's maximum and minimum values
# 
# range() is already a built-in function in Python. 
# So I chose another name for our custom function
def range_f(column):
    return column.max() - column.min()

The agg() Method

Let’s prepare a list of all the required statistics.

You can mix a variety of functions in this list:

  • pandas’ built-in functions like min, max, etc.
  • the custom functions we defined above.
  • NumPy aggregate functions. For example, we’ll use NumPy’s var() to calculate variance.
stats_list = [
    'min', 'max', 
    range_f, # custom function 
    'mean', 'median',  'std',
    np.var, # numpy function
    IQR   # custom function
]

Next, we generate the statistics using pandas’ agg() method.

summary_stats = data_df.agg(func=stats_list)
summary_stats
data
min -23.72
max 37.45
range_f 61.17
mean 5.92
median 4.62
std 15.47
var 239.35
IQR 17.23

Pretty Names

The statistics above look good. But you may not like the default names for the statistics.

You can use custom names:

# custom names for the statistics. 
# Make sure they are in the same order 
# as in the 'stat_list' variable
pretty_names = [
    'Minimum', 'Maximum', 'Range', 'Mean', 'Median', 
    'Standard Deviation', 'Variance', 'IQR'
]
 
# update the index labels with our custom names
summary_stats.index = pretty_names
summary_stats
data
Minimum -23.72
Maximum 37.45
Range 61.17
Mean 5.92
Median 4.62
Standard Deviation 15.47
Variance 239.35
IQR 17.23

That’s it! We have the statistics just the way you wanted.

Title Image by Pexels