Title Image for: What is Stratified Sampling and How to do it using Pandas?

What Is Stratified Sampling and How to Do It Using Pandas?

Yashmeet Singh · · 10 minute read

Sampling

Imagine you want to conduct a study to answer the below question:

How many hours does an average American adult spend watching TV each day?

There are over 256 million adults in America. You can’t possibly ask every adult in America about their TV habits.

So what do you do?

You can select and interview a subset (a sample) of all adults in America. If you choose the sample correctly, you can use it to draw conclusions about the entire population of American adults.

In this post, I’ll cover two techniques for selecting samples. First, we’ll discuss Simple Random Sampling (SRS).

Then we’ll see where SRS could go wrong. And how Stratified Sampling can alleviate the issues with SRS.

Finally, we’ll develop some practical skills. We’ll implement both sampling techniques using Python and Pandas.

Simple Random Sampling (SRS)

In Simple Random Sampling (SRS), everyone in the population has an equal chance of being selected for the sample.

To prepare a sample using SRS, we randomly select the desired number of members from the population.

Let’s say we use SRS to choose a sample of 1000 American adults for the TV study.

Will this sample accurately represent the population? We’re about to find out!

Trouble with SRS

Below pie chart shows how American adults are distributed by the age groups:

Stratified Sampling: Distribution of American adults by age groups

Why is this important?

The average time spent on TV varies widely among these age groups. The younger generations spend significantly less time on TV than the older adults.

Therefore, our sample should have a similar proportion of age groups as the entire adult population. If one group is over or under represented, it may exert an undue influence on our conclusions.

SRS cannot guarantee such proportional representation in the samples.

SRS ensures that every adult gets an equal chance to be included in the sample. But SRS chooses each participant at random and independently of others. Therefore, SRS may not maintain the group distribution as we desire.

To illustrate this, I simulated drawing five random samples from the adult population. Here are the proportions of participants by each age group:

TABLE 1: SRS Samples - Percentages By Age Groups
Sample1Sample2Sample3Sample4Sample5
Age 19-3424.428.030.631.727.8
Age 35-5432.433.532.231.233.7
Age 55-6419.418.317.716.117.5
Age 65+23.820.219.521.021.0

A couple of things to note:

  • The proportions for the same group can vary wildly across samples. For example, the percentages for the age group 19-34 go from 24.4% to 31.7%.

  • An imbalanced sample can negatively influence the results. In Sample1, the younger generations (ages 19-54) are under represented while older age groups are over represented. Thus, the average TV time calculated using this sample will be higher than the actual population average.

So SRS is incapable of representative sampling. What do we do then?

Stratified Sampling

In Stratified sampling, we divide the population into groups and then draw proportional samples from each group1.

Let’s prepare a sample for the TV study using this technique.

First, we need to answer this question - how many participants do we need to select from each age group?

We already know the expected proportions for each age group. And the sample size is 1000. So we can calculate the count for age group like below:

TABLE 2: Desired Participant Count By Age Group
Age
Group
Expected
Percentage
Number of Participants
in a Sample of 1000
Age 19-34 27.8% 27.8 \% 27.8100×1000=278 \Large \frac{27.8}{100} \normalsize \times 1000 = 278
Age 35-54 33.4% 33.4 \% 33.4100×1000=334\Large \frac{33.4}{100} \normalsize \times 1000 = 334
Age 55-64 17.2% 17.2 \% 17.2100×1000=172 \Large \frac{17.2}{100} \normalsize \times 1000 = 172
Age 65+ 21.6% 21.6 \% 21.6100×1000=216 \Large \frac{21.6}{100} \normalsize \times 1000 = 216

Next, split the population of all American adults into the age groups. Then apply SRS within each group to select the desired number of participants.

Finally, combine the selected participants from all age groups to prepare the sample. This sample will have the same proportion of age groups as the population.

The below figure summarizes the steps taken to prepare the stratified sample:

Figure to show Stratified Sampling steps: Split into groups, draw random samples from groups and combine


Hands-On Practice with Pandas

Now we know the theory behind SRS and Stratified Sampling. Let’s learn to implement them using Python and Pandas.

First, allow me to introduce the dataset we’ll use today.

Palmer Penguins Dataset

Palmer Penguins is one of the most exciting datasets I’ve come across recently. Here’s the official description:

The dataset contains data for 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

First, let’s load the dataset.

import pandas as pd
dataset = pd.read_csv('penguins.csv')
dataset.head()
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0MALE
1AdelieTorgersen39.517.4186.03800.0FEMALE
2AdelieTorgersen40.318.0195.03250.0FEMALE
3AdelieTorgersen36.719.3193.03450.0FEMALE
4AdelieTorgersen39.320.6190.03650.0MALE

Next, check how penguins are distributed across the islands. We can do that using Pandas method value_counts():

# Check unique values and their counts 
# for the column 'island'
dataset['island'].value_counts()
Biscoe       164
Dream        123
Torgersen     47
Name: island, dtype: int64 

Let’s convert these raw numbers into proportions using the normalize=True parameter.

# Get ratio instead of raw numbers using normalize=True
expected_ratio = dataset['island'].value_counts(normalize=True)
 
# Round and then convert to percentage
expected_ratio = expected_ratio.round(4)*100
 
# convert to a DataFrame and store in variable 'island_ratios'
# We'll use this variable to compare ratios for samples 
# selected using SRS and Stratified Sampling 
island_ratios = pd.DataFrame({'Expected':expected_ratio})
island_ratios
Expected
Biscoe49.10
Dream36.83
Torgersen14.07

This is the percentage of rows we have for each island. We expect a sample from this dataset to have a similar distribution across islands.

Let’s test the two sampling techniques we learned today.

Simple Random Sampling (SRS)

We can do Simple Random Sampling (SRS) using Pandas method sample().

This method has two ways to specify how many items you want to select.

If you know the exact number of items you want, use the parameter n. In below example, we randomly draw 5 rows from the dataset:

# Choose a Simple Random Sample of 5 items
dataset.sample(n=5)
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
144AdelieDream36.017.1187.03700.0FEMALE
58AdelieBiscoe36.417.1184.02850.0FEMALE
119AdelieTorgersen40.619.0199.04000.0MALE
210ChinstrapDream43.518.1202.03400.0FEMALE
323GentooBiscoe43.515.2213.04650.0FEMALE

Or, you can use the parameter frac to randomly select a fraction of the dataset.

Below, we choose a random sample with 20% of the rows. Then we calculate the proportion of rows for each island.

# Choose an SRS with 20% of the dataset
srs_sample = dataset.sample(frac=0.20)
# Ratio of selected items by the island
srs_ratio = srs_sample['island'].value_counts(normalize=True)
# Convert to percentage
srs_ratio = srs_ratio.round(4)*100
# We did sampling using SRS. So give it proper name
srs_ratio.name = 'SRS'
 
# Finally add SRS sample proportions as a column to 
# the  variable islands_ratios 
island_ratios = pd.concat([island_ratios, srs_ratio], axis=1)
island_ratios
ExpectedSRS
Biscoe49.1059.70
Dream36.8328.36
Torgersen14.0711.94

Check the values for the islands Biscoe and Dream. The proportions generated by SRS are off by 10% from the expected values. The island Biscoe is over represented and Dream is under represented.

So SRS doesn’t work if we want to maintain proportions by a group in the samples.

Let’s try what works.

Stratified Sampling

We’ll implement Stratified Sampling using Pandas methods groupby() and apply():

  1. First, use groupby() to split the dataset into 3 groups, one for each island.
  2. Then use apply() to sample 20% rows within each group. We use lambda function to execute sample() on each group.
  3. Next, combine the rows selected from each group to return the final sample. Pandas will do this step automatically.
# Stratified Sampling
# Use groupby and apply to select sample 
# which maintains the population group ratios
stratified_sample = dataset.groupby('island').apply(
    lambda x: x.sample(frac=0.20)
)
 
stratified_sample.head()
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
island
Biscoe214GentooBiscoe46.113.2211.04500.0FEMALE
298GentooBiscoe43.414.4218.04600.0FEMALE
19AdelieBiscoe38.817.2180.03800.0MALE
288GentooBiscoe47.514.2209.04600.0FEMALE
292GentooBiscoe49.114.5212.04625.0FEMALE

There is one problem though - groupby() has added island as an index.

Let’s drop the extra index using the Pandas method droplevel(). Pass the parameter 0 as we want to drop the top level index.

# Remove the extra index added by groupby()
stratified_sample = stratified_sample.droplevel(0)
stratified_sample.head()
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
214GentooBiscoe46.113.2211.04500.0FEMALE
298GentooBiscoe43.414.4218.04600.0FEMALE
19AdelieBiscoe38.817.2180.03800.0MALE
288GentooBiscoe47.514.2209.04600.0FEMALE
292GentooBiscoe49.114.5212.04625.0FEMALE

The sample looks good now.

Next, we calculate the proportion by each island for this sample.

# Ratio of selected items by the island
stratified_ratio = stratified_sample['island'].value_counts(normalize=True)
# Convert to percentage
stratified_ratio = stratified_ratio.round(4)*100
# We did stratified sampling. So give it proper name
stratified_ratio.name = 'Stratified'
 
# Add it to the variable island_ratios which already has 
# the  expected and SRS proportions 
island_ratios = pd.concat([island_ratios, stratified_ratio], axis=1)
island_ratios
ExpectedSRSStratified
Biscoe49.1059.7049.25
Dream36.8328.3637.31
Torgersen14.0711.9413.43

The proportions generated by the Stratified Sampling look much better! All of them are within 1% of the expected values.

Summary & Next Steps

We covered a lot of ground today. Let’s do a quick recap.

Now you know why we need to do sampling. You are familiar with two ways to select samples from a population.

You understand Simple Random Sampling (SRS). And why SRS may not produce samples that accurately represent the population. You know Stratified Sampling, which gives you samples with desired group proportions.

Finally, you’ve gained valuable practical skills. You can now implement SRS and Stratified Sampling using Python and Pandas.

Here’s what you can do to build on the foundation you’ve laid today:

  • Research other sampling strategies such as Cluster and Systematic sampling.

  • Learn about Sampling Bias. What are the various types of biases? How do you avoid them?

  • Read up on Train Test Split and Cross-Validation. Both use sampling to train and evaluate machine learning models.

Footnotes

  1. Stratum is a fancy term for group in Statistics. Hence the technique is called Stratified Sampling. I wish it was called Grouped Sampling.

Title Image by Pixel-mixer