How to Generate Datasets Using make_classification

Yashmeet Singh · Jul 3, 2022 · 8 minute read

Introduction

Imagine you just learned about a new classification algorithm. And you want to explore it further. Maybe you’d like to try out its hyperparameters to see how they affect performance.

The only problem is - you can’t find a good dataset to experiment with.

Don’t fret. Scikit-Learn has written a function just for you!

You can use make_classification() to create a variety of classification datasets. Here are a few possibilities:

Generate binary or multiclass labels.
Create labels with balanced or imbalanced classes.
Produce a dataset that’s harder to classify.

Let’s create a few such datasets. We’ll also build RandomForestClassifier models to classify a few of them.

A First Look

Here are the basic input parameters for the function make_classification():

n_samples: How many observations do you want to generate?
n_features: The number of numerical features.
n_informative: The number of features that are ‘useful.’ Only these features carry the signal that your model will use to classify the dataset.
n_classes: The number of unique classes (values) for the target label.

The function will return a tuple containing two NumPy arrays - the features (X) and the corresponding labels (y).

We’ll explore other parameters as we need them.

Binary Classification Dataset

Let’s generate a dataset with a binary label. That is, a label with only two possible values - 0 or 1.

To do so, set the value of the parameter n_classes to 2.

We’ll create a dataset with 1,000 observations. It’ll have five features, out of which three will be informative. The other two features will be redundant.

from sklearn.datasets import make_classification
 
X, y = make_classification(
    n_samples=1000, # 1000 observations 
    n_features=5, # 5 total features
    n_informative=3, # 3 'useful' features
    n_classes=2, # binary target/label 
    random_state=999 # if you want the same results as mine
)

Let’s convert the output of make_classification() into a pandas DataFrame. It’s easier to analyze a DataFrame than raw NumPy arrays.

import pandas as pd
 
# Create DataFrame with features as columns
dataset = pd.DataFrame(X)
# give custom names to the features
dataset.columns = ['X1', 'X2', 'X3', 'X4', 'X5']
# Now add the label as a column
dataset['y'] = y
 
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      1000 non-null   float64
 1   X2      1000 non-null   float64
 2   X3      1000 non-null   float64
 3   X4      1000 non-null   float64
 4   X5      1000 non-null   float64
 5   y       1000 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 47.0 KB

As expected, the dataset has 1,000 observations, five features (X1, X2, X3, X4, and X5), and the corresponding target label (y).

We had set the parameter n_informative to 3. So only the first three features (X1, X2, X3) are important. The others, X4 and X5, are redundant.¹

Next, check the unique values and their counts for the label y:

dataset['y'].value_counts()

1    502
0    498
Name: y, dtype: int64

The label has only two possible values (0 and 1). So it’s a binary classification dataset.

Moreover, the counts for both values are roughly equal. Thus, the label has balanced classes.

Here are the first five observations from the dataset:

dataset.head()

	X1	X2	X3	X4	X5	y
0	2.501284	-0.159155	0.672438	3.469991	0.949268	0
1	2.203247	-0.331271	0.794319	3.259963	0.832451	0
2	-1.524573	-0.870737	1.004304	-1.028624	-0.717383	1
3	1.801498	3.106336	1.490633	-0.297404	-0.607484	0
4	-0.125146	0.987915	0.880293	-0.937299	-0.626822	0

An Example Classifier

The generated dataset looks good. Now let’s create a RandomForestClassifier model with default hyperparameters.

We’ll use Cross-Validation and measure the model’s score on key classification metrics:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
 
# initialize classifier
classifier = RandomForestClassifier() 
 
# Run cross validation with 10 folds
scores = cross_validate(
    classifier, X, y, cv=10, 
    # measure score for a list of classification metrics
    scoring=['accuracy', 'precision', 'recall', 'f1']
)
 
scores = pd.DataFrame(scores)
scores.mean().round(4)

fit_time          0.1201
score_time        0.0072
test_accuracy     0.8820
test_precision    0.8829
test_recall       0.8844
test_f1           0.8827
dtype: float64

The model’s Accuracy, Precision, Recall, and F1 Score are around 88%. Not bad for a model built without any hyperparameter tuning!

A Harder Dataset

Let’s create a dataset that won’t be so easy to classify.

You can control the difficulty level of a dataset using the below parameters of the function make_classification():

flip_y: adds noise by flipping a few labels. For example, change a few labels from 0 to 1 and vice versa. A higher value flips more labels and thus adds more noise. The default value is 0.01.
class_sep: controls the space between the label classes. A smaller value reduces the space and thus makes classification harder. The default value is 1.0.

We’ll use a higher value for flip_y and lower value for class_sep to create a challenging dataset.

X, y = make_classification(
    # same as the previous section
    n_samples=1000, n_features=5, n_informative=3, n_classes=2, 
    # flip_y - high value to add more noise
    flip_y=0.1, 
    # class_sep - low value to reduce space between classes
    class_sep=0.5
)

# Check label class distribution
pd.DataFrame(y).value_counts()

1    508
0    492
dtype: int64

The labels 0 and 1 have an almost equal number of observations. So we still have balanced classes:

Classifying The Harder Dataset

Let’s again build a RandomForestClassifier model with default hyperparameters.

This time, we’ll train the model on the harder dataset we just created:

classifier = RandomForestClassifier() 
 
scores = cross_validate(
    classifier, X, y, cv=10, 
    scoring=['accuracy', 'precision', 'recall', 'f1']
)
 
scores = pd.DataFrame(scores)
scores.mean()

fit_time          0.138662
score_time        0.007333
test_accuracy     0.756000
test_precision    0.764619
test_recall       0.760196
test_f1           0.759281
dtype: float64

Accuracy, Precision, Recall, and F1 Score for this model are around 75-76%. That’s a sharp decrease from 88% for the model trained using the easier dataset.

The custom values for parameters flip_y and class_sep worked! They created a dataset that’s harder to classify.²

Imbalanced Dataset

So far, we have created datasets with a roughly equal number of observations assigned to each label class.

What if you wanted a dataset with imbalanced classes? That is, a dataset where one of the label classes occurs rarely?

You can use the parameter weights to control the ratio of observations assigned to each class.

In the code below, the function make_classification() assigns class 0 to 97% of the observations. It’ll label the remaining observations (3%) with class 1.

X, y = make_classification(
    # the usual parameters
    n_samples=1000, n_features=5, n_informative=3, n_classes=2, 
    # Set label 0 for  97% and 1 for rest 3% of observations
    weights=[0.97], 
)

Let’s confirm the class imbalance:

pd.DataFrame(y).value_counts()

0    964
1     36
dtype: int64

Sure enough, make_classification() assigned about 3% of the observations to class 1.

Classifying Imbalanced Dataset

As before, we’ll create a RandomForestClassifier model with default hyperparameters.

And then train it on the imbalanced dataset:

classifier = RandomForestClassifier() 
 
scores = cross_validate(
    classifier, X, y, cv=10, 
    scoring=['accuracy', 'precision', 'recall', 'f1']
)
 
scores = pd.DataFrame(scores)
scores.mean()

fit_time          0.101848
score_time        0.006896
test_accuracy     0.964000
test_precision    0.250000
test_recall       0.083333
test_f1           0.123333
dtype: float64

We see something funny here. Our model has high Accuracy (96%) but ridiculously low Precision and Recall (25% and 8%)!

This is a classic case of Accuracy Paradox. It occurs whenever you deal with imbalanced classes. Read more about it here.

Multiclass Dataset

So far, we have created labels with only two possible values.

What if you wanted to experiment with multiclass datasets where the label can take more than two values?

You can do that using the parameter n_classes. Below code will create label with 3 classes:

X, y = make_classification(
    # same parameters as usual 
    n_samples=1000, n_features=5, n_informative=3,
    # create target label with 3 classes
    n_classes=3, 
)

Let’s confirm that the label indeed has 3 classes (0, 1, and 2):

pd.DataFrame(y).value_counts()

1    334
2    333
0    333
dtype: int64

We have balanced classes as well. All three of them have roughly the same number of observations.

Multiclass Imbalanced Dataset

You can easily create datasets with imbalanced multiclass labels. Just use the parameter n_classes along with weights.

In the code below, we ask make_classification() to assign only 4% of observations to the class 0.

And divide the rest of the observations equally between the remaining classes (48% each).

X, y = make_classification(
    # same parameters as usual 
    n_samples=1000, n_features=5, n_informative=3,
    # create target label with 3 classes
    n_classes=3, 
    # assign 4% of rows to class 0, 48% to class 1
    # and the rest to class 2
    weights=[0.04, 0.48]
)

Let’s confirm the class imbalance:

pd.DataFrame(y).value_counts()

2    479
1    477
0     44
dtype: int64

Looks good. Class 0 has only 44 observations out of 1,000!

Summary & Next Steps

You should now be able to generate different datasets using Python and Scikit-Learn’s make_classification() function.

You know how to create binary or multiclass datasets. You know the exact parameters to produce challenging datasets.

Lastly, you can generate datasets with imbalanced classes as well.

To gain more practice with make_classification(), you can try the parameters we didn’t cover today. Specifically, explore shift and scale.

By default, make_classification() creates numerical features with similar scales. You can use the parameters shift and scale to control the distribution for each feature.

Once you’ve created features with vastly different scales, check out how to handle them.

Confirm this by building two models. One with all the inputs. Another with only the informative inputs. Use the same hyperparameters and their values for both models. You should not see any difference in their test performance. ↩
You can perform better on the more challenging dataset by tweaking the classifier’s hyperparameters. ↩

Title Image by Desertrose7

Classification make_classification()RandomForestClassifier

Share this article:

What's Popular

Explore More