How to Generate Datasets Using make_classification
Introduction
Imagine you just learned about a new classification algorithm. And you want to explore it further. Maybe you’d like to try out its hyperparameters to see how they affect performance.
The only problem is - you can’t find a good dataset to experiment with.
Don’t fret. Scikit-Learn has written a function just for you!
You can use make_classification() to create a variety of classification datasets. Here are a few possibilities:
- Generate binary or multiclass labels.
- Create labels with balanced or imbalanced classes.
- Produce a dataset that’s harder to classify.
Let’s create a few such datasets. We’ll also build RandomForestClassifier models to classify a few of them.
A First Look
Here are the basic input parameters for the function make_classification():
n_samples
: How many observations do you want to generate?n_features
: The number of numerical features.n_informative
: The number of features that are ‘useful.’ Only these features carry the signal that your model will use to classify the dataset.n_classes
: The number of unique classes (values) for the target label.
The function will return a tuple containing two NumPy arrays - the features (X) and the corresponding labels (y).
We’ll explore other parameters as we need them.
Binary Classification Dataset
Let’s generate a dataset with a binary label. That is, a label with only two possible values - 0 or 1.
To do so, set the value of the parameter n_classes to 2.
We’ll create a dataset with 1,000 observations. It’ll have five features, out of which three will be informative. The other two features will be redundant.
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=1000, # 1000 observations
n_features=5, # 5 total features
n_informative=3, # 3 'useful' features
n_classes=2, # binary target/label
random_state=999 # if you want the same results as mine
)
Let’s convert the output of make_classification() into a pandas DataFrame. It’s easier to analyze a DataFrame than raw NumPy arrays.
import pandas as pd
# Create DataFrame with features as columns
dataset = pd.DataFrame(X)
# give custom names to the features
dataset.columns = ['X1', 'X2', 'X3', 'X4', 'X5']
# Now add the label as a column
dataset['y'] = y
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X1 1000 non-null float64
1 X2 1000 non-null float64
2 X3 1000 non-null float64
3 X4 1000 non-null float64
4 X5 1000 non-null float64
5 y 1000 non-null int64
dtypes: float64(5), int64(1)
memory usage: 47.0 KB
As expected, the dataset has 1,000 observations, five features (X1, X2, X3, X4, and X5), and the corresponding target label (y).
We had set the parameter n_informative
to 3. So only the first three features (X1, X2, X3) are important. The others, X4 and X5, are redundant.1
Next, check the unique values and their counts for the label y:
dataset['y'].value_counts()
1 502
0 498
Name: y, dtype: int64
The label has only two possible values (0 and 1). So it’s a binary classification dataset.
Moreover, the counts for both values are roughly equal. Thus, the label has balanced classes.
Here are the first five observations from the dataset:
dataset.head()
X1 | X2 | X3 | X4 | X5 | y | |
---|---|---|---|---|---|---|
0 | 2.501284 | -0.159155 | 0.672438 | 3.469991 | 0.949268 | 0 |
1 | 2.203247 | -0.331271 | 0.794319 | 3.259963 | 0.832451 | 0 |
2 | -1.524573 | -0.870737 | 1.004304 | -1.028624 | -0.717383 | 1 |
3 | 1.801498 | 3.106336 | 1.490633 | -0.297404 | -0.607484 | 0 |
4 | -0.125146 | 0.987915 | 0.880293 | -0.937299 | -0.626822 | 0 |
An Example Classifier
The generated dataset looks good. Now let’s create a RandomForestClassifier model with default hyperparameters.
We’ll use Cross-Validation and measure the model’s score on key classification metrics:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
# initialize classifier
classifier = RandomForestClassifier()
# Run cross validation with 10 folds
scores = cross_validate(
classifier, X, y, cv=10,
# measure score for a list of classification metrics
scoring=['accuracy', 'precision', 'recall', 'f1']
)
scores = pd.DataFrame(scores)
scores.mean().round(4)
fit_time 0.1201
score_time 0.0072
test_accuracy 0.8820
test_precision 0.8829
test_recall 0.8844
test_f1 0.8827
dtype: float64
The model’s Accuracy, Precision, Recall, and F1 Score are around 88%. Not bad for a model built without any hyperparameter tuning!
A Harder Dataset
Let’s create a dataset that won’t be so easy to classify.
You can control the difficulty level of a dataset using the below parameters of the function make_classification():
flip_y
: adds noise by flipping a few labels. For example, change a few labels from 0 to 1 and vice versa. A higher value flips more labels and thus adds more noise. The default value is 0.01.class_sep
: controls the space between the label classes. A smaller value reduces the space and thus makes classification harder. The default value is 1.0.
We’ll use a higher value for flip_y
and lower value for class_sep
to create a challenging dataset.
X, y = make_classification(
# same as the previous section
n_samples=1000, n_features=5, n_informative=3, n_classes=2,
# flip_y - high value to add more noise
flip_y=0.1,
# class_sep - low value to reduce space between classes
class_sep=0.5
)
# Check label class distribution
pd.DataFrame(y).value_counts()
1 508
0 492
dtype: int64
The labels 0 and 1 have an almost equal number of observations. So we still have balanced classes:
Classifying The Harder Dataset
Let’s again build a RandomForestClassifier model with default hyperparameters.
This time, we’ll train the model on the harder dataset we just created:
classifier = RandomForestClassifier()
scores = cross_validate(
classifier, X, y, cv=10,
scoring=['accuracy', 'precision', 'recall', 'f1']
)
scores = pd.DataFrame(scores)
scores.mean()
fit_time 0.138662
score_time 0.007333
test_accuracy 0.756000
test_precision 0.764619
test_recall 0.760196
test_f1 0.759281
dtype: float64
Accuracy, Precision, Recall, and F1 Score for this model are around 75-76%. That’s a sharp decrease from 88% for the model trained using the easier dataset.
The custom values for parameters flip_y
and class_sep
worked! They created a dataset that’s harder to classify.2
Imbalanced Dataset
So far, we have created datasets with a roughly equal number of observations assigned to each label class.
What if you wanted a dataset with imbalanced classes? That is, a dataset where one of the label classes occurs rarely?
You can use the parameter weights
to control the ratio of observations assigned to each class.
In the code below, the function make_classification() assigns class 0 to 97% of the observations. It’ll label the remaining observations (3%) with class 1.
X, y = make_classification(
# the usual parameters
n_samples=1000, n_features=5, n_informative=3, n_classes=2,
# Set label 0 for 97% and 1 for rest 3% of observations
weights=[0.97],
)
Let’s confirm the class imbalance:
pd.DataFrame(y).value_counts()
0 964
1 36
dtype: int64
Sure enough, make_classification() assigned about 3% of the observations to class 1.
Classifying Imbalanced Dataset
As before, we’ll create a RandomForestClassifier model with default hyperparameters.
And then train it on the imbalanced dataset:
classifier = RandomForestClassifier()
scores = cross_validate(
classifier, X, y, cv=10,
scoring=['accuracy', 'precision', 'recall', 'f1']
)
scores = pd.DataFrame(scores)
scores.mean()
fit_time 0.101848
score_time 0.006896
test_accuracy 0.964000
test_precision 0.250000
test_recall 0.083333
test_f1 0.123333
dtype: float64
We see something funny here. Our model has high Accuracy (96%) but ridiculously low Precision and Recall (25% and 8%)!
This is a classic case of Accuracy Paradox. It occurs whenever you deal with imbalanced classes. Read more about it here.
Multiclass Dataset
So far, we have created labels with only two possible values.
What if you wanted to experiment with multiclass datasets where the label can take more than two values?
You can do that using the parameter n_classes
. Below code will create label with 3 classes:
X, y = make_classification(
# same parameters as usual
n_samples=1000, n_features=5, n_informative=3,
# create target label with 3 classes
n_classes=3,
)
Let’s confirm that the label indeed has 3 classes (0, 1, and 2):
pd.DataFrame(y).value_counts()
1 334
2 333
0 333
dtype: int64
We have balanced classes as well. All three of them have roughly the same number of observations.
Multiclass Imbalanced Dataset
You can easily create datasets with imbalanced multiclass labels. Just use the parameter n_classes
along with weights
.
In the code below, we ask make_classification() to assign only 4% of observations to the class 0.
And divide the rest of the observations equally between the remaining classes (48% each).
X, y = make_classification(
# same parameters as usual
n_samples=1000, n_features=5, n_informative=3,
# create target label with 3 classes
n_classes=3,
# assign 4% of rows to class 0, 48% to class 1
# and the rest to class 2
weights=[0.04, 0.48]
)
Let’s confirm the class imbalance:
pd.DataFrame(y).value_counts()
2 479
1 477
0 44
dtype: int64
Looks good. Class 0 has only 44 observations out of 1,000!
Summary & Next Steps
You should now be able to generate different datasets using Python and Scikit-Learn’s make_classification() function.
You know how to create binary or multiclass datasets. You know the exact parameters to produce challenging datasets.
Lastly, you can generate datasets with imbalanced classes as well.
To gain more practice with make_classification(), you can try the parameters we didn’t cover today. Specifically, explore shift and scale.
By default, make_classification() creates numerical features with similar scales. You can use the parameters shift and scale to control the distribution for each feature.
Once you’ve created features with vastly different scales, check out how to handle them.
Footnotes
-
Confirm this by building two models. One with all the inputs. Another with only the informative inputs. Use the same hyperparameters and their values for both models. You should not see any difference in their test performance. ↩
-
You can perform better on the more challenging dataset by tweaking the classifier’s hyperparameters. ↩