Introduction ๐
In the last article, we learned about K-Fold Cross-Validation, a technique to estimate the predictive power of a machine learning model.
It’s time to put all that theory into practice. We’ll implement K-Fold Cross-Validation using two popular methods from the Scikit-Learn library.
First, let me introduce the dataset we’ll use in this article.
Banknote Dataset ๐
The banknote authentication data set contains data from images of 1372 genuine and forged banknotes. We want to build a machine learning model to predict whether a given note is real or fake.
Let’s load the dataset using pandas and examine it:
import pandas as pd
# Load the dataset
dataset = pd.read_csv("banknote_auth.csv")
# Pick 5 rows at random and print them
dataset.sample(5)
variance | skewness | curtosis | entropy | authentic | |
---|---|---|---|---|---|
833 | -2.8267 | -9.04070 | 9.0694 | -0.98233 | 1 |
0 | 3.6216 | 8.66610 | -2.8073 | -0.44699 | 0 |
894 | -1.8391 | -9.08830 | 9.2416 | -0.10432 | 1 |
136 | 5.4380 | 9.46690 | -4.9417 | -3.92020 | 0 |
1235 | -3.5359 | 0.30417 | 0.6569 | -0.29570 | 1 |
The first four columns (variance
, skewness
, curtosis
, entropy
) contain the features extracted from each banknote image using the wavelet transform tool. These will be the inputs to our model.
The last column, authentic
, tells us whether the banknote is fake (0) or genuine (1). It will be the output of our model.
Build the Classification Model ๐
Let’s separate the features and the output labels:
features = dataset.drop(columns='authentic')
label = dataset['authentic']
We’ll build the classification model using Scikit-Learn’s RandomForestClassifier. The random forest will have ten trees, each with a maximum depth of 2:
from sklearn.ensemble import RandomForestClassifier
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
Next, we’ll evaluate the performance of this model using Scikit-Learn’s built-in Cross-Validation functions - cross_val_score() and cross_validate().
cross_val_score() ๐
Scikit-Learn’s helper function cross_val_score() provides a simple implementation of K-Fold Cross-Validation.
This function performs all the necessary steps - it splits the given dataset into K folds, builds multiple models (one for each fold), and evaluates them to provide test scores.
Here are the key parameters of cross_val_score():
estimator
: the model you want to train and evaluate.X
: the input features to train the model.y
: the output labels associated with input features.cv
: the number of folds. We’ll set it to 4.scoring
: the metric to evaluate the model. You can pass only one metric tocross_val_score()
. We’ll use accuracy as we are dealing with a classification problem.
from sklearn.model_selection import cross_val_score
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
scores = cross_val_score(
estimator=rfclassifier, # model to evaluate
X=features, # inputs features
y=label, # output labels
cv=4, # how many folds
scoring='accuracy' # model evaluation metric
)
cross_val_score() will return a list containing accuracy scores for each fold:
scores.round(4)
array([0.9417, 0.9038, 0.9184, 0.9096])
The accuracy ranges from 90% to 94%. We can report that a random forest model with ten trees of a maximum depth of 2 will classify bank notes with 90-94% accuracy.
Or we can report the mean accuracy:
scores.mean().round(4)
0.9184
cross_validate() ๐
Scikit-Learn’s function cross_validate() improves upon cross_val_score() by adding a few useful features:
- It can evaluate models using multiple metrics.
- It returns the time taken during the training and evaluation steps.
We’ll use the same parameters we passed in the last section for cross_val_score(). The only difference is that we’ll give a list of classification metrics to the parameter scoring
.
from sklearn.model_selection import cross_validate
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
scores = cross_validate(
estimator=rfclassifier, # model to evaluate
X=features, # inputs features
y=label, # output labels
cv=4, # how many folds
# list of model evaluation metrics
scoring=['accuracy', 'precision', 'recall'],
)
The function cross_validate() returns a dictionary. Let’s convert to pandas DataFrame and print it:
scores = pd.DataFrame(scores)
scores.round(4)
fit_time | score_time | test_accuracy | test_precision | test_recall | |
---|---|---|---|---|---|
0 | 0.0116 | 0.0031 | 0.8513 | 0.8061 | 0.8750 |
1 | 0.0101 | 0.0031 | 0.8863 | 0.8844 | 0.8553 |
2 | 0.0107 | 0.0030 | 0.9446 | 0.9589 | 0.9150 |
3 | 0.0094 | 0.0027 | 0.9563 | 0.9367 | 0.9673 |
We can see evaluation metrics for each fold (test_accuracy, test_precision, and test_recall). The output also shows the time taken during training (fit_time) and evaluation (score_time) steps.
Finally, you can get the mean values for reporting purposes:
scores.mean().round(4)
fit_time 0.0105
score_time 0.0030
test_accuracy 0.9096
test_precision 0.8965
test_recall 0.9032
dtype: float64
Customizations ๐
You may want to use change labels for the output metrics. You can prepare a dictionary and pass it to the parameter scoring
. In the below example, I’ve added a prefix, cls_, to all the metrics.
If you want to compute metrics for training data, set the parameter return_train_score
to True
like below:
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
# Custom labels for the metrics
scoring = {'cls_accuracy': 'accuracy',
'cls_precision': 'precision',
'cls_recall': 'recall'}
scores = cross_validate(
rfclassifier,
X=features,
y=label,
cv=4,
scoring=scoring,
# include train scores
return_train_score=True
)
Let’s convert the output scores to pandas DataFrame and transpose to print it nicely:
scores = pd.DataFrame(scores, index=range(1, 5))
scores.index.name = 'Fold'
scores.columns.name = 'Scores'
scores.round(4).transpose()
Fold | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Scores | ||||
fit_time | 0.0105 | 0.0100 | 0.0100 | 0.0096 |
score_time | 0.0030 | 0.0029 | 0.0028 | 0.0027 |
test_cls_accuracy | 0.9446 | 0.9592 | 0.9184 | 0.9038 |
train_cls_accuracy | 0.9466 | 0.9699 | 0.9320 | 0.9009 |
test_cls_precision | 0.9463 | 0.9600 | 0.9310 | 0.9286 |
train_cls_precision | 0.9549 | 0.9631 | 0.9448 | 0.9383 |
test_cls_recall | 0.9276 | 0.9474 | 0.8824 | 0.8497 |
train_cls_recall | 0.9236 | 0.9694 | 0.8993 | 0.8315 |
And print the mean scores for reporting purposes:
scores.mean().round(4)
Scores
fit_time 0.0100
score_time 0.0028
test_cls_accuracy 0.9315
train_cls_accuracy 0.9373
test_cls_precision 0.9415
train_cls_precision 0.9503
test_cls_recall 0.9018
train_cls_recall 0.9060
dtype: float64
The test accuracy, precision, and recall are in the 90 - 94% range. That’s not bad for the very first model we built. I’ll leave it to you as an exercise to improve performance by refining the model hyperparameters.
What’s Next? ๐
Now you are familiar with the basics of K-Fold Cross-Validation. And you know how to implement it using Python and Scikit-Learn’s helper functions.
Here are a few things you can do to solidify what you’ve learned today:
- Check out the official documentation for cross_val_score() and cross_validate(). Explore the parameters that we didn’t cover in this post.
- Run Cross-Validation for a regression task.
- Implement Cross-Validation from scratch in Python.