K-Fold Cross-Validation Using Python and Scikit-Learn
Introduction
In the last article, we learned about K-Fold Cross-Validation, a technique to estimate the predictive power of a machine learning model.
It’s time to put all that theory into practice. We’ll implement K-Fold Cross-Validation using two popular methods from the Scikit-Learn library.
First, let me introduce the dataset we’ll use in this article.
Banknote Dataset
The banknote authentication data set contains data from images of 1372 genuine and forged banknotes. We want to build a machine learning model to predict whether a given note is real or fake.
Let’s load the dataset using pandas and examine it:
import pandas as pd
# Load the dataset
dataset = pd.read_csv("banknote_auth.csv")
# Pick 5 rows at random and print them
dataset.sample(5)
variance | skewness | curtosis | entropy | authentic | |
---|---|---|---|---|---|
833 | -2.8267 | -9.04070 | 9.0694 | -0.98233 | 1 |
0 | 3.6216 | 8.66610 | -2.8073 | -0.44699 | 0 |
894 | -1.8391 | -9.08830 | 9.2416 | -0.10432 | 1 |
136 | 5.4380 | 9.46690 | -4.9417 | -3.92020 | 0 |
1235 | -3.5359 | 0.30417 | 0.6569 | -0.29570 | 1 |
The first four columns (variance
, skewness
, curtosis
, entropy
) contain the features extracted from each banknote image using the wavelet transform tool. These will be the inputs to our model.
The last column, authentic
, tells us whether the banknote is fake (0) or genuine (1). It will be the output of our model.
Build the Classification Model
Let’s separate the features and the output labels:
features = dataset.drop(columns='authentic')
label = dataset['authentic']
We’ll build the classification model using Scikit-Learn’s RandomForestClassifier. The random forest will have ten trees, each with a maximum depth of 2:
from sklearn.ensemble import RandomForestClassifier
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
Next, we’ll evaluate the performance of this model using Scikit-Learn’s built-in Cross-Validation functions - cross_val_score() and cross_validate().
cross_val_score()
Scikit-Learn’s helper function cross_val_score() provides a simple implementation of K-Fold Cross-Validation.
This function performs all the necessary steps - it splits the given dataset into K folds, builds multiple models (one for each fold), and evaluates them to provide test scores.
Here are the key parameters of cross_val_score():
estimator
: the model you want to train and evaluate.X
: the input features to train the model.y
: the output labels associated with input features.cv
: the number of folds. We’ll set it to 4.scoring
: the metric to evaluate the model. You can pass only one metric tocross_val_score()
. We’ll use accuracy as we are dealing with a classification problem.
from sklearn.model_selection import cross_val_score
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
scores = cross_val_score(
estimator=rfclassifier, # model to evaluate
X=features, # inputs features
y=label, # output labels
cv=4, # how many folds
scoring='accuracy' # model evaluation metric
)
cross_val_score() will return a list containing accuracy scores for each fold:
scores.round(4)
array([0.9417, 0.9038, 0.9184, 0.9096])
The accuracy ranges from 90% to 94%. We can report that a random forest model with ten trees of a maximum depth of 2 will classify bank notes with 90-94% accuracy.
Or we can report the mean accuracy:
scores.mean().round(4)
0.9184
cross_validate()
Scikit-Learn’s function cross_validate() improves upon cross_val_score() by adding a few useful features:
- It can evaluate models using multiple metrics.
- It returns the time taken during the training and evaluation steps.
We’ll use the same parameters we passed in the last section for cross_val_score(). The only difference is that we’ll give a list of classification metrics to the parameter scoring
.
from sklearn.model_selection import cross_validate
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
scores = cross_validate(
estimator=rfclassifier, # model to evaluate
X=features, # inputs features
y=label, # output labels
cv=4, # how many folds
# list of model evaluation metrics
scoring=['accuracy', 'precision', 'recall'],
)
The function cross_validate() returns a dictionary. Let’s convert to pandas DataFrame and print it:
scores = pd.DataFrame(scores)
scores.round(4)
fit_time | score_time | test_accuracy | test_precision | test_recall | |
---|---|---|---|---|---|
0 | 0.0116 | 0.0031 | 0.8513 | 0.8061 | 0.8750 |
1 | 0.0101 | 0.0031 | 0.8863 | 0.8844 | 0.8553 |
2 | 0.0107 | 0.0030 | 0.9446 | 0.9589 | 0.9150 |
3 | 0.0094 | 0.0027 | 0.9563 | 0.9367 | 0.9673 |
We can see evaluation metrics for each fold (test_accuracy, test_precision, and test_recall). The output also shows the time taken during training (fit_time) and evaluation (score_time) steps.
Finally, you can get the mean values for reporting purposes:
scores.mean().round(4)
fit_time 0.0105
score_time 0.0030
test_accuracy 0.9096
test_precision 0.8965
test_recall 0.9032
dtype: float64
Customizations
You may want to use change labels for the output metrics. You can prepare a dictionary and pass it to the parameter scoring
. In the below example, I’ve added a prefix, cls_, to all the metrics.
If you want to compute metrics for training data, set the parameter return_train_score
to True
like below:
rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
# Custom labels for the metrics
scoring = {'cls_accuracy': 'accuracy',
'cls_precision': 'precision',
'cls_recall': 'recall'}
scores = cross_validate(
rfclassifier,
X=features,
y=label,
cv=4,
scoring=scoring,
# include train scores
return_train_score=True
)
Let’s convert the output scores to pandas DataFrame and transpose to print it nicely:
scores = pd.DataFrame(scores, index=range(1, 5))
scores.index.name = 'Fold'
scores.columns.name = 'Scores'
scores.round(4).transpose()
Fold | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Scores | ||||
fit_time | 0.0105 | 0.0100 | 0.0100 | 0.0096 |
score_time | 0.0030 | 0.0029 | 0.0028 | 0.0027 |
test_cls_accuracy | 0.9446 | 0.9592 | 0.9184 | 0.9038 |
train_cls_accuracy | 0.9466 | 0.9699 | 0.9320 | 0.9009 |
test_cls_precision | 0.9463 | 0.9600 | 0.9310 | 0.9286 |
train_cls_precision | 0.9549 | 0.9631 | 0.9448 | 0.9383 |
test_cls_recall | 0.9276 | 0.9474 | 0.8824 | 0.8497 |
train_cls_recall | 0.9236 | 0.9694 | 0.8993 | 0.8315 |
And print the mean scores for reporting purposes:
scores.mean().round(4)
Scores
fit_time 0.0100
score_time 0.0028
test_cls_accuracy 0.9315
train_cls_accuracy 0.9373
test_cls_precision 0.9415
train_cls_precision 0.9503
test_cls_recall 0.9018
train_cls_recall 0.9060
dtype: float64
The test accuracy, precision, and recall are in the 90 - 94% range. That’s not bad for the very first model we built. I’ll leave it to you as an exercise to improve performance by refining the model hyperparameters.
What’s Next?
Now you are familiar with the basics of K-Fold Cross-Validation. And you know how to implement it using Python and Scikit-Learn’s helper functions.
Here are a few things you can do to solidify what you’ve learned today:
- Check out the official documentation for cross_val_score() and cross_validate(). Explore the parameters that we didn’t cover in this post.
- Run Cross-Validation for a regression task.
- Implement Cross-Validation from scratch in Python.