 # K-Fold Cross-Validation Using Python and Scikit-Learn

## Introduction 🔗

In the last article, we learned about K-Fold Cross-Validation, a technique to estimate the predictive power of a machine learning model.

It’s time to put all that theory into practice. We’ll implement K-Fold Cross-Validation using two popular methods from the Scikit-Learn library.

## Banknote Dataset 🔗

The banknote authentication data set contains data from images of 1372 genuine and forged banknotes. We want to build a machine learning model to predict whether a given note is real or fake.

Let’s load the dataset using pandas and examine it:

``````import pandas as pd

# Pick 5 rows at random and print them
dataset.sample(5)
``````

variance skewness curtosis entropy authentic
833 -2.8267 -9.04070 9.0694 -0.98233 1
0 3.6216 8.66610 -2.8073 -0.44699 0
894 -1.8391 -9.08830 9.2416 -0.10432 1
136 5.4380 9.46690 -4.9417 -3.92020 0
1235 -3.5359 0.30417 0.6569 -0.29570 1

The first four columns (`variance`, `skewness`, `curtosis`, `entropy`) contain the features extracted from each banknote image using the wavelet transform tool. These will be the inputs to our model.

The last column, `authentic`, tells us whether the banknote is fake (0) or genuine (1). It will be the output of our model.

## Build the Classification Model 🔗

Let’s separate the features and the output labels:

``````features = dataset.drop(columns='authentic')
label = dataset['authentic']
``````

We’ll build the classification model using Scikit-Learn’s RandomForestClassifier. The random forest will have ten trees, each with a maximum depth of 2:

``````from sklearn.ensemble import RandomForestClassifier

rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)
``````

Next, we’ll evaluate the performance of this model using Scikit-Learn’s built-in Cross-Validation functions - cross_val_score() and cross_validate().

## cross_val_score()🔗

Scikit-Learn’s helper function cross_val_score() provides a simple implementation of K-Fold Cross-Validation.

This function performs all the necessary steps - it splits the given dataset into K folds, builds multiple models (one for each fold), and evaluates them to provide test scores.

Here are the key parameters of cross_val_score():

• `estimator`: the model you want to train and evaluate.
• `X`: the input features to train the model.
• `y`: the output labels associated with input features.
• `cv`: the number of folds. We’ll set it to 4.
• `scoring`: the metric to evaluate the model. You can pass only one metric to `cross_val_score()`. We’ll use accuracy as we are dealing with a classification problem.
``````from sklearn.model_selection import cross_val_score

rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)

scores = cross_val_score(
estimator=rfclassifier, # model to evaluate
X=features, # inputs features
y=label, # output labels
cv=4, # how many folds
scoring='accuracy' # model evaluation metric
)
``````

cross_val_score() will return a list containing accuracy scores for each fold:

``````scores.round(4)
``````
``````array([0.9417, 0.9038, 0.9184, 0.9096])
``````

The accuracy ranges from 90% to 94%. We can report that a random forest model with ten trees of a maximum depth of 2 will classify bank notes with 90-94% accuracy.

Or we can report the mean accuracy:

``````scores.mean().round(4)
``````
``````0.9184
``````

## cross_validate()🔗

Scikit-Learn’s function cross_validate() improves upon cross_val_score() by adding a few useful features:

• It can evaluate models using multiple metrics.
• It returns the time taken during the training and evaluation steps.

We’ll use the same parameters we passed in the last section for cross_val_score(). The only difference is that we’ll give a list of classification metrics to the parameter `scoring`.

``````from sklearn.model_selection import cross_validate

rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)

scores = cross_validate(
estimator=rfclassifier, # model to evaluate
X=features, # inputs features
y=label, # output labels
cv=4, # how many folds
# list of model evaluation metrics
scoring=['accuracy', 'precision', 'recall'],
)
``````

The function cross_validate() returns a dictionary. Let’s convert to pandas DataFrame and print it:

``````scores = pd.DataFrame(scores)
scores.round(4)
``````

fit_time score_time test_accuracy test_precision test_recall
0 0.0116 0.0031 0.8513 0.8061 0.8750
1 0.0101 0.0031 0.8863 0.8844 0.8553
2 0.0107 0.0030 0.9446 0.9589 0.9150
3 0.0094 0.0027 0.9563 0.9367 0.9673

We can see evaluation metrics for each fold (test_accuracy, test_precision, and test_recall). The output also shows the time taken during training (fit_time) and evaluation (score_time) steps.

Finally, you can get the mean values for reporting purposes:

``````scores.mean().round(4)
``````
``````fit_time          0.0105
score_time        0.0030
test_accuracy     0.9096
test_precision    0.8965
test_recall       0.9032
dtype: float64
``````

### Customizations 🔗

You may want to use change labels for the output metrics. You can prepare a dictionary and pass it to the parameter `scoring`. In the below example, I’ve added a prefix, cls_, to all the metrics.

If you want to compute metrics for training data, set the parameter `return_train_score` to `True` like below:

``````rfclassifier = RandomForestClassifier(max_depth=2, n_estimators=10)

# Custom labels for the metrics
scoring = {'cls_accuracy': 'accuracy',
'cls_precision': 'precision',
'cls_recall': 'recall'}

scores = cross_validate(
rfclassifier,
X=features,
y=label,
cv=4,
scoring=scoring,
# include train scores
return_train_score=True
)
``````

Let’s convert the output scores to pandas DataFrame and transpose to print it nicely:

``````scores = pd.DataFrame(scores, index=range(1, 5))
scores.index.name = 'Fold'
scores.columns.name = 'Scores'

scores.round(4).transpose()
``````

Fold 1 2 3 4
Scores
fit_time 0.0105 0.0100 0.0100 0.0096
score_time 0.0030 0.0029 0.0028 0.0027
test_cls_accuracy 0.9446 0.9592 0.9184 0.9038
train_cls_accuracy 0.9466 0.9699 0.9320 0.9009
test_cls_precision 0.9463 0.9600 0.9310 0.9286
train_cls_precision 0.9549 0.9631 0.9448 0.9383
test_cls_recall 0.9276 0.9474 0.8824 0.8497
train_cls_recall 0.9236 0.9694 0.8993 0.8315

And print the mean scores for reporting purposes:

``````scores.mean().round(4)
``````
``````Scores
fit_time               0.0100
score_time             0.0028
test_cls_accuracy      0.9315
train_cls_accuracy     0.9373
test_cls_precision     0.9415
train_cls_precision    0.9503
test_cls_recall        0.9018
train_cls_recall       0.9060
dtype: float64
``````

The test accuracy, precision, and recall are in the 90 - 94% range. That’s not bad for the very first model we built. I’ll leave it to you as an exercise to improve performance by refining the model hyperparameters.

## What’s Next? 🔗

Now you are familiar with the basics of K-Fold Cross-Validation. And you know how to implement it using Python and Scikit-Learn’s helper functions.

Here are a few things you can do to solidify what you’ve learned today:

• Check out the official documentation for cross_val_score() and cross_validate(). Explore the parameters that we didn’t cover in this post.
• Run Cross-Validation for a regression task.
• Implement Cross-Validation from scratch in Python.
Title Image by Habib