Title Image: Precision, Recall, and F1 Score Using Scikit-Learn

Precision, Recall, and F1 Score: A Practical Guide Using Scikit-Learn

Yashmeet Singh · · 7 minute read

Introduction

In the last post, we learned why Accuracy could be a misleading metric for classification problems with imbalanced classes. And how Precision, Recall, and F1 Score can come to our rescue.

It’s time to put all that theory into practice using Python, Scikit-Learn, and Seaborn.

First, let me introduce the dataset we’ll be working with today.

Credit Card Default Dataset

We’ll use the Default dataset from ISLR. The dataset1 contains credit card debt information for 10,000 consumers and has the following columns:

  • default: indicates whether the consumer defaulted on the debt (0 - didn’t default, 1 - defaulted).
  • student: indicates whether the consumer is a student (0 - No, 1 - Yes).
  • balance: consumer’s credit card balance.
  • income: consumer’s annual income.

We aim is to build a classification model to predict whether consumers will default on their credit card debts.

Let’s dive in!

First Classification Model

Load the dataset

First, we load the dataset using pandas:

import pandas as pd
 
dataset = pd.read_csv('Default.csv')
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  int64  
 1   student  10000 non-null  int64  
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: float64(2), int64(2)
memory usage: 312.6 KB

As expected, the dataset contains observations for 10K consumers. Moreover, every column has 10K non-null values. So we don’t have any missing data.

Imbalanced Classes

Next, let’s find out how many consumers have defaulted on their loans.

We can use Pandas’ function value_counts() to get the counts of each outcome in default, the output column:

dataset['default'].value_counts()
0    9667
1     333
Name: default, dtype: int64

Only 333 out of 10K, or 3.33% of consumers have defaulted on their loans. We are dealing with a dataset with imbalanced classes. This will become important, as we’ll see later.

Train Test Split

Let’s store input columns (student, balance, and income) as variable X, and output column (default) as y:

# X contains all input columns 
# Use pandas drop() to get all columns except 'default'
X = dataset.drop(columns='default')
 
# y has the output column 
y = dataset['default']

We must ensure that we’ll test our model on data it has never seen during training. So let’s set aside a portion of the available data for testing.

We’ll use Scikit-Learn’s train_test_split which will return training set (X_train, y_train) and test set (X_test, y_test):

from sklearn.model_selection import train_test_split
 
# keep 30% of data for testing using the argument 'test_size'
# Order of the output variables is important
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3)

Build the Model

Let’s train the model using Scikit-Learn’s LogisticRegression. Make sure that we use the training data (X_train and y_train) in this phase:

from sklearn.linear_model import LogisticRegression
 
# liblinear solver works well with unscaled data
model = LogisticRegression(solver='liblinear')
# fit the model on the training data 
model.fit(X_train, y_train)

Evaluate the Model

Now that the model is fully trained, let’s measure its performance. We’ll use the model to predict the output (default) for all test inputs, X_test:

# predict y for the test inputs 
y_test_predictions = model.predict(X_test)

Plot Confusion Matrix

Next, let’s generate a Confusion Matrix by comparing the actual test output (y_test) with the model’s predictions (y_test_predictions):

# import all the metrics we'll use later on
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)
 
# Generate confusion matrix for the predictions
conf_matrix = confusion_matrix(y_test, y_test_predictions)
conf_matrix
array([[2907,    2],
       [  91,    0]])

The above output can be hard to interpret - it’s just a bunch of numbers in a two-dimensional array.

We can visualize the Confusion Matrix instead using Seaborn heatmap():

import matplotlib.pyplot as plt
import seaborn as sns
 
plt.figure(figsize=(8,8))
sns.set(font_scale = 1.5)
 
ax = sns.heatmap(
    conf_matrix, # confusion matrix 2D array 
    annot=True, # show numbers in the cells
    fmt='d', # show numbers as integers
    cbar=False, # don't show the color bar
    cmap='flag', # customize color map
    vmax=175 # to get better color contrast
)
 
ax.set_xlabel("Predicted", labelpad=20)
ax.set_ylabel("Actual", labelpad=20)
plt.show()

Confusion Matrix for the base model. Built using seaborn heatmap()

Notice the bottom left square (containing the number 91).

The test set contained 91 cases where consumers defaulted on their loans (actual value = 1). Our model couldn’t predict any of them correctly (predicted value = 0).

That’s a terrible performance! Let’s see if Accuracy reflects this reality.

accuracy = accuracy_score(y_test, y_test_predictions)
print(f"Accuracy = {accuracy}")
Accuracy = 0.969

No, it doesn’t. A model which gets all the positive cases wrong isn’t supposed to have an Accuracy of almost 97%.

This paradoxical phenomenon where a model has high Accuracy but performs poorly is known as the Accuracy Paradox.

That’s why you cannot trust Accuracy to measure the performance of a classification model. That’s especially true when you have imbalanced classes.

We’ll turn to the metrics that can help us identify bad models in such situations.

Precision, Recall, and F1 Score

Let’s calculate Precision, Recall, and F1 Score using Scikit-Learn’s built-in functions - precision_score(), recall_score() and f1_score().

precision = precision_score(y_test, y_test_predictions)
recall = recall_score(y_test, y_test_predictions)
f1score = f1_score(y_test, y_test_predictions)
 
print(f"Precision = {precision}")
print(f"Recall = {recall}")
print(f"F1 Score = {f1score}")
Precision = 0.0
Recall = 0.0
F1 Score = 0.0

All of them are 0. That’s not surprising. We know our model is flawed, as it failed to predict any of the positive cases.

A Better Model

The model we built performed terribly because the input dataset had imbalanced classes.

How can we mitigate the harmful effect of imbalanced classes? One way is to assign a higher weight to the observations that occur infrequently.

LogisticsRegression can do that if you use the parameter class_weight and set its value to balanced.

Let’s build another model using this option. Then get new predictions for the test input data X_test.

model = LogisticRegression(
    solver='liblinear', 
    class_weight='balanced' # handle imbalanced classes
)
# fit the model on the training data 
model.fit(X_train, y_train)
# and then predict y for the test inputs 
y_test_predictions = model.predict(X_test)

Next, plot the Confusion Matrix for the test predictions:

conf_matrix = confusion_matrix(y_test, y_test_predictions)
 
plt.figure(figsize=(8,8))
sns.set(font_scale = 1.5)
 
ax = sns.heatmap(
    conf_matrix, annot=True, fmt='d', 
    cbar=False, cmap='tab10', vmax=500 
)
 
ax.set_xlabel("Predicted", labelpad=20)
ax.set_ylabel("Actual", labelpad=20)
plt.show()

Confusion Matrix for the improved model. Built using seaborn heatmap()

The new model correctly predicted 74 of the 91 positive cases. That’s quite an improvement!

Let’s calculate the performance metrics for the new model:

accuracy = accuracy_score(y_test, y_test_predictions)
precision = precision_score(y_test, y_test_predictions)
recall = recall_score(y_test, y_test_predictions)
f1score = f1_score(y_test, y_test_predictions)
 
print(f"Accuracy = {accuracy.round(4)}")
print(f"Precision = {precision.round(4)}")
print(f"Recall = {recall.round(4)}")
print(f"F1 Score = {f1score.round(4)}")
Accuracy = 0.8717
Precision = 0.1674
Recall = 0.8132
F1 Score = 0.2777

Here's the summary of metrics for both models:
MetricFirst ModelSecond Model
Accuracy0.9690.8717
Precision0.00.1674
Recall0.00.8132
F1 Score0.00.2777

The second model has much better overall performance. Even though it has slightly lower Accuracy but other scores went up. Especially, Recall has shot up significantly.

There’s room for improvement, though. Precision is still too low. I’ll leave it as an exercise for you to refine the model further.

Summary & Next Steps

This post showed us how to evaluate classification models using Scikit-Learn and Seaborn.

We built a model that suffered from Accuracy Paradox. Then we measured its performance by plotting the Confusion Matrix and calculating Precision, Recall, and F1 Score.

Next, we used Scikit-Learn’s built-in feature to tackle imbalance in the output classes. That helped us train a better-performing model.

Here are a few suggestions if you want to enhance your knowledge of classification metrics:

Footnotes

  1. We’ll use a slightly modified version of ISLR Default dataset.

Title Image by anncapictures