 # Precision, Recall, and F1 Score: A Practical Guide Using Scikit-Learn

## Introduction 🔗

In the last post, we learned why Accuracy could be a misleading metric for classification problems with imbalanced classes. And how Precision, Recall, and F1 Score can come to our rescue.

It’s time to put all that theory into practice using Python, Scikit-Learn, and Seaborn.

First, let me introduce the dataset we’ll be working with today.

## Credit Card Default Dataset 🔗

We’ll use the Default dataset from ISLR. The dataset contains credit card debt information for 10,000 consumers and has the following columns:

• default: indicates whether the consumer defaulted on the debt (0 - didn’t default, 1 - defaulted).
• student: indicates whether the consumer is a student (0 - No, 1 - Yes).
• balance: consumer’s credit card balance.
• income: consumer’s annual income.

We aim is to build a classification model to predict whether consumers will default on their credit card debts.

Let’s dive in!

## First Classification Model 🔗

First, we load the dataset using pandas:

``````import pandas as pd

dataset.info()
``````
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
#   Column   Non-Null Count  Dtype
---  ------   --------------  -----
0   default  10000 non-null  int64
1   student  10000 non-null  int64
2   balance  10000 non-null  float64
3   income   10000 non-null  float64
dtypes: float64(2), int64(2)
memory usage: 312.6 KB
``````

As expected, the dataset contains observations for 10K consumers. Moreover, every column has 10K non-null values. So we don’t have any missing data.

### Imbalanced Classes 🔗

Next, let’s find out how many consumers have defaulted on their loans.

We can use Pandas' function value_counts() to get the counts of each outcome in `default`, the output column:

``````dataset['default'].value_counts()
``````
``````0    9667
1     333
Name: default, dtype: int64
``````

Only 333 out of 10K, or 3.33% of consumers have defaulted on their loans. We are dealing with a dataset with imbalanced classes. This will become important, as we’ll see later.

### Train Test Split 🔗

Let’s store input columns (`student`, `balance`, and `income`) as variable X, and output column (`default`) as y:

``````# X contains all input columns
# Use pandas drop() to get all columns except 'default'
X = dataset.drop(columns='default')

# y has the output column
y = dataset['default']
``````

We must ensure that we’ll test our model on data it has never seen during training. So let’s set aside a portion of the available data for testing.

We’ll use Scikit-Learn’s train_test_split which will return training set (X_train, y_train) and test set (X_test, y_test):

``````from sklearn.model_selection import train_test_split

# keep 30% of data for testing using the argument 'test_size'
# Order of the output variables is important
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3)
``````

### Build the Model 🔗

Let’s train the model using Scikit-Learn’s LogisticRegression. Make sure that we use the training data (X_train and y_train) in this phase:

``````from sklearn.linear_model import LogisticRegression

# liblinear solver works well with unscaled data
model = LogisticRegression(solver='liblinear')
# fit the model on the training data
model.fit(X_train, y_train)
``````

### Evaluate the Model 🔗

Now that the model is fully trained, let’s measure its performance. We’ll use the model to predict the output (default) for all test inputs, X_test:

``````# predict y for the test inputs
y_test_predictions = model.predict(X_test)
``````

### Plot Confusion Matrix 🔗

Next, let’s generate a Confusion Matrix by comparing the actual test output (y_test) with the model’s predictions (y_test_predictions):

``````# import all the metrics we'll use later on
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
precision_score,
recall_score,
f1_score
)

# Generate confusion matrix for the predictions
conf_matrix = confusion_matrix(y_test, y_test_predictions)
conf_matrix
``````
``````array([[2907,    2],
[  91,    0]])
``````

The above output can be hard to interpret - it’s just a bunch of numbers in a two-dimensional array.

We can visualize the Confusion Matrix instead using Seaborn heatmap():

``````import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,8))
sns.set(font_scale = 1.5)

ax = sns.heatmap(
conf_matrix, # confusion matrix 2D array
annot=True, # show numbers in the cells
fmt='d', # show numbers as integers
cbar=False, # don't show the color bar
cmap='flag', # customize color map
vmax=175 # to get better color contrast
)

plt.show()
`````` Notice the bottom left square (containing the number 91).

The test set contained 91 cases where consumers defaulted on their loans (actual value = 1). Our model couldn’t predict any of them correctly (predicted value = 0).

That’s a terrible performance! Let’s see if Accuracy reflects this reality.

``````accuracy = accuracy_score(y_test, y_test_predictions)
print(f"Accuracy = {accuracy}")
``````
``````Accuracy = 0.969
``````

No, it doesn’t. A model which gets all the positive cases wrong isn’t supposed to have an Accuracy of almost 97%.

This paradoxical phenomenon where a model has high Accuracy but performs poorly is known as the Accuracy Paradox.

That’s why you cannot trust Accuracy to measure the performance of a classification model. That’s especially true when you have imbalanced classes.

We’ll turn to the metrics that can help us identify bad models in such situations.

### Precision, Recall, and F1 Score 🔗

Let’s calculate Precision, Recall, and F1 Score using Scikit-Learn’s built-in functions - precision_score(), recall_score() and f1_score().

``````precision = precision_score(y_test, y_test_predictions)
recall = recall_score(y_test, y_test_predictions)
f1score = f1_score(y_test, y_test_predictions)

print(f"Precision = {precision}")
print(f"Recall = {recall}")
print(f"F1 Score = {f1score}")
``````
``````Precision = 0.0
Recall = 0.0
F1 Score = 0.0
``````

All of them are 0. That’s not surprising. We know our model is flawed, as it failed to predict any of the positive cases.

## A Better Model 🔗

The model we built performed terribly because the input dataset had imbalanced classes.

How can we mitigate the harmful effect of imbalanced classes? One way is to assign a higher weight to the observations that occur infrequently.

LogisticsRegression can do that if you use the parameter `class_weight` and set its value to `balanced`.

Let’s build another model using this option. Then get new predictions for the test input data X_test.

``````model = LogisticRegression(
solver='liblinear',
class_weight='balanced' # handle imbalanced classes
)
# fit the model on the training data
model.fit(X_train, y_train)
# and then predict y for the test inputs
y_test_predictions = model.predict(X_test)
``````

Next, plot the Confusion Matrix for the test predictions:

``````conf_matrix = confusion_matrix(y_test, y_test_predictions)

plt.figure(figsize=(8,8))
sns.set(font_scale = 1.5)

ax = sns.heatmap(
conf_matrix, annot=True, fmt='d',
cbar=False, cmap='tab10', vmax=500
)

plt.show()
`````` The new model correctly predicted 74 of the 91 positive cases. That’s quite an improvement!

Let’s calculate the performance metrics for the new model:

``````accuracy = accuracy_score(y_test, y_test_predictions)
precision = precision_score(y_test, y_test_predictions)
recall = recall_score(y_test, y_test_predictions)
f1score = f1_score(y_test, y_test_predictions)

print(f"Accuracy = {accuracy.round(4)}")
print(f"Precision = {precision.round(4)}")
print(f"Recall = {recall.round(4)}")
print(f"F1 Score = {f1score.round(4)}")
``````
``````Accuracy = 0.8717
Precision = 0.1674
Recall = 0.8132
F1 Score = 0.2777
``````

Here's the summary of metrics for both models:
Metric First Model Second Model
Accuracy 0.969 0.8717
Precision 0.0 0.1674
Recall 0.0 0.8132
F1 Score 0.0 0.2777

The second model has much better overall performance. Even though it has slightly lower Accuracy but other scores went up. Especially, Recall has shot up significantly.

There’s room for improvement, though. Precision is still too low. I’ll leave it as an exercise for you to refine the model further.

## Summary & Next Steps 🔗

This post showed us how to evaluate classification models using Scikit-Learn and Seaborn.

We built a model that suffered from Accuracy Paradox. Then we measured its performance by plotting the Confusion Matrix and calculating Precision, Recall, and F1 Score.

Next, we used Scikit-Learn’s built-in feature to tackle imbalance in the output classes. That helped us train a better-performing model.

Here are a few suggestions if you want to enhance your knowledge of classification metrics:

Title Image by anncapictures