Measuring the efficiency of an A.I. with a Confusion Matrix

6 min readFeb 8, 2023

This post is an English version of another post of mine in Portuguese 🇧🇷🇵🇹

In Machine Learning, it is necessary to assess whether the algorithms are actually learning correctly, and for that we can assess the number of right choices, errors, and other important values to estimate the efficiency of our model.
So that this can be done, we take the results generated by the model and apply them to the confusion matrix, which is nothing more than a table that groups all the elements to show their mistakes and successes.

Figure 1: How can a Confusion Matrix can be interpreted. In some cases the axes can be reversed, so caution is warranted.

In the confusion matrix (Figure 1), there are four possible classifications for an element:

True Positive (TP): An element that has been set to True, and it is True.
False Positive (FP): Element pointed as True, but it should be False.
False Negative (FN): Set to False, but should be True.
True Negative (TN): Reported as False, and it is False.

Imagine that you are passing a series of images from your cell phone to a prediction model. The model should return the image stating whether or not it has a cat. Then that same information is crossed with the pre-informed data (if the photo image has a cat). In a confusion matrix, the goal is identify how the images were classified.

Assuming that the model has already been trained with several photos that have cats, and photos without any appearance of a pet. Thus, it is possible that the AI has trained enough to identify an image containing or not a cat.

According to the example above of the Confusion Matrix, in the case of cats, we can visualize it as follows.

Figure 2: Identification for each type of classification according to the axes.

Metrics

To measure the efficiency of our model, the most suitable metrics are:

Accuracy

One of the most used measures initially and seeks to understand the proportion of correct answers based on attempts. The main thing is to understand how many images the model got right (On an image of a cat, True Positive; And on an image without cats, True Negative).

Precision

In this metric, the objective is to understand the number of accurately identified images. Therefore, the number of False Positives is placed in the formula. Making it possible to balance with the True Positives.

Recall (Or Sensibility)

In this metric, the focus is on understanding how many photos of cats were identified as Positive (True Positives). In this formula, the value is balanced with photos of cats that were erroneously identified as False (False Negative). This is an important example, because even with an accuracy close to 100%, in case of a high recall, it means that the model is leaving many records as False Negatives.

F1-Score

More complex, this formula aims to combine the indices indicated in Precision and Recall into a single metric.

Exemplifying

Still in the example of the kitten identifier model, let’s assume that of the 100 images processed by the model for testing. From these images, they were placed in the Confusion Matrix as follows
True Positives: 58 images.
False Positives: 6 images.
False Negative: 17 images.
True Negative: 19 images.

Figura 7: How the results of the 100 images would be distributed within the Confusion Matrix

With that in mind, it is possible to calculate according to the formulas for the metrics listed above.
Having implemented the formulas, the metrics of our model show the following results:
Accuracy: 77%
Precision: 77%
Revocation: 90%
F1-Score: 83%

To be calculated, there are two ways we recommend:
1. If you just want to test with ready-made values like I did, the Online Confusion Matrix website generates the calculation just by inserting the values into the matrix. It is worth noting that the matrix has inverted axes, changing FP to FN. Also note that the Recall measure is called Sensibility.

2. Thinking about implementing it in Python, we can generate it with the following code:

from sklearn.metrics import confusion_matrix

#1=Cat| 0=Isn´t a Cat
pic_cats = [1, 0, 1, 1, 0, 0, 1, 1, 1] #List with the real value
pred_cats =  [0, 0, 1, 1, 0, 1, 1, 1, 0] #list with the model output

matrix = confusion_matrix(pic_cats, pred_cats)
print("Confusion Matrix: \n", matrix)

In this example, the photos would be the first list (pics_cats) where 1 indicates that the photo has a cat, and 0 confirms that it does not have a cat. In the list pred_cats are the predictions generated by the model. Both have 10 elements, and each position in the list refers to the same photo.

When running the code it will provide the following output:

Figure 8: Matrix generated according to the data shown.

With other libs, we can make this view a bit more presentable and even customizable:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

#1=Cat| 0=Isn´t a Cat
pic_cats = [1, 0, 1, 1, 0, 0, 1, 1, 1] #List with the real value
pred_cats =  [0, 0, 1, 1, 0, 1, 1, 1, 0] #list with the model output

conf_matrix = confusion_matrix(pic_cats, pred_cats)
sns.heatmap(conf_matrix, annot=True, fmt='d')
plt.xlabel('What is expected')
plt.ylabel('What was predicted')
plt.show()

Figure 9: Plotting a Confusion Matrix plot using libraries such as Matplotlib and Seaborn.

Last but not least, we can print the metrics using functions provided by sklearn.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(pics_cats, pred_cats)
precision = precision_score(pics_cats, pred_cats)
recall = recall_score(pics_cats, pred_cats)
f1 = f1_score(pics_cats, pred_cats)

print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1-score: ", f1)

Running the code, we have the output with the metrics:

Figure 10: Metrics printed on the console

Conclusions

Understanding how a confusion matrix works is the first step in understanding how the predictions generated by a Machine Learning model are classified. Then you can get your metrics to understand if the performance is satisfactory. Sometimes out of innocence, we understand that Accuracy is the best measure to evaluate performance. Although it is simple and shows a little of what we want, it is necessary to understand what we should prioritize in our predictions and analyze the other available metrics.

In more delicate cases, such as the prediction of diseases such as cancer, based on images, it is not enough to have just good Accuracy. It is necessary to avoid False Negatives, as this could compromise or impact a life. A case of a False Positive could be something equally bad, but with a more careful medical analysis and other separate tests, it could be understood that it was nothing more than an isolated failure of the model, thus becoming something more circumventable.

Within prediction models, errors are normal. Models that get 100% right tend to be 100% wrong. Even though we are looking for the most efficient model for predictions, it is important to know where the model can go wrong for some more difficult cases. If all cats would need to be identified in photos, it is important to know that in some cases photos of stuffed animals or other animals may be classified as cats. Or if the case would be to ensure that only cats will be identified as cats, thus giving a chance that a photo of a cat from the back will not be classified as a real cat.

Figure 10: Could an artificial intelligence mistake a dog for a loaf of bread?

Using prediction models and calculating performance through the listed metrics, it is possible to understand the strengths and weaknesses of our code. As a Data Scientist, you are in charge not only of implementing the metrics, but also of understanding what conclusions we can draw from them, in addition to prioritizing where our model needs to learn, in order to identify in the best way.