Classification metrics in a nutshell
I was working on a classification project, more precisely a Churn project. After training 5 machine learning models with a very satisfactory accuracy I found myself in doubt which model I should put into production.
I needed to research and understand better about the classification metrics and this article aims to summarize about them.
But first, we need to talk about:
Confusion Matrix
Also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm.
Each row of the matrix represent the instances in an actual class while each column represent the instances in a predicted class or vice versa.
The correct classifications of the “no” class are defined as true negatives (TN), while for the correct classifications of the “yes” class they are denominated as true positives (TP). For wrong classifications of class “no” as “yes” are called false positives (FP), as well as incorrect classifications of class “yes” as “no” are declared as false negatives (FN).
Knowing a little more about the confusion matrix, let’s go to the metrics in which you will understand the reason for the confusion matrix.
Accuracy
Accuracy is the number of positives over the total number of examples.
Because it is positive for the total, the accuracy should not be used in projects where we have unbalanced classes eg.: We have 85% of customers who did not enter churn and 5% who did, the model could miss all customers of the churn class “Yes” that it will likely have a high accuracy.
Precision
Precision is number of examples classified as belonging to a class that actually belongs to that class.
If we throw 100 darts at a target and hit 65 of them we will have a hit accuracy of 65%. Going back to the churn project, if the model hit 95 customers who canceled the company’s services out of a list of 100 our model then had an precision of 95%.
Recall
As precision, recall shows how much the model can differentiate classes.
It can be read as the number of examples classified as belonging to a class, which actually belong to that class divided by the total number of examples that belong to this class, even if they are classified in another.
It is used when False Negatives (FN) are considered more harmful than False Positives. For example, in disease prediction the model must in anyway find all sick patients, even if it classifies some healthy as sick (FP) in the process. In other words, the model must have high recall, as classifying sick patients as healthy can cause tragedy.
F1-Score
F1-Score it is the harmonic mean between the model’s precision and recall. That is, when we have a low F1-Score, it is an indication that the model has a low precision or recall.
It is used when you want to have a model with both good precision and recall.
AUC-ROC
AUC (Area under the Curve)
ROC (Receiver Operating Characteristics)
It measures the area under a curve formed by the graph between the rate of positive examples, which are actually positive, and the rate of false positives.
Higher the AUC, the better the model is at predicting False classes as False and True classes as True, in other words, the better the probability of it getting right if a patient is sick or if a customer will leave the company’s base.
Conclusion
The metrics serve to give an overview of how the model behaves on unknown data.
It is necessary to know the business problem we are trying to solve in order to know which metrics we will use. Remember, there is no better or worse, there is the one that best fits our goal.
I hope this post was useful and that you enjoyed the reading.
Contacts: