Understand Classification Problems in less than 5 minutes

Nicolas Pogeant

4 min readMay 16, 2022

Classification is one of the most known supervised problems in Machine Learning, let’s try to quickly explain it !

In this article, we will see :

What is a Classification Problem ?
The different types of Classification Problems.
How to evaluate a Classification model.

How can we define a Classification Problem ?

To be able to understand what Classification is, let’s use Regression to compare the both.

What makes a classification problem different than a regression one ?

Both differs mostly from the target it tries to predict. On one side, you want to predict numerical continuous variables, for example the price of a coffee based on some features (dataset source: https://www.ico.org/new_historical.asp) :

Your Target will be distributed in a range, so an enormous number of possibilities between 4 US$ per lb and 6.3 US$ per lb.

On the other side, you want to predict classes, for example the varieties of coffee bean :

In this case, the target will not be in a range, but in a finite number of possible classes : 29, from Pacamara to Blue Mountain (dataset source : https://database.coffeeinstitute.org/) .

To summarize, a classification problem is a problem where your targets can be defined as classes : Is the traffic light green, yellow or red ? What items does this photo contains ?

The different types of Classification Problems

We can distinguish 4 types of classification problems:

Binary Classification : predict 0 or 1 (positive or negative sentiment from a tweet).
Multi-class Classification : predict 0, 1, 2 or 4 (a cappuccino, espresso, latte, latte macchiato).
Multi-label Classification : predict 0 or/and 1 (dog and/or cat on a picture).
Multi-output Classification : predict 0 or/and 1 or/and 2 (multi-class and multi label combined).

Beyond this, Classification problems are presents in Computer vision, Natural Language Processing, Automatic Speech Recognition…

How to evaluate a Classification model

As the target is a class and most datasets are not perfectly distributed, evaluate classification models with a performance measure such as a classic accuracy score only is not a good option.

Let’s take an example of a model that classifies photos of dogs and cats :

The accuracy is pretty good : 80%. However, if we take a look at the predictions (y_preds), the model only returned Dogs and no Cats whereas 2 photos were Cats one. The issue comes from an imbalanced dataset that leads to a bad learning and a model that makes mistakes but not enough to be seen in the accuracy score.

Thus, we need better types of evaluation to know if the system is doing good, let’s see the 2 main ones :

Confusion Matrix

The Confusion Matrix shows how each element from each class are stored. How many 0 have been predicted as 0 or as 1 and vice versa. We call True Negatives, True Positives, False Positives and False Negatives, the cells of the Matrix.

True Positives are class element from 1 correctly predicted by the model.

2 ratios exist and allow a better understanding of what the model do :

Precision : TP / (TP + FP)
Recall : TP / (TP + FN)

Finally, by combining these two, we obtain an harmonic mean called F1-score :

By the author (with Codecogs)

Note : The more your precision increase, the more you recall decrease, it can be seen with the Precision — Recall Curve.

ROC Curve

The ROC Curve crosses the True Positives Ratio and the False Positives One. The closer the curve is to the upper left corner, the best your model is (depending on the balance of your data of course).

The AUC is the Area Under the Curve and is also a measure of performance. The higher it is, the better is your model (the maximum AUC is 1).

Thank you for reading this article, I hope you understand how classification problems are handled in Data Science and how you can evaluate the performance of you machine learning models !

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com