Dimensionality Reduction — Fight Your Model Complexity

Nicolas Pogeant
8 min readJul 3, 2022

In this article, we will dive into dimensionality reduction to understand the way it works and why it can be a life saver when you have to train your model.

Photo by Ehud Neuhaus on Unsplash

Dimensionality reduction is often introduced with the concept of the curse of dimensionality.

The Curse of Dimensionality can be expressed as the situation where the number of features (the columns for tabular data) is high whereas samples are not as numerous in a dataset.

Let’s have a dataset composed of 2 features and 12 observations :

If we add one feature, the distance between data points will increase and change the way the model will learn :

In high dimensional space, points are very distant from each other. Thus, for a dataset with many features, training data points are going to be widely separated. Therefore, the curse of dimensionality appears implying that if there are too few data to train, having many features will result to poor learning and a model that will have bad predictions for new observations.

Increasing the number of features during training leads to a more complex structure of data and therefore, a possible overfitting as the model cannot link observations with each others.

The idea is to be able to counter this problem by allowing the machine to learn from the best relationships between the data. Indeed, what we know about machine learning is that the algorithm initialized in a model will try to understand how the data are represented between them in order to be able to aim correctly when faced with new data.
This also applies to other types of data such as images, text or audio, the model will learn to spot the distinctions from the features. However, in the context of unstructured data presented above (images, text, audio…), the said features are multiplied (a word is represented in relation to itself but also in relation to others and more…) and this leads to dimensions impossible to imagine.

What are the ways to reduce the dimension of a dataset and to face this burden?

There are different approaches to dimension reduction: A linear approach that can be associated with Projection and an approach that overcomes the shortcomings of the first one, the non-linear approach, Manifold Learning (which in fact also includes projection).

Two Approaches against the Curse of Dimensionality

As we see before, dealing with dimensionality in a dataset is dealing with the number of features it contains. A solution is the reduction of this number. However, the major obstacle is to have less features but keep the relationships between them and the target. What we want is to have a representation of our dataset that can fit in a small dimension space so that the model learns better (and faster). But the structure of the data and the relationships between each points have to be maintained.

Projection

This first approach is basically reducing the dimension by projecting data points from a superior dimension space into an inferior one while retaining the relationships between the features.

If we apply projection to a 3D dataset, it will allow the dataset to be represented into a 2 dimensions subspace as followed :

This example is inspired by the same one from the book of Aurelien Géron.

A particularity that we can note is that the new representation (in 2D) is projected with only 2 features, whereas the 3D representation was using 3 features. Until then, it is logical, however, to be able to keep the relationships between the features/variables, projection create new features that contain the “essence” of the dataset. Principal Component Analysis (PCA) is the most known techniques of projection and we will focus on it in the next part.

One important thing about projection is that it performs well with linear structure of data but has difficulty dealing with non linear one. It will create a new representation that will not really keep the differences between data points.

As you can note, projection makes machine learning difficult. Let’s see how we can reach the second reduction and deal with non linear structures.

Manifold Learning

This approach is in fact including projection because a manifold in mathematics is :

A topological space that locally resembles Euclidean space near each point (Wikipedia).

A manifold allows the representation of any structure of data into a lower dimension space that keeps the same structure by eventually reassembling the high dimensional representation into a unique local one.

If we consider projection, it would not really reassemble but just represents the 3D structure into a 2D one.

A 2D manifold is a two dimensions that can be folded into a large space : a sphere or a torus for example.

Manifold Learning stands on two hypothesis :

  • “The manifold hypothesis is that real-world high dimensional data lie on low-dimensional manifolds embedded in the high-dimensional space” (a statement from Manifolds: A Gentle Introduction).
    It means that we can represent the real world characterized by high dimensional data thanks to manifolds.
  • The second hypothesis tends to explain that the regression or classification problem would be easier to solve if it were done in a low dimension manifold space (it does not imply that the model would perform better, however, it could learn faster or would simply allow him to learn).

Now that we know both of the main approaches of dimensionality reduction, let’s see the techniques with two algorithms : PCA and LLE.

The Main Techniques of Dimensionality Reduction

We will dive into the techniques that take the structures of a dataset and reduce the dimension of these structures.

Principal Component Analysis

This technique is the most known one and can be used also in unsupervised learning problems.

PCA uses the projection approaches to reduce the dimension of the data. It finds the principal components in the structure of the dataset and keeps the most useful ones.

Principal components are unit vectors (vectors that can especially specify the direction of a non nul vector on the plan) :

These components trace the representation of the data structure as each components contributes to the variance of the data (keeps the information). The vector c1 and c2 are the 2 first principal components of the dataset.

The rate of explained variance is one of the important metrics in PCA because it is by looking at the rate of each components that you can chose the number of components to keep as the new representation of your dataset. For example, the first component (the best one) can explain 80% of the variance in your data, the second one, 10% … If you decide to keep 90% of the variance, you will then keep the two components and your dataset will now be composed of 2 features and therefore 2 dimensions. That is dimensionality reduction based on projection.

As these components are unit vectors, the original features of the dataset will no longer be used for the learning and the new features will be Principal Component 1, Principal Component 2…

Principal Components Analysis is available in Scikit-Learn with the following code :

The main issue of PCA is the fact that it works mainly for linear data structure and that is why other techniques exist.

Locally Linear Embedding

LLE is a non linear technique of dimensionality reduction that evaluates neighborhoods for each data point in the high dimensional space and looks for a lower one that keeps these nearest neighborhoods relations.
The Scikit-Learn definition comes with a nice comparison : LLE “can be thought of as a series of local Principal Component Analyses which are globally compared to find the best non-linear embedding”.

The technique works in several steps :

  • Find the local linear relations/k-nearest neighbors of each data point and associate weights that minimizes the distance between a data point and its neighbors. Thanks to that, the algorithm know all the local neighborhoods.
  • Once it understands the data, it will look to find a new representation of all this local linear relations in a d dimensional space. To do that, it search positions for each point in the new space that respects the relations and the distances minimized found in the first step.

As PCA, LLE is available in Scikit-Learn :

Conclusion

Dimensionality reduction is so much important and useful in a world with always bigger datasets. Any unstructured and complex data needs to pass by dimensionality reduction to allow a model to learn.
Different approaches exist such as Projection or Manifold Learning and each one has algorithms that can be applied to the data directly. For example, PCA or LLE, but some other are also performant : Isomap, t-SNE Method, LDA (Linear Discriminant Analysis)…

Thank you for reading this article, I hope you enjoyed and see better how dimensionality reduction works !

Resources :

--

--

Nicolas Pogeant

Data Scientist | Passionate about Data/ML | Love to learn | npogeant.com