Introduction to Dimensionality Reduction
Introduction
All of you might have assignments issued to you via your institutions. If you are loaded with about 10 projects, how do you decipher which one to complete first? How do you plan your work? There might be several factors, such as which assignment’s deadline is closer or which task marks matter more. Is it a good time to start, are you in a mood to do a project right now, can you spend enough time on it, is there some other task lying around, or sometimes might include the strictness of the professor. Considering all these factors, you will start your work, right?
Source: Link
But do you really require all the mentioned factors to start your assignments? Do you really want to know whether it is the right time to begin your work or not? Or are you willing to shoo away your marks just because you are not in a mood to do it? Of course, these factors do not play a vital role in your decision-making process.
Similarly, in machine learning, sometimes various features in our dataset are unnecessary for the decision-making process, and we filter them out using dimensionality reduction. Let us have a closer look at it.
What is dimensionality reduction?
In a dataset, the number of features is known as dimensionality.
Sometimes, too many factors in machine learning classification problems help in classification. Due to which the visualizing of the training dataset gets harder. In some cases, many features are correlated and cause redundancy and over-fitting. Therefore, the dimensionality reduction algorithm is introduced.
Why is dimensionality reduction required?
Dimensionality reduction is applied to preserve only the required and relevant data. The higher the number of features, the more is the complexity of visualizing the training datasets, and the more is redundancy in the model. Therefore, dimensionality reduction is applied to overcome the issue of overfitting, removing outliers, and redundancy. Since the irrelevant and trivial features are released, overfitting and regularization are also handled easily because the correlated elements are removed.
Let us take an example of a pseudo dataset containing various people's height, weight, and age. The 3-D plot will somewhat look like this:
Source: Link
We are supposed to calculate a person's BMI based on the dataset. So with the help of dimensionality reduction, the feature age will be removed, as it is not required and not directly related to calculating BMI. Therefore, the 3-dimensional dataset is reduced to the 2-dimensional dataset by removing the Age parameter. The 2-D plot will look like this:
Source: Link
When is dimensionality reduction applied?
Dimensionality reduction is used to prepare data. It is applied after mining and cleaning the data before training the model.
How is dimensionality reduction made?
It consists of two components:
Feature Selection
Feature selection is a process where essential features are chosen as a subset of original variables; with the help of those variables, the model is created.
Source: Link
Feature selection can be made in three ways:
- Filter- It grades the original parameter, compare them, and removes the correlated and unnecessary parameters.
- Wrapper- It accesses data one by one using various potential subsets of features to get the subset that best fits.
- Embedded- It combines the features of the filter and wrapper method and is responsible for selecting the best subset and embedding/fitting it into the model.
Source: Link
Feature Extraction
Feature extraction is a process in which the dataset is reduced to create new features based on the original ones. Feature extraction can be done in the following ways:
- Principal Component Analysis- It is based on the principle that when the image in upper dimensional space and lower-dimensional space is mapped, the variance of the lower area should be maximum.
- Linear Discriminant Analysis- It reduces the dimensionality of data and the parameters from a higher dimensional space to a lower-dimensional room.
-
Generalized Discriminant Analysis- It works to map the convenient features where parameters are nonlinearly related.
Here is a sample code to determine the important features from the iris dataset.
#importing necessary files
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
dataset = load_iris()
#creating the dataframe
df_iris = pd.DataFrame(dataset['data'], columns = dataset['feature_names'])
df_iris.head()
#Checking whether there is any null value
df_iris.isnull().sum()
#checking the variance of all the features
df_iris.var()
#importing the model
from sklearn.ensemble import RandomForestClassifier
x= df_iris.iloc[:,:].astype(str)
y= df_iris.iloc[:,-1].astype(str)
model = RandomForestClassifier()
model.fit(x,y)
feat_importances= pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
This code will generate the following output:
It represents the importance of the features, and it can be seen that sepal width is the least important feature; therefore, dimensionality reduction will remove that feature.
Advantages and Disadvantages of dimensionality reduction
Advantages of dimensionality reduction
- It reduces the memory requirements.
- Compile-time or computational time is comparatively less.
- Reduction in time-space complexity.
- Helps in removing the redundant features.
- Removing correlations on interdependent components provides a better classification hence a better model.
Disadvantages of dimensionality reduction
- It leads to data loss.
- As the dimensionality of features is reduced, the vital elements are sometimes removed.
Frequently Asked Questions
-
What is PCA used for?
Ans. Principal Component Analysis is used to understand the variance-covariance relationship of a nonlinear combination.
-
Which dimensionality reduction techniques handle higher dimensionality data well?
Ans. The principal component analysis technique handles the higher dimensionality dataset in the best way possible.
-
What is the difference between dimensionality reduction and feature selection?
Ans. Feature selection is a subset of dimensionality reduction. While dimensionality reduction reduces the dimension of the data set, feature selection is responsible for selecting the essential parameters from the dataset.
Key Takeaways
Dimensionality reduction is used to remove the unnecessary parameters and keep only the vital features for the classification problems. This blog was made to understand the basics of dimensionality reduction. It included why it is used, how it is used, the various types and categories of dimensionality reduction. If you’re interested in going deeper, Check out our industry-oriented machine learning course curated by our faculty from Stanford University and Industry experts.