SVM Algorithm In Machine Learning

SVM Algorithm In Machine Learning
SVM Algorithm In Machine Learning

Introduction

Hey! Are you active on social media?

Most probably yes, right?

So have you ever wondered how Instagram, Pinterest, and Facebook ad recommendations work?

You recently searched for a highlighting pen that you wanted for some assignments, and you browsed for those on some online shopping portal or Google.

But now, your social media ads are based on pens only. How?

So your social media accounts take the data from your search history and provide the best recommendations possible. On the platform’s explore page, you can find posts related to your most searched topics in your feed. This is a recommendation algorithm based on Machine Learning algorithms.

Image Source: Instagram

Let’s explore some machine learning algorithms to understand the concepts in a more profound way. Machine learning can be defined as making your machine intelligent and learn on its own through computer algorithms by collecting data.

Watch below what Ankush Singla, Co-Founder of Coding Ninjas, has to say about Machine Learning.

According to Stanford University, Machine learning can be categorised into two types:

  1. Supervised learning: One of the most popular examples of Supervised Learning, is spam classification. To classify emails, one of the most popular algorithms used is Supervised Learning. The data is categorised into two categories one is spam, and another is not spam.
  1. Unsupervised learning: It is all about grouping data on some patterns, generally called a cluster. Clustering is one of the widely used techniques in Unsupervised Learning. Clustering refers to grouping similar data points or data to perform the analysis on those groups. Here, the groups are termed Clusters.

Apart from these two, there are some other learning types like reinforcement learning and recommendation systems, but these are not considered the main parts of machine learning. Let’s explore one of the famous Supervised Learning algorithms – Support Vector Machine.

What is a Support Vector Machine (SVM)?

An important topic, Support Vector Machine, is a supervised machine learning algorithm used to solve classification and regression problems. However, it is primarily used for classification problems. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(where ‘N’ is the number of features) that distinctly classifies the data points.

SVM has a vast number of applications and examples. For example, suppose you need to check your chances of getting heart disease from home. You need to enter data like blood pressure, sugar levels, age, gender, etc. Based on the given data, the algorithm classifies whether the heart is healthy or not. The spam classifier example mentioned in Supervised Learning is also an example of a support vector machine algorithm. 

In this important topic, the SVM algorithm plots each data item as a point in an ‘N’-dimensional space (where ‘N’ is the number of features you have), with the value of each feature being the value of a particular coordinate. Then, classification is done by finding the hyper-plane that differentiates the two classes very well. How to find the Hyperplanes and Support Vectors?

blog banner 1

Hyperplanes and Support Vectors

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the Hyperplane can be attributed to different classes. In simple terms, it is the ability of your machine learning model to correctly differentiate/separate/classify between different groups of data.

A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the hyperplane are the Support Vectors.

For example, if the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

3-D visualisation of Hyperplane, note the hyperplane in this case is a 2D plane

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.

Support Vectors 3-D working

Cost Function and Gradients

Before moving onto cost function and gradients, let’s first discuss intuition for machine learning. 

Intuition is something that enables you to act without using inference(or knowledge gained from observations or conclusions) to help you decide the right course of action. However, intuition is not the only tool that decisions are based on.

Instead of relying on intuition, one can base the decisions on logic. Logic-based(Rule-based) decision-making is a method of proceeding from a number of premises and eventually arriving at a decision/conclusion or a course of action through inference.

The target is to maximize the margin between the data points and the hyperplane in the SVM algorithm. The loss function that helps maximize the margin is hinge loss. The Cost Function is used to train the SVM. By minimizing the value of J(theta), we can ensure that the SVM is as accurate as possible. In the equation, the functions cost1 and cost0 refer to the cost for an example where y=1 and the cost for an example where y=0.

Gradient descent is a technique for converging on a solution to a problem by choosing an arbitrary solution, measuring the goodness of fit (under a loss function), and then iteratively taking steps to minimize loss (by stepping in the direction of the derivative). Gradient descent is a common technique used to find optimal weights.

We are looking to maximize the margin between the data points and the hyperplane in the SVM algorithm. The loss function that helps maximize the margin is hinge loss.

Hinge loss function (function on top can be represented as a function below)

The cost is 0 if the predicted value and the actual value are of the same sign. If they are not, we then calculate the loss value. We also add a regularisation parameter to the cost function. The objective of the regularization parameter is to balance the margin maximisation and loss. After adding the regularization parameter, the cost function looks like:

The loss function for SVM

Now that we have the loss function, we take partial derivatives concerning the weights to find the gradients. Using the gradients, we can update our weights.

Gradients

When there is no misclassification, i.e., our model correctly predicts the class of our data point, we only have to update the gradient from the regularisation parameter.

Gradient Update — No misclassification

When there is a misclassification, i.e., our model makes a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update.

Gradient Update — Misclassification

Here is a tabular summary of all the important terms:

KeywordDefinitions
HyperplanesHyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. 
Support vectorsSupport vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane.
Cost FunctionA cost function is a mechanism utilised in supervised machine learning. The cost function returns the error between predicted outcomes compared with the actual outcomes.
GradientA gradient is a derivative of a function that has more than one input variable. It is a term used to refer to the derivative of a function from the perspective of the field of linear algebra.

How does SVM work?

The main part of SVM lies in finding the hyperplane, so here are some steps to find the perfect plane. Here, we have three scenarios represented with different symbols.

  • Identify the right hyper-plane (Scenario-1): We have three hyper-planes (A, B, and C). Now, identify the right hyper-plane to classify stars and circles.
    SVM_2

You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the two classes better.” In this scenario, hyper-plane “B” has excellently performed this job.

  • Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B, and C), and all are segregating the classes well. Now, How can we identify the right hyper-plane?
    SVM_3

Here, maximising the distances between the nearest data point (either class) and hyper-plane will help us decide the right hyper-plane. This distance is called a Margin. Let’s look at the below snapshot:SVM_4
Above, you can see that the margin for hyper-plane C is more compared to both A and B. Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with a higher margin is robustness. Conversely, if we select a hyper-plane with a low margin, there is a high chance of miss-classification.

  • Identify the right hyper-plane (Scenario-3):
SVM_5

Some of you may have selected the hyper-plane B as it has a higher margin than A. But, here is the catch, SVM selects the hyper-plane, which classifies the classes accurately before maximizing margin. Here, hyper-plane B has a classification error, and A has classified all correctly. Therefore, the right hyper-plane is A.

Can we classify two classes? (Scenario-4): Below, we cannot segregate the two classes using a straight line, as one of the stars lies in the territory of the other(circle) class as an outlier. 

As already mentioned, one star at the other end is like an outlier for star class. However, the SVM algorithm has a feature to ignore outliers and find the hyperplane with the maximum margin. Hence, we can say, SVM classification is robust to outliers.

SVM_6
SVM_7

Find the hyper-plane to segregate into classes (Scenario-5): In the scenario below, we can’t have a linear hyperplane between the two classes, so how does SVM classify these two classes? Till now, we have only looked at the linear hyper-plane.

SVM_8

SVM can solve this problem. Easily! It solves this problem by introducing the additional feature. Here, we will add a new feature, z=x^2+y^2. Now, let’s plot the data points on axis x and z:

SVM_9

In the above plot, points to consider are:

  • All values for z would be positive always because z is the squared sum of both x and y
  • In the original plot, red circles appear close to the origin of the x and y axes, leading to a lower value of z and a star relatively away from the origin, resulting in a higher value of z.

Code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import SVM, datasets
 
 
def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in
Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional
 
    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy
 
 
def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.
 
    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out
 
 
# import some data to play with
iris = datasets.load_iris()
# Take the first two features. We could avoid this by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target
 
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
models = (svm.SVC(kernel='linear', C=C),
          svm.LinearSVC(C=C, max_iter=10000),
          svm.SVC(kernel='rbf', gamma=0.7, C=C),
          svm.SVC(kernel='poly', degree=3, gamma='auto', C=C))
models = (clf.fit(X, y) for clf in models)
 
# title for the plots
titles = ('SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel')
 
# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
 
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
 
for clf, title, ax in zip(models, titles, sub.flatten()):
    plot_contours(ax, clf, xx, yy,
                  cmap=plt.cm.coolwarm, alpha=0.8)
    ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xlabel('Sepal length')
    ax.set_ylabel('Sepal width')
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(title)
 
plt.show():

Output

The following 2-D graphs are based on two features, sepal length, and sepal width. From the graphs, you can clearly understand how different kernels are modelled with the same SVM classifiers. For example, the LinearSVC() minimises the squared hinge loss, while SVC minimizes the regular hinge loss.

The four graphs are SVC with linear kernel, LinearSVC, SVC with RBF kernel, and SVC with a third-degree polynomial. The data points are segregated into different groups based on the requirements provided in the above code.

SVM Classifier with Iris Sepal features

In the above code, we specified different plots, but generally, after plotting different graphs and analysing them, we split our data into parts one is for the test set, and another is the train set.

Then, we train our model using the train set and test using the test set using epochs. Finally, we obtain the predictions and compare them with the actual values and print the accuracy of our model.

Applications of SVM

As mentioned above, SVM has associated many applications across many fields. Let’s explore them:

  • Data Classification
  • Facial Expression detection
  • Texture Classification
  • Text Classification
  • Cancer Disease Diagnosis
  • Heart Disease prediction
  • Stenography Detection in Digital Images
  • Identification of DNA 
  • Spam Emails classification
  • Bioinformatics
  • Protein Fold and Remote Homology Detection
  • Simple speech recognition

Frequently Asked Questions

Can I teach myself machine learning?

Yes, you can learn machine learning yourself from different resources available online. Some popular universities like the Massachusetts Institute of Technology, Harvard University, etc., provide you with the best machine learning courses.

Is coding required for data science?

Python and R are mostly used for writing the code for data science and machine learning. You can have most of the code automated through libraries. It is essential to learn to code if you want to be a data scientist.

Which is the best website to learn machine learning?

Many websites teach you machine learning for free through courses designed by universities. Coding Ninjas provides you with the topmost machine learning course associated with Amazon, Facebook, Stanford University.

What is SVM used for?

SVM is a supervised machine learning algorithm that can be used for classification or regression problems.

How does SVM predict?

The support vector machine (SVM) is a predictive analysis data-classification algorithm that assigns new data elements to one of the labelled categories.

Key Takeaways

This article briefed about machine learning and its types, with some applications. Then we discussed classification, followed by discussing Support Vector Machine (SVM). Finally, we discussed the working of SVM with code and applications.

“Programming isn’t about what you know; it’s about what you can figure out.” ~ Chris Pine

By Dharani Mandla