Applying K-Means on Iris Dataset
We’ll be learning about a very famous machine learning algorithm - K-Means and a very popular dataset - Iris Dataset.
In short, K-Means is an unsupervised machine learning algorithm used for clustering. The Iris Dataset is a very well-known dataset used to predict the Iris flower species based on a few given properties.
What is K-Means?
K-Means is an unsupervised machine learning algorithm that is used for clustering problems. Since it is an unsupervised machine learning algorithm, it uses unlabelled data to make predictions.
K-Means is nothing but a clustering technique that analyzes the mean distance of the unlabelled data points and then helps to cluster the same into specific groups.
In detail, K-Means divides unlabelled data points into specific clusters/groups of points. As a result, each data point belongs to only one cluster that has similar properties.
The various steps involved in K-Means are as follows:-
→ Choose the 'K' value where 'K' refers to the number of clusters or groups.
→ Randomly initialize 'K' centroids as each cluster will have one center. So, for example, if we have 7 clusters, we would initialize seven centroids.
→ Now, compute the euclidian distance of each current data point to all the cluster centers. Based on this, assign each data point to its nearest cluster. This is known as the 'E- Step.'
Example: Let us assume we have two points, A1(X1, Y1) and B2(X2, Y2). Then the euclidian distance between the two points would be the following:-
→ Now, update the cluster center locations by taking the mean of the data points assigned. This is known as the 'M-Step.'
→ Repeat the above two steps until convergence, i.e., until we reach a global optimum where no further optimization is possible.
We will be using the Iris Dataset and applying K-Means on the same.
The Iris Dataset helps predict the Iris flower species based on a few given properties. It consists of 5 features and one target variable.
(i) Id - ID of the flower for differentiating, numerical feature.
(ii) SepalLengthCm - sepal length of the flower, numerical feature.
(iii) SepalWidthCm - sepal width of the flower, numerical feature.
(iv) PetalLengthCm - petal length of the flower, numerical feature.
(v) PetalWidthCm - petal width of the flower, numerical feature.
(vi) Species - iris species , target variable / label.
For simplicity, we would use the already existing sklearn library for K-Means implementation.
Importing Necessary Libraries
Firstly, we will load some basic libraries:-
(i) Numpy - for linear algebra.
(ii) Pandas - for data analysis.
(iii) Seaborn - for data visualization.
(iv) Matplotlib - for data visualisation.
(v) KMeans - for using K-Means.
(vi) LabelEncoder - for label encoding.
(vii) classification_report - for generating numerous results.
(viii) accuracy_score - for generating model accuracy.
|import numpy as np |
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
|#loading dataset |
Above, we load the data using pandas.
We visualize the dataset by printing the first ten rows of the data frame. We use the head() function for the same.
|#finding different class labels |
We notice that there are three different classes present in the dataset.
We observe that our dataset consists of 150 rows and six columns.
|#finding correlation of features |
From the above, we observe that bigger values are represented with light color. This observation will always be the same for the heatmap. Dark values will always be less than light-colored values.
Now, we will use Matplotlib for a scatter plot.
|ax = df[df.Species=='Iris-setosa'].plot.scatter(x='SepalLengthCm', y='SepalWidthCm', |
color='red', label='Iris - Setosa')
color='green', label='Iris - Versicolor', ax=ax)
color='blue', label='Iris - Virginica', ax=ax)
|#checking for Null values|
We observe that the dataset does not contain any Null values.
We perform label encoding for converting the categorical feature ‘Species’ into a numerical one.
|#Label Encoding - for encoding categorical features into numerical ones|
encoder = LabelEncoder()
df['Species'] = encoder.fit_transform(df['Species'])
|#finding different class labels |
As noticeable above, all target values are now numerical.
We drop ‘ID’ as this feature is insignificant.
|#DROPPING ID |
df= df.drop(['Id'], axis = 1)
Now, we will divide our data into training data and testing data. We will have a 3:1 train test split.
|#converting dataframe to np array |
data = df.values
X=data [:, 0:5]
Y= data [: , -1]
#train-test split = 3:1
train_x = X[: 112, ]
train_y = Y[:112, ]
test_x = X[112:150, ]
test_y = Y[112:150, ]
We will build our KMeans model using the sklearn library and then train it on the given iris dataset.
kmeans = KMeans(n_clusters=3)
# training predictions
test_labels = kmeans.predict(test_x)
Now, we analyze our models and generate the result.
|#KMeans model accuracy|
We notice that we get good results on both training and testing sets. The training set gives us a score of 99.10, whereas the testing set gives us a score of 94.73.
Finally, we will generate a classification report for in-depth analysis.
|#classification report for training set |
Frequently Asked Questions
- What is the advantage as well as disadvantage of KMeans?
An advantage of KMeans is that it is computationally very fast. A disadvantage of the same is that it does not work too well with clusters of different sizes.
- What is the importance of clustering in ML?
Clustering helps identify and group similar data points in larger datasets without concern for the specific outcome.
- What does the ‘K’ in K-Means stand for?
‘K’ refers to the number of clusters in K-means.
Congratulations on making it this far. This blog discussed a fundamental overview of KMeans along with the Iris Dataset!!
We learned about Data Loading, Data Visualisation, Data Preprocessing, and Training. We learned how to visualize data then, based on this EDA, took significant decisions concerning preprocessing, made our model training ready, and finally generated the results for it.
If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this link for a carefully crafted and designed course on-campus placements and interview preparation.