# Applying K-Means on Iris Dataset

## Introduction

We’ll be learning about a very famous machine learning algorithm - K-Means and a very popular dataset - Iris Dataset.

In short, K-Means is an unsupervised machine learning algorithm used for clustering. The Iris Dataset is a very well-known dataset used to predict the Iris flower species based on a few given properties.

## What is K-Means?

K-Means is an unsupervised machine learning algorithm that is used for clustering problems. Since it is an unsupervised machine learning algorithm, it uses unlabelled data to make predictions.

K-Means is nothing but a clustering technique that analyzes the mean distance of the unlabelled data points and then helps to cluster the same into specific groups.

In detail, K-Means divides unlabelled data points into specific clusters/groups of points. As a result, each data point belongs to only one cluster that has similar properties.

## K-Means Algorithm

The various steps involved in K-Means are as follows:-

→ Choose the 'K' value where 'K' refers to the number of clusters or groups.

→ Randomly initialize 'K' centroids as each cluster will have one center. So, for example, if we have 7 clusters, we would initialize seven centroids.

→ Now, compute the euclidian distance of each current data point to all the cluster centers. Based on this, assign each data point to its nearest cluster. This is known as the 'E- Step.'

Example: Let us assume we have two points, A1(X1, Y1) and B2(X2, Y2). Then the euclidian distance between the two points would be the following:-

→ Now, update the cluster center locations by taking the mean of the data points assigned. This is known as the 'M-Step.'

→ Repeat the above two steps until convergence, i.e., until we reach a global optimum where no further optimization is possible.

Iris Dataset

**Link to the dataset **- __https://www.kaggle.com/pralabhpoudel/iris-classification-report-97-accuracy/data?select=Iris.csv__

We will be using the Iris Dataset and applying K-Means on the same.

The Iris Dataset helps predict the Iris flower species based on a few given properties. It consists of 5 features and one target variable.

(i) Id - ID of the flower for differentiating, numerical feature.

(ii) SepalLengthCm - sepal length of the flower, numerical feature.

(iii) SepalWidthCm - sepal width of the flower, numerical feature.

(iv) PetalLengthCm - petal length of the flower, numerical feature.

(v) PetalWidthCm - petal width of the flower, numerical feature.

(vi) Species - iris species , target variable / label.

## Implementation

For simplicity, we would use the already existing sklearn library for K-Means implementation.

## Importing Necessary Libraries

Firstly, we will load some basic libraries:-

(i) Numpy - for linear algebra.

(ii) Pandas - for data analysis.

(iii) Seaborn - for data visualization.

(iv) Matplotlib - for data visualisation.

(v) KMeans - for using K-Means.

(vi) LabelEncoder - for label encoding.

(vii) classification_report - for generating numerous results.

(viii) accuracy_score - for generating model accuracy.

import numpy as np import pandas as pd import seaborn as sns from matplotlib import pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import LabelEncoder from sklearn.metrics import classification_report from sklearn.metrics import accuracy_score |

## Loading Data

#loading dataset df= pd.read_csv('iris.csv') |

Above, we load the data using pandas.

## Visualization

We visualize the dataset by printing the first ten rows of the data frame. We use the head() function for the same.

#visualizing dataset df.head(n=10) |

**Output**

#finding different class labels np.unique(df['Species']) |

**Output**

We notice that there are three different classes present in the dataset.

df.shape |

**Output**

We observe that our dataset consists of 150 rows and six columns.

df.info() |

**Output**

#finding correlation of features correl=df.corr() sns.heatmap(correl,annot=True) |

**Output**

From the above, we observe that bigger values are represented with light color. This observation will always be the same for the heatmap. Dark values will always be less than light-colored values.

Now, we will use Matplotlib for a scatter plot.

ax = df[df.Species=='Iris-setosa'].plot.scatter(x='SepalLengthCm', y='SepalWidthCm', color='red', label='Iris - Setosa') df[df.Species=='Iris-versicolor'].plot.scatter(x='SepalLengthCm', y='SepalWidthCm', color='green', label='Iris - Versicolor', ax=ax) df[df.Species=='Iris-virginica'].plot.scatter(x='SepalLengthCm', y='SepalWidthCm', color='blue', label='Iris - Virginica', ax=ax) ax.set_title("Scatter Plot") |

**Output**

## Preprocessing

### Data imputation

#checking for Null values df.isnull().sum() |

**Output**

We observe that the dataset does not contain any Null values.

### Label Encoding

We perform label encoding for converting the categorical feature ‘Species’ into a numerical one.

#Label Encoding - for encoding categorical features into numerical ones encoder = LabelEncoder() df['Species'] = encoder.fit_transform(df['Species']) |

df |

**Output**

#finding different class labels np.unique(df['Species']) |

**Output**

As noticeable above, all target values are now numerical.

### Insignificant Features

We drop ‘ID’ as this feature is insignificant.

#DROPPING ID df= df.drop(['Id'], axis = 1) |

df.shape |

**Output**

### Train-Test Split

Now, we will divide our data into training data and testing data. We will have a 3:1 train test split.

#converting dataframe to np array data = df.values X=data [:, 0:5] Y= data [: , -1] print(X.shape) print(Y.shape) #train-test split = 3:1 train_x = X[: 112, ] train_y = Y[:112, ] test_x = X[112:150, ] test_y = Y[112:150, ] print(train_x.shape) print(train_y.shape) print(test_x.shape) print(test_y.shape) |

**Output**

## Training

We will build our KMeans model using the sklearn library and then train it on the given iris dataset.

#KMeans kmeans = KMeans(n_clusters=3) kmeans.fit(train_x, train_y) # training predictions train_labels= kmeans.predict(train_x) #testing predictions test_labels = kmeans.predict(test_x) |

### Results

Now, we analyze our models and generate the result.

#KMeans model accuracy #training accuracy print(accuracy_score(train_y, train_labels)*100) #testing accuracy print(accuracy_score(test_labels, test_y)*100) |

**Output**

We notice that we get good results on both training and testing sets. The training set gives us a score of 99.10, whereas the testing set gives us a score of 94.73.

Finally, we will generate a classification report for in-depth analysis.

#classification report for training set print(classification_report(train_y, train_labels)) |

**Output**

## Frequently Asked Questions

**What is the advantage as well as disadvantage of KMeans?**

An advantage of KMeans is that it is computationally very fast. A disadvantage of the same is that it does not work too well with clusters of different sizes.

**What is the importance of clustering in ML?**

Clustering helps identify and group similar data points in larger datasets without concern for the specific outcome.

**What does the ‘K’ in K-Means stand for?**

‘K’ refers to the number of clusters in K-means.

## Key Takeaways

Congratulations on making it this far. This blog discussed a fundamental overview of KMeans along with the Iris Dataset!!

We learned about Data Loading, Data Visualisation, Data Preprocessing, and Training. We learned how to visualize data then, based on this EDA, took significant decisions concerning preprocessing, made our model training ready, and finally generated the results for it.

If you are preparing for the upcoming Campus Placements, don’t worry. Coding Ninjas has your back. Visit this __link__ for a carefully crafted and designed course on-campus placements and interview preparation.

Comments

## No comments yet

## Be the first to share what you think