KDE Plot Visualisation with Pandas & Seaborn

KDE Plot Visualisation with Pandas & Seaborn
KDE Plot Visualisation with Pandas & Seaborn

When we have a large number of data and we want to take insights out of them then the main step we want to do is to visualise that data. It explains how our data is behaving under a certain condition or how each field of our data is changing at different values.

KDE stands for Kernel Density Estimate, which is a graphical way to visualise our data as the Probability Density of a continuous variable. It is an effort to analyse the model data to understand how the variables are distributed.

Creating a KDE plot can answer many questions such as,

  1. What range is covered by the observer?
  2. The central tendency of the data.
  3. If the data is skewed in one direction or not.
  4. The bimodality of the data.
  5. Are there significant outliers?

KDE plot is a probability density function that generates the data by binning and counting observations. But, rather than using a discrete bin KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate. KDE can produce a plot that is less cluttered and more interpretable, especially when drawing multiple distributions. However, sometimes the KDE plot has the potential to introduce distortions if the underlying distribution is bounded or not smooth.

Introduction to Seaborn

Seaborn is a python library integrated with Numpy and Pandas (which are other libraries for data representation). Seaborn is closely related to Matplotlib and allow the data scientist to create beautiful and informative statistical graphs and charts which provide a clear idea and flow of pieces of information within modules.

Install seaborn as
pip install seaborn

Syntax of KDE plot:
seaborn.kdeplot(data) the function can also be formed by seaboen.displot() when we are using displot() kind of graph should be specified as kind=’kde’,
seaborn.display( data, kind=’kde’)

Normal KDE plot:

import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(500)
res = sn.kdeplot(data)
plt.show()

This plot is taken on 500 data samples created using the random library and are arranged in numpy array format because seaborn only works well with seaborn and pandas DataFrames.

We can also add color to our graph and provide shade to the graph to make it more interactive.

import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(500)
res = sn.kdeplot(data, color='orange', shade='True')
plt.show()

Our task is to create a KDE plot using pandas and seaborn.
Let us create a KDE plot for the iris dataset.

#importing important library
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
import pandas as pd
%matplotlib inline 

# Setting up the Data Frame 
iris = datasets.load_iris() 

#changing column names
iris_df = pd.DataFrame(iris.data, columns=['Sepal_Length', 'Sepal_Width', 'Patal_Length',
                                      'Petal_Width']) 
iris_df['Target'] = iris.target 
#changing target from values to labels
iris_df['Target'].replace([0], 'Iris_Setosa', inplace=True) 
iris_df['Target'].replace([1], 'Iris_Vercicolor', inplace=True) 
iris_df['Target'].replace([2], 'Iris_Virginica', inplace=True)

# Plotting the KDE Plot 
sns.kdeplot(iris_df.loc[(iris_df['Target']=='Iris_Virginica'), 
                   'Sepal_Width'], color='green', shade=True, Label='Iris_Virginica') 
  
# Setting the X and Y Label 
plt.xlabel('Sepal Length') 
plt.ylabel('Probability Density')

Iris data contain information about a flower’s Sepal_Length, Sepal_Width, Patal_Length, Petal_Width in centimetre. On the basis of these four factors, the flower is classified as Iris_Setosa, Iris_Vercicolor, Iris_Virginica, there are in total of 150 entries.

Steps that we did for creating our kde plot,

  • We start everything by importing the important libraries pandas, seaborn, NumPy and datasets from sklearn.
  • Once our modules are imported our next task is to load the iris dataset, we are loading the iris dataset from sklearn datasets, we will name our data as iris.
  • Now we will convert our data in pandas DataFrame which will be passed as an argument to the kdeplot() function and also provide names to columns to identify each column individually.
  • Add a new column to the iris DataFrame that will indicate the Target value for our data.
  • Now the next step is to replace Target values with labels, iris data Target values contain a set of {0, 1, 2} we change that value to Iris_Setosa, Iris_Vercicolor, Iris_Virginica.
  • Now we will define kdeplot() we have defined our kdeplot for the column of sepal width where the target values are equal to Iris_Virginica, the kdeplot is green in colour and has shading parameter set to True with a label that indicates that kdeplot is drawn for Iris_Virginica.
  • Finally, we provide labels to the x-axis and the y-axis, we don’t need to call show() function as matplotlib was already defined as inline.

We can also provide kdeplot for many target values in same graph as,

# Plotting the KDE Plot 
sns.kdeplot(iris_df.loc[(iris_df['Target']=='Iris_Virginica'), 
            'Sepal_Width'], color='green', shade=True, Label='Iris_Virginica')

sns.kdeplot(iris_df.loc[(iris_df['Target']=='Iris_Setosa'), 
            'Sepal_Width'], color='blue', shade=True, Label='Iris_Setosa')

sns.kdeplot(iris_df.loc[(iris_df['Target']=='Iris_Vercicolor'), 
            'Sepal_Width'], color='red', shade=True, Label='Iris_Vercicolor')
  
# Setting the X and Y Label 
plt.xlabel('Sepal Length') 
plt.ylabel('Probability Density')

We can also create a Bivariate kdeplot using the seaborn library. Seaborn is used for plotting the data against multiple data variables or bivariate(2) variables to depict the probability distribution of one with respect to the other values.

Syntax using bivariate kdeplot,
seaborn.kdeplot(x,y)

Bivariate kdeplot on the iris dataset

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
import pandas as pd
%matplotlib inline 

iris = datasets.load_iris() 
iris_df = pd.DataFrame(iris.data, columns=['Sepal_Length', 'Sepal_Width', 'Patal_Length', 
                                      'Petal_Width']) 
iris_df['Target'] = iris.target 
iris_df['Target'].replace([0], 'Iris_Setosa', inplace=True) 
iris_df['Target'].replace([1], 'Iris_Vercicolor', inplace=True) 
iris_df['Target'].replace([2], 'Iris_Virginica', inplace=True)

#query for target selection
iris_virginica = iris_df.query("Target=='Iris_Virginica'") 
  
# Plotting the KDE Plot 
sns.kdeplot(iris_virginica['Sepal_Length'],  
            iris_virginica['Sepal_Width'], 
            color='b', shade=True, Label=’Iris_Virginica’, 
            cmap="Blues", shade_lowest=False) 

  • To obtain a bivariate kdeplot we first obtain the query that will select the target value of Iris_Virginica, this query selects all the rows from the table of data with the target value of Iris_Virginica.
  • Now we will define kdeplot of bivariate with x and y data, from our data we select all entries of sepal_length and speal_width for the selected query of Iris_Virginica.
  • The color of the graph is defined as blue with a cmap of Blues and has a shade parameter set to true.

Bivariate kdeplot for multiple sample:

iris_virginica = iris_df.query("Target=='Iris_Virginica'") 
iris_vercicolor = iris_df.query("Target=='Iris_Vercicolor'")
  
# Plotting the KDE Plot 
sns.kdeplot(iris_virginica['Sepal_Length'],  
            iris_virginica['Sepal_Width'], 
            color='b', shade=True, Label='Iris_Virginica', 
            cmap="Blues", shade_lowest=False)

sns.kdeplot(iris_vercicolor['Sepal_Length'], 
            iris_vercicolor['Sepal_Width'], 
            color='r', shade=True, Label='Iris_Vercicolor', 
            cmap="Reds", shade_lowest=False) 

Apart from all these doing seaborn kdeplot can also do many things, it can also revert the plot as vertical for example.

import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(500)
res = sn.kdeplot(data, color='orange', vertical=True, shade='True')
plt.show()

KDE plot can also be drawn using distplot(),
Let us see how the distplot() function works when we want to draw a kdeplot.
Distplot: This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions.
The arguments to distplot function are hist and kde is set to True that is it always show both histogram and kdeplot for the certain which is passed as an argument to the function, if we wish to change it to only one plot we need to set hist or kde to False in our case we wish to get the kde plot only so we will set hist as False and pass data in the distplot function.

Example:

import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(500)
res = sn.distplot(data)
plt.show()

Example 2:

For iris dataset,
sn.distplot(iris_df.loc[(iris_df[‘Target’]==’Iris_Virginica’),’Sepal_Width’], hist=False)

This graphical representation gives an accurate description of If the data is skewed in one direction or not also explains the central tendency of the graph.

Find this article intriguing? Explore more blogs now!

By Vikas Upadhyay