Spam/Ham Classification using Naive Bayes

Tushar Tangri
Last Updated: May 13, 2022

Introduction

Email/text messages have become a crucial part of our daily life as it is handy and easy to use. Since the user batch is enormous and contains a lot of sensitive information, it is susceptible to being compromised. Hackers can steal your information, leading to data theft, time wasted, and other security concerns. This brings us to the need to introduce a filter that can segregate emails into spam and ham. 

Ham refers to genuine mail that is important to the user and is informative for his means. In contrast, Spam is spurious mail sent from unreliable sources with harmful intentions. In this blog, we will be studying how to segregate such messages into spam/ham using Naive Bayes in machine learning. But first, let’s start from scratch.  

Naive Bayes Classifiers

Naive Bayes classifiers are one of the most used predictive algorithms in machine learning which is used to predict the probability of the occurrence of the data under observation, assuming that each data point is independent in the given dataset. Using binary or categorical input values, the approach is simplest to grasp. We may calculate the likelihood of X occurring using the Bayes theorem, given that C has occurred. In this case, X represents the evidence, and C represents the hypothesis. The predictors/features are assumed to be independent in this case. That is, the existence of one trait does not affect the present. As a result, it is said to as naïve.  

Source: Link

The Bayes classifiers determine the likelihood of a particular event occurring (in our example, spamming a message) based on the combined probabilistic distributions of several other occurrences (in our case, the appearance of certain words in a message). We'll get into the mechanics of the Bayes Theorem later in the mission, but first, let's get a sense of the data we'll be dealing with.

Spam/Ham Classification using Naive Bayes

Understanding the dataset 

For spam/ham classification, here we have taken our training dataset from Kaggle. The dataset contains 5000+ text messages samples categorized under the category of spam/ham depending on the content of the messages. 

Step 1: Import all the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

 

Step 2: Import the data and print the head to have a fundamental outlook of the data. In this case, we are using the.csv file from Kaggle named spam.csv. 

df_sms = pd.read_csv('spam.csv',encoding='latin-1')
df_sms.head()

 

Output:

 

Data Cleaning

Step 3: As we can see, we have some unwanted columns. Let’s drop them and have a more optimized table of the dataset. 

df_sms = df_sms.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df_sms = df_sms.rename(columns={"v1":"label", "v2":"sms"})
df_sms.head()

 

Output: 

Step 4: Print total values of spam and ham in the given dataset to the in-depth idea of our given dataset. 

df_sms.label.value_counts()

 

Output:

ham     4825
spam     747
Name: label, dtype: int64

 

Step 5: Since we want our answers in binary and it is easy for computation, let's convert our data into 0 and 1 for spam and ham. 

df_sms['Labeling']= df_sms['label'].map({'ham': 1, 'spam':0})
df_sms.head()

 

Output:

Step 6: Let’s find the length of every spam/ham message to help categorize them better. 

df_sms['length'] = df_sms['sms'].apply(len)
df_sms.head()

 

Output: 

 

Step 7: Let’s visualize the data of length using a histogram. 

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
df_sms['length'].plot(bins=50, kind='hist')

 

Output: 

Step 7: Using a histogram, let’s also visualize the segregated length of spam/ham messaged in our given dataset. 

df_sms.hist(column='length', by='label', bins=50,figsize=(10,4))

 

Output: 

Splitting the dataset in train and test 

Step 8: Let’s train the model using the sklearn library. 

X = df_sms['sms']
Y = df_sms['Labeling']
from sklearn.model_selection import train_test_split as tt
X_train, X_test, Y_train, Y_test = tt(X, Y,test_size=0.2, random_state=100)

 

Step 9: Let’s find the x test and train shape for the split data. 

X_train.shape

 

Output: 

(4457,)

 

X_test.shape

 

Output:

(1115,)

 

Exclude Stop Words

Step 10: Words that are not in our English dictionaries should be excluded from our data as they cause unnecessary errors. For this, we use the sklearn vector feature.

from sklearn.feature_extraction.text import CountVectorizer
vector = CountVectorizer(stop_words ='english')
vector.fit(X_train)

 

Step 11: Print the frequency of the words used. 

vector.vocabulary_

 

Output:

{'clearing': 1819, 'cars': 1642’ 'meant': 4323, 'calculation': 1577, 'units': 6960, 'expensive': 2677, 'started': 6274, 'practicing': 5214, 'accent': 784, 'important': 3549, 'decided': 2183, '4years': 539, 'dental': 2226, 'nmde': 4684, 'exam': 2656, 'idk': 3516, 'sitting': 6023, 'shop': 5939, 'parking': 4954, 'lot': 4124, 'bawling': 1240, 'feel': 2759, ...}

 

Step 12: Let’s train the model with the final vector inputs and proceed to prediction.

X_train_transformed =vector.transform(X_train)
X_test_transformed =vector.transform(X_test)

 

Building the Final Model sing Naive Bayes

Step 13: We import the multinomial naive Bayes libraries from sklearn 

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_transformed,Y_train)
y_pred = model.predict(X_test_transformed)
y_pred_prob = model.predict_proba(X_test_transformed)

 

Performing the Predictions 

Step 14: To help the model predict, we import the confusion matrix, accuracy score, precision score, recall score, and f1 score. 

from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,f1_score
print(confusion_matrix(Y_test,y_pred))
print()
print(accuracy_score(Y_test,y_pred))

 

Output:

[[135  10]
 [  9 961]]

0.9829596412556054

 

Step 15: Print all the predictions made

print(precision_score(Y_test,y_pred))
print()
print(recall_score(Y_test,y_pred))
print()
print(f1_score(Y_test,y_pred))
print()

 

Output:

0.9897013388259527

0.9907216494845361

0.9902112313240599

 

ROC curve

Step 16: Import the ROC curve feature from sklearn to find true and false positives. 

from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, y_pred_prob[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
print(roc_auc)

 

Output:

0.9866619267685744

 

print(false_positive_rate)
print()
print(true_positive_rate)
print()
print(thresholds)

 

Output:

[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.00689655 0.00689655 0.00689655 0.00689655 0.00689655
 0.00689655 0.00689655 0.00689655 0.00689655 0.00689655 0.00689655
 0.00689655 0.00689655 0.00689655 0.00689655 0.00689655 0.00689655
 0.00689655 0.0137931  0.0137931  0.0137931  0.0137931  0.0137931
 0.0137931  0.0137931  0.0137931  0.0137931  0.0137931  0.02068966
 0.02068966 0.02068966 0.02068966 0.02068966 0.02068966 0.02068966
 0.02068966 0.02068966 0.02068966 0.02068966 0.02068966 0.02068966
 0.02068966 0.02068966 0.02758621 0.02758621 0.02758621 0.02758621
 0.02758621 0.02758621 0.02758621 0.02758621 0.02758621 0.02758621
 0.02758621 0.02758621 0.02758621 0.02758621 0.03448276 0.03448276
 0.03448276 0.03448276 0.03448276 0.03448276 0.03448276 0.03448276
 0.03448276 0.03448276 0.03448276 0.04137931 0.04137931 0.04827586
 0.04827586 0.05517241 0.05517241 0.06206897 0.06206897 0.06896552
 0.06896552 0.06896552 0.06896552 0.08275862 0.08275862 0.09655172
 0.09655172 0.10344828 0.10344828 0.11034483 0.11034483 0.29655172
 0.31034483 0.54482759 0.55862069 0.69655172 0.71034483 0.8
 0.82758621 0.84827586 0.86206897 0.88275862 0.89655172 0.9862069
 1.        ] 

 

Step 18: Find the FPR. Threshold and TPR for the data. 

df = pd.DataFrame({'Threshold': thresholds, 
              'TPR': true_positive_rate, 
              'FPR':false_positive_rate
             })
df.head()

 

Output:

 

Step 19: Plot the ROC curve of the dataset. 

%matplotlib inline  
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.plot(false_positive_rate, true_positive_rate)

 

Output:

Frequently Asked Questions

Q1. Can we use Bernoulli Naive Bayes classification for spam/ham?

Ans: Yes, we can use gaussian naive Bayes classification. Instead of importing the multinomialNB, we will import the BernoulliNB.

from sklearn.naive_bayes import BernoulliNB
modelB = BernoulliNB()

 

Q2. When is the best time to use naive Bayes classification? 

Ans: When the independence requirement is met, a Naive Bayes classifier outperforms alternative models such as logistic regression and requires less training data. When compared to numerical variables, it performs well with categorical input variables. 

 

Q3.Does naive Bayes come under supervised or unsupervised learning? 

Ans: Naive Bayes classification comes under supervised learning. It is supervised because naive Bayes classifiers are trained on labeled data, i.e., data pre-categorized into the classes accessible for classification.

Key Takeaways

Spam messages can be a real headache and can cause a lot of inconveniences to the users. In this article, we have discussed the application of spam/ham classification using naive Bayes from scratch. We have first discussed naive Bayes to know how Naive Bayes works; later on, we went with the classification of spam/ham using our code in python. 

To read more such interesting real-world implementation of concepts, read our blogs at the coding ninjas’ website.  

Was this article helpful ?
0 upvotes

Comments

No comments yet

Be the first to share what you think