SMS Spam Detection Using LSTM – A Hands On Guide!

Basil Saji Last Updated : 20 May, 2021

5 min read

This article was published as a part of the Data Science Blogathon

Introduction

In today’s world, almost everyone is using a mobile phone and all of them will receive messages(SMS/ email) daily on their phone. But the main thing is that many of the received messages will be spam and only a few of them are ham or required messages.

In this article, we are going to create an SMS spam detection model which will help you to find whether an SMS is spam or not using LSTM.

About Dataset: Here we are using SMS Spam Detection Dataset which contains SMS text and its corresponding label( Spam or Ham).

Implementation

First of all, we are importing all the required libraries for data preprocessing

import pandas as pd
import numpy as np
import re
import collections
import contractions
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('dark_background')
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import warnings
warnings.simplefilter(action='ignore', category=Warning)
import keras
from keras.layers import Dense, Embedding, LSTM, Dropout
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import pickle

Importing the SMS spam detection dataset

df = pd.read_csv("spam.csv", encoding='latin-1')
df.head()

df.shape # output - (5572, 8674)

As you can see our data contains some columns which are not useful to us. So let’s drop those columns.

df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)

Also, we are renaming the column names for our convenience.

df.columns = ["SpamHam","Tweet"]

Let’s plot the value counts of both spam and ham SMS.

sns.countplot(data["SpamHam"])

The number of ham messages is more than that of spam messages in the data.

Before doing the preprocessing techniques let’s plot the count of different words present in our dataset. For this, we are creating a function named word_count_plot.

def word_count_plot(data):
     # finding words along with count
     word_counter = collections.Counter([word for sentence in data for word in sentence.split()])
     most_count = word_counter.most_common(30) # 30 most common words
     # sorted data frame
     most_count = pd.DataFrame(most_count, columns=["Word", "Count"]).sort_values(by="Count")
     most_count.plot.barh(x = "Word", y = "Count", color="green", figsize=(10, 15))
word_count_plot(data["Tweet"])

As you can see most of the words are stopwords. So let’s do some preprocessing techniques on the dataset.

lem = WordNetLemmatizer()
def preprocessing(data):
      sms = contractions.fix(data) # converting shortened words to original (Eg:"I'm" to "I am")
      sms = sms.lower() # lower casing the sms
      sms = re.sub(r'https?://S+|www.S+', "", sms).strip() #removing url
      sms = re.sub("[^a-z ]", "", sms) # removing symbols and numbes
      sms = sms.split() #splitting
      # lemmatization and stopword removal
      sms = [lem.lemmatize(word) for word in sms if not word in set(stopwords.words("english"))]
      sms = " ".join(sms)
      return sms
X = data["v2"].apply(preprocessing)

Yeah!.. We completed the data preprocessing techniques, now let’s plot the word count once again to see the most frequent words.

word_count_plot(X)

Now we can see the most common words other than the stopwords. Let’s continue our preprocessing.

Since our output values(Spam or Ham) are categorical values, we have to convert them into a numerical form. So we are encoding this with LabelEncoder.

from sklearn.preprocessing import LabelEncoder
lb_enc = LabelEncoder()
y = lb_enc.fit_transform(data["SpamHam"])

We converted our output feature into numerical form, then, what about the input feature. So, let’s convert the input feature into numerical form by using keras Tokenizer followed by padding.

First, let’s tokenize our data and convert it into a numerical sequence using keras Tokenizer.

tokenizer = Tokenizer() #initializing the tokenizer
tokenizer.fit_on_texts(X)# fitting on the sms data
text_to_sequence = tokenizer.texts_to_sequences(X) # creating the numerical sequence

Let’s look into some text and corresponding numerical sequence

 for i in range(5):
           print("Text               : ",X[i] )
           print("Numerical Sequence : ", text_to_sequence[i])

Output

We can also find the index number of the corresponding words.

tokenizer.index_word # this will output a dictionary of index and words

Output

{1: 'call',
 2: 'get',
 3: 'ur',
 4: 'go',
 5: 'free',
 6: 'ok',
 7: 'ltgt',
 8: 'know',
 9: 'day',
 10: 'got',
 11: 'want',
 12: 'come',
 13: 'like',
 14: 'love',
 15: 'good',
 16: 'time',
 17: 'going',
 18: 'text',
 19: 'send',
 20: 'need',
 21: 'one',
 22: 'today',
 23: 'txt',
 24: 'home',
 25: 'lor',
 26: 'see',
 27: 'sorry',
 28: 'stop',
 29: 'r',
 30: 'still',......}

This dict contains 7774 words which mean that our data contains 7774 unique words.

As you can see in text_to_sequence, all the sequences are of different lengths which are not compatible for the model to train. So we should make all the sentences length equal. For this, we are padding the sequences with “0”.

max_length_sequence = max([len(i) for i in text_to_sequence])
 # finding the length of largest sequence
padded_sms_sequence = pad_sequences(text_to_sequence, maxlen=max_length_sequence, 
                                    padding = "pre") 
padded_sms_sequence

Output

array([[   0,    0,    0, ...,   10, 3568,   68],
       [   0,    0,    0, ..., 1177,  330, 1542],
       [   0,    0,    0, ..., 2419,  263, 2420],
       ...,
       [   0,    0,    0, ..., 1028, 7773, 3565],
       [   0,    0,    0, ...,  792,   65,    5],
       [   0,    0,    0, ..., 2152,  367,  145]], dtype=int32)

We prepared the input data suitable for feeding into the model. Now let’s create the LSTM model for training.

TOT_SIZE = len(tokenizer.word_index)+1
def create_model():
    
      lstm_model = Sequential()
      lstm_model.add(Embedding(TOT_SIZE, 32, input_length=max_length_sequence))
      lstm_model.add(LSTM(100))
      lstm_model.add(Dropout(0.4))
      lstm_model.add(Dense(20, activation="relu"))
      lstm_model.add(Dropout(0.3))
      lstm_model.add(Dense(1, activation = "sigmoid"))
      return lstm_model
lstm_model = create_model()
lstm_model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])

lstm_model.summary()

We created our LSTM model, so, let’s train our model with the input and output features created earlier.

lstm_model.fit(padded_sms_sequence, y, epochs = 5, validation_split=0.2, batch_size=16)

Both training accuracy(0.9986) and validation accuracy(0.9839) imply that our model is very good at predicting spam and ham SMS.

We can save our model and tokenizer for future uses as a pickle file.

pickle.dump(tokenizer, open("sms_spam_tokenizer.pkl", "wb"))
pickle.dump(lstm_model, open("lstm_model.pkl", "wb"))

Conclusion

Through this article, you will be able to understand and create a text classification model using LSTM architecture. In future articles, we will see other text classification techniques and other Natural Langauge Processing models.

Thank You!..

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion

Basil Saji

Python Developer, ML Enthusiast, Blogger and an Electronics and Communication Engineering aspirant determined and motivated to finish tasks with atmost sincerity and dedication.I'am a good learner who ready to accept challenges to bring up my best even in the worst. Wish for a world with enough advancements and opportunities for the people.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

SMS Spam Detection Using LSTM – A Hands On Guide!

Introduction

Implementation

Output

Output

Output

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at