This article was published as a part of the Data Science Blogathon
In today’s world, almost everyone is using a mobile phone and all of them will receive messages(SMS/ email) daily on their phone. But the main thing is that many of the received messages will be spam and only a few of them are ham or required messages.
In this article, we are going to create an SMS spam detection model which will help you to find whether an SMS is spam or not using LSTM.
About Dataset: Here we are using SMS Spam Detection Dataset which contains SMS text and its corresponding label( Spam or Ham).
First of all, we are importing all the required libraries for data preprocessing
import pandas as pd import numpy as np import re import collections import contractions import seaborn as sns import matplotlib.pyplot as plt plt.style.use('dark_background') import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords import warnings warnings.simplefilter(action='ignore', category=Warning) import keras from keras.layers import Dense, Embedding, LSTM, Dropout from keras.models import Sequential from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences import pickle
Importing the SMS spam detection dataset
df = pd.read_csv("spam.csv", encoding='latin-1') df.head()
df.shape # output - (5572, 8674)
As you can see our data contains some columns which are not useful to us. So let’s drop those columns.
df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)
Also, we are renaming the column names for our convenience.
df.columns = ["SpamHam","Tweet"]
Let’s plot the value counts of both spam and ham SMS.
sns.countplot(data["SpamHam"])
The number of ham messages is more than that of spam messages in the data.
Before doing the preprocessing techniques let’s plot the count of different words present in our dataset. For this, we are creating a function named word_count_plot.
def word_count_plot(data): # finding words along with count word_counter = collections.Counter([word for sentence in data for word in sentence.split()]) most_count = word_counter.most_common(30) # 30 most common words # sorted data frame most_count = pd.DataFrame(most_count, columns=["Word", "Count"]).sort_values(by="Count") most_count.plot.barh(x = "Word", y = "Count", color="green", figsize=(10, 15)) word_count_plot(data["Tweet"])
As you can see most of the words are stopwords. So let’s do some preprocessing techniques on the dataset.
lem = WordNetLemmatizer() def preprocessing(data): sms = contractions.fix(data) # converting shortened words to original (Eg:"I'm" to "I am") sms = sms.lower() # lower casing the sms sms = re.sub(r'https?://S+|www.S+', "", sms).strip() #removing url sms = re.sub("[^a-z ]", "", sms) # removing symbols and numbes sms = sms.split() #splitting # lemmatization and stopword removal sms = [lem.lemmatize(word) for word in sms if not word in set(stopwords.words("english"))] sms = " ".join(sms) return sms X = data["v2"].apply(preprocessing)
Yeah!.. We completed the data preprocessing techniques, now let’s plot the word count once again to see the most frequent words.
word_count_plot(X)
Now we can see the most common words other than the stopwords. Let’s continue our preprocessing.
Since our output values(Spam or Ham) are categorical values, we have to convert them into a numerical form. So we are encoding this with LabelEncoder.
from sklearn.preprocessing import LabelEncoder lb_enc = LabelEncoder() y = lb_enc.fit_transform(data["SpamHam"])
We converted our output feature into numerical form, then, what about the input feature. So, let’s convert the input feature into numerical form by using keras Tokenizer followed by padding.
First, let’s tokenize our data and convert it into a numerical sequence using keras Tokenizer.
tokenizer = Tokenizer() #initializing the tokenizer tokenizer.fit_on_texts(X)# fitting on the sms data text_to_sequence = tokenizer.texts_to_sequences(X) # creating the numerical sequence
Let’s look into some text and corresponding numerical sequence
for i in range(5): print("Text : ",X[i] ) print("Numerical Sequence : ", text_to_sequence[i])
We can also find the index number of the corresponding words.
tokenizer.index_word # this will output a dictionary of index and words
{1: 'call', 2: 'get', 3: 'ur', 4: 'go', 5: 'free', 6: 'ok', 7: 'ltgt', 8: 'know', 9: 'day', 10: 'got', 11: 'want', 12: 'come', 13: 'like', 14: 'love', 15: 'good', 16: 'time', 17: 'going', 18: 'text', 19: 'send', 20: 'need', 21: 'one', 22: 'today', 23: 'txt', 24: 'home', 25: 'lor', 26: 'see', 27: 'sorry', 28: 'stop', 29: 'r', 30: 'still',......}
This dict contains 7774 words which mean that our data contains 7774 unique words.
As you can see in text_to_sequence, all the sequences are of different lengths which are not compatible for the model to train. So we should make all the sentences length equal. For this, we are padding the sequences with “0”.
max_length_sequence = max([len(i) for i in text_to_sequence]) # finding the length of largest sequence padded_sms_sequence = pad_sequences(text_to_sequence, maxlen=max_length_sequence, padding = "pre") padded_sms_sequence
array([[ 0, 0, 0, ..., 10, 3568, 68], [ 0, 0, 0, ..., 1177, 330, 1542], [ 0, 0, 0, ..., 2419, 263, 2420], ..., [ 0, 0, 0, ..., 1028, 7773, 3565], [ 0, 0, 0, ..., 792, 65, 5], [ 0, 0, 0, ..., 2152, 367, 145]], dtype=int32)
We prepared the input data suitable for feeding into the model. Now let’s create the LSTM model for training.
TOT_SIZE = len(tokenizer.word_index)+1 def create_model(): lstm_model = Sequential() lstm_model.add(Embedding(TOT_SIZE, 32, input_length=max_length_sequence)) lstm_model.add(LSTM(100)) lstm_model.add(Dropout(0.4)) lstm_model.add(Dense(20, activation="relu")) lstm_model.add(Dropout(0.3)) lstm_model.add(Dense(1, activation = "sigmoid")) return lstm_model lstm_model = create_model() lstm_model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
lstm_model.summary()
We created our LSTM model, so, let’s train our model with the input and output features created earlier.
lstm_model.fit(padded_sms_sequence, y, epochs = 5, validation_split=0.2, batch_size=16)
Both training accuracy(0.9986) and validation accuracy(0.9839) imply that our model is very good at predicting spam and ham SMS.
We can save our model and tokenizer for future uses as a pickle file.
pickle.dump(tokenizer, open("sms_spam_tokenizer.pkl", "wb")) pickle.dump(lstm_model, open("lstm_model.pkl", "wb"))
Through this article, you will be able to understand and create a text classification model using LSTM architecture. In future articles, we will see other text classification techniques and other Natural Langauge Processing models.
Thank You!..
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion