Natural Language Processing to Detect Spam Messages

aditya8495d Last Updated : 08 Nov, 2022

11 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Most existing research on Twitter spam focuses on account blocking or identifying and blocking spam users or spammers. To detect spam users, we can use traditional machine learning algorithms that use information from users’ tweets, demographics, shared URLs, and social connections as features. Social networks have become popular as a tool for internet users to communicate with friends and family, read news, and discuss current events. Users spend more time saving and sharing their personal information on well-known social platforms (e.g., Facebook, Twitter, etc.).

What is Natural Language Processing?

Natural Language Processing is a computer science approach to understanding natural language. The field of language processing is the study of human language and communication in various forms. The area includes systems that take input and transform the information into the desired output. It can range from a simple tool like Spam Detection to more complex programs like Google Translate. Google Translate is a translation algorithm that uses a set of input languages to generate a translation of the input.

Natural Language Processing (NLP) — Source – searchenginejournal

In this article, we will discuss how we can make a spam text detector to detect the spam messages which are flooded on various social media platforms like Twitter. We use famous machine learning classification algorithms, such as Support Vector Machines (SVM) and different Naïve Bayesian (NB) techniques, to evaluate the proposed characteristics.

Natural Language Processing Pipeline

We will follow the below pipeline, which includes the loading and preprocessing of the Dataset, model training, testing, etc.:
1. We will create a Semi-supervised learning Naive Bayes Support Vector Machine model to classify spam tweets.
2. Clean & Preprocess data (in this case, tweets dataset) obtained from Kaggle. Preprocessing involves the seven stages of natural language processing.
3. Pictorial/Graphical representation of this Data.
4. Vectorize the preprocessed data so that we can apply various machine learning algorithms to the data.
5. Calculate various error measurement parameters like F1-Score, Accuracy, etc., and depict these in the form of confusion matrices.
6. Conclude the best algorithm based on the previously calculated error measurement parameters.
7. Apply the final algorithm over live tweets on Twitter using API.

Experiments on a large dataset suggest that the system adapts to new spam patterns and maintains high accuracy for spam identification in a Twitter stream without constantly updating the semi-supervised model’s corpus. However, we can discover by the time a rogue user may have infected many additional users. We feel that spam detection at the tweet level complements spam detection at the user level. We employed a simple strategy to deal with user-level spam detection due to the minimal user information in our data set.

Past Literature Survey

This section will discuss past papers on Natural Language Processing techniques, their findings, limitations, future scope, etc.

Paper-1

Title:

Semi-Supervised Spam Detection in Twitter Stream (IEEE Explore 2017)

Findings:

It offers S3D, a semi-Supervised spam detection framework, in this paper. S3D uses four lightweight detectors to detect spam tweets in real-time and regularly update the models in batch mode. The experiment’s findings show that our spam detection framework works well with a semi-supervised technique. They discovered that the method is excellent at capturing novel spamming trends when clusters and tweets are confidently identified.

Limitations:

Spam detection at the tweet level is an acceptable-grained method for detecting spam tweets in real-time. However, only we can glean so much information from a single tweet. On the other hand, we can obtain more discriminative traits from a user’s account, historical tweets, and social graphs. However, when a rogue user is discovered, they may have infected many additional users.

Future Scope:

Although tweet-level spam detection may work in tandem with user-level spam detection, they employed an essential strategy to cope with it due to the low user information in their Dataset. Nonetheless, we could argue that we can implement user-level spam detection can be implemented into S3D, which we’ll be working on in the future.

Paper-2

Title:

Spam Detection in Twitter Stream (IJSDR Paper 2019)

Findings:

In this spam detection framework, the experiment shows that the offered outcome strategy of semi-supervised is effective. We discovered that mining data makes a simple concept for detecting spam tweets on Twitter. It also consists of confidently labeled clusters and tweets, making the system effective in capturing new spamming patterns. The Porter Stemmer algorithm can also represent the social graph for spam detection on Twitter.

Limitations:

A fine-grained method for detecting spam tweets in real time is spam detection at the tweet level. A single tweet, on the other hand, can only provide so much information. On the other hand, a user’s account, history of tweets, and social graph can provide more discriminative qualities. By the time a rogue user is found, however, they may have infected many other users.

Future Scope:

Although it’s feasible that spam detection at the tweet level will function in tandem with spam detection at the user level. Due to the limited user information in their Dataset, they used a simple technique to deal with user-level spam detection. However, we could argue that we can include user-level spam detection in S3D, which we’ll be pursuing in the future.

Paper-3

Title:

A Neural Network-Based Ensemble Approach for Spam Detection in Twitter (IEEE Xplore 2018)

Findings:

This paper presents a neural network-based ensemble strategy that combines deep learning and classic feature-based algorithms to detect spam at the tweet level. They used CNN to explore multiple-word embeddings. The suggested technique outperforms all existing methods for the 1KS10KN and HSpam data sets. When applied to many unseen tweets, even the model trained with a small number of instances performed admirably. The proposed strategy’s performance was superior to the baseline methods in all experiments.

Limitations:

Feature-based methods perform poorly compared to deep learning approaches for the HSpam14 data set. The inputs for deep learning-based algorithms were simply tweets with no further information. It would be fascinating to explore if we may enhance the deep learning methods’ performance further by considering other information about the tweets or their authors.

Future Scope:

Feature-based algorithms perform poorly compared to deep learning approaches for the HSpam14 data set. They could develop a better strategy to convey the Data’s characteristics. Tweets with No further information were used as inputs for deep learning-based systems. It would be fascinating to see if the effectiveness of deep learning algorithms could be improved even more by including additional information about the tweets or their writers.

Paper-4

Title:

Detecting spam accounts on Twitter (IEEE Xplore 2018)

Findings:

The following are the main contributions of this paper in summary. They created a collection of innovative graphs and content-based characteristics that effectively detect spam accounts on Twitter. Then detect spam and valid users using seven machine learning algorithms: K-nearest Neighbor (kNN), Decision Tree (DT), Naive Bayesian (NB), Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and extreme Gradient Boosting (XGBoost). The top 10 most impacting detecting features among all the features employed by state-of-the-art techniques were then ranked using the feature ranking method (i.e., information gain).

Limitations:

In their paper, they created a new and more robust set of features to detect spammers on Twitter. They used seven distinct machine learning algorithms to incorporate both graph-based and tweet content-based information. Random Forest (RF) outperforms the other algorithms in the trial, with an accuracy of 91 per cent, a precision of 92 per cent, and an F1 score of 91 per cent. The performance comparison analysis revealed that the proposed strategy is viable and capable of producing even better outcomes than other state-of-the-art alternatives.

Future Scope:

A more effective model that can quickly classify various sorts of spammers within multiple social networks could be developed in the future. We will also improve machine learning algorithms and apply our technology to various social networks.

Methods Used for Natural Language Processing

Preprocessing:
Each Data needs cleaning, like removing redundant variables (variables that do not contribute to the target variable). Then this phase also includes the removal of eradicating duplicates or erroneous values. (eg. ‘Python,” python,” python3.6’,etc. are all in the same categories).

Then further preprocessing of data includes creating N-Gram models, stemming, and Vectorization.

N-Gram Models:
Statistical language models, in their most basic form, are models that assign a probability to word sequences. This post will look at the n-gram, the most basic model for assigning probabilities to sentences and word sequences.

Stemming:
The process of developing morphological variants of a root/base word is known as stemming. Stemming algorithms or stemmers are terms used to describe stemming programmers. The phrases “chocolates,” “chocolatey,” and “choco” are reduced to the root word “chocolate,” and “retrieval,” “retrieved,” and “retrieves” are reduced to the stem “retrieve.” In natural language processing, stemming is integral to the pipelining process. Tokenized words are fed into the stemmer.

Vectorization:
Vectorization is a jargon term describing a traditional method of turning raw Data (text) into vectors of real numbers. This format is what machine learning models allow. This approach has been used since the start of computing, has proven effective in various disciplines, and is currently being applied in natural language processing. Vectorization is a phase in feature extraction in Machine Learning. The goal of translating text to numerical vectors is to extract some identifiable features from the text for the model to learn.

Exploratory Data Analysis:
This phase divides the data into appropriate data frames based on a particular attribute of the data. This part includes showcasing ‘spam’ & ‘ham’ words in the form of word clouds.

Naive Bayes:
Bayes’ Theorem is a famous classification algorithm with an assumption of independence among predictors. In layman’s language, a Naive Bayes classifier assumes that a particular feature in a class is unrelated to the other parts. This theory provides a method of calculating posterior probability P(c|x) by using P(c), P(x), and P(x|c).

Naive Bayes classifier — Source – medium.com

The below equation justifies it:
Above,
● P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
● P(c) is the prior probability of class.
● P(x|c) is the likelihood which is the probability of the predictor given class.
● P(x) is the prior probability of the predictor.
So for naïve bayes we have employed two variations of naïve bayes:
▪ Multinomial naïve bayes
▪ Bernoulli naïve bayes

Support Vector Machines:
Support vector machine is another simple algorithm every machine learning expert should have in their arsenal. Support vector machine is much preferred. It gives high accuracy even using low computational resources. SVMs can be used for regression and classification tasks. But mainly in classification tasks.

The main motto of this algorithm is to find a hyperplane in N-dimensional space (N — the number of features) that distinctly classifies the data points. We can choose many possible hyperplanes to separate the two data points classes. The algorithm’s objective is to find a plane with the maximum margin, i.e., the maximum distance between data points of both types. Maximizing the margin distance reinforces sous to classify future data points more confidently.

Support vector machine | Natural Language Processing — Source – towardsdatascience

Error Measurement

In this section, we will discuss the performance parameters like accuracy score, recall score, etc. Error Measurement is the difference between the observed value and the true value of that variable.

True-Positive: Predicted is positive, and it’s true

True-Negative: Predicted is negative, and it’s true

False-Positive: Predicted is positive, but it’s false

False-Negative: Predicted is negative, but it’s false

Precision: How many of the positive classes that we accurately predicted are positive?

Precision = True-Positive/(True-Positive + False-Positive)

Recall: How many of the positive classes did we accurately predict? It ought to be as high as is practical.

Recall = True-Positive/(True-Positive + False-Negative)

Accuracy: How many classes out of all were accurately predicted?

Accuracy = (True-Positive + True-Negative)/Data-set size

F-Measure: A accuracy and recall combination measure

F-measure = 2*Precision*Recall/(Precision + Recall)

Confusion matrix

We first send the data to a top-notch model after cleaning, preprocessing, and organizing it, and we naturally get output in probabilities. Keep going, however! How will we assess the effectiveness of our model? Higher energy equates to better performance, and better performance is what we want. The Confusion matrix now assumes a prominent position. A classification performance matrix is the Confusion Matrix. It’s a performance categorization issue with output from two or more classes. This table contains four separate sets of anticipated and actual values.

Source – towardsdatascience.com

Implementation

1. Importing Necessary Libraries:
We will import all the necessary used in this project. It includes Pickle, Pandas, NumPy, SVC, etc.

import pickle as pk
import pandas as pd
# Numpy library to perform various complex mathematical calculations.
import numpy as np
import tweet_catch as tc
# Matplotlib library to draw various plots.
import matplotlib.pyplot as plt
import seaborn as sns
import ps_preprocess as pp
import warnings

# Library to draw the Word Cloud.
from wordcloud import WordCloud

# All Sklearn Related Libraries.
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from sklearn.metrics import accuracy_score

2. Read & Pre-Process Dataset:
Firstly, we will read the dataset using the read_csv() function of the Pandas library. After that, we will showcase and visualize the dataset. Further, we will create offsets of tweets for training and testing purposes. We will use 80% of our data for training and the remaining 20% of the data for testing. Finally, we will separately visualize the training and testing data.

# Read dataset
tweets = pd.read_csv('spam.csv', encoding='latin-1')
tweets['length'] = tweets['TweetText'].apply(len)

# Showcase data
tweet_preview = tweets.tail(10)
tweet_group = tweets.groupby('Label').describe()
print('**** DATASET ****')
print(tweet_preview)
print('n------------------n')
print('**** DATASET GROUPED BY LABEL ****')
print(tweet_group)
print('n------------------n')

The complete code for training and pre-processing of the dataset can be found here.

3. Training of Model:
This section will see the code used to train our models. We will train our dataset on Multinomial Naive Bayes, BernoulliNB, and Support Vector Machine. In the next section, we will also find the model which gives our best accuracy and precision.

The code can be found here.

Plotting the Word Clouds for Spam and Normal Messages:
A word cloud is a graphical representation of the frequency of words used in a document. The more often a word appears in the document, the larger it is displayed in the word cloud. Word clouds, such as topic modelling, are often used to visualize text analysis results.

The advantages of using word clouds for text analysis are that they can be generated quickly and easily and provide a good overview of the essential topics in a document. The disadvantages of word clouds are that they can be challenging to interpret, and they may not be able to capture all the nuances of a document.

Word Cloud Containing Spam Messages | Natural Language Processing

Word Cloud Containing Normal Messages | Natural Language Processing

4. Plotting the Graphs:
Now, we will plot the graphs of the Accuracy Score and Confusion Matrix for the MNB Classifier. We will use the famous matplotlib library to plot all the required graphs.

The code can be found here.

Below we will plot the graphs of the Training Accuracy Score and F1 Score comparison of all three algorithms.

In predictive modelling, accuracy and F1 score are two of the most commonly used metrics. But what do they mean?

Accuracy is the proportion of correct predictions out of all the predictions made.
It’s a good metric, but it can be misleading.

F1 score is a measure of a classifier’s accuracy. It considers both the precision and the recall of the classifier. Precision is the number of correct positive results divided by the number of all positive results, and recall is the number of correct positive results divided by the number of positive results that should have been returned.

Training Accuracy Score Comparison | Natural Language Processing

F1 Score Comparison | Natural Language Processing

Conclusion

Preprocessing the data effectively to include all the relevant sources of spam, as we observe in the case of opinion mining and sentimental analysis where the algorithm gets tricked due to non-identification of These inbuilt parameters and features that social media sites provide due to which the accuracy of any Classifier suffers.

Using N-gram models, stemming, and vectorization further increases the quality of the classifier built to detect malicious and non-malicious data separately. Among all the classifiers we made, the best accuracy was obtained using the multinomial naïve Bayes gives the best accuracy as it employs the use of multiple features, which is the case in any generic malicious dataset taken.

Key takeaways of this article:
1. Firstly, we discussed the basic understanding of Natural Language Processing, its importance, and its main advantages.
2. We have discussed several research papers on natural language processing for Text Analysis from various organizations. The articles we have discussed are on the topics of different spam detection methods on the Twitter dataset.
3. Then, we discussed the code, which includes loading and pre-processing the dataset, and then training several machine learning models. Finally, we have visualized various performance parameters like Accuracy Score, Confusion Matrix, etc.

This is all for today. Thanks for reading.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

aditya8495d

As a content writer with experience in data science and data engineering, I am skilled in creating clear, concise and engaging written materials for a variety of audiences. My technical knowledge in these fields allows me to accurately convey complex ideas and concepts in an easy-to-understand manner. In addition, I am constantly seeking to learn and stay up-to-date on the latest trends and developments in data science and data engineering.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Natural Language Processing to Detect Spam Messages

Introduction

What is Natural Language Processing?

Natural Language Processing Pipeline

Past Literature Survey

Methods Used for Natural Language Processing

Error Measurement

Implementation

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap