Detect Cyberbullying Using Topic Modeling and Sentiment Analysis

Avikumar talaviya Last Updated : 12 Oct, 2024

11 min read

Introduction

With the rise in internet penetration across the world, followed by the rapid growth of social media companies, users are increasingly using various social media platforms to interact and engage with other like-minded individuals and also follow their favourite celebrities and influencers. With the increased use of social media, there has been a significant rise in cyberbullying cases as well. According to the youth activism non-profit organization DoSomething, about 37% of teenagers between the ages of 12 and 17 have been bullied online. And 23 percent of students have said that they have done something cruel or mean to someone else. Owing to the rise in cyberbullying cases, it is important to monitor and control such cases to avoid greater harm to the minds of young people.

In this article, we will cover an unsupervised learning method of Topic Modeling and a supervised learning method of Sentiment Classification to identify topics in the dataset. Real-world text data comes with a large number of unique tokens, which can be complex to comprehend. It is difficult and costly to label textual instances for supervised classification as opposed to unsupervised learning methods. This article explores the importance of Topic Modeling for large amounts of text corpus over supervised learning methods with hands-on project implementation. So let’s dive deep into the article.

This article was published as a part of the Data Science Blogathon.

What is Topic Modeling?
Types of Topic Modeling Techniques
Applications of Topic Modeling
What is Sentiment Classification — a Supervised Classification?
Applications of the Sentiment Classification
Differences between Topic Modeling and Supervised Sentiment Classification
Hands-on Project Implementation Using Python
Conclusion

What is Topic Modeling?

Sentiment classification is an unsupervised machine learning approach to extract frequently discussed topics from a certain text corpus. Unlike a supervised learning method, an unsupervised learning method does not have any labels associated with each document in the training corpus. Each topic of the text corpus consists of a composition of words available in documents. The corpus of the documents, or text corpus, contains multiple topics that depend on the context of the text data.

In addition, we will learn various methods for Topic Modeling which are used in the industry. There are mainly two types of Topic Modeling techniques-

Traditional Topic Modeling
Neural Topic Modeling

Let’s look at the Traditional Topic Modeling techniques and their applications in industries.

1. Traditional Topic Modeling

These types of Topic Modeling techniques are based on statistics and probabilistic models. These techniques assume that each document contains a set of topics and each topic is distributed over words. In these techniques, models are trained using Matrix Factorization techniques or statistical inference.

In Matrix Factorization techniques, we have a Non-Negative Matrix Factorization (NNMF) model. While in statistical and probabilistic methods we have Latent Dirichlet Allocation (LDA) modeling technique which is used widely in topic modeling tasks.

Non-Negative Matrix Factorization aims to reduce a high dimensional dataset into a lower dimensional dataset composed of non-negative vectors. This helps capture essential structure and variability of the dataset to identify a set of topics and themes that can explain word frequencies in document term-matrix.

In opposition to that, Latent Dirichlet Allocation aims to identify hidden topics from a large text corpus using a probabilistic generative model. It assumes topics are distributed over words of each document and an algorithm calculates the probability of each topic based on each word in the document.

Using the coherence metric to measure the performance of the LDA model

The coherence metric is used to measure how sufficiently topics are identified in a given text corpus. When we talk about ‘Coherence’, we talk about cooperation characteristics between reference corpus and identified topics.

Topic coherence assesses how well topic is supported by a text corpus or a reference text. It uses statistics and probability to compare the distribution of words and topics of a given corpus. It then assigns a coherence score to each topic. Finally, it aggregates all the individual scores to give a single coherence score to the model.

The intuition behind the topic coherence metric

To understand topic coherence in a simple manner as opposed to going with heavy math and statistics, the method takes selected topics and references corpora as input. It then segments topics into various pairs and calculates the probabilities of words in the text corpus. Finally, it calculates confirmation measures that simply tell us how well each topic pair is present in the text corpus and what words support the topic pair in the text corpus. Then, all the confirmation measures are summed to come up with a topic coherence score, which will be in the range of 0 to 1. A topic coherence score closer to 1 means better performance in Topic Modeling.

We will look at hands-on Topic Modeling with code examples in later sections.

We will look at the implementation of the coherence metric in the hands-on project in the implementation section.

2. Neural Topic Modeling

Neural Topic Modeling uses neural networks to capture complex relationships between words in the text corpus. Unlike Traditional Topic Modeling techniques, it does not use frequency of words or TF-IDF methods to identify the most frequently occurring words or topics in our case. Neural network-based Topic Modeling techniques can capture the context of the text corpus, which is not possible in Traditional Topic Modeling methods.

Types of Topic Modeling Techniques

There are two types of Topic modeling techniques available:

Contextualized Topic Modeling — It incorporates contextualized embeddings from a text corpus, like words and proximities of words to better capture topics.
BERTopic — It uses the pre-trained model BERT to embed words in the text corpus to extract topics in large collections of documents.

Applications of Topic Modeling

Now that we have learned what is topic modeling, let’s look at some of the applications of topic modeling in industry.

Marketing — Topic Modeling can be used to analyze customer reviews and feedback to discover the sentiments of the customers along with identifying new trends in the text corpus
Healthcare — In the healthcare sector Topic Modeling can be used to analyze medical records, identify patterns, and extract relevant information.
Legal — Topic Modeling can be used to analyze legal documents, identify key issues, and extract relevant information.

What is Sentiment Classification — a Supervised Classification?

Sentiment Classification is a Natural Language Processing (NLP) technique used to classify text data according to the sentiment expressed in the text, such as positive, negative, or neutral. In the context of cyberbullying, Sentiment Classification can be used to identify the sentiment of the text as being indicative of bullying behavior. We want to classify text as a positive tweet or a negative tweet indicating bullying behavior. We will look at the code examples to archive the same in a later section.

Applications of the Sentiment Classification

Sentiment analysis has many applications in the industry. Let’s look at some of them:

Social Media Monitoring – Companies use social media to engage with customers and maintain their online presence. It is important to monitor customer engagement and conversation on social media platforms to measure how well companies’ products and services are received.

Customer Support Ticket Analysis -Companies have online customer support ticket systems to manage queries and concerns, By analyzing support ticket conversations companies can know monitor feedback from customers and measure the sentiment of customers.
Brand Monitoring and Management – Sentiment Analysis techniques can be used to monitor brands across social media platforms and other online presence.

Depending upon each application, text samples need to be classified as very positive, positive, neutral, negative, or very negative.

Differences Between Topic Modeling and Supervised Sentiment Classification

One of the major differences between topic modeling and sentiment classification is their learning method itself. Topic Modeling is an unsupervised learning technique while Sentiment Classification is a supervised learning technique. Let’s look at some other differences:

Topic Modeling	Sentiment Classification
There is no need to label large text document	One has to label large samples
It can identify complex word similarities within one document	It is not possible to identify similarities within one single document
It has a lower cost of modeling and inference due to ease flexibility	It has a higher cost of modeling due to manual labeling of text samples

There is no need to label large text documentOne has to label large samplesIt can identify complex word similarities within one documentIt is not possible to identify similarities within one single documentIt has a lower cost of modeling and inference due to ease flexibilityIt has a higher cost of modeling due to manual labeling of text samples

While Sentiment Analysis is a popular approach used widely in industry, it has many drawbacks which can not be avoided. Cost of labeling each text document would significantly increase which might not be a viable option to have. In a large text corpus, each text document may have different topics to infer which is impossible to label in a supervised learning approach. Topic Modeling can identify and capture such relationships within the document to cluster the topics accordingly.

Hands-on Project Implementation Using Python

In this section, we will look at the implementation of Topic Modeling using the Gensim library of Python. We will also compare Topic Modeling with the Sentiment Classification technique as well.

Topic Modeling Using the ‘Gensim’ Library

First, we will load the dataset of cyberbullying tweets data. The dataset is annotated as ‘none’, ‘racism’, and ‘sexism’ categories. Labels are assigned as a ‘0’ for the non-bullying tweets and a ‘1’ for bullying tweets in the dataset.

Let’s read the dataset and perform a topic modeling pipeline on textual data with its interpretation using LDAviz. We will also measure performance using the coherence metric to find an optimal number of topics in the dataset.

Python Code:

import pandas as pd
import gensim
import nltk
import pyLDAvis.gensim
from gensim.models.coherencemodel import CoherenceModel
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')

# read the dataset using the read_csv method
df = pd.read_csv("twitter_parsed_dataset.csv")
print(df)

# write down following helper function in your python environment
# stop word removal function
stopwrds = set(stopwords.words('english'))
def remove_stowords(text, cores=2):
    
    sample = text
    sample = sample.lower()
    sample = [word for word in sample.split() if not word in stopwrds]
    sample = ' '.join(sample)
    
    return sample

# lemmatization function
lemmatizer = WordNetLemmatizer()
def lemma_clean_text(text, cores = 1):
 
    sample = text
    sample = sample.split()
    sample = [lemmatizer.lemmatize(word.lower()) for word in sample]
    sample = ' '.join(sample)
    
    return sample

As we can see in the dataset output, the text column is a series of tweets with annotations and labels. There are more than 16000 rows in the dataset so labeling each tweet would have been a costly task. This increases the cost of the data science project which needs to be taken into account. While Topic Modeling does not require labels as such so, it saves the cost for the company of client in identifying the most prevalent topics in the dataset.

Let’s implement the Topic Modeling pipeline in the next step:

# define pre-processing function to model topics based on annotation
def preprocess_topic(df, topic):
    """ Preprocessing function to model text data based on give topics.
    
    args:
    df = input dataframe
    topic = input topic "nonn", "sexism", or "racism"
    
    returns:
    corpus of words under given topic
    """
    corpus=[]
    # topic wise division
    if topic == 'none':
        for doc in ndf[ndf['Annotation'] == 'none']['cleaned_text']:
            stop_word_removal = remove_stowords(doc)
            lemmmatized_sample = lemma_clean_text(stop_word_removal)
            words = lemmmatized_sample.split()
            corpus.append(words)
            
    elif topic == 'sexism':
        for doc in ndf[ndf['Annotation'] == 'sexism']['cleaned_text']:
            stop_word_removal = remove_stowords(doc)
            lemmmatized_sample = lemma_clean_text(stop_word_removal)
            words = lemmmatized_sample.split()
            corpus.append(words)
                
    elif topic == 'racism':
        for doc in ndf[ndf['Annotation'] == 'racism']['cleaned_text']:
            stop_word_removal = remove_stowords(doc)
            lemmmatized_sample = lemma_clean_text(stop_word_removal)
            words = lemmmatized_sample.split()
            corpus.append(words) 
            
    return corpus

Above code takes user input to choose one of the annotations in our dataset to perform Topic Modeling on a subset of the text corpus.

(Note: Above code takes the cleaned text with pre-processed text from the original data frame. I have linked the code repository at the end of this article with more details.)

# corpus of the words
corpus = preprocess_topic(ndf, 'sexism')

# creat BOW model from corpus
dic=gensim.corpora.Dictionary(corpus)
bow_corpus = [dic.doc2bow(doc) for doc in corpus]

# create LDA model using gensim library
lda_model = gensim.models.LdaMulticore(bow_corpus,
                                   num_topics = 4,
                                   id2word = dic,
                                   passes = 10,
                                   workers = 2)

lda_model.show_topics()

The above code stores the text corpus in the ‘corpus’ object and creates the dictionary of the text corpus. In the next step, we call the ‘LdaMulticore’ object of the ‘gensim.models’ module in order to model the text data and generate 4 topics in the training dataset. Finally, we can call ‘lda_model.show_topics()’ to see 4 topics.

As an output of the training pipeline, the model will generate a list of tuples containing the 4 most prevalent topics in the text corpus along with its word distribution.

Visual Interpretation of Topic Modeling Output

# visualizing the topics
def plot_lda_vis(lda_model, bow_corpus, dic):
    pyLDAvis.enable_notebook()
    vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic)
    return vis

plot_lda_vis(lda_model, bow_corpus, dic)

In the above visualization, each topic is shown on an intertopic distance map which explains how far each topic is from the others. On the right side, a bar chart of word frequency is shown with the most salient terms occurring in the text corpus.

Using ‘pyLDAvis’ we can visualize the distribution of the topics and words in the text corpus to make it more interpretable for the stakeholders.

Calculating the Coherence Metric of the Model

#  assessing coherenece metric of the model
from gensim.models.coherencemodel import CoherenceModel

topics = [['prophet', 'slavery', 'violence', 'fear'],
           ['people', 'religion', 'slave', 'hate', 'like'],
           ['like', 'murder', 'people', 'prophet'],
           ['war', 'humanity', 'religion', 'salon', 'world']]

# Coherence model
cm = CoherenceModel(topics=topics, 
                    texts=corpus,
                    coherence='c_v',  
                    dictionary=dic)

coherence_per_topic = cm.get_coherence_per_topic()
coherence_per_topic

--------------------------------[output]--------------------------------------
[0.24646713695437958,
 0.17976752238536964,
 0.32051023235616505,
 0.33402730347565524]

To calculate the coherence metric of our topic model we can use the ‘CoherenceModel’ function of the ‘gensim.models.coherencemodel’ module. By setting function parameters as shown above we can get the coherence score of each topic in our corpus. The function implements a coherence metric pipeline under the hood which we saw in the earlier section.

Now let’s visualize the coherence score of each topic using the seaborn library.

# plotting coherenece score
topics_str = [ '\n '.join(t) for t in topics ]
data_topic_score = pd.DataFrame( data=zip(topics_str, coherence_per_topic), 
                                columns=['Topic', 'Coherence'] )
data_topic_score = data_topic_score.set_index('Topic')

# plottinh using matplotlib heatmap
fig, ax = plt.subplots( figsize=(2,6) )
ax.set_title("Topics coherence\n $C_v$")
sns.heatmap(data=data_topic_score, annot=True, square=True,
            cmap='Reds', fmt='.2f',
            linecolor='black', ax=ax )
plt.yticks( rotation=0 )
ax.set_xlabel('')
ax.set_ylabel('')
fig.show()

In the above example, topic coherence is still low. So to improve the model performance one can try a different number of topics to train the topic model and find the optimal number of topics in the dataset.

Sentiment Classification using TF-IDF vectorization

To detect cyberbullying in a text corpus of tweets, Sentiment Classification can be used to classify each tweet as either containing or not containing bullying behavior. This can be achieved by training supervised learning algorithms like Multinomial Naive Bayes or Support Vector Machines. We will implement the Naive Bayes algorithm to classify the sentiment of each tweet.

X = ndf['correct_text']
y = ndf['oh_label']

# train and test split the dataset
X_trn, X_tst, y_trn, y_tst = train_test_split(X,y, random_state=42)

# tfidf object
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,2), 
max_features=5000)

# Vectorization using iftdf
X_trn_vect = tfidf.fit_transform(X_trn)
X_tst_vect = tfidf.transform(X_tst)

# converting sparse dataframe into pandas dataframe
x_t1 = pd.DataFrame(X_trn_vect.toarray(),columns=tfidf.get_feature_names())
x_t2 = pd.DataFrame(X_tst_vect.toarray(),columns=tfidf.get_feature_names())

# applying MultinomialNB algorithms
clf = MultinomialNB()
clf.fit(x_t1, y_trn)
pred = clf.predict(x_t2)

# LOG LOSS of the model
print("logloss: %0.3f " % log_loss(y_tst.values, pred))

-------------------------------[Output]-------------------------------------
logloss: 8.241

The code performs Sentiment Classification using the Multinomial Naive Bayes algorithm on a dataset consisting of two columns. The first one containing the text data (X) and the other containing the corresponding labels (y).

Then dataset is split into train and test data followed by TF-IDF vectorization using the sklearn library. Then we use the Multinomial Naive Bayes classifier to build a classification model and test it on a dataset.

Github code repository: Sentiment classificatio

Conclusion

In order to analyze large text corpora Topic Modeling and Sentiment Analysis are two crucial Natural Language Processing techniques which are used. While both techniques are used to extract insights from text data however they differ in their approach and goals.

Sentiment Classification is a technique used to classify the sentiment expressed in a piece of text as positive, neutral, or negative. This is achieved using supervised learning algorithms, such as Naive Bayes.

While, Topic Modeling is a technique used to identify the underlying topics in a large corpus of text. This is achieved using unsupervised learning algorithms, such as Latent Dirichlet Allocation (LDA). Topic Modeling is useful for applications such as content analysis, trend analysis, and document clustering. Let’s look at the key takeaways from this article.

Topic Modeling is an unsupervised learning technique for identifying patterns and relationships within the data.
Sentiment Analysis is limited to identifying sentiment polarity, whereas Topic Modeling can identify complex themes and subtopics within the data. This makes Topic Modeling preferable for the analysis of large text corpus.
We learned about the coherence metric to measure the performance of the model.
We also got to implement Topic Model pipeline while using the Gensim library of Python.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Avikumar talaviya

I specialize in data science and machine learning with hands-on experience in working on various end-to-end data science projects. I am the chapter co-lead of the Mumbai local chapter of Omdena. I am also a kaggle master and educator ambassador at streamlit with volunteers around the world.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Detect Cyberbullying Using Topic Modeling and Sentiment Analysis

Introduction

Table of Contents

What is Topic Modeling?

1. Traditional Topic Modeling

Using the coherence metric to measure the performance of the LDA model

The intuition behind the topic coherence metric

2. Neural Topic Modeling

Types of Topic Modeling Techniques

Applications of Topic Modeling

What is Sentiment Classification — a Supervised Classification?

Applications of the Sentiment Classification

Differences Between Topic Modeling and Supervised Sentiment Classification

Hands-on Project Implementation Using Python

Topic Modeling Using the ‘Gensim’ Library

Visual Interpretation of Topic Modeling Output

Calculating the Coherence Metric of the Model

Sentiment Classification using TF-IDF vectorization

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie