With the rise in internet penetration across the world, followed by the rapid growth of social media companies, users are increasingly using various social media platforms to interact and engage with other like-minded individuals and also follow their favourite celebrities and influencers. With the increased use of social media, there has been a significant rise in cyberbullying cases as well. According to the youth activism non-profit organization DoSomething, about 37% of teenagers between the ages of 12 and 17 have been bullied online. And 23 percent of students have said that they have done something cruel or mean to someone else. Owing to the rise in cyberbullying cases, it is important to monitor and control such cases to avoid greater harm to the minds of young people.
In this article, we will cover an unsupervised learning method of Topic Modeling and a supervised learning method of Sentiment Classification to identify topics in the dataset. Real-world text data comes with a large number of unique tokens, which can be complex to comprehend. It is difficult and costly to label textual instances for supervised classification as opposed to unsupervised learning methods. This article explores the importance of Topic Modeling for large amounts of text corpus over supervised learning methods with hands-on project implementation. So let’s dive deep into the article.
This article was published as a part of the Data Science Blogathon.
Sentiment classification is an unsupervised machine learning approach to extract frequently discussed topics from a certain text corpus. Unlike a supervised learning method, an unsupervised learning method does not have any labels associated with each document in the training corpus. Each topic of the text corpus consists of a composition of words available in documents. The corpus of the documents, or text corpus, contains multiple topics that depend on the context of the text data.
In addition, we will learn various methods for Topic Modeling which are used in the industry. There are mainly two types of Topic Modeling techniques-
Let’s look at the Traditional Topic Modeling techniques and their applications in industries.
These types of Topic Modeling techniques are based on statistics and probabilistic models. These techniques assume that each document contains a set of topics and each topic is distributed over words. In these techniques, models are trained using Matrix Factorization techniques or statistical inference.
In Matrix Factorization techniques, we have a Non-Negative Matrix Factorization (NNMF) model. While in statistical and probabilistic methods we have Latent Dirichlet Allocation (LDA) modeling technique which is used widely in topic modeling tasks.
Non-Negative Matrix Factorization aims to reduce a high dimensional dataset into a lower dimensional dataset composed of non-negative vectors. This helps capture essential structure and variability of the dataset to identify a set of topics and themes that can explain word frequencies in document term-matrix.
In opposition to that, Latent Dirichlet Allocation aims to identify hidden topics from a large text corpus using a probabilistic generative model. It assumes topics are distributed over words of each document and an algorithm calculates the probability of each topic based on each word in the document.
The coherence metric is used to measure how sufficiently topics are identified in a given text corpus. When we talk about ‘Coherence’, we talk about cooperation characteristics between reference corpus and identified topics.
Topic coherence assesses how well topic is supported by a text corpus or a reference text. It uses statistics and probability to compare the distribution of words and topics of a given corpus. It then assigns a coherence score to each topic. Finally, it aggregates all the individual scores to give a single coherence score to the model.
To understand topic coherence in a simple manner as opposed to going with heavy math and statistics, the method takes selected topics and references corpora as input. It then segments topics into various pairs and calculates the probabilities of words in the text corpus. Finally, it calculates confirmation measures that simply tell us how well each topic pair is present in the text corpus and what words support the topic pair in the text corpus. Then, all the confirmation measures are summed to come up with a topic coherence score, which will be in the range of 0 to 1. A topic coherence score closer to 1 means better performance in Topic Modeling.
We will look at hands-on Topic Modeling with code examples in later sections.
We will look at the implementation of the coherence metric in the hands-on project in the implementation section.
Neural Topic Modeling uses neural networks to capture complex relationships between words in the text corpus. Unlike Traditional Topic Modeling techniques, it does not use frequency of words or TF-IDF methods to identify the most frequently occurring words or topics in our case. Neural network-based Topic Modeling techniques can capture the context of the text corpus, which is not possible in Traditional Topic Modeling methods.
There are two types of Topic modeling techniques available:
Now that we have learned what is topic modeling, let’s look at some of the applications of topic modeling in industry.
Sentiment Classification is a Natural Language Processing (NLP) technique used to classify text data according to the sentiment expressed in the text, such as positive, negative, or neutral. In the context of cyberbullying, Sentiment Classification can be used to identify the sentiment of the text as being indicative of bullying behavior. We want to classify text as a positive tweet or a negative tweet indicating bullying behavior. We will look at the code examples to archive the same in a later section.
Sentiment analysis has many applications in the industry. Let’s look at some of them:
Depending upon each application, text samples need to be classified as very positive, positive, neutral, negative, or very negative.
One of the major differences between topic modeling and sentiment classification is their learning method itself. Topic Modeling is an unsupervised learning technique while Sentiment Classification is a supervised learning technique. Let’s look at some other differences:
Topic Modeling | Sentiment Classification |
---|---|
There is no need to label large text document | One has to label large samples |
It can identify complex word similarities within one document | It is not possible to identify similarities within one single document |
It has a lower cost of modeling and inference due to ease flexibility | It has a higher cost of modeling due to manual labeling of text samples |
There is no need to label large text documentOne has to label large samplesIt can identify complex word similarities within one documentIt is not possible to identify similarities within one single documentIt has a lower cost of modeling and inference due to ease flexibilityIt has a higher cost of modeling due to manual labeling of text samples
While Sentiment Analysis is a popular approach used widely in industry, it has many drawbacks which can not be avoided. Cost of labeling each text document would significantly increase which might not be a viable option to have. In a large text corpus, each text document may have different topics to infer which is impossible to label in a supervised learning approach. Topic Modeling can identify and capture such relationships within the document to cluster the topics accordingly.
In this section, we will look at the implementation of Topic Modeling using the Gensim library of Python. We will also compare Topic Modeling with the Sentiment Classification technique as well.
First, we will load the dataset of cyberbullying tweets data. The dataset is annotated as ‘none’, ‘racism’, and ‘sexism’ categories. Labels are assigned as a ‘0’ for the non-bullying tweets and a ‘1’ for bullying tweets in the dataset.
Let’s read the dataset and perform a topic modeling pipeline on textual data with its interpretation using LDAviz. We will also measure performance using the coherence metric to find an optimal number of topics in the dataset.
Python Code:
import pandas as pd
import gensim
import nltk
import pyLDAvis.gensim
from gensim.models.coherencemodel import CoherenceModel
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
# read the dataset using the read_csv method
df = pd.read_csv("twitter_parsed_dataset.csv")
print(df)
# write down following helper function in your python environment
# stop word removal function
stopwrds = set(stopwords.words('english'))
def remove_stowords(text, cores=2):
sample = text
sample = sample.lower()
sample = [word for word in sample.split() if not word in stopwrds]
sample = ' '.join(sample)
return sample
# lemmatization function
lemmatizer = WordNetLemmatizer()
def lemma_clean_text(text, cores = 1):
sample = text
sample = sample.split()
sample = [lemmatizer.lemmatize(word.lower()) for word in sample]
sample = ' '.join(sample)
return sample
As we can see in the dataset output, the text column is a series of tweets with annotations and labels. There are more than 16000 rows in the dataset so labeling each tweet would have been a costly task. This increases the cost of the data science project which needs to be taken into account. While Topic Modeling does not require labels as such so, it saves the cost for the company of client in identifying the most prevalent topics in the dataset.
Let’s implement the Topic Modeling pipeline in the next step:
# define pre-processing function to model topics based on annotation
def preprocess_topic(df, topic):
""" Preprocessing function to model text data based on give topics.
args:
df = input dataframe
topic = input topic "nonn", "sexism", or "racism"
returns:
corpus of words under given topic
"""
corpus=[]
# topic wise division
if topic == 'none':
for doc in ndf[ndf['Annotation'] == 'none']['cleaned_text']:
stop_word_removal = remove_stowords(doc)
lemmmatized_sample = lemma_clean_text(stop_word_removal)
words = lemmmatized_sample.split()
corpus.append(words)
elif topic == 'sexism':
for doc in ndf[ndf['Annotation'] == 'sexism']['cleaned_text']:
stop_word_removal = remove_stowords(doc)
lemmmatized_sample = lemma_clean_text(stop_word_removal)
words = lemmmatized_sample.split()
corpus.append(words)
elif topic == 'racism':
for doc in ndf[ndf['Annotation'] == 'racism']['cleaned_text']:
stop_word_removal = remove_stowords(doc)
lemmmatized_sample = lemma_clean_text(stop_word_removal)
words = lemmmatized_sample.split()
corpus.append(words)
return corpus
Above code takes user input to choose one of the annotations in our dataset to perform Topic Modeling on a subset of the text corpus.
(Note: Above code takes the cleaned text with pre-processed text from the original data frame. I have linked the code repository at the end of this article with more details.)
# corpus of the words
corpus = preprocess_topic(ndf, 'sexism')
# creat BOW model from corpus
dic=gensim.corpora.Dictionary(corpus)
bow_corpus = [dic.doc2bow(doc) for doc in corpus]
# create LDA model using gensim library
lda_model = gensim.models.LdaMulticore(bow_corpus,
num_topics = 4,
id2word = dic,
passes = 10,
workers = 2)
lda_model.show_topics()
The above code stores the text corpus in the ‘corpus’ object and creates the dictionary of the text corpus. In the next step, we call the ‘LdaMulticore’ object of the ‘gensim.models’ module in order to model the text data and generate 4 topics in the training dataset. Finally, we can call ‘lda_model.show_topics()’ to see 4 topics.
As an output of the training pipeline, the model will generate a list of tuples containing the 4 most prevalent topics in the text corpus along with its word distribution.
# visualizing the topics
def plot_lda_vis(lda_model, bow_corpus, dic):
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic)
return vis
plot_lda_vis(lda_model, bow_corpus, dic)
In the above visualization, each topic is shown on an intertopic distance map which explains how far each topic is from the others. On the right side, a bar chart of word frequency is shown with the most salient terms occurring in the text corpus.
Using ‘pyLDAvis’ we can visualize the distribution of the topics and words in the text corpus to make it more interpretable for the stakeholders.
# assessing coherenece metric of the model
from gensim.models.coherencemodel import CoherenceModel
topics = [['prophet', 'slavery', 'violence', 'fear'],
['people', 'religion', 'slave', 'hate', 'like'],
['like', 'murder', 'people', 'prophet'],
['war', 'humanity', 'religion', 'salon', 'world']]
# Coherence model
cm = CoherenceModel(topics=topics,
texts=corpus,
coherence='c_v',
dictionary=dic)
coherence_per_topic = cm.get_coherence_per_topic()
coherence_per_topic
--------------------------------[output]--------------------------------------
[0.24646713695437958,
0.17976752238536964,
0.32051023235616505,
0.33402730347565524]
To calculate the coherence metric of our topic model we can use the ‘CoherenceModel’ function of the ‘gensim.models.coherencemodel’ module. By setting function parameters as shown above we can get the coherence score of each topic in our corpus. The function implements a coherence metric pipeline under the hood which we saw in the earlier section.
Now let’s visualize the coherence score of each topic using the seaborn library.
# plotting coherenece score
topics_str = [ '\n '.join(t) for t in topics ]
data_topic_score = pd.DataFrame( data=zip(topics_str, coherence_per_topic),
columns=['Topic', 'Coherence'] )
data_topic_score = data_topic_score.set_index('Topic')
# plottinh using matplotlib heatmap
fig, ax = plt.subplots( figsize=(2,6) )
ax.set_title("Topics coherence\n $C_v$")
sns.heatmap(data=data_topic_score, annot=True, square=True,
cmap='Reds', fmt='.2f',
linecolor='black', ax=ax )
plt.yticks( rotation=0 )
ax.set_xlabel('')
ax.set_ylabel('')
fig.show()
In the above example, topic coherence is still low. So to improve the model performance one can try a different number of topics to train the topic model and find the optimal number of topics in the dataset.
To detect cyberbullying in a text corpus of tweets, Sentiment Classification can be used to classify each tweet as either containing or not containing bullying behavior. This can be achieved by training supervised learning algorithms like Multinomial Naive Bayes or Support Vector Machines. We will implement the Naive Bayes algorithm to classify the sentiment of each tweet.
X = ndf['correct_text']
y = ndf['oh_label']
# train and test split the dataset
X_trn, X_tst, y_trn, y_tst = train_test_split(X,y, random_state=42)
# tfidf object
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,2),
max_features=5000)
# Vectorization using iftdf
X_trn_vect = tfidf.fit_transform(X_trn)
X_tst_vect = tfidf.transform(X_tst)
# converting sparse dataframe into pandas dataframe
x_t1 = pd.DataFrame(X_trn_vect.toarray(),columns=tfidf.get_feature_names())
x_t2 = pd.DataFrame(X_tst_vect.toarray(),columns=tfidf.get_feature_names())
# applying MultinomialNB algorithms
clf = MultinomialNB()
clf.fit(x_t1, y_trn)
pred = clf.predict(x_t2)
# LOG LOSS of the model
print("logloss: %0.3f " % log_loss(y_tst.values, pred))
-------------------------------[Output]-------------------------------------
logloss: 8.241
The code performs Sentiment Classification using the Multinomial Naive Bayes algorithm on a dataset consisting of two columns. The first one containing the text data (X) and the other containing the corresponding labels (y).
Then dataset is split into train and test data followed by TF-IDF vectorization using the sklearn library. Then we use the Multinomial Naive Bayes classifier to build a classification model and test it on a dataset.
Github code repository: Sentiment classificatio
In order to analyze large text corpora Topic Modeling and Sentiment Analysis are two crucial Natural Language Processing techniques which are used. While both techniques are used to extract insights from text data however they differ in their approach and goals.
Sentiment Classification is a technique used to classify the sentiment expressed in a piece of text as positive, neutral, or negative. This is achieved using supervised learning algorithms, such as Naive Bayes.
While, Topic Modeling is a technique used to identify the underlying topics in a large corpus of text. This is achieved using unsupervised learning algorithms, such as Latent Dirichlet Allocation (LDA). Topic Modeling is useful for applications such as content analysis, trend analysis, and document clustering. Let’s look at the key takeaways from this article.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.