Topic Modeling with ML Techniques

Anuradha Mahato Last Updated : 15 Jun, 2023
10 min read

Introduction

Topic modeling is a method to use and identify the themes that exist in large sets of data. It’s a kind of unsupervised learning technique where the model tries to predict the presence of underlying topics without ground truth labels. It is helpful in a wide range of industries, including healthcare, finance, and marketing, where there’s a lot of text-based data to analyze. Using topic modeling, organizations can quickly gain valuable insights from the topics that matter most to their business that can help them make better decisions and improve their products and services.

This article was published as a part of the Data Science Blogathon.

Project Description

Topic modeling is valuable for numerous industries, including and not limited to finance, healthcare, and marketing. It is beneficial for industries that deal with huge amounts of unstructured text data, such as, customer reviews, social media posts, or medical records, as it can help reduce the vast amount of time and labor to do the same without machines.

For example, in the healthcare industry, topic modeling can identify common themes or patterns in patient records that can help improve patient outcomes, identify risk factors, and guide clinical decision-making. In finance, topic modeling can analyze news articles, financial reports, and other text data to identify trends, market sentiment, and potential investment opportunities.

In marketing industry, topic modeling can analyze customer feedback, social media posts, and other text data to identify customer needs and preferences and develop targeted marketing campaigns. This can help companies improve customer satisfaction, increase sales, and gain a competitive market edge.

In general, topic modeling can help to gain insights from large amounts of text data quickly and efficiently. By identifying key topics or themes, organizations can make informed decisions, improve their products and services, and gain a competitive advantage in their respective industries.

Problem Statement

The aim is to do topic modeling on the A million headlines news dataset. It is a collection of over one million news article headlines published by the ABC.

Using LDA, this project aims to identify the main topics and cover the themes in the news headlines dataset. LDA is a probabilistic generative model that assumes that each document is a mixture of several topics. Both techniques have their advantages as well as disadvantages, and the project explores which technique is better suited for analyzing the news headlines dataset.

By identifying the main themes in the news headlines dataset. The project aims to provide insights into the types of news stories that will cover the ABC. Use this information by journalists, editors, and media organizations to better understand their audience and to tailor their news coverage to meet the needs and interests of their readers.

Dataset Description

The dataset contains a large collection of news headlines published over a period of nineteen years, between February 19, 2003, and December 31, 2021. The data is sourced from the Australian Broadcasting Corporation (ABC), a reputable news organization in Australia. The dataset is provided in CSV format and contains two columns: “publish_date” and “headline_text“.

The “publish_date” column provides the date when the news article was published, in the YYYYMMDD format. The “headline_text” column contains the text of the headline, written in ASCII, English, and lowercase.

Project Plan

The project steps for applying topic modeling to the news headlines dataset can be as follow:

1. Exploratory Data Analysis: The next step is analyzing the data to understand the distribution of headlines over time. The frequency of different words and phrases, and other patterns in the data. Also, you can visualizing the data using charts and graphs to gain insights into the data.

2. Data Pre-processing: The first step is cleaning and preprocessing the text to remove stop words, punctuation, etc. It also involves tokenization, stemming, and lemmatization to standardize the text data and make it suitable for analysis.

3. Topic Modeling: The core of the project is applying techniques such as LDA. Then, identify the main topics and themes in the news headlines dataset. It requires selecting the appropriate parameters for the topic modeling algorithms. For example, the number of topics, the size of the vocabulary, and the similarity measure.

4. Topic Interpretation: After identifying the main topics, the next step is interpreting the topics and assigning human-readable labels to them. It includes analyzing the top words and phrases associated with each topic and identifying the main themes and trends.

5. Evaluation: The final step involves evaluating the performance of the topic modeling algorithms. Then, comparing them based on metrics such as coherence score and perplexity. Identifying the limitations and challenges of the topic modeling approach and proposing possible solutions.

Steps for The Project

First, importing the necessary libraries.

import numpy as np
import pandas as pd
from IPython.display import display
from tqdm import tqdm
from collections import Counter

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob
import scipy.stats as stats

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE
from wordcloud import WordCloud, STOPWORDS

from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
output_notebook()

%matplotlib inline

Loading the csv format data in dataframe while parsing the dates in usable format.

path = '/content/drive/MyDrive/topic_modeling/abcnews-date-text.csv' #path of your dataset
df = pd.read_csv(path, parse_dates=[0], infer_datetime_format=True)

reindexed_data = df['headline_text']
reindexed_data.index = df['publish_date']

Seeing a glimpse of the loaded data through first five rows.

df.head()
"

There are 2 columns named publish_date and headline_text as mentioned above in the dataset description.

df.info() #general description of data
"

We can see that there are 12,44,184 rows in the dataset with no null values.

Now, using 100,000 rows of the data for convenience and feasibility for using LDA model

Exploratory Data Analysis

Starting with visualizing the top 15 words in the data without including stopwords.

def get_top_n_words(n_top_words, count_vectorizer, text_data):
    '''
    returns a tuple of the top n words in a sample and their 
    accompanying counts, given a CountVectorizer object and text sample
    '''
    vectorized_headlines = count_vectorizer.fit_transform(text_data.values)
    vectorized_total = np.sum(vectorized_headlines, axis=0)
    word_indices = np.flip(np.argsort(vectorized_total)[0,:], 1)
    word_values = np.flip(np.sort(vectorized_total)[0,:],1)
    
    word_vectors = np.zeros((n_top_words, vectorized_headlines.shape[1]))
    for i in range(n_top_words):
        word_vectors[i,word_indices[0,i]] = 1

    words = [word[0].encode('ascii').decode('utf-8') for 
             word in count_vectorizer.inverse_transform(word_vectors)]
    return (words, word_values[0,:n_top_words].tolist()[0])
    
# CountVectorizer function maps words to a vector space with similar words closer together
count_vectorizer = CountVectorizer(max_df=0.8, min_df=2,stop_words='english')
words, word_values = get_top_n_words(n_top_words=15,
                                     count_vectorizer=count_vectorizer, 
                                     text_data=reindexed_data)

fig, ax = plt.subplots(figsize=(16,8))
ax.bar(range(len(words)), word_values);
ax.set_xticks(range(len(words)));
ax.set_xticklabels(words, rotation='vertical');
ax.set_title('Top words in headlines dataset (excluding stop words)');
ax.set_xlabel('Word');
ax.set_ylabel('Number of occurences');
plt.show()
top words in headlines dataset | topic modeling with ML

Now, doing part of speech tagging for the headlines.

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

tagged_headlines = [TextBlob(reindexed_data[i]).pos_tags for i in range(reindexed_data.shape[0])]
tagged_headlines[10] #checking the 10th headline
"
tagged_headlines_df = pd.DataFrame({'tags':tagged_headlines})

word_counts = [] 
pos_counts = {}

for headline in tagged_headlines_df[u'tags']:
    word_counts.append(len(headline))
    for tag in headline:
        if tag[1] in pos_counts:
            pos_counts[tag[1]] += 1
        else:
            pos_counts[tag[1]] = 1
            
print('Total number of words: ', np.sum(word_counts))
print('Mean number of words per headline: ', np.mean(word_counts))

Output

Total number of words: 8166553

Mean number of words per headline: 6.563782366595294

Checking if the distribution is normal.

y = stats.norm.pdf(np.linspace(0,14,50), np.mean(word_counts), np.std(word_counts))

fig, ax = plt.subplots(figsize=(8,4))
ax.hist(word_counts, bins=range(1,14), density=True);
ax.plot(np.linspace(0,14,50), y, 'r--', linewidth=1);
ax.set_title('Headline word lengths');
ax.set_xticks(range(1,14));
ax.set_xlabel('Number of words');
plt.show()
headline word lengths | bar chart | topic modeling with ML

Visualizing the proportion of top 5 used parts of speech.

# importing libraries
import matplotlib.pyplot as plt
import seaborn as sns
  
# declaring data
pos_sorted_types = sorted(pos_counts, key=pos_counts.__getitem__, reverse=True)
pos_sorted_counts = sorted(pos_counts.values(), reverse=True)
  
top_five = pos_sorted_types[:5]
data = pos_sorted_counts[:5]
# declaring exploding pie
explode = [0, 0.1, 0, 0, 0]
# define Seaborn color palette to use
palette_color = sns.color_palette('dark')
  
# plotting data on chart
plt.pie(data, labels=top_five, colors=palette_color, explode=explode,
         autopct='%.0f%%')
  
# displaying chart
plt.show()
pie chart | topic modeling with ML

Here, it’s visible that 50% of the words in headlines are Noun which sounds reasonable.

Pre-processing

First, sampling 100,000 healines and converting sentences to words.

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  
text_sample = reindexed_data.sample(n=100000, random_state=0).values
data = text_sample.tolist()
data_words = list(sent_to_words(data))

print(data_words[0])

Making bigram and trigram models.

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) 
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
# higher threshold fewer phrases.
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

We will do Stopwords removal, bigrams and trigrams and lemmatization in this step.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words]\
                                                               for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc \
                                     if token.pos_ in allowed_postags])
    return texts_out
# !python -m spacy download en_core_web_sm
import spacy

# Remove Stop Words
data_words_nostops = remove_stopwords(text_sample)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, \
                             allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

Topic Modeling

Applying LDA model assuming 15 themes in whole dataset

num_topics = 15

lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.01,
                                           eta=0.9)

Topic Interpretation

from pprint import pprint

# Print the Keyword in the 15 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
Output:

[(0,
  '0.046*"new" + 0.034*"fire" + 0.020*"year" + 0.018*"ban" + 0.016*"open" + '
  '0.014*"set" + 0.011*"consider" + 0.009*"security" + 0.009*"name" + '
  '0.008*"melbourne"'),
 (1,
  '0.021*"urge" + 0.020*"attack" + 0.016*"government" + 0.014*"lead" + '
  '0.014*"driver" + 0.013*"public" + 0.011*"want" + 0.010*"rise" + '
  '0.010*"student" + 0.010*"funding"'),
 (2,
  '0.019*"day" + 0.015*"flood" + 0.013*"go" + 0.013*"work" + 0.011*"fine" + '
  '0.010*"launch" + 0.009*"union" + 0.009*"final" + 0.007*"run" + '
  '0.006*"game"'),
 (3,
  '0.023*"australian" + 0.023*"crash" + 0.016*"health" + 0.016*"arrest" + '
  '0.013*"fight" + 0.013*"community" + 0.013*"job" + 0.013*"indigenous" + '
  '0.012*"victim" + 0.012*"support"'),
 (4,
  '0.024*"face" + 0.022*"nsw" + 0.018*"council" + 0.018*"seek" + 0.017*"talk" '
  '+ 0.016*"home" + 0.012*"price" + 0.011*"bushfire" + 0.010*"high" + '
  '0.010*"return"'),
 (5,
  '0.068*"police" + 0.019*"car" + 0.015*"accuse" + 0.014*"change" + '
  '0.013*"road" + 0.010*"strike" + 0.008*"safety" + 0.008*"federal" + '
  '0.008*"keep" + 0.007*"problem"'),
 (6,
  '0.042*"call" + 0.029*"win" + 0.015*"first" + 0.013*"show" + 0.013*"time" + '
  '0.012*"trial" + 0.012*"cut" + 0.009*"review" + 0.009*"top" + 0.009*"look"'),
 (7,
  '0.027*"take" + 0.021*"make" + 0.014*"farmer" + 0.014*"probe" + '
  '0.011*"target" + 0.011*"rule" + 0.008*"season" + 0.008*"drought" + '
  '0.007*"confirm" + 0.006*"point"'),
 (8,
  '0.047*"say" + 0.026*"water" + 0.021*"report" + 0.020*"fear" + 0.015*"test" '
  '+ 0.015*"power" + 0.014*"hold" + 0.013*"continue" + 0.013*"search" + '
  '0.012*"election"'),
 (9,
  '0.024*"warn" + 0.020*"worker" + 0.014*"end" + 0.011*"industry" + '
  '0.011*"business" + 0.009*"speak" + 0.008*"stop" + 0.008*"regional" + '
  '0.007*"turn" + 0.007*"park"'),
 (10,
  '0.050*"man" + 0.035*"charge" + 0.017*"jail" + 0.016*"murder" + '
  '0.016*"woman" + 0.016*"miss" + 0.016*"get" + 0.014*"claim" + 0.014*"school" '
  '+ 0.011*"leave"'),
 (11,
  '0.024*"find" + 0.015*"push" + 0.015*"drug" + 0.014*"govt" + 0.010*"labor" + '
  '0.008*"state" + 0.008*"investigate" + 0.008*"threaten" + 0.008*"mp" + '
  '0.008*"world"'),
 (12,
  '0.028*"court" + 0.026*"interview" + 0.025*"kill" + 0.021*"death" + '
  '0.017*"die" + 0.015*"national" + 0.014*"hospital" + 0.010*"pay" + '
  '0.009*"announce" + 0.008*"rail"'),
 (13,
  '0.020*"help" + 0.017*"boost" + 0.016*"child" + 0.016*"hit" + 0.016*"group" '
  '+ 0.013*"case" + 0.011*"fund" + 0.011*"market" + 0.011*"appeal" + '
  '0.010*"local"'),
 (14,
  '0.036*"plan" + 0.021*"back" + 0.015*"service" + 0.012*"concern" + '
  '0.012*"move" + 0.011*"centre" + 0.010*"inquiry" + 0.010*"budget" + '
  '0.010*"law" + 0.009*"remain"')]

Evaluation

1. Calculating Coherence score (ranges between -1 and 1), which is a measure of how similar the words in a topic are.

from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized,\
                                    dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Output

Coherence Score: 0.38355488160129025

2. Calculating perplexity score that is a measure of randomness in the model and how well the probability distribution predicts the sample. (lower value indicates better model)

perplexity = lda_model.log_perplexity(corpus)

print(perplexity)

Output

-10.416591518443418

We can see that the coherence score is fairly low but can still predict relevant themes well and can surely be improved by doing hyperparameter tuning. Also, perplexity is low which can be justified with the normal distribution of the data as was seen in exploratory data analysis section.

Conclusion

Topic Modeling is an unsupervised learning technique to identify themes in large sets of data. It is useful in various domains such as healthcare, finance, and marketing, where there is a huge amount of text-based data to analyze. In this project, you had to apply topic modeling to a dataset called “A million headlines” consisting of over one million news article headlines published by the ABC. The aim is to use Latent Dirichlet Allocation (LDA) algorithm, which is a probabilistic generative model, to identify the main topics in the dataset.

The project plan involves several steps: exploratory data analysis to understand the data distribution, preprocessing the text by removing stop words, punctuation, etc., and applying techniques like tokenization, stemming, and lemmatization. The essence of the project revolves around topic modeling, leveraging LDA to identify the primary topics and themes within the news headlines. We analyze associated words and phrases to interpret the topics and assign human-readable labels to them. The evaluation of topic modeling algorithms encompasses metrics such as coherence score and perplexity, while also taking into account the limitations of the approach.

Key Takeaways

  • Topic Modeling is an effective way of finding broad themes from the data with Machine Learning (ML) without labels.
  • It has a wide range of applications from healthcare to recommender systems.
  • LDA is one effective way of implementing topic modeling.
  • Coherence score and perplexity are effective evaluation metrics for checking the performance of topic modeling through ML models.

Frequently Asked Questions

Q1. What is topic modeling in ML?

A. Topic modeling in ML refers to a technique that automatically extracts underlying themes or topics from a collection of text documents. It helps uncover latent patterns and structures, enabling tasks like document clustering, text summarization, and content recommendation in natural language processing (NLP) and machine learning.

Q2. What is topic modeling with examples?

A. Topic modeling, with an example, involves extracting topics from a set of news articles. The algorithm identifies topics such as “politics,” “sports,” and “technology” based on word co-occurrence patterns. This helps organize and categorize articles, making browsing and searching for specific topics of interest easier.

Q3. What is the best algorithm for topic modeling?

A. The best algorithm for topic modeling depends on the specific requirements and characteristics of the dataset. Popular algorithms include Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and Latent Semantic Analysis (LSA). Each algorithm has its strengths and weaknesses, so the choice should align with the task at hand.

Q4. Is topic modeling an NLP technique?

A. Yes, topic modeling is a technique commonly used in natural language processing (NLP). It leverages machine learning algorithms to identify and extract topics from text data, allowing for better understanding, organization, and analysis of textual information. It aids in various NLP tasks, including text classification, sentiment analysis, and information retrieval.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I'm a software engineer (data analytics and AI) majored in Data Science and Engineering at the Indian Institute of Science Education and Research Bhopal, interested in Deep Learning and Computer Vision. My hobbies include writing and playing badminton

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details