Topic modelling is a natural language processing technique that uncovers hidden themes or topics within large text datasets by analyzing word patterns and grouping similar documents together. Methods like Latent Dirichlet Allocation (LDA) help organize unstructured text, such as articles or social media posts, making it easier to understand. By reading this article, you’ll gain a clear understanding of how topic modelling works, its real-world applications, and how it can save time, improve data analysis, and reveal valuable trends in text data, making it a powerful tool for anyone working with large volumes of information.
Topic modeling is a technique in natural language processing (NLP) used to identify and extract abstract topics or themes from a collection of documents. It helps uncover hidden patterns by grouping words that frequently occur together, allowing for the discovery of the main ideas within large text datasets. Common algorithms for topic modeling include Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF).
Topic modeling is a technique used in natural language processing (NLP) to discover abstract topics within a collection of documents. It helps in understanding the main themes or ideas present in large text data without manually reading through it.
Both topic modeling and clustering are unsupervised learning techniques used to group data, but they serve different purposes and work differently.
Feature | Topic Modeling | Clustering |
Definition | Identifies hidden topics in text data. | Groups similar data points based on features. |
Purpose | Finds themes in a collection of documents. | Organizes data into meaningful groups. |
Data Type | Primarily used for text analysis. | Can be applied to text, numerical, and image data. |
Methods Used | LDA, LSA, NMF. | K-Means, Hierarchical Clustering, DBSCAN. |
Output | Topics represented by word distributions. | Groups (clusters) of similar data points. |
There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency. NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation is the most popular topic modeling technique and in this article, we will discuss the same.
LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.
LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. The following matrix shows a corpus of N documents D1, D2, D3 … Dn and vocabulary size of M words W1,W2 .. Wn. The value of i,j cell gives the frequency count of word Wj in Document Di.
LDA converts this Document-Term Matrix into two lower dimensional matrices – M1 and M2.
M1 is a document-topics matrix and M2 is a topic – terms matrix with dimensions (N, K) and (K, M) respectively, where N is the number of documents, K is the number of topics and M is the vocabulary size.
Notice that these two matrices already provides topic word and document topic distributions, However, these distribution needs to be improved, which is the main aim of LDA. LDA makes use of sampling techniques in order to improve these matrices.
It Iterates through each word “w” for each document “d” and tries to adjust the current topic – word assignment with a new assignment. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2.
For every topic, two probabilities p1 and p2 are calculated. P1 – p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 – p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w.
The current topic – word assignment is updated with a new topic with the probability, product of p1 and p2 . In this step, the model assumes that all the existing word – topic assignments except the current word are correct. This is essentially the probability that topic t generated word w, so it makes sense to adjust the current word’s topic with new probability.
After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.
Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.
Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.
Here are the sample documents combining together to form a corpus.
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."
# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]
Cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the corpus.
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in doc_complete]
All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.
```# Importing Gensim
import gensim
from gensim import corpora
# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] ```
Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.
```# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
print(ldamodel.print_topics(num_topics=3, num_words=3))
['0.168*health + 0.083*sugar + 0.072*bad,
'0.061*consume + 0.050*drive + 0.050*sister,
'0.049*pressur + 0.049*father + 0.049*sister]
Each line is a topic with individual topic terms and weights. Topic1 can be termed as Bad Health, and Topic3 can be termed as Family.
The results of topic models are completely dependent on the features (terms) present in the corpus. The corpus is represented as document term matrix, which in general is very sparse in nature. Reducing the dimensionality of the matrix can improve the results of topic modelling. Based on my practical experience, there are few approaches which do the trick.
Sometimes LDA can also be used as feature selection technique. Take an example of text classification problem where the training data contain category wise documents. If LDA is running on sets of category wise documents. Followed by removing common topic terms across the results of different categories will give the best features for a category.
With this, we come to this end of tutorial on Topic Modeling. I hope this will help you to improve your knowledge to work on text data. To reap maximum benefits out of this tutorial, I’d suggest you practice the codes side by side and check the results.
A. Topic modeling is used to uncover hidden patterns and thematic structures within a collection of documents. It aids in understanding the main themes and concepts present in the text corpus without relying on pre-defined tags or training data. By extracting topics, researchers can gain insights, summarize large volumes of text, classify documents, and facilitate various tasks in text mining and natural language processing.
A. The technique commonly used in topic modeling is Latent Dirichlet Allocation (LDA). LDA is a generative probabilistic model that assigns words to topics and topics to documents, allowing the discovery of latent topics within a text corpus. It is a widely adopted method for topic modeling in natural language processing.
A. While topic modeling involves the identification of clusters or groups of similar words within a body of text, it is not strictly considered a clustering technique in the traditional sense. Topic modeling aims to discover the underlying thematic structures or topics within a text corpus, which goes beyond the notion of clustering based solely on word similarity. It uses statistical models, such as Latent Dirichlet Allocation (LDA), to assign words to topics and topics to documents, providing a way to explore the latent semantic relationships in the data.
