Hacks to perform faster Text Mining in R

Tavish Srivastava Last Updated : 11 Dec, 2015

4 min read

Introduction

Data science demands versatility. Move away from your regular methods, challenge your ways of working, explore new ways of doing things more efficiently. On reminiscing about my old days, my initial years in data science, I had also got trapped by this devil of ‘complacency’. At one point, I was not challenging myself enough. I wasn’t experimenting with the ways of doing work. I accepted the things as they were, until I realized ‘Complacency is a state of mind that exists only in retrospective: it has to be shattered before being ascertained’. Now, whenever possible, I try to challenge my ways of working with a purpose of doing it faster and more efficient. It helps me to discover new ways of working in data science.

Text Mining, is one of the most frequent yet challenging exercise faced by beginners in data science / analytics experts. The biggest challenge is one needs to thoroughly assess the underlying patterns in text, that too manually. For example: it is pretty common to delete numbers from the text before we do any kind of text mining. But what if we want to extract something like “24/7”. Hence, the text cleansing exercise is highly personalized as per the objective of the exercise and the type of text patterns.

Majorly, we work on two aspects of Text Mining:

Sentiment Mining: Here, we are more concerned about deciphering the sentiment of the author.
Subject Extraction: Here, we wish to pull out the main subject of the chosen speech. This is done prior to sentiment mining.

You may find numerous ways on internet to do sentiment analysis. However, subject extraction is very specific to the context. In this article, I have shared the top 4 hacks applied in the industry to do subject extraction in R. For ease, I’ve also highlighted the strength and weakness associated with each trick.

Top 4 Hacks in R

1. Keyword Match Algorithm

This is the most powerful tool to do text mining. Let’s first look at the code in R to execute this step

ss <- read.csv("keywords.csv")

#Import the list of Keywords with first column as the keyword you wish to match and the tag you need to populate
Keywords <- as.character(ss$Keywords)

tags <- as.character(ss$Merchant_Name)

for (i in 1:length(Keywords)) {
for (j in 1:nrow(Data1)) {

#Data1 is the complete data from which you are trying to extract the text. We will look at the text line by line
if(grepl(Keywords[i],Data1[j,1]) == 1){Data1[j,2] <- tags[i]

#Here is where you do an actual search
Data1[j,4] <- 1

#Flag 1 to those observations where you find a match
}
}
}

Now let’s try to see the strengths and weaknesses of this algorithm.

Strengths

It is highly effective in extracting keywords from not so well separated words. For instance, this algorithm can pull out “Tavish” from “#DataScientistTavishSrivatava”.
This algorithm has the option of assigning priority order in the keyword match algorithm. For instance, if I need to give “Tavish” higher priority than “Srivastava” in the above hash-tag, it can easily be done.

Weaknesses

It needs a pre-defined list of keywords from where you need to search.
It can capture many mis-classified cases. For instance, if want to search “APE” from the context, you will also erroneously tag “CAPE” as “APE”.

2. Word Match Algorithm

This is the fix for the second weakness (mis-classified cases) in the previous algorithm. In this algorithm, we try to match words instead of keywords. Here is the R-code :

words <- read.csv("word_match.csv")
word <- as.character(words$Keywords)
tags <- as.character(words$Tag)

for (i in 1:length(word)) {
for (j in 1:nrow(Data1)) {
if(word(unlist(Data1[j,1]),1) == word[i]){Data1[j,2] <- tags[i]
Data1[j,4] <- 1
}
}
}

Strengths

It operates perfectly on finding well separated words. For instance, this algorithm can effortlessly pull out “Tavish” from “Tavish Srivatava”.
This algorithm also allows priority order in the word match algorithm. For instance if I need to give “Tavish” higher priority than “Srivastava” in the above hash-tag, it can easily be executed.

Weaknesses

It needs a pre-defined list of keywords from where you need to search.
It only captures the first well separated word. The algorithm can be modified to search among all words though.
It misses out not on so well separated words.

3. General Expressions

This methods needs extensive research on the sentence structures. For ease of understanding, I’ve taken an uncomplicated example of “www.dummyvalue.com”. Here is the code :

for (i in 1:nrow(Data1)) {
if(grepl("WWW",Data1[i,1]) == 1 & grepl("COM",Data1[i,1]) == 1){
start <- str_locate(unlist(Data1[i,1]),"WWW")[2]
end <- str_locate(unlist(Data1[i,1]),"CO")[1]
Data1[i,2] <- paste("www",tolower(substr(unlist(Data1[i,1]),start + 1,end-1)),"com", sep = ".")
Data1[i,4] <- 1}
}

Strengths

It does not need any kind of list to start with.
Usually, it turns out to be highly accurate if you are able to find out a strong regular expression.

Weaknesses

It needs deep research to create a regular expression.
In case of a not so well structured data, this method is able to tag a very small number of observation

4. Word Association:

I bet, this method is good enough to challenge you intellectually. So, that you could work on it, instead of giving away the entire code, I’ve provided the step by step methods to do the same. If you still find it difficult, mention your request for code in the comment section below.

Step 1: Find most frequent words which can possibly be something what you are looking for.

Step 2: Find the most associated word with these frequently occurring words.

Step 3: For each of the pairs find the best frequency-association pair (this will need some number of iterations)

Strengths

No dictionary is required.
If parameters are optimized well, it can be highly predictive.
It can act as a feedback to other algorithms.
You can use this algorithm even if you don’t know the language of the text.

Weaknesses

It sometimes is not very precise on the subject name. It tends to capture even those trends which does not mean anything significant.

End Notes

Hope you find these 4 hacks useful enough to speed up your text mining process. I’d encourage you to take a shot on the last algorithm code and share it in the comment box below. This list is no way exhaustive of what all can be done in subject extraction.

All these algorithms can be used together on the same text to boost up the performance. However, in those cases you need to create decision points of when to use which algorithm.

Did you find this article helpful? Please share your opinions / thoughts in the comments section below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Tavish Srivastava

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Aiswarya

Hi Tavish, Very illuminating article. It would be nice if you can provide the csv files that you have used. I would like to try the same and understand better the strengths and weakness of each algo that you have used.

Nitin

Hi tavish, Can you please provide the excel sheet for the above. It would give better understanding in following the above

TheR.Enthusiast

Great post; however, I wish you did it with an actual dataset so it would have some level of reproducibility.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Hacks to perform faster Text Mining in R

Introduction

Top 4 Hacks in R

1. Keyword Match Algorithm

2. Word Match Algorithm

3. General Expressions

4. Word Association:

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap