Data science demands versatility. Move away from your regular methods, challenge your ways of working, explore new ways of doing things more efficiently. On reminiscing about my old days, my initial years in data science, I had also got trapped by this devil of ‘complacency’. At one point, I was not challenging myself enough. I wasn’t experimenting with the ways of doing work. I accepted the things as they were, until I realized ‘Complacency is a state of mind that exists only in retrospective: it has to be shattered before being ascertained’. Now, whenever possible, I try to challenge my ways of working with a purpose of doing it faster and more efficient. It helps me to discover new ways of working in data science.
Text Mining, is one of the most frequent yet challenging exercise faced by beginners in data science / analytics experts. The biggest challenge is one needs to thoroughly assess the underlying patterns in text, that too manually. For example: it is pretty common to delete numbers from the text before we do any kind of text mining. But what if we want to extract something like “24/7”. Hence, the text cleansing exercise is highly personalized as per the objective of the exercise and the type of text patterns.
Majorly, we work on two aspects of Text Mining:
You may find numerous ways on internet to do sentiment analysis. However, subject extraction is very specific to the context. In this article, I have shared the top 4 hacks applied in the industry to do subject extraction in R. For ease, I’ve also highlighted the strength and weakness associated with each trick.
This is the most powerful tool to do text mining. Let’s first look at the code in R to execute this step
ss <- read.csv("keywords.csv")
#Import the list of Keywords with first column as the keyword you wish to match and the tag you need to populate Keywords <- as.character(ss$Keywords)
tags <- as.character(ss$Merchant_Name)
for (i in 1:length(Keywords)) { for (j in 1:nrow(Data1)) {
#Data1 is the complete data from which you are trying to extract the text. We will look at the text line by line if(grepl(Keywords[i],Data1[j,1]) == 1){Data1[j,2] <- tags[i]
#Here is where you do an actual search Data1[j,4] <- 1
#Flag 1 to those observations where you find a match } } }
Now let’s try to see the strengths and weaknesses of this algorithm.
Strengths
Weaknesses
This is the fix for the second weakness (mis-classified cases) in the previous algorithm. In this algorithm, we try to match words instead of keywords. Here is the R-code :
words <- read.csv("word_match.csv") word <- as.character(words$Keywords) tags <- as.character(words$Tag)
for (i in 1:length(word)) { for (j in 1:nrow(Data1)) { if(word(unlist(Data1[j,1]),1) == word[i]){Data1[j,2] <- tags[i] Data1[j,4] <- 1 } } }
Strengths
Weaknesses
This methods needs extensive research on the sentence structures. For ease of understanding, I’ve taken an uncomplicated example of “www.dummyvalue.com”. Here is the code :
for (i in 1:nrow(Data1)) { if(grepl("WWW",Data1[i,1]) == 1 & grepl("COM",Data1[i,1]) == 1){ start <- str_locate(unlist(Data1[i,1]),"WWW")[2] end <- str_locate(unlist(Data1[i,1]),"CO")[1] Data1[i,2] <- paste("www",tolower(substr(unlist(Data1[i,1]),start + 1,end-1)),"com", sep = ".") Data1[i,4] <- 1} }
Strengths
Weaknesses
I bet, this method is good enough to challenge you intellectually. So, that you could work on it, instead of giving away the entire code, I’ve provided the step by step methods to do the same. If you still find it difficult, mention your request for code in the comment section below.
Step 1: Find most frequent words which can possibly be something what you are looking for.
Step 2: Find the most associated word with these frequently occurring words.
Step 3: For each of the pairs find the best frequency-association pair (this will need some number of iterations)
Strengths
Weaknesses
Hope you find these 4 hacks useful enough to speed up your text mining process. I’d encourage you to take a shot on the last algorithm code and share it in the comment box below. This list is no way exhaustive of what all can be done in subject extraction.
All these algorithms can be used together on the same text to boost up the performance. However, in those cases you need to create decision points of when to use which algorithm.
Did you find this article helpful? Please share your opinions / thoughts in the comments section below.
Hi Tavish, Very illuminating article. It would be nice if you can provide the csv files that you have used. I would like to try the same and understand better the strengths and weakness of each algo that you have used.
Hi tavish, Can you please provide the excel sheet for the above. It would give better understanding in following the above
Great post; however, I wish you did it with an actual dataset so it would have some level of reproducibility.