In past, if you were not particularly happy with a service or a product, you would go to the service provider or the shop and lodge a complaint. With services-businesses going online and due to enormous scale, lodging complaints in-person may not be always possible. Electronic ways such as emails, social media and particularly websites like www.consumercomplaints.in focusing on such issues, are widely used platforms to vent out the anger as well as publicizing the issue in expectancy of quick actions.
Keeping a close watch on complaints on such sites has become imperative for businesses such as banks. This article looks at ways to structure these unstructured complaints in an actionable form.
In a typical case, a bank may be interested in looking at classification of the complaints into various categories such as “Loans”, “Fixed Deposits”, “Credit Cards”, etc. so that they can be forwarded to respective departments. An important feature would be to summarize long complaints so that further actions can be formulated quickly. Sentiment analysis of such complaints is typically not very useful as most of them would be highly negative anyways. This article proposes a way to classify and summarize customer complaints seen on consumer complaints website.
The Natural Language Programming pipeline is utilized to structure the text with stages such as:
Following sections describe all these stages, with more elaboration of one the core process of feature extraction, the summarization.
Various tools libraries can be used to scrape reviews from websites. Python has libraries such as “requests”, “BeautifulSoup” to take care of these tasks. Useful tutorials can be found at Tutorial1 and Tutorial2. Output of this stage is a set of text files with one complaint in each. A sample complaint looks as follows (some of the text has been masked, for the sake of confidentiality):
PIN not received.
17 Reviews
<CUSTOMER USER NAME>
Hello,
I have issued a new Debit/ATM card for my account no. 039309999999. This charged around Rs 240 (so unreasonable amount) for it on my account and
Bank officer tells this is service charge. The card was delivered to my home however pin did not come to home. : : : BANK charges so much for such small things with pathetic service. Even an account statement incurs around Rs. 150 or 200 (i don't remember exactly). Is there anybody from BANK who can take responsibility and look into this matter? Regards,
XXXX
16 Comments Updated: Mar 19, 2010
Apart from core text of the complaint, useful features are:
Subject being the first line of the complaint is easy to extract. It forms a one line gist of the issue.
Other features such as Number of reviews, comments, etc. are typically in a fixed format and can be extracted by regular expressions. Tutorials like this can be used to extract them.
Extraction of known categories such as “Loans”, “Fixed Deposits”, “Credit Cards” can also be done using matching pre-defined keywords by regular expressions.
Customer complaints could be very long and with such large volume, it is manually impossible to read through all of them. Effective summarization is a way of compressing the text into few meaningful lines.
Following section elaborates one of the ways of summarization for customer complaints.
Text summaries can be Abstractive or Extractive. In Abstractive, the summary is constructed by employing words and phrases which are (typically) NOT in the original text, whereas in Extractive, few of the highly representative sentences are picked from the original text and ordered to form the summary. The proposed method is of Extractive type.
Given a document D, having sentences (?0,?1,… , ??) return Y, which is a set of K important statements from D. Thus, Extractive text summarization is a binary classification model, where, out of n sentences, K sentences are labelled as True (meaning, they are part of the summary) or False if otherwise. So, the problem boils down the determining if a sentence (??) is labelled as True or False.
Decision of labeling depends on various factors, called as Summary features. In the overall process, for each statement, these features are computed. Their weighted sum is ranked. Top K ranked sentences are chosen as set Y, representing the summary.
In the current method, following features are incorporated to arrive at the rank of a sentence:
def tf(word, doc): count = doc.count(word) total = len(doc) tf_score = count / float(total) return tf_score def n_containing(word, docs): count = 0 for doc in docs: if doc.count(word) > 0: count += 1 return count def isf(word, docs): doc_count = n_containing(word, docs) ratio = len(docs) / float(1 + doc_count ) return math.log(ratio) def tfidf(word, doc, docs): tf_score = tf(word, doc) isf_score = isf(word, docs) return tf_score * isf_score def compute_tfisf_scores(sentences): tfisf_scores = [] for sent in sentences: sentence_score = 0 for word in sent: sentence_score += tfisf(word,sent,sentences) sentence_score /= float(len(sent)) tfisf_scores.append(sentence_score) return normalize(tfisf_scores)
2. Length: Number of words in the sentence can dictate the importance of it. Shorter ones are less important as they may not represent the gist of the whole text.
3. Position: Sentences occurring initially and towards end carry more meaning than the middle ones. The first sentence is of utmost importance.
4. Proper Nouns: The sentences which contain Named Entity called Proper Nouns (“NNP”) are important ones as they contain names of the places, persons, etc.
5. Cue Words: Domain specific words such as “Undelivered”, ‘Fraud”, etc. suggest important sentences. So, sentences having more such words are given more weightage.
6. Topic Words: Topic words are arrived as central words of the whole text. It could be words such as “Debit”, “Loan”, etc. Sentences aligned more with them are central to the text and thus are more eligible to be part of the summary.
def identify_lda_topics(cleaned_sentences): dictionary = corpora.Dictionary(cleaned_sentences) doc_term_matrix = [dictionary.doc2bow(doc) for doc in cleaned_sentences] ldamodel = models.ldamodel.LdaModel(doc_term_matrix, num_topics=6,id2word=dictionary, passes=5) topics = ldamodel.show_topics(num_topics=1, formatted=False, num_words=6) topic_names = [] for topic in ldamodel.show_topics(num_topics=6, formatted=False, num_words=6): # print("Topic {}: Words: ".format(topic[0])) topicwords = [w for (w, val) in topic[1]] topic_names += topicwords return list(set(topic_names))
Each feature values are normalized to lie in range 0 to 1.
Rank is computed for each sentence as weighted sum of the features. Values of the weights can either be derived empirically of by employing Machine/Deep Learning algorithms such as Naïve
Bayes, Logistic Regression, Support Vector Machine, etc. The current method computes Rank as:
rank_scores = weight_TfIsf * df['TfIsf'] + \ weight_Length * df[ weight_Length * df['Length'] + \ weight_Position * df['Position'] + \ weight_ProperNouns * df["ProperNouns"] + \ weight_TopicWords * df['TopicWords'] + \ weight_CueWords * df['CueWords']
A data frame is populated with the summary features as below:
The data frame is then sorted based on the “Rank” and fed for summary generation.
While collecting K top ranked sentences, care is taken that sentences “similar” to already selected sentences are not added to the set Y. This avoids getting almost duplicate sentences in the summary. Similarity measure used in the method is based on TF-ISF and similarity as shown below.
def compute_similarity(sent1, sent2): sentences = [sent1, sent2] from sklearn.feature_extraction.text import CountVectorizer c = CountVectorizer() bow_matrix = c.fit_transform(sentences) from sklearn.feature_extraction.text import TfidfTransformer normalized_matrix = TfidfTransformer().fit_transform(bow_matrix) similarity_graph = normalized_matrix * normalized_matrix.T return 1 - (similarity_graph[0,1])
And the resultant 3-line (K = 3) summary (Y) is:
PIN not received. This charged around Rs 240 (so unreasonable amount) for it on my account and Bank officer tells this is service charge. The card was delivered to my home however pin did not come to home.
Each complaint is ready with features such as its Summary and classification category for identifying the department to which it can be forwarded. The concerned department can sort complaints based on the amount involved, number of reviews and/or comments and start addressing the issues by reading the summary. If more details are needed, then original complaints can be considered.
The current article presents an automatic summarization method. It extracts features from sentences and picks top ranked sentences as the summary. Features such as “Cue Words” give flexibility for customization specific to the given domain. Topic words used are the centroids of the clusters of words. Sentences having such central words form the gist of the original text. Overall rank of the sentence captures the effect of all these features in relative importance. The proposed method can be further developed to incorporate additional features. Use of machine/deep learning algorithms can derive more accurate weights used for ranking the sentences.
This article was contributed by Yogesh H.Kulkarni who is the second rank holder of Blogathon 2. Stay Tuned to read rest of the articles.
May I ask, how did you get Cue Words? I was using Penn Treebank P.O.S. Tags, but it didn't show Cue Words
Cue words are the keywords important to the domain and are supplied to this process. For Banks, as mentioned in the article, the cue words could be “Undelivered”, ‘Fraud”, etc. So, the sentences having more of such cue words can be deemed important. This arrangement gives facility to inject domain expertise in the summarization process.
wow really nice. It will be helpful for the people those who are ready to crack the interview and please also for remind what they have learned throughout concept.
Great article & work Yogesh - very helpful! FYI: the Analytics Vidhya notification email I received mentions a full write-up at your website, but I cannot locate your site. Also, did I overlook the Y/output "department" in the summary or is it not in the article?