Introduction to Structuring Customer complaints explained with examples

Yogesh Last Updated : 12 Feb, 2017

7 min read

Introduction

In past, if you were not particularly happy with a service or a product, you would go to the service provider or the shop and lodge a complaint. With services-businesses going online and due to enormous scale, lodging complaints in-person may not be always possible. Electronic ways such as emails, social media and particularly websites like www.consumercomplaints.in focusing on such issues, are widely used platforms to vent out the anger as well as publicizing the issue in expectancy of quick actions.

Keeping a close watch on complaints on such sites has become imperative for businesses such as banks. This article looks at ways to structure these unstructured complaints in an actionable form.

In a typical case, a bank may be interested in looking at classification of the complaints into various categories such as “Loans”, “Fixed Deposits”, “Credit Cards”, etc. so that they can be forwarded to respective departments. An important feature would be to summarize long complaints so that further actions can be formulated quickly. Sentiment analysis of such complaints is typically not very useful as most of them would be highly negative anyways. This article proposes a way to classify and summarize customer complaints seen on consumer complaints website.

Proposed Method

The Natural Language Programming pipeline is utilized to structure the text with stages such as:

Input: Acquisition of text corpus
Core Process: Extraction of features (including summarization).
Output: Generation of Insights, Actionable tasks.

Following sections describe all these stages, with more elaboration of one the core process of feature extraction, the summarization.

Input

Various tools libraries can be used to scrape reviews from websites. Python has libraries such as “requests”, “BeautifulSoup” to take care of these tasks. Useful tutorials can be found at Tutorial1 and Tutorial2 . Output of this stage is a set of text files with one complaint in each. A sample complaint looks as follows (some of the text has been masked, for the sake of confidentiality):

PIN not received.

17 Reviews

<CUSTOMER USER NAME>

Hello,

I have issued a new Debit/ATM card for my account no. 039309999999. This charged around Rs 240 (so unreasonable amount) for it on my account and Bank officer tells this is service charge. The card was delivered to my home however pin did not come to home.

:

: :

BANK charges so much for such small things with pathetic service. Even an account statement incurs around Rs. 150 or 200 (i don't remember exactly).

Is there anybody from BANK who can take responsibility and look into this matter?


Regards,

XXXX

16 Comments

Updated: Mar 19, 2010

Extraction of Features

Apart from core text of the complaint, useful features are:

Subject e.g. “PIN not received”.
Number of reviews e.g. “17 Reviews”
Customer user name
Account numbers e.g. “039309999999”
Amount involved, e.g. “Rs 240”  Number of comments e.g. “16 Comments”
Last update e.g. “March 19, 2010”.

Subject being the first line of the complaint is easy to extract. It forms a one line gist of the issue.

Other features such as Number of reviews, comments, etc. are typically in a fixed format and can be extracted by regular expressions. Tutorials like this can be used to extract them.

Extraction of known categories such as “Loans”, “Fixed Deposits”, “Credit Cards” can also be done using matching pre-defined keywords by regular expressions.

Customer complaints could be very long and with such large volume, it is manually impossible to read through all of them. Effective summarization is a way of compressing the text into few meaningful lines.

Following section elaborates one of the ways of summarization for customer complaints.

Summarization

Text summaries can be Abstractive or Extractive. In Abstractive, the summary is constructed by employing words and phrases which are (typically) NOT in the original text, whereas in Extractive, few of the highly representative sentences are picked from the original text and ordered to form the summary. The proposed method is of Extractive type.

Problem Definition

Given a document D, having sentences (?₀,?₁,… , ?_?) return Y, which is a set of K important statements from D. Thus, Extractive text summarization is a binary classification model, where, out of n sentences, K sentences are labelled as True (meaning, they are part of the summary) or False if otherwise. So, the problem boils down the determining if a sentence (?_?) is labelled as True or False.

Decision of labeling depends on various factors, called as Summary features. In the overall process, for each statement, these features are computed. Their weighted sum is ranked. Top K ranked sentences are chosen as set Y, representing the summary.

Summary Features

In the current method, following features are incorporated to arrive at the rank of a sentence:

TF-ISF (Term Frequency – Inverse Sentence Frequency): For each word, its number of occurrences are computed, called TF. ISF denotes uniqueness of that word across all sentences. TF-ISF of a sentence is summation of TF-ISF of each word in it.

def tf(word, doc):     
    count = doc.count(word)     
    total = len(doc)
    tf_score = count / float(total)     
    return tf_score

def n_containing(word, docs):
    count = 0     
    for doc in docs:         
        if doc.count(word) > 0:
            count += 1     
    return count

def isf(word, docs):
    doc_count = n_containing(word, docs)     
    ratio = len(docs) / float(1 + doc_count )     
    return math.log(ratio)

def tfidf(word, doc, docs):     
    tf_score = tf(word, doc)     
    isf_score = isf(word, docs)     
    return tf_score * isf_score

def compute_tfisf_scores(sentences):
    tfisf_scores = []     
    for sent in sentences:         
        sentence_score = 0         
        for word in sent:
            sentence_score += tfisf(word,sent,sentences)         
        sentence_score /= float(len(sent))         
        tfisf_scores.append(sentence_score)     
    return normalize(tfisf_scores)

2. Length: Number of words in the sentence can dictate the importance of it. Shorter ones are less important as they may not represent the gist of the whole text.

3. Position: Sentences occurring initially and towards end carry more meaning than the middle ones. The first sentence is of utmost importance.

4. Proper Nouns: The sentences which contain Named Entity called Proper Nouns (“NNP”) are important ones as they contain names of the places, persons, etc.

5. Cue Words: Domain specific words such as “Undelivered”, ‘Fraud”, etc. suggest important sentences. So, sentences having more such words are given more weightage.

6. Topic Words: Topic words are arrived as central words of the whole text. It could be words such as “Debit”, “Loan”, etc. Sentences aligned more with them are central to the text and thus are more eligible to be part of the summary.

def identify_lda_topics(cleaned_sentences):
    dictionary = corpora.Dictionary(cleaned_sentences)
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in cleaned_sentences]     
    ldamodel = models.ldamodel.LdaModel(doc_term_matrix, num_topics=6,id2word=dictionary, passes=5)
    topics = ldamodel.show_topics(num_topics=1, formatted=False, num_words=6)     topic_names = []     
    for topic in ldamodel.show_topics(num_topics=6, formatted=False, num_words=6):
        # print("Topic {}: Words: ".format(topic[0]))         
        topicwords = [w for (w, val) in topic[1]]         
        topic_names += topicwords     
    return list(set(topic_names))

Each feature values are normalized to lie in range 0 to 1.

Computation

Rank is computed for each sentence as weighted sum of the features. Values of the weights can either be derived empirically of by employing Machine/Deep Learning algorithms such as Naïve

Bayes, Logistic Regression, Support Vector Machine, etc. The current method computes Rank as:

rank_scores =   weight_TfIsf * df['TfIsf'] + \                 
                weight_Length * df[
                weight_Length * df['Length'] +  \                 
                weight_Position * df['Position'] + \                 
                weight_ProperNouns * df["ProperNouns"] + \                 
                weight_TopicWords * df['TopicWords'] + \                 
                weight_CueWords * df['CueWords']

A data frame is populated with the summary features as below:

Summary Generation

The data frame is then sorted based on the “Rank” and fed for summary generation.

While collecting K top ranked sentences, care is taken that sentences “similar” to already selected sentences are not added to the set Y. This avoids getting almost duplicate sentences in the summary. Similarity measure used in the method is based on TF-ISF and similarity as shown below.

def compute_similarity(sent1, sent2):
       sentences = [sent1, sent2]
       from sklearn.feature_extraction.text import CountVectorizer     
       c = CountVectorizer()
       bow_matrix = c.fit_transform(sentences)  
       from sklearn.feature_extraction.text import TfidfTransformer     
       normalized_matrix = TfidfTransformer().fit_transform(bow_matrix)     
       similarity_graph = normalized_matrix * normalized_matrix.T     
       return 1 - (similarity_graph[0,1])

And the resultant 3-line (K = 3) summary (Y) is:

PIN not received.
This charged around Rs 240 (so unreasonable amount) for it on my account and Bank officer tells this is service charge.
The card was delivered to my home however pin did not come to home.

Post Processing

Each complaint is ready with features such as its Summary and classification category for identifying the department to which it can be forwarded. The concerned department can sort complaints based on the amount involved, number of reviews and/or comments and start addressing the issues by reading the summary. If more details are needed, then original complaints can be considered.

End Notes

The current article presents an automatic summarization method. It extracts features from sentences and picks top ranked sentences as the summary. Features such as “Cue Words” give flexibility for customization specific to the given domain. Topic words used are the centroids of the clusters of words. Sentences having such central words form the gist of the original text. Overall rank of the sentence captures the effect of all these features in relative importance. The proposed method can be further developed to incorporate additional features. Use of machine/deep learning algorithms can derive more accurate weights used for ranking the sentences.

This article was contributed by Yogesh H.Kulkarni who is the second rank holder of Blogathon 2. Stay Tuned to read rest of the articles.

Yogesh

Yogesh H. Kulkarni is currently pursuing full-time PhD in the field of Geometric modeling, after working in the same domain for more than 16 years. He is also keenly interested in data sciences, especially Natural Language Processing, Machine Learning and wishes to pursue further career in these fields.

Intermediate Machine Learning NLP Programming Python Python Technique Text Unstructured Data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Cherry

May I ask, how did you get Cue Words? I was using Penn Treebank P.O.S. Tags, but it didn't show Cue Words

Show 1 reply

Yogesh Kulkarni

Cue words are the keywords important to the domain and are supplied to this process. For Banks, as mentioned in the article, the cue words could be “Undelivered”, ‘Fraud”, etc. So, the sentences having more of such cue words can be deemed important. This arrangement gives facility to inject domain expertise in the summarization process.

stepherd

wow really nice. It will be helpful for the people those who are ready to crack the interview and please also for remind what they have learned throughout concept.

Scott

Great article & work Yogesh - very helpful! FYI: the Analytics Vidhya notification email I received mentions a full write-up at your website, but I cannot locate your site. Also, did I overlook the Y/output "department" in the summary or is it not in the article?

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Introduction to Structuring Customer complaints explained with examples

Introduction

Proposed Method

Input

Extraction of Features

Summarization

Problem Definition

Summary Features

Computation

Summary Generation

Post Processing

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm