Industrial Application of NLP: Task I

Abhishek Jaiswal Last Updated : 02 Feb, 2022

8 min read

This article was published as a part of the Data Science Blogathon.

An end-to-end Guide on Consumer Complaint Segregation using NLP

Industrial Application of NLP — Source: Author

Hey Folks!

In this article, we are going to solve a real Business Problem that is Consumer Complaint Segregation using basic concepts of NLP in a very detailed manner.

I believe that you are already comfortable with the concepts of natural language processing basics like feature extraction, working with raw data, and comfortable with model training on textual data.

I have already written a series of detailed articles on NLP starting from zero, so if you are not comfortable with the basics of NLP you can refer to my articles.

Table of Content

Introduction & Goal
loading the data
Feature Engineering & Feature Extraction
Exploratory Data Analysis (EDA)
Text Preprocessing
Training Multi-Classification Model
Recall, Precision, F1-score
Predictions

Introduction

Financial Protection Bureau is an organization that sends thousands of consumers’ complaints about financial services (mortgage, student loans, etc.) and products ( ie. credit cards, debit cards) for some response.

Complaints need to be segregated and deliver to the concerning department, this increases the response time of complaints since we are reducing the human intervention to classify the complaint category.

So we need to build a model that reads the complaint and can tell us the concerning department like complaints related to the mortgage must be forwarded to the mortgage department and credit card complaints to be forwarded to the banking products department.

Goal

The Goal of this Project is to segregate the complaint into their concerning product or category department.

Since the complaint can be more than two hence it will be a multi-class classification and it can be solved by using NLP and machine learning algorithms.

By using the Machine learning model we can easily classify the complaint hence we are reducing the human intervention to classify the complaint and reducing the response time of complaints.

Note: Text classification is an example of supervised machine learning because we are working on labelled data for training and testing purpose.

Let us start working on this Project.

1. Loading & Understanding the Dataset

We are going to work on the Consumer Finance Complaints Dataset provided by the Bureau of Consumer Financial Protection.

You can download the dataset using this link or you can create a cloud Notebook and work instantly.

Loading and unloading the data | Industrial Application of NLP — Source: Kaggle.com

Loading the dataset into Pandas data frame

import pandas as pd
import numpy as np
df = pd.read_csv("../input/consumer-complaint-database/rows.csv", low_memory = False)
df.head()

Loading dataset | Industrial Application of NLP — Source: Author

This dataset contains so many columns but we need to focus on only two columns [Product, Consumer complaint narrative].

Product → Category of Complaint
Consumer complaint narrative → Consumer’s Complaint text

df1 = df[['Product', 'Consumer complaint narrative']]
df1.columns = ['Product', 'Consumer complaint']

We have renamed the “consumer complaint narrative” to “Consumer complaint” and kept in a data frame df1 .

Filtering the Complaints with No text (False Complaint )

In the dataset there are many complaints with no text body, these are false complaints we need to filter such complaints out.

df1 = df1[df1['Consumer complaint'].isna() != True]

2. Feature Engineering

Under Feature Engineering we do some data manipulation in order to train the model efficiently and get a better insight into the data.

pd.DataFrame(df1.Product.unique()).values

Feature Engineering | Industrial Application of NLP — Source: Author

In our dataset, we have 18 different complaints categories, some of the complaint categories are interrelated. for instance “Credit card of the prepaid card” and “Prepaid Card”, “Credit card” are so related.

Hence we need to rename the categories in order to merge related categories.

Renaming the Categories

# Renaming categories by using 
df1.replace({'Product': 
             { 
              'Credit reporting': 'Repair or Credit reporting',
             'Credit card': 'Credit card or prepaid card',
             'Prepaid card': 'Credit card or prepaid card',
            'credit repair services,Credit reporting , or other personal consumer reports': 
            'Repair or Credit reporting',
             'Money transfer': 'Money transfer, virtual currency, or money service',
            'Payday loan': 'title loan,Payday ,Personal loan',
             'Virtual currency': 'Money transfer, virtual currency, or money service'}}, 
            inplace= True)

Renaming the categories — Source: Author

After renaming we have only 13 categories / Products to classify.

Note: The Dataset we are working on is too big . it contains 1.2 millions of rows , hence training on a big data will be so time consuming , so we will take only sample of 10000 rows for our training in order to save time.

df2 = df1.sample(10000, random_state=1).copy()

here df2 is the dataset we will further work on and it contains 10000 rows and 2 columns

so far we have the dataset in text and we need to convert the complaint text and category into some number

Converting the Categories (Product) into numbers

df2['category_id'] = df2['Product'].factorize()[0]

We have added a new column “category_id” and it will contain the category number.

Converting the categories — Source: Author

As you see we have converted the category label into category_id we also need data to convert category_id back to the category label at the time of prediction. For this purpose, we will create a dictionary.

category_id = df2[['Product', 'category_id']].drop_duplicates()
id_2_category = dict(category_id[['category_id', 'Product']].values)

we will use id_2_category dictionary for converting class_id to class_lable.

3. Performing EDA

Under EDA we explore our data, we plot charts to understand the relation and various insights of the data.

Plotting the Product/Categories vs Number of Complaints

import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure(figsize=(8,6))
colors = ['grey','grey','grey','grey','grey','grey','grey','grey','grey',
    'grey','darkblue','darkblue','darkblue']
df2.groupby('Product')['Consumer complaint'].count().sort_values().plot.barh(
    ylim=0, color=colors, title= 'NUMBER OF COMPLAINTS IN EACH PRODUCT CATEGORYn')
plt.xlabel('Number of ocurrences', fontsize = 10)

Performing EDA | Industrial Application of NLP — Source: Author

You can Possibly see that “credit reporting, repair and other “ have the maximum supporting records and “Other financial Services” have very few records. This shows some Possible data imbalance and this can be fixed by sampling equal records from each category/Product.

4. Feature Engineering

Now we need to convert the Complaint text into some vectors since the machine can’t understand the textual data. This process is called Feature Extraction.

We are going to use tf-idf Vectorizer ( Inverse Document Frequency ) for feature extraction. If you are not comfortable with Feature Engineering prefer my article.

TFIDF evaluated how important a word is to its document in a collection or group of documents.

Note: After removing punctuation, lower caseing the words we can proceed to the feature extraction step. TFIDF vectorizer can handle the stopwords on its own.

Term Frequency: This tells how often a given word occurs within a document.

Inverse Document Frequency: It is the opposite of the Term Frequency. If a given word appears many times among the documents it will have a low IDF score.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
                        ngram_range=(1, 2), 
                        stop_words='english')

# Vectorization
features = tfidf.fit_transform(df2['Consumer complaint']).toarray()
labels = df2.category_id

min_df: it removes those words from the vocabulary which has appeared less than ‘min_df’ number of files.
sublinear_tf = TrueScale the term frequency on a logarithmic scale.
stop_words: removes the stopwords of mentioned language.
ngram_range =(1, 2) Unigram and bigram both will be considered
max_df: It removes those words from the vocabulary which have appeared more than ‘maxdf’

Splitting the Dataset

Splitting the dataset into training and testing partitions. 75% of the records will be used for training and the rest will be used for testing purposes.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, 
                                                    test_size=0.25,
                                                    random_state = 0)

5. Model Training

We are going to use LinearSVC hence it performed well you can try other models as well and check their performance.

from sklearn.svm import LinearSVC
model = LinearSVC()
model.fit(X_train, y_train)

6. Evaluation and Testing

In order to see how our model performs we will use sklearn’s metrics class. We will use a Classification report and confusion matrix.

from sklearn import metrics
from sklearn.metrics import classification_report
# Classification report
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred, 
                                    target_names= df2['Product'].unique()))

As you observe classes having more support ( data rows) are having better f1-score. this is happening because those classes are trained on more data. To fix this issue we should balance the data, as we have discussed already.

classes like ‘Mortgage’, ‘Student loan ‘Credit reporting, repair, or other’ can be classified with more precision.

Plotting Confusion Matrix

import seaborn as sns
sns.set()
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(conf_mat, annot=True, cmap="Blues", fmt='d',
            xticklabels=category_id.Product.values, 
            yticklabels=category_id.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title("CONFUSION MATRIX - LinearSVCn", size=16);

It is clearly observable that Credit card reporting and data collection classes have more precision than others do.

7. Prediction

It’s time to try our model for some prediction. we will pass a complaint text and our model will classify according to its complaint class.

complain = """I have been enrolled back in 2019 to Indian University. Few days ago , i have been harassed by 
Navient. I have already faxed the paperwork providing them with everything they wanted. And still getting 
phone calls for payments. Furthermore, Navient is now reporting to the credit bureaus that I am late for the payment. At this point, 
Navient needs take their act together to avoid me taking further steps"""

We can’t pass text directly to the trained model for prediction we need to use our fitted vectorizer for feature extraction and then only we will pass our features for prediction.

complaint_id = (model.predict(tfidf.transform([complain])))
print("complain", id_2_category[complaint_id[0]])

output:

complain student loan

It’s clearly visible that our model has predicted accurately.

Conclusion

In this article we have Solved a Business Problem using NLP, we used various concepts like data cleaning, EDA, feature extraction, feature engineering, and successfully build our model for segregation of complaints types.

Further, you can try different models like BERT, LSTM for classification, you can also give a try to word embedding since word embedding holds an upper hand compared to TFIDF Vectorizer.

In the next article, we will cover Text summarization for subject notes with simple implementation with Python.

Read more articles on NLP on our website

If you have any suggestions or questions for me feel free to hit me on my Linkedin.

Thanks for reading !!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Abhishek Jaiswal

A data enthusiast exploring the leading technologies related to the data

Beginner NLP

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Industrial Application of NLP: Task I

An end-to-end Guide on Consumer Complaint Segregation using NLP

Table of Content

Introduction

Goal

1. Loading & Understanding the Dataset

2. Feature Engineering

3. Performing EDA

4. Feature Engineering

5. Model Training

6. Evaluation and Testing

7. Prediction

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID