Step-by-step Explanation of Text Classification

Adnan Last Updated : 12 Oct, 2024

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Suppose you are working in an IT firm as a support desk specialist and receive hundreds of support tickets you have to handle daily. The first task you do with each ticket is to classify it into one of the categories you have developed, such as “Credentials expired”, “Operating System Faulty”, “Hardware malfunctioning”, etc. If you are to categorize each support ticket manually, it would require a lot of time and effort to do so. Thanks to the text classification algorithms and Machine Learning, you can automate this task and save many man hours.

What is Text Classification?

Text classification is a machine learning algorithm that allocates categories to the input text. These categories are predefined and customizable; for example, in the previous example quoted above, “Operating System Faulty”, “Hardware Malfunctioning”, and “Credentials expired” are all predefined categories against which you would want your existing and new input data to be categorized into.

Source: https://www.pexels.com/photo/assorted-beans-placed-in-rows-on-white-fabric-5913170/

Applications of Text Classification

There are various applications of Text Classifications. A few of them include:

Support ticket classification used by IT companies
Movies or TV shows classification based on their genres
Journal papers classification based on their field of research

and so on…

Machine Learning Models for Text Classification

There are currently various Machine Learning models that are used for Text Classification Problems, such as:

Support Vector Machine
Naive Bayes Algorithm
Logistic Regression

But we have mostly seen the implementation of these models on numeric classification. For text classification, we need to convert text data into numerical data first, where vectorization comes in. Before moving forward, let us briefly understand these models.

Support Vector Machine

“Support Vector Machine (SVM) is an excellent regression and classification algorithm that helps maximize a model’s accuracy and avoids overfitting. SVMs work the best when the dataset size is large. Common SVM applications include Image recognition, Customer Relationship Management (CRM) tools, text classification, extraction, etc.

Naive Bayes Algorithm

A Naive Bayes Algorithm (NB), is based on the Bayes theorem and works on the principle of conditional probability, which in turn, measures an event’s probability given that another event has occurred.

Source: https://www.kdnuggets.com/2020/06/naive-bayes-algorithm-everything.html

Logistic Regression

Logistic Regression is a supervised learning algorithm that helps predict the probability of an event or an outcome. Common Logistic Regression problems consist of binary classification of the input data, such as if the emails are spam or not, or if the person likes the hamburger.

The logistic regression model is based on a Logistic function which is defined as:

Logistic function = (1)/((1+e^(-x)))

Vectorization

Text Vectorization is a process through which text data are converted into numerical data. Various tools help with vectorization, such as:

Bag of Words Term Frequency
- It is a measure of words’ occurrence in a document. Two metrics are obtained from BoW: a vocabulary of words and the count of those words in a document.
Binary Term Frequency
- If a particular word is found in the document that is also found in the corpus, we get a 1; if not, we get a 0, hence the name, Binary Term Frequency.
Term Frequency (L1 Normalized)
- It is the measure of how frequent a word is found in a document
TF-IDF (L2 Normalized)
- Count of a word in a document/ total words count in that document * log (total number of documents/documents containing the given the word).
Word2Vec
- Word2Vec is an algorithm that works on building word embeddings through neural networks.

In this article, we will focus on Text Classification using a combination of TF-IDF Vectorization and Logistic Regression. Let us first have a brief introduction to TF-IDF Vectorizer and Logistic Regressor.

Term Frequency-Invert Document Frequency (TF-IDF) Vectorizer

Using the TF-IDF model, we can define the significance of each input word depending on its frequency in the text. It is based on the composite score representing the word’s power. This composite score is calculated by multiplying the Term Frequency (TF) factor with the Inverse Document Frequency (IDF) factor.

Term Frequency (TF): This factor shows the occurrence of a word out of total words in that document and is calculated as :

TF: Count of a word in a document/ total words count in that document

Inverse Document Frequency (IDF): This factor takes the log value of the ratio of the total number of documents and the total number of documents that contain that particular word. It is calculated as:

IDF: log (total number of documents/documents containing the given word).

The higher the TF-IDF value is, the more chances the word is unique and occurs rarely. The lower the value of the factor, the more common the word is. For example, the commonly occurring words, such as “and,” “the,” and “is,” all have a meager value of TF-IDF, nearly equal to zero.

Step1: Vectorization using TF-IDF Vectorizer

Let us take a real-life example of text data and vectorize it using a TF-IDF vectorizer. We will be using Jupyter Notebook and Python for this example. So let us first initiate the necessary libraries in Jupyter.

import pandas as pd
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import _stop_words

TfidfVectorizer is the required library we need to import from sklearn.feature_extraction. _stop_words is used here to list all the common words in a language.

For this example, we will use the publically available Internet Movie Database (IMDB) movie titles and genres dataset, which can be downloaded from here. I have downloaded the file named “title.basics.tsv.gz“. This is a huge file, around 150MB, with millions of rows. For simplicity, I have taken only the initial 1000+ entries of the dataset and split the dataset into two files, the first 1028 rows for the training dataset (just a random number, no logic behind 1028), called imdb_train.csv, and the remaining 18 entries as imdb_test.csv. We will first be training on the training dataset and then testing our model on the unseen test dataset and letting the model classify the 18 movies into their genres. Finally, we will evaluate how our model did by comparing any random movie’s predicted genre with the actual genre.

Step2: Loading and Visualizing the Dataset

Let us load and display the training dataset as follows:

Python Code:

import pandas as pd

train_data = pd.read_csv('imdb_train.csv')
print(train_data.shape)
print(train_data.head())
print(train_data['genres'].unique())

We have 1058 movie titles along with their genres. There are 17 different genres in which 1058 movies are classified.

Step 3: Vectorization

We will first create a matrix of the movie titles in a corpus.

corpus = train_data['primaryTitle'].values
corpus

Then, we will vectorize our corpus

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

First, we generating the Vectorizer object using vectorizer = TfidfVectorizer(stop_words=’english’) command. In the next step, we converted the input text into a TF-IDF matrix using X = vectorizer.fit_transform(corpus) command, and we print the words selected in the TF-IDF matrix in the final step.

Vector_Text=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
Vector_Text['originalText']=pd.Series(corpus)
Vector_Text

In the previous step, we visualized the document term matrix using TF_IDF. Now let us add the genres column back to the vectorized table.

ML_Data=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
ML_Data['genres']=train_data['genres']
ML_Data.head()

Step4: Data Formatting

Now let us do some data formatting and adjustments

Target=ML_Data.columns[-1]
Predictors=ML_Data.columns[:-1]
X=ML_Data[Predictors].values
y=ML_Data[Target].values

Step5: Logistic Regression for Classification

The Logistic regression model helps estimate an event’s probability based on the independent variables dataset. We can try other models for classification, such as Naive Bayes, Decision Trees, and such, but for simplicity, we are using Logistic Regression here. Readers are encouraged to try the other models and comment if those models produced a better result.

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
clf = LogisticRegression(C=5, solver='newton-cg',penalty='l2')
LOG=clf.fit(X,y)
pred=LOG.predict(X)
Test_Data=pd.DataFrame(data=X, columns=Predictors)
Test_Data['TargetVariable']=y
Test_Data['Prediction']=pred
print(Test_Data.head())
print(metrics.classification_report(y, pred))
print(metrics.confusion_matrix(pred, y))
F1_Score=metrics.f1_score(y, pred, average='weighted')
print('Accuracy of the model on Testing Sample Data:', round(F1_Score,2)

Step6: Predictions for New Movie Labels

In this step, we will be loading the test dataset and see how our model does with predicting the movies’ genres. We will define a function that converts the words into numeric vectors.

def genres_test(inpText):
X=vectorizer.transform(input_text)
    Prediction=FinalModel.predict(X)
    Result=pd.DataFrame(data=input_text, columns=['title'])
    Result['Prediction']=Prediction
    return(Result)

Now, let’s call the function

movie_name=["Flores y perlas"]
predicted_genre=genres_test(input_text=movie_name)
predicted_genre

Results of Text Classification

Now let us compare the predicted genre with the original genre of the same title in our dataset.

test_data=pd.read_csv('imdb_test.csv')
test_data

We can see in row number 9 that the actual genre of the movie “Flores y perlas” is also “Drama”.

Conclusion

In this article, we started by defining what Text Classification is in the field of Machine Learning and what its applications are. Then, we read how text classification is carried out by first vectorizing our text data using any vectorizer model such as Word2Vec, Bag of Words, or TF-IDF, and then using any classical classification methods, such as Naive Bayes, Decision Trees, or Logistic Regression to do the text classification.

We used the refined IMDB movies dataset with just the movie titles and their genres. Fed the model with a portion of the dataset so it could learn and then fed it with new unseen data to predict the movies’ genres, which it did with high accuracy.

Key takeaways from this article are:

Text Classification is a crucial machine learning function. It has multiple applications in the field, such as Support ticket classification used by IT companies, Movies or TV shows classification based on their genres and journal papers classification based on their field of research, etc.
Text classification is a two-step process. First, we need to convert the input text into vectors and then classify those vectors using a classification algorithm.
Various vectorization algorithms are available such as TF-IDF, Word2Vec, Bag of Words, etc. Similarly, various classification algorithms include Logistic Regression, Naive Bayes, and Decision Trees/Random Forest.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Adnan

Competent and passionate professional holding over 3 years of Python, Data Science, Data Analytics, and ML experience with recent experience in Prompt Engineering. I love writing and one of my blogs at Analytics Vidhya was among the top-3 winners of the Data Science Blogathon, read by 700+ users.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

raja

for the given reviews, i din't see any difference between using either bow or tf idf , if the stop word could have been removed first and I think that's how it should have been. Other observation is, Movie is also important word but tf idf was not able to consider that as important . Pl correct if am wrong

Rajkumar

This is an excellent article. Simple to follow. It appears to be a TYPO in the below. Review 1: This movie is very scary and long Review 2: This movie is not scary and is slow Review 3: This movie is spooky and good 1-This,2-movie,3-is,4-very,5-scary,6-and,7-long,8-not,9-slow,10-spooky,11-good TYPO in long word calculation. long word available only in Review 1. But its updated in all three Vector of Reviews. Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0] Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0] Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

dummy .

the provided wrong values . Please correct me if Im wrong

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Step-by-step Explanation of Text Classification

Introduction

What is Text Classification?

Applications of Text Classification

Machine Learning Models for Text Classification

Support Vector Machine

Naive Bayes Algorithm

Logistic Regression

Vectorization

Term Frequency-Invert Document Frequency (TF-IDF) Vectorizer

Step1: Vectorization using TF-IDF Vectorizer

Step2: Loading and Visualizing the Dataset

Step 3: Vectorization

Step4: Data Formatting

Step5: Logistic Regression for Classification

Step6: Predictions for New Movie Labels

Results of Text Classification

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)