Beginner’s Guide To Text Classification Using PyCaret

nithilaau Last Updated : 26 Aug, 2021

8 min read

Introduction

Have you ever solved a Machine Learning problem in just one go?

Solving a problem using machine learning isn’t straightforward. It involves various steps to come up with an accurate solution. The process/steps to be followed for solving an ml problem is known as ML Pipeline/ML Cycle.

ML Pipeline/ ML Cycle Text Classification Using PyCaret

ML Pipeline/ ML Cycle (Credits: https://medium.com/analytics-vidhya/machine-learning-development-life-cycle-dfe88c44222e)

As shown in the figure, the Machine Learning pipeline consists of different steps like:

Understand Problem Statement, Hypothesis Generation, Exploratory Data Analysis, Data Preprocessing, Feature Engineering, Feature Selection, Model Building, Model Tuning, and Model Deployment.

I would recommend going through the below articles for in detailed understanding of the Machine Learning pipeline:

The process of solving a machine learning problem involves a lot of time and human effort. Hip Hip Hooray! It’s no longer a tedious and time-consuming process! Thanks to AutoML for providing instant solutions to ML problems.

AutoML is all about automatically building the high-performance model with the least human intervention

AutoML libraries offer low-code and no-code programming.

You’ve probably heard of the terms “low-code” and “no-code.”

No-code frameworks are simple UI’s that enable even non-technical users to build models without writing a single line of code.
Low-code refers to minimum coding.

Though no-code platforms make it simple to train a Machine Learning model using a drag-and-drop interface, they are limited in terms of flexibility. Low-code ML, on the other hand, is the sweet spot and middle ground, as they offer both flexibility and easy-to-use code.

In this article, let us understand how to build a text classification model within a few lines of code using a low code AutoML library, PyCaret.

What is PyCaret?
Why do we need PyCaret?
Different Approaches to solving text classification in PyCaret
1. Topic Modelling
2. Count Vectorizer
Case Study – Text Classification using PyCaret

What is PyCaret?

PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within a few minutes.

PyCaret (Credits: https://pycaret.org/)

PyCaret is essentially a low-code library that replaces hundreds of lines of code in scikit learn to 5-6 lines of code. It increases the productivity of the team and helps the team to focus on understanding the problem and feature engineering rather than model optimization.

PyCaret, pycaret, pycaret model Text Classification Using PyCaret

PyCaret (Credits: https://pycaret.org/about/)

PyCaret is built on top of a scikit learn library. As a result, all the machine learning algorithms available in scikit learn are available in pycaret. As of now, PyCaret can solve problems related to Classification, Regression, Clustering, Anomaly detection, Text Classification, Associate Rule Mining, and Time Series.

Now, let us discuss the reasons behind using PyCaret.

Why do we need PyCaret?

PyCaret automatically builds the benchmark model given a dataset within 5-6 lines of code. Let’s see how pycaret simplifies each step in the machine learning pipeline.

Data Preparation: PyCaret does the data cleaning and data preprocessing with the least manual intervention.
Feature Engineering: PyCaret creates the mathematical features automatically and selects the most important features required for model
Model Building: It greatly simplifies the modeling portion of your project. We can build different models and select the top-performing models with one single line of code.
Model Tuning: PyCaret finetunes the model without explicitly passing the hyperparameters to each model.

Next, we will focus on solving a text classification problem in PyCaret.

Different Approaches to solving text classification in PyCaret

Let’s solve a text classification problem in PyCaret using 2 different techniques-

Topic Modeling
Count Vectorizer

I will touch upon each approach in detail

Topic Modeling

Topic Modeling, as the name conveys, is a technique to identify different topics present in the text data.

Topics are defined as a repeating group of statistically significant tokens (or words) in a corpus. Here, statistical significance refers to important words in the document. Generally, the frequently occurring words with higher TF-IDF scores are considered to be statistically significant words.

Topic modeling is an unsupervised technique to automatically find the hidden topics in text data. It can also be referred to as the text mining approach to find recurring patterns in text documents.

Topic Modeling - Text Classification Using PyCaret

Topic Modeling (Credits: https://medium.com/analytics-vidhya/topic-modeling-using-lda-and-gibbs-sampling-explained-49d49b3d1045)

Some common use-cases of topic modeling are as follows:

Solve text classification/regression problems
Creating relevant tags to documents
Generate insights for customer feedback forms, customer reviews, survey results, etc.

Example of Topic Modeling

Let’s say you work for a legal firm and you’re working with a company where there’s some money that’s been embezzled, and you know there’s some key information lying in the emails that have been set around the company.

So, you go through the emails and there are hundreds of thousands of emails. Now, what you need to do is, you need to figure out which ones are related to money versus other topics.
You can either hand label them based on what you read in the text, which would take a long time, or you can use the technique called topic modeling to find out what these labels are and automatically label all these emails.

As explained earlier, the objective of topic modeling is to extract different topics from the raw text. But, what’s the underlying algorithm to achieve it?

This drives us to the different algorithms/techniques to topic modeling – Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NNMF), Latent Semantic Allocation (LSA).

I would recommend you to go through the following resources to read in detail about the algorithms

Coming to topic modeling, it’s a 2 step process:

Topic to Term Distribution: Find the most important topics in the corpus.
Document to Topic Distribution: Assign scores of each topic to each document.

Having understood the topic modeling, we will see how to solve text classification using topic modeling with the help of an example.

Consider a corpus:

Document 1: I want to have fruits for my breakfast.
Document 2: I like to eat almonds, eggs, and fruits.
Document 3: I will take fruits and biscuits with me while going to the Zoo.
Document 4: The zookeeper feeds the lion very carefully.
Document 5: One should give good quality biscuits to their dogs.

Topic modeling algorithm (LDA) identifies the most important topics in documents.

Topic 1: 30% fruits, 15% eggs, 10% biscuits, … (food)
Topic 2: 20% lion, 10% dogs, 5% zoo, … (animals)

Next, assigns scores of each topic to documents as follows.

Assign Topics To Each Document Using LDA

This matrix acts like features of the machine learning algorithm. Next, we’ll see about the bag of words.

Bag Of Words

Bag Of Words (BOW) is another popular algorithm for representing text in numbers. It relies on the frequency of the words in the document. BOW has numerous applications like document classification, topic modeling, and text similarity. In BOW, every document is represented as the frequency of words present in the document. So, the frequency of words represents the importance of the words in the document.

Bag Of Words (Credits: Jurafsky et al., 2018)

Follow the below article for a detailed understanding of Bag Of Words:

An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec

In the next section, we will solve the text classification problem in PyCaret.

Case Study – Text Classification using PyCaret

Let us understand the problem statement prior to solving it.

Understanding Problem Statement

Steam is a video game digital distribution service with a vast community of gamers globally. A lot of gamers write reviews on the game page and have the option of choosing whether they would recommend this game to others or not. However, determining this sentiment automatically from the text can help Steam to automatically tag such reviews extracted from other forums across the internet and can help them better judge the popularity of games.

Given the review text with user recommendation, the task is to predict whether the reviewer recommended the game titles available in the test set on the basis of review text and other information.

In simpler terms, the task at hand is to identify whether a given user review is good or bad. You can download the dataset from here.

Implementation

For classifying the Steam game reviews using PyCaret, I’ve discussed 2 different approaches in the article.

The first approach uses topic modeling using PyCaret.
The second approach uses Bag Of Words features. Use these features for classification using PyCaret.

We will implement the BOW approach now.

Note: The tutorial is implemented on Google Colab. I would recommend running the code on the same.

Installing PyCaret

You can install PyCaret just like any other python library.

Installing PyCaret on Google Colab or Azure Notebooks

!pip install pycaret

view raw install2.py hosted with ❤ by GitHub

Importing Libraries

	from pycaret.nlp import *
	from pycaret.classification import *

	import pandas as pd

view raw libraries_import.py hosted with ❤ by GitHub

Loading Data

	df = pd.read_csv('train.csv')
	df.head()

view raw read_data.py hosted with ❤ by GitHub

As PyCaret doesn’t support count vectorizer, import the module CountVectorizer from sklearn.feature_extraction.

from sklearn.feature_extraction.text import CountVectorizer

view raw countvec_import.py hosted with ❤ by GitHub

Then, I initialize a CountVectorizer object named ‘tf_vectorizer’.

tf_vectorizer = CountVectorizer(min_df=.015, max_df=.8, max_features=no_features, ngram_range=[1, 3])

view raw countvec_obj.py hosted with ❤ by GitHub

What exactly does the fit_transform function do to your data?

“Fit” extracts the features from the dataset.
“Transform” actually performs the transformations on the dataset.

%time features = tf_vectorizer.fit_transform(df['user_review'])

view raw fit_transform.py hosted with ❤ by GitHub

Let’s convert the output of fit_transform to the data frame.

features_df = pd.DataFrame(features.toarray(), columns=tf_vectorizer.get_feature_names())

view raw return_matrix.py hosted with ❤ by GitHub

Now, concatenate the features and target along the column.

df = pd.concat([features_df,df['user_suggestion']],axis=1)

view raw concat.py hosted with ❤ by GitHub

Next, we will split the dataset into train and test data.

	#Shuffle your dataset
	shuffle_df = df.sample(frac=1)

	# Define a size for your train set
	train_size = int(0.9 * len(df))

	# Split your dataset
	train_df = shuffle_df[:train_size]
	test_df = shuffle_df[train_size:]

view raw train_test_split.py hosted with ❤ by GitHub

Now that feature extraction is done. Let’s use these features to build different models. So, the next step is to set up the environment in PyCaret.

Setting up the environment

	numerical_features = list(features_df.columns)
	%time temp = setup(data = train_df, target = 'user_suggestion',numeric_features=numerical_features, use_gpu=True, n_folds=5)

view raw setup_env.py hosted with ❤ by GitHub

This function sets up the training framework and builds the transition pipeline. The setup function must be called before any other function may be called.
The only mandatory parameter is data and target.

Model Creation

%time lightgbm = create_model('lightgbm')

view raw model_creation.py hosted with ❤ by GitHub

Model Tuning

%time tuned_pce_1_m1 = tune_model(pce_1_m1)

view raw model_tuning.py hosted with ❤ by GitHub

From the above output, we can observe that the metrics of the tuned model are better than the base model metrics.

Evaluate and Predict Model

	lightgbm_pred = predict_model(tuned_lightgbm, data = test_df)

	#evaluate model
	predict_model(tuned_lightgbm)

view raw predict_model.py hosted with ❤ by GitHub

Here, I’ve predicted the flag values for our processed dataset, ‘tuned_lightgbm’.

End Notes

PyCaret, which trains machine learning models in a low-code environment, piqued my interest. From your preferred notebook environment, PyCaret helps you to go from preparing data to deploying models in seconds. Before using PyCaret, I tried other traditional methods to solve the JanataHack NLP hackathon problem, but the results weren’t very satisfactory!

PyCaret has proved to be exponentially fast and efficient in comparison to the other open-source machine learning libraries and also has the advantage of replacing several lines of code with just a few words.

Here, if you avoid the first part of my approach where I use the count vectorizer embedding techniques on my dataset and then moved on to setting up and creating models using PyCaret, then you can notice that all the transformations such as one-hot-encoding, imputing missing values, etc, will happen behind the scenes automatically, and then you get a data frame with predictions, just like what we got!

I hope I’ve made clear my overall approach for the hackathon.

nithilaau

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Beginner’s Guide To Text Classification Using PyCaret

Introduction

Table of Contents

What is PyCaret?

Why do we need PyCaret?

Different Approaches to solving text classification in PyCaret

Topic Modeling

Example of Topic Modeling

Bag Of Words

Case Study – Text Classification using PyCaret

Implementation

Installing PyCaret

Importing Libraries

Loading Data

Setting up the environment

Model Creation

Model Tuning

Evaluate and Predict Model

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit