Learn how to perform text classification using PyTorch
Grasp the importance of Pack Padding feature
Understand the key points involved while solving text classification
Introduction
I always turn to State of the Art architectures to make my first submission in data science hackathons. Implementing the State of the Art architectures has become quite easy thanks to deep learning frameworks such as PyTorch, Keras, and TensorFlow. These frameworks provide an easy way to implement complex model architectures and algorithms with least knowledge of concepts and coding skills. In short, it’s a goldmine for the data science community!
In this article, we will use PyTorch, which is well known for its fast computational power. So in this article, we will walk through the key points for solving a text classification problem. And then we will implement our first text classifier in PyTorch!
Note: I highly recommend to go through the below article before moving forward with this article.
2.Understanding the Problem Statement 3.Implementation – Text Classification in PyTorch
Why PyTorch for Text Classification?
Before we dive deeper into the technical concepts, let us quickly familiarize ourselves with the framework that we are going to use – PyTorch. The basic unit of PyTorch is Tensor, similar to the “numpy” array in python. There are a number of benefits for using PyTorch but the two most important are:
Dynamic networks – Change in the architecture during the run time
Distributed training across GPUs
I am sure you are wondering – why should we use PyTorch for working with text data? Let us discuss some incredible features of PyTorch that makes it different from other frameworks, especially while working with text data.
1. Dealing with Out of Vocabulary words
A text classification model is trained on fixed vocabulary size. But during inference, we might come across some words which are not present in the vocabulary. These words are known as Out of Vocabulary words. Skipping Out of Vocabulary words can be a critical issue as this results in the loss of information.
In order to handle the Out Of Vocabulary words, PyTorch supports a cool feature that replaces the rare words in our training data with Unknown token. This, in turn, helps us in tackling the problem of Out of Vocabulary words.
Apart from handling Out Of Vocabulary words, PyTorch also has a feature that can handle sequences of variable length!
2. Handling Variable Length sequences
Have you heard of how Recurrent Neural Network is capable of handling variable-length sequences? Ever wondered how to implement it? PyTorch comes with a useful feature ‘Packed Padding sequence‘ that implements Dynamic Recurrent Neural Network.
Padding is a process of adding an extra token called padding token at the beginning or end of the sentence. As the number of the words in each sentence varies, we convert the variable length input sentences into sentences with the same length by adding padding tokens.
Padding is required since most of the frameworks support static networks, i.e. the architecture remains the same throughout the model training. Although padding solves the issue of variable length sequences, there is another problem with this idea – the architectures now process these padding token like any other information/data. Let me explain this through a simple diagram-
As you can see in the diagram (below), the last element, which is a padding token is also used while generating the output. This is taken care of by the Packed Padding sequence in PyTorch.
Packed padding ignores the input timesteps with padding token. These values are never shown to the Recurrent Neural Network which helps us in building a dynamic Recurrent Neural Network.
3. Wrappers and Pretrained models
The state of the art architectures are being launched for PyTorch framework. Hugging Face released Transformers which provides more than 32 state of the art architectures for the Natural Language Understanding Generation!
Not only this, PyTorch also provides pretrained models for several tasks like Text to Speech, Object Detection and so on, which can be executed within few lines of code.
Incredible, isn’t it? These are some really useful features of PyTorch among many others. Let us now use PyTorch for a text classification problem.
Understanding the problem statement
As a part of this article, we are going to work on a really interesting problem.
Quora wants to keep track of insincere questions on their platform so as to make users feel safe while sharing their knowledge. An insincere question in this context is defined as a question intended to make a statement rather than looking for helpful answers. To break this down further, here are some characteristics that can signify that a particular question is insincere:
Has a non-neutral tone
Is disparaging or inflammatory
Isn’t grounded in reality
Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers
The training data includes the question that was asked, and a flag denoting whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise, i.e. they are not guaranteed to be perfect. Our task will be to identify if a given question is ‘insincere’. You can download the dataset for this from here.
It is time to code our own text classification model using PyTorch.
Implementation – Text Classification in PyTorch
Let us first import all the necessary libraries required to build a model. Here is a brief overview of the packages/libraries we are going to use-
Torch package is used to define tensors and mathematical operations on it
TorchText is a Natural Language Processing (NLP) library in PyTorch. This library contains the scripts for preprocessing text and source of few popular NLP datasets.
Python Code:
#deal with tensors
import torch
#handling text data
from torchtext import data
In order to make the results reproducible, I have specified the seed value. Since Deep Learning model might produce different results each when it is executed due to the randomness in it, it is important to specify the seed value.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Now, let us see how to preprocess the text using field objects. There are 2 different types of field objects – Field and LabelField. Let us quickly understand the difference between the two-
Field: Field object from data module is used to specify preprocessing steps for each column in the dataset.
LabelField: LabelField object is a special case of Field object which is used only for the classification tasks. Its only use is to set the unk_token and sequential to None by default.
Before we use Field, let us look at the different parameters of Field and what are they used for.
Parameters of Field:
Tokenize: specifies the way of tokenizing the sentence i.e. converting sentence to words. I am using spacy tokenizer since it uses novel tokenization algorithm
Lower: converts text to lowercase
batch_first: The first dimension of input and output is always batch size
TEXT = data.Field(tokenize='spacy',batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)
Next we are going to create a list of tuples where first value in every tuple contains a column name and second value is a field object defined above. Furthermore we will arrange each tuple in the order of the columns of csv, and also specify as (None,None) to ignore a column from a csv file.
Let us read only required columns – question and label
In the following code block I have loaded the custom dataset by defining the field objects.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Let us now split the dataset into training and validation data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The next step is to build the vocabulary for the text and convert them into integer sequences. Vocabulary contains the unique words in the entire text. Each unique word is assigned an index. Below are the parameters listed for the same
Parameters:
1. min_freq: Ignores the words in vocabulary which has frequency less than specified one and map it to unknown token.
2. Two special tokens known as unknown and padding will be added to the vocabulary
Unknown token is used to handle Out Of Vocabulary words
Padding token is used to make input sequences of same length
Let us build vocabulary and initialize the words with the pretrained embeddings. Ignore the vectors parameter if you wish to randomly initialize embeddings.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Now we will prepare batches for training the model. BucketIterator forms the batches in such a way that a minimum amount of padding is required.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It is now time to define the architecture to solve the binary classification problem. The nn module from torch is a base model for all the models. This means that every model must be a subclass of the nn module.
I have defined 2 functions here: init as well as forward. Let me explain the use case of both of these functions-
1. Init: Whenever an instance of a class is created, init function is automatically invoked. Hence, it is called as a constructor. The arguments passed to the class are initialized by the constructor.We will define all the layers that we will be using in the model
2. Forward: Forward function defines the forward pass of the inputs.
Finally, let’s understand in detail about the different layers used for building the architecture and their parameters-
Embedding layer: Embeddings are extremely important for any NLP related task since it represents a word in a numerical format. Embedding layer creates a look up table where each row represents an embedding of a word. The embedding layer converts the integer sequence into a dense vector representation. Here are the two most important parameters of the embedding layer –
num_embeddings: No. of unique words in dictionary
embedding_dim: No. of dimensions for representing a word
LSTM: LSTM is a variant of RNN that is capable of capturing long term dependencies. Following the some important parameters of LSTM that you should be familiar with. Given below are the parameters of this layer:
input_size : Dimension of input
hidden_size : Number of hidden nodes
num_layers : Number of layers to be stacked
batch_first : If True, then the input and output tensors are provided as (batch, seq, feature)
dropout: If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
bidirection: If True, introduces a Bi directional LSTM
Linear Layer: Linear layer refers to dense layer. The two important parameters here are described below:
in_features : No. of input features
out_features: No. of hidden nodes
Pack Padding: As already discussed, pack padding is used to define the dynamic recurrent neural network. Without pack padding, the padding inputs are also processed by the rnn and returns the hidden state of the padded element. This an awesome wrapper that does not show the inputs that are padded. It simply ignores the values and returns the hidden state of the non padded element.
Now that we have a good understanding of all the blocks of the architecture, let us go to the code!
I will start with defining all the layers of the architecture:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The next step would be to define the hyperparameters and instantiate the model. Here is the code block for the same:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Let us look at the model summary and initialize the embedding layer with the pretrained embeddings
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here I have defined the optimizer, loss and metric for the model:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Training phase: model.train() sets the model on the training phase and activates the dropout layers.
Inference phase: model.eval() sets the model on the evaluation phase and deactivates the dropout layers.
Here is the code block to define a function for training the model
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
So we have a function to train the model, but we will also need a function to evaluate the mode. Let’s do that
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Finally we will train the model for a certain number of epochs and save the best model every epoch.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Let us load the best model and define the inference function that accepts the user defined input and make predictions
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Amazing! Let us use this model to make predictions for few questions:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We have seen how to build our own text classification model in PyTorch and learnt the importance of pack padding. You can play around with the hyper-parameters of the Long Short Term Model such as number of hidden nodes, number of hidden layers and so on to improve the performance even further.
If you have any queries/feedback, leave in the comments section. I will get back to you.
Getting the following error -
NameError: name 'TEXT' is not defined
while running - fields = [(None, None), ('text',TEXT),('label', LABEL)]
Could you please tell me how to initialise TEXT and LABEL?
Great tutorial btw.
Ajay
Thank Aravind for sharing wonderful topic. i m getting error @ fields = [(None, None), ('text',TEXT),('label', LABEL)]
NameError: name 'TEXT' / 'LABEL' is not defined
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Getting the following error - NameError: name 'TEXT' is not defined while running - fields = [(None, None), ('text',TEXT),('label', LABEL)] Could you please tell me how to initialise TEXT and LABEL? Great tutorial btw.
Thank Aravind for sharing wonderful topic. i m getting error @ fields = [(None, None), ('text',TEXT),('label', LABEL)] NameError: name 'TEXT' / 'LABEL' is not defined
Hi, Thanks for pointing it out. I just missed a snippet of code. I have updated it now.
you are post good article keep posting