Transfer Learning for NLP: Fine-Tuning BERT for Text Classification

Prateek joshi Last Updated : 14 Oct, 2024

11 min read

Introduction

With the advancement in deep learning, neural network architectures like recurrent neural networks (RNN and LSTM) and convolutional neural networks (CNN) have shown a decent improvement in performance in solving several Natural Language Processing (NLP) tasks like text classification, language modeling, machine translation, etc.

However, this performance of deep learning models in NLP pales in comparison to the performance of deep learning in Computer Vision.

One of the main reasons for this slow progress could be the lack of large labeled text datasets. Most of the labeled text datasets are not big enough to train deep neural networks because these networks have a huge number of parameters and training such networks on small datasets will cause overfitting.

Another quite important reason for NLP lagging behind computer vision was the lack of transfer learning in NLP. Transfer learning has been instrumental in the success of deep learning in computer vision. This happened due to the availability of huge labeled datasets like Imagenet on which deep CNN based models were trained and later they were used as pre-trained models for a wide range of computer vision tasks.

That was not the case with NLP until 2018 when the transformer model was introduced by Google. Ever since the transfer learning in NLP is helping in solving many tasks with state of the art performance.

In this article, I explain how do we fine-tune BERT for text classification.

If you want to learn NLP from scratch, check out our course – Natural Language Processing (NLP) Using Python

Transfer Learning in NLP
What is Model Fine-Tuning?
Overview of BERT
Fine-Tune BERT for Spam Classification

Transfer Learning in NLP

Transfer learning is a technique where a deep learning model trained on a large dataset is used to perform similar tasks on another dataset. We call such a deep learning model a pre-trained model. The most renowned examples of pre-trained models are the computer vision deep learning models trained on the ImageNet dataset. So, it is better to use a pre-trained model as a starting point to solve a problem rather than building a model from scratch.

This breakthrough of transfer learning in computer vision occurred in the year 2012-13. However, with recent advances in NLP, transfer learning has become a viable option in this NLP as well.

Most of the tasks in NLP such as text classification, language modeling, machine translation, etc. are sequence modeling tasks. The traditional machine learning models and neural networks cannot capture the sequential information present in the text. Therefore, people started using recurrent neural networks (RNN and LSTM) because these architectures can model sequential information present in the text.

A typical RNN

However, these recurrent neural networks have their own set of problems. One major issue is that RNNs can not be parallelized because they take one input at a time. In the case of a text sequence, an RNN or LSTM would take one token at a time as input. So, it will pass through the sequence token by token. Hence, training such a model on a big dataset will take a lot of time.

So, the need for transfer learning in NLP was at an all-time high. In 2018, the transformer was introduced by Google in the paper “Attention is All You Need” which turned out to be a groundbreaking milestone in NLP.

The Transformer – Model Architecture
(Source: https://arxiv.org/abs/1706.03762)

Soon a wide range of transformer-based models started coming up for different NLP tasks. There are multiple advantages of using transformer-based models, but the most important ones are:

First Benefit

These models do not process an input sequence token by token rather they take the entire sequence as input in one go which is a big improvement over RNN based models because now the model can be accelerated by the GPUs.
2nd Benefit

We don’t need labeled data to pre-train these models. It means that we have to just provide a huge amount of unlabeled text data to train a transformer-based model. We can use this trained model for other NLP tasks like text classification, named entity recognition, text generation, etc. This is how transfer learning works in NLP.

BERT and GPT-2 are the most popular transformer-based models and in this article, we will focus on BERT and learn how we can use a pre-trained BERT model to perform text classification.

What is Model Fine-Tuning?

BERT (Bidirectional Encoder Representations from Transformers) is a big neural network architecture, with a huge number of parameters, that can range from 100 million to over 300 million. So, training a BERT model from scratch on a small dataset would result in overfitting.

So, it is better to use a pre-trained BERT model that was trained on a huge dataset, as a starting point. We can then further train the model on our relatively smaller dataset and this process is known as model fine-tuning.

Different Fine-Tuning Techniques

Train the entire architecture – We can further train the entire pre-trained model on our dataset and feed the output to a softmax layer. In this case, the error is back-propagated through the entire architecture and the pre-trained weights of the model are updated based on the new dataset.
Train some layers while freezing others – Another way to use a pre-trained model is to train it partially. What we can do is keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained.
Freeze the entire architecture – We can even freeze all the layers of the model and attach a few neural network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training.

In this tutorial, we will use the third approach. We will freeze all the layers of BERT during fine-tuning and append a dense layer and a softmax layer to the architecture.

Overview of BERT

You’ve heard about BERT, you’ve read about how incredible it is, and how it’s potentially changing the NLP landscape. But what is BERT in the first place?

Here’s how the research team behind BERT describes the NLP framework:

“BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.”

That sounds way too complex as a starting point. But it does summarize what BERT does pretty well so let’s break it down.

Firstly, BERT stands for Bidirectional Encoder Representations from Transformers. Each word here has a meaning to it and we will encounter that one by one in this article. For now, the key takeaway from this line is – BERT is based on the Transformer architecture. Secondly, BERT is pre-trained on a large corpus of unlabelled text including the entire Wikipedia (that’s 2,500 million words!) and Book Corpus (800 million words).

This pre-training step is half the magic behind BERT’s success. This is because as we train a model on a large text corpus, our model starts to pick up the deeper and intimate understandings of how the language works. This knowledge is the swiss army knife that is useful for almost any NLP task.

Third, BERT is a “deep bidirectional” model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

To learn more about the BERT architecture and its pre-training tasks, then you may like to read the below article:

Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework

Fine-Tune BERT for Spam Classification

Now we will fine-tune a BERT model to perform text classification with the help of the Transformers library. You should have a basic understanding of defining, training, and evaluating neural network models in PyTorch. If you want a quick refresher on PyTorch then you can go through the article below:

A Beginner-Friendly Guide to PyTorch and How it Works from Scratch

Link to Colab Notebook

Problem Statement

We have a collection of SMS messages. Some of these messages are spam and the rest are genuine. Our task is to build a system that would automatically detect whether a message is spam or not.

The dataset that we will be using for this use case can be downloaded from here (right-click and click on “Save link as…”).

I suggest you use Google Colab to perform this task so that you can use the GPU. Firstly, activate the GPU runtime on Colab by clicking on Runtime -> Change runtime type -> Select GPU.

Install Transformers Library

We will then install Huggingface’s transformers library. This library lets you import a wide range of transformer-based pre-trained models. Just execute the code below to install the library.

!pip install transformers

Import Libraries

	import numpy as np
	import pandas as pd
	import torch
	import torch.nn as nn
	from sklearn.model_selection import train_test_split
	from sklearn.metrics import classification_report
	import transformers
	from transformers import AutoModel, BertTokenizerFast

	# specify GPU
	device = torch.device("cuda")

view raw import_lib_bert.py hosted with ❤ by GitHub

Load Dataset

You would have to upload the downloaded spam dataset to your Colab runtime. Then read it into a pandas dataframe.

	df = pd.read_csv("spamdata_v2.csv")
	df.head()

view raw load_data_bert.py hosted with ❤ by GitHub

Output:

The dataset consists of two columns – “label” and “text”. The column “text” contains the message body and the “label” is a binary variable where 1 means spam and 0 means the message is not a spam.

Now we will split this dataset into three sets – train, validation, and test.

	# split train dataset into train, validation and test sets
	train_text, temp_text, train_labels, temp_labels = train_test_split(df['text'], df['label'],
	random_state=2018,
	test_size=0.3,
	stratify=df['label'])


	val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels,
	random_state=2018,
	test_size=0.5,
	stratify=temp_labels)

view raw data_split_bert.py hosted with ❤ by GitHub

We will fine-tune the model using the train set and the validation set, and make predictions for the test set.

Import BERT Model and BERT Tokenizer

We will import the BERT-base model that has 110 million parameters. There is an even bigger BERT model called BERT-large that has 345 million parameters.

Python Code:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast

# Read dataset
df = pd.read_csv("spamdata_v2.csv")
df.head()

# Train test split
train_text, temp_text, train_labels, temp_labels = train_test_split(df['text'], df['label'], random_state = 2018, test_size = 0.3, stratify = df['label'])

val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, random_state = 2018, test_size = 0.5, stratify = temp_labels)

# import BERT-base pretrained model
bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# sample data
text = ["this is a bert model tutorial", "we will fine-tune a bert model"]

# encode text
sent_id = tokenizer.batch_encode_plus(text, padding=True)

# output
print(sent_id)

Let’s see how this BERT tokenizer works. We will try to encode a couple of sentences using the tokenizer.

	# sample data
	text = ["this is a bert model tutorial", "we will fine-tune a bert model"]

	# encode text
	sent_id = tokenizer.batch_encode_plus(text, padding=True)

	# output
	print(sent_id)

view raw testing_tokenizer_bert.py hosted with ❤ by GitHub

Output:

{‘input_ids’: [[101, 2023, 2003, 1037, 14324, 2944, 14924, 4818, 102, 0],
[101, 2057, 2097, 2986, 1011, 8694, 1037, 14324, 2944, 102]],

‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

As you can see the output is a dictionary of two items.

‘input_ids’ contains the integer sequences of the input sentences. The integers 101 and 102 are special tokens. We add them to both the sequences, and 0 represents the padding token.
‘attention_mask’ contains 1’s and 0’s. It tells the model to pay attention to the tokens corresponding to the mask value of 1 and ignore the rest.

Tokenize the Sentences

Since the messages (text) in the dataset are of varying length, therefore we will use padding to make all the messages have the same length. We can use the maximum sequence length to pad the messages. However, we can also have a look at the distribution of the sequence lengths in the train set to find the right padding length.

	# get length of all the messages in the train set
	seq_len = [len(i.split()) for i in train_text]

	pd.Series(seq_len).hist(bins = 30)

view raw seq_len_bert.py hosted with ❤ by GitHub

We can clearly see that most of the messages have a length of 25 words or less. Whereas the maximum length is 175. So, if we select 175 as the padding length then all the input sequences will have length 175 and most of the tokens in those sequences will be padding tokens which are not going to help the model learn anything useful and on top of that, it will make the training slower.

Therefore, we will set 25 as the padding length.

	# tokenize and encode sequences in the training set
	tokens_train = tokenizer.batch_encode_plus(
	train_text.tolist(),
	max_length = 25,
	pad_to_max_length=True,
	truncation=True
	)

	# tokenize and encode sequences in the validation set
	tokens_val = tokenizer.batch_encode_plus(
	val_text.tolist(),
	max_length = 25,
	pad_to_max_length=True,
	truncation=True
	)

	# tokenize and encode sequences in the test set
	tokens_test = tokenizer.batch_encode_plus(
	test_text.tolist(),
	max_length = 25,
	pad_to_max_length=True,
	truncation=True
	)

view raw tokenize_bert.py hosted with ❤ by GitHub

So, we have now converted the messages in train, validation, and test set to integer sequences of length 25 tokens each.

Next, we will convert the integer sequences to tensors.

	## convert lists to tensors

	train_seq = torch.tensor(tokens_train['input_ids'])
	train_mask = torch.tensor(tokens_train['attention_mask'])
	train_y = torch.tensor(train_labels.tolist())

	val_seq = torch.tensor(tokens_val['input_ids'])
	val_mask = torch.tensor(tokens_val['attention_mask'])
	val_y = torch.tensor(val_labels.tolist())

	test_seq = torch.tensor(tokens_test['input_ids'])
	test_mask = torch.tensor(tokens_test['attention_mask'])
	test_y = torch.tensor(test_labels.tolist())

view raw seq_to_tensor_bert.py hosted with ❤ by GitHub

Now we will create dataloaders for both train and validation set. These dataloaders will pass batches of train data and validation data as input to the model during the training phase.

	from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

	#define a batch size
	batch_size = 32

	# wrap tensors
	train_data = TensorDataset(train_seq, train_mask, train_y)

	# sampler for sampling the data during training
	train_sampler = RandomSampler(train_data)

	# dataLoader for train set
	train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

	# wrap tensors
	val_data = TensorDataset(val_seq, val_mask, val_y)

	# sampler for sampling the data during training
	val_sampler = SequentialSampler(val_data)

	# dataLoader for validation set
	val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

view raw dataloaders_bert.py hosted with ❤ by GitHub

Define Model Architecture

If you can recall, earlier I mentioned in this article that I would freeze all the layers of the model before fine-tuning it. So, let’s do it first.

	# freeze all the parameters
	for param in bert.parameters():
	param.requires_grad = False

view raw freeze_model_bert.py hosted with ❤ by GitHub

This will prevent updating of model weights during fine-tuning. If you wish to fine-tune even the pre-trained weights of the BERT model then you should not execute the code above.

Moving on we will now let’s define our model architecture.

	class BERT_Arch(nn.Module):

	def __init__(self, bert):

	super(BERT_Arch, self).__init__()

	self.bert = bert

	# dropout layer
	self.dropout = nn.Dropout(0.1)

	# relu activation function
	self.relu = nn.ReLU()

	# dense layer 1
	self.fc1 = nn.Linear(768,512)

	# dense layer 2 (Output layer)
	self.fc2 = nn.Linear(512,2)

	#softmax activation function
	self.softmax = nn.LogSoftmax(dim=1)

	#define the forward pass
	def forward(self, sent_id, mask):

	#pass the inputs to the model
	_, cls_hs = self.bert(sent_id, attention_mask=mask)

	x = self.fc1(cls_hs)

	x = self.relu(x)

	x = self.dropout(x)

	# output layer
	x = self.fc2(x)

	# apply softmax activation
	x = self.softmax(x)

	return x

view raw architecture_bert.py hosted with ❤ by GitHub

	# pass the pre-trained BERT to our define architecture
	model = BERT_Arch(bert)

	# push the model to GPU
	model = model.to(device)

view raw model_start_bert.py hosted with ❤ by GitHub

We will use AdamW as our optimizer. It is an improved version of the Adam optimizer. To learn more about it do check out this paper.

	# optimizer from hugging face transformers
	from transformers import AdamW

	# define the optimizer
	optimizer = AdamW(model.parameters(),
	lr = 1e-5) # learning rate

view raw optm_bert.py hosted with ❤ by GitHub

There is a class imbalance in our dataset. The majority of the observations are not spam. So, we will first compute class weights for the labels in the train set and then pass these weights to the loss function so that it takes care of the class imbalance.

	from sklearn.utils.class_weight import compute_class_weight

	#compute the class weights
	class_weights = compute_class_weight('balanced', np.unique(train_labels), train_labels)

	print("Class Weights:",class_weights)

view raw compute_class_wt_bert.py hosted with ❤ by GitHub

Output: [0.57743559 3.72848948]

	# converting list of class weights to a tensor
	weights= torch.tensor(class_weights,dtype=torch.float)

	# push to GPU
	weights = weights.to(device)

	# define the loss function
	cross_entropy = nn.NLLLoss(weight=weights)

	# number of training epochs
	epochs = 10

view raw define_loss_bert.py hosted with ❤ by GitHub

Fine-Tune BERT

So, till now we have defined the model architecture, we have specified the optimizer and the loss function, and our dataloaders are also ready. Now we have to define a couple of functions to train (fine-tune) and evaluate the model, respectively.

	# function to train the model
	def train():

	model.train()

	total_loss, total_accuracy = 0, 0

	# empty list to save model predictions
	total_preds=[]

	# iterate over batches
	for step,batch in enumerate(train_dataloader):

	# progress update after every 50 batches.
	if step % 50 == 0 and not step == 0:
	print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))

	# push the batch to gpu
	batch = [r.to(device) for r in batch]

	sent_id, mask, labels = batch

	# clear previously calculated gradients
	model.zero_grad()

	# get model predictions for the current batch
	preds = model(sent_id, mask)

	# compute the loss between actual and predicted values
	loss = cross_entropy(preds, labels)

	# add on to the total loss
	total_loss = total_loss + loss.item()

	# backward pass to calculate the gradients
	loss.backward()

	# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
	torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

	# update parameters
	optimizer.step()

	# model predictions are stored on GPU. So, push it to CPU
	preds=preds.detach().cpu().numpy()

	# append the model predictions
	total_preds.append(preds)

	# compute the training loss of the epoch
	avg_loss = total_loss / len(train_dataloader)

	# predictions are in the form of (no. of batches, size of batch, no. of classes).
	# reshape the predictions in form of (number of samples, no. of classes)
	total_preds = np.concatenate(total_preds, axis=0)

	#returns the loss and predictions
	return avg_loss, total_preds

view raw train_bert.py hosted with ❤ by GitHub

We will use the following function to evaluate the model. It will use the validation set data.

	# function for evaluating the model
	def evaluate():

	print("\nEvaluating...")

	# deactivate dropout layers
	model.eval()

	total_loss, total_accuracy = 0, 0

	# empty list to save the model predictions
	total_preds = []

	# iterate over batches
	for step,batch in enumerate(val_dataloader):

	# Progress update every 50 batches.
	if step % 50 == 0 and not step == 0:

	# Calculate elapsed time in minutes.
	elapsed = format_time(time.time() - t0)

	# Report progress.
	print(' Batch {:>5,} of {:>5,}.'.format(step, len(val_dataloader)))

	# push the batch to gpu
	batch = [t.to(device) for t in batch]

	sent_id, mask, labels = batch

	# deactivate autograd
	with torch.no_grad():

	# model predictions
	preds = model(sent_id, mask)

	# compute the validation loss between actual and predicted values
	loss = cross_entropy(preds,labels)

	total_loss = total_loss + loss.item()

	preds = preds.detach().cpu().numpy()

	total_preds.append(preds)

	# compute the validation loss of the epoch
	avg_loss = total_loss / len(val_dataloader)

	# reshape the predictions in form of (number of samples, no. of classes)
	total_preds = np.concatenate(total_preds, axis=0)

	return avg_loss, total_preds

view raw evaluate_bert.py hosted with ❤ by GitHub

Now we will finally start fine-tuning of the model.

	# set initial loss to infinite
	best_valid_loss = float('inf')

	# empty lists to store training and validation loss of each epoch
	train_losses=[]
	valid_losses=[]

	#for each epoch
	for epoch in range(epochs):

	print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))

	#train model
	train_loss, _ = train()

	#evaluate model
	valid_loss, _ = evaluate()

	#save the best model
	if valid_loss < best_valid_loss:
	best_valid_loss = valid_loss
	torch.save(model.state_dict(), 'saved_weights.pt')

	# append training and validation loss
	train_losses.append(train_loss)
	valid_losses.append(valid_loss)

	print(f'\nTraining Loss: {train_loss:.3f}')
	print(f'Validation Loss: {valid_loss:.3f}')

view raw start_training_bert.py hosted with ❤ by GitHub

Output:

Training Loss: 0.592
Validation Loss: 0.567

Epoch 5 / 10
Batch 50 of 122.
Batch 100 of 122.

Evaluating...

Training Loss: 0.566
Validation Loss: 0.543

Epoch 6 / 10
Batch 50 of 122.
Batch 100 of 122.

Evaluating...

Training Loss: 0.552
Validation Loss: 0.525

Epoch 7 / 10
Batch 50 of 122.
Batch 100 of 122.

Evaluating...

Training Loss: 0.525
Validation Loss: 0.498

Epoch 8 / 10
Batch 50 of 122.
Batch 100 of 122.

Evaluating...

Training Loss: 0.507
Validation Loss: 0.477

Epoch 9 / 10
Batch 50 of 122.
Batch 100 of 122.

Evaluating...

Training Loss: 0.488
Validation Loss: 0.461

Epoch 10 / 10
Batch 50 of 122.
Batch 100 of 122.

Evaluating...

Training Loss: 0.474
Validation Loss: 0.454

You can see that the validation loss is still decreasing at the end of the 10th epoch. So, you may try a higher number of epochs. Now let’s see how well it performs on the test dataset.

Make Predictions

To make predictions, we will first of all load the best model weights which were saved during the training process.

	#load weights of best model
	path = 'saved_weights.pt'
	model.load_state_dict(torch.load(path))

view raw load_weights_bert.py hosted with ❤ by GitHub

Once the weights are loaded, we can use the fine-tuned model to make predictions on the test set.

	# get predictions for test data
	with torch.no_grad():
	preds = model(test_seq.to(device), test_mask.to(device))
	preds = preds.detach().cpu().numpy()

view raw prediction_bert.py hosted with ❤ by GitHub

Let’s check out the model’s performance.

	preds = np.argmax(preds, axis = 1)
	print(classification_report(test_y, preds))

view raw performance_bert.py hosted with ❤ by GitHub

Output:

Both recall and precision for class 1 are quite high which means that the model predicts this class pretty well. However, our objective was to detect spam messages, so misclassifying class 1 (spam) samples is a bigger concern than misclassifying class 0 samples. If you look at the recall for class 1, it is 0.90 which means that the model was able to correctly classify 90% of the spam messages. However, precision is a bit on the lower side for class 1. It means that the model misclassifies some of the class 0 messages (not spam) as spam.

Link to Colab Notebook

End Notes

To summarize, in this article, we fine-tuned a pre-trained BERT model to perform text classification on a very small dataset. I urge you to fine-tune BERT on a different dataset and see how it performs. You can even perform multiclass or multi-label classification with the help of BERT. In addition to that, you can even train the entire BERT architecture as well if you have a bigger dataset.

In case you are looking for a roadmap to becoming an expert in NLP read the following article-

A Comprehensive Learning Path to Understand and Master NLP in 2020

You may use the comment section in case you have any thoughts to share or have any doubts.

Prateek joshi

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Dinesh Chauhan

Hi, Thanks a lot for very detailed explanation. I have some doubt around precision/recall inference you made. You mentioned recall for class 1 is high (0.90) so model will correctly identify spam 90% of time , however doesn't that metric should be accuracy ? Also you mentioned precision is on lower side (0.39) which means we would misclassify non-spam as spam , however how do you interpret high precision for class 0 in same context ? if possible please expand little more on precision/recall for both spam (1) & ham(class 0). Thanks & Regards

Show 1 reply

Prateek Joshi

Hi Dinesh, I have said that the model was able to correctly classify 90% of the spam messages. It means that if there were 100 spam messages in the unseen dataset then the model would have classified 90 of them as spam. Same logic can be applied for the ham (class 0).

John ODonovan

Hi Dinesh, Nice tutorial, thanks! I got similar results to you. ( I used my own GPU box instead of Colab) precision recall f1-score support 0 0.97 0.87 0.91 724 1 0.48 0.81 0.61 112 accuracy 0.86 836 macro avg 0.73 0.84 0.76 836 weighted avg 0.90 0.86 0.87 836

Tegene

Hi Prateek ! Thank you very much for your hands on explanation on such comple concept. I am enjoying your tutorials as well. Currently, am working a project on transfer learning for netx word prediction for one of my local languages. the language uses latin letters(english letters) and it uses a very long suffixes when it is inflected, it is also a low resource language. so which method would you recommend? if you can please put the steps for me. thanks in advance

Hi, We can solve the next word prediction problem with the help of a language model. The good thing is that you don't need labeled dataset to train a language model. So, you can either train a language model from scratch or use a pre-trained model such as GPT-2. But for your language I don't think there would be any pre-trained model avaialable. So, you should try to collect as much as data possible and train a language model from scratch to predict next word.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Transfer Learning for NLP: Fine-Tuning BERT for Text Classification

Introduction

Table of Contents

Transfer Learning in NLP

First Benefit

2nd Benefit

What is Model Fine-Tuning?

Different Fine-Tuning Techniques

Overview of BERT

Fine-Tune BERT for Spam Classification

Link to Colab Notebook

Problem Statement

Install Transformers Library

Import Libraries

Load Dataset

Import BERT Model and BERT Tokenizer

Tokenize the Sentences

Define Model Architecture

Fine-Tune BERT

Make Predictions

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC