Create a Pipeline to Perform Sentiment Analysis using NLP

Vaibhav Last Updated : 09 Nov, 2020

7 min read

This article was published as a part of the Data Science Blogathon.

Overview

Every basic fundamental and building block which is required for Sentiment Analysis.
I’ve used an easy approach to explain all the basic concepts so that even a beginner reader would be able to get a thorough understanding of all the concepts.
Topics: Preprocessing text, Vocabulary Corpus, Feature Extraction (Sparse Representation and Frequency Dictionary), Logistic Regression model for sentiment analysis.

Introduction

Sentiment Analysis is a supervised Machine Learning technique that is used to analyze and predict the polarity of sentiments within a text (either positive or negative).

It is often used by businesses and companies to understand their user’s experience, emotions, responses, etc. so that they can improve the quality and flexibility of their products and services.

Now, let’s dive deep into understanding how this sentiment analysis technique is used by machine learning engineers to examine sentiments of various texts.

Gathering Data

Data is the heart of every machine learning problem. In terms of Machine Learning, no problem can be solved without data, even there’s an old saying:

” THE MORE DATA, THE BETTER ! “

There are so many open sources of data that can be used for training ML models so it’s a personal choice to gather data by yourself or use open datasets to train our algorithm.

Text-based datasets are generally distributed as JSON or CSV formats, so to use them, we can either fetch the data into a python list or a dictionary/data frame object.

Data should be split into the train, validation, and test sets in a common fashion of 60% 20% 20% or 70% 15% 15%.

The popular Twitter dataset can be downloaded from here.

Pipeline

Every Machine Learning task should have a Pipeline. Pipelines are used for splitting up your machine learning workflows into independent, reusable, modular parts that can then be pipelined together to continuously improve the accuracy of the model and achieve a successful algorithm.

We’ll follow a basic pipeline structure for our problem so that a reader can easily understand each part of the pipeline used in our workflow. Our pipeline will include the following steps:

Preprocessing Text and Building Vocabulary: Removing unwanted texts (stop words), punctuations, URLs, handles, etc. which do not have any sentimental value. And then adding unique preprocessed words to a vocabulary.
Feature Extraction: Iterating through each data example to extract features using a frequency dictionary and finally create a feature matrix.
Training Model: We’ll then use our feature matrix to train a Logistic Regression model in order to use that model for predicting sentiments.
Testing Model: Using our trained model to get the predictions from data it never saw.

Preprocessing Data

It is an important step in our pipeline. Preprocessing text can be used to remove the words and punctuations from the text data that do not have any sentimental value, as preprocessing text can significantly improve our training time as our data size will be reduced and limited to the words having some sentimental value. Preprocessing includes handling of-

Stop Words

Words that do not have any semantic or sentimental value/weight in a sentence. e.g.: and, is, the, you, etc.

How to process them? We’ll create a list including all possible stop words like

[‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’]

Now we’ll iterate through each example in our data and remove every word from our data that is present in the stop word list.

Punctuations

Punctuations are symbols that we use to emphasize our text. for e.g. : ! , @ , # , $ , etc.

How to process them? We’ll process them similarly as we have processed the Stop Words, we will create a list of them and process each example with that list.
URLs and Handles

URLs are the links that start with Http protocol declaration eg. ‘https://….’ and handles are used to mention people in social media eg. ‘@user’ both of them share null sentimental significance.

How to process them? Process them by creating some function ‘process_handles_urls()’ which will take our train data and eliminate words starting with ‘https://’ or ‘@’ from each example.
Stemming

Stemming is a process of reducing a word to its base stem word. e.g. ‘turn’ is a stem word of turning, turns, turned, etc. Since stem word delivers the same sentimental value for all its suffixed words thus, we can reduce each word to its base stem that can reduce our vocabulary size and training time as well.

How to process them? Process them by creating some function ‘do_stemming()’ which will take the data and stem the words of each example.
Lower Casing

We should use similar letter cases for each word in data so that to represent ‘Word’, ‘WORD’, ‘word’ there should only be a single case to follow i.e. lower case, this can also help to reduce vocabulary size and eliminate repetition of words.

How to process them? iterate through each example use .lower() method to make convert every piece of text into lower case.

Vocabulary Corpus

After preprocessing the data it’s time to create a vocabulary that will store each unique word and assign some numeric value to each distinct word (this is also called Tokenization).

We’ll use this vocabulary dictionary for feature extraction.

Feature Extraction

One of the problems, while working with language processing is that machine learning algorithms cannot work on the raw text directly. So, we need some feature extraction techniques to convert text into a matrix(or vector) of numerical features.

Let’s take some positive and negative tweet examples:

NOTE: The above example is not processed so we'll process it first before moving on to further steps.

Sparse Representation

It’s a naive approach to extract features of a text. According to sparse representation, we can create a feature matrix by iterating through whole data, and for each word, in the text example we’ll assign 1 at the position of that word in the vocabulary list and for non-occurring words, we’ll assign 0. So, our feature matrix will have rows=total sentences in our data and columns=total words in the vocabulary.

Disadvantages:

Large Training time
Large Prediction time

Frequency Dictionary

A frequency dictionary keeps track of the Positive and Negative frequencies of each word in our data.

Positive Frequency: The Number of times a word occurred in sentences with positive sentiment.

Negative Frequency: The Number of times a word occurred in sentences with negative sentiment.

Feature extraction with Frequency Dictionary :

Using Frequency Dictionary for feature extraction, we can reduce the dimensions of each row representing each sentence of a feature matrix (i.e. equal to the number of words in vocabulary in case of sparse representation) to three dimensions.

Features of a text data are extracted with feature dictionary using the following formulae:

The process almost looks like:

Now we have a 3-dimensional feature vector for our tweet that looks like :

Xm= [1,8,11]

Now we’ll iterate through every example to extract features of each example then we’ll use those features to create the feature matrix that we can use for training. In the end, we have some feature matrix like –

Logistic Regression for Sentiment Analysis

Logistic regression models the probabilities for classification problems with two possible outcomes. It’s an extension of the linear regression model for classification problems.

Logistic Regression uses a sigmoid function to map the output of our linear function (θ^Tx) between 0 to 1 with some threshold (usually 0.5) to differentiate between two classes, such that if h>0.5 it’s a positive class, and if h<0.5 its a negative class. (explaining whole logistic regression is beyond the scope of this article)

Training Sentiment Analysis model

Training of our model will follow the following steps:

We initialize our parameter θ, that we can use in our sigmoid, we then compute the gradient that we will use to update θ and then calculate the cost. We’ll keep repeating the steps until the cost minimizes/converges.

Testing our Model

To test our model we’ll use our Validation set and follow the following steps:

Split the data into X_validation (text) and Y_validation (sentiment).
Use feature extraction for X_validation to transform texts into numerical features.
Find the vector h (= sigmoid(θ^Tx)) for each text in the validation set.
Map some function to get the actual classes while comparing with a threshold.
Find the accuracy of our predictions.

Summary

Glad you made it till here! if you’re a beginner to Natural Language Processing, I hope I was able to give you a glimpse of how things work underneath the hood and make you able to cover more complex and advanced topics further and if you’re a practitioner I hope I was able to brush up your basics.

Natural Language Processing is a vast domain of AI its applications are used in various paradigms such as Chatbots, Sentiment Analysis, Machine Translation, Autocorrect, etc. and I’ve just covered a grain of the topic so in case you want to learn more there are various e-learning platforms and freely distributed articles, papers, etc. which can be useful for you to go further in the journey.

References: Natural Language Processing Specialization

Vaibhav

Advanced Maths NLP

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Create a Pipeline to Perform Sentiment Analysis using NLP

Overview

Introduction

Gathering Data

Pipeline

Preprocessing Data

Stop Words

Punctuations

URLs and Handles

Stemming

Lower Casing

Vocabulary Corpus

Feature Extraction

Sparse Representation

Disadvantages:

Frequency Dictionary

Feature extraction with Frequency Dictionary :

Logistic Regression for Sentiment Analysis

Training Sentiment Analysis model

Testing our Model

Summary

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC