According to experts, 80% of the world’s data is unstructured (images, videos, text, etc). This could be generated by Social media tweets/posts, call transcripts, survey or interview reviews, text across blogs, forums, news, etc.
Reading all the text across the web and finding patterns is humanly impossible. Yet, there is a need for the business to analyze this data for better actions.
One such process of drawing insights from textual data is Sentiment Analysis. To obtain the data for sentiment analysis, one can directly scrape the content from the web pages using different web scraping techniques.
Learning Objectives
Understand the importance of sentiment analysis in processing unstructured textual data and its applications in various fields such as market analysis, social media monitoring, customer feedback analysis, and market research.
Differentiate between rule-based and machine learning-based approaches in sentiment analysis, focusing on the rule-based (lexicon-based) approach.
Gain practical knowledge of the rule-based approach by implementing TextBlob, VADER, and SentiWordNet for sentiment analysis in Python.
Learn essential data preprocessing steps for text analysis, including cleaning text, tokenization, enrichment (POS tagging), stopwords removal, and obtaining stem words.
Explore and compare sentiment analysis results using different lexicon-based techniques (TextBlob, VADER, SentiWordNet) through visualizations and understand the significance of choosing the right approach for specific applications.
Sentiment Analysis (also known as opinion mining or emotion AI) is a sub-field of NLP that measures the inclination of people’s opinions (Positive/Negative/Neutral) within the unstructured text.
Sentiment Analysis can be performed using two approaches: Rule-based and Machine Learning based.
Few applications of Sentiment Analysis Algorithms
Market analysis
Social media monitoring
Customer feedback analysis – Brand sentiment or reputation analysis
Market research
What is Natural Language Processing(NLP)?
Natural language is the way humans communicate with each other. It could be Speech or Text. NLP is the automatic manipulation of the natural language by software. NLP is a higher-level term combining Natural Language Understanding (NLU) and Natural Language Generation (NLG).
NLP = NLU + NLG
Some of the Python Natural Language Processing (NLP) libraries are:
Natural Language Toolkit (NLTK)
TextBlob
SpaCy
Gensim
CoreNLP
I hope we have a basic understanding of the terms Sentiment Analysis and NLP.
This article focuses on the Rule-based approach of Sentiment Analysis
Rule-based approach
This is a practical approach to analyzing text without training or using machine learning models. This approach results in rules based on which the text is labeled as positive/negative/neutral. These rules are also known as lexicons. Hence, the Rule-based approach is called the Lexicon-based approach.
Widely used lexicon-based approaches are TextBlob, VADER, and SentiWordNet.
Data preprocessing steps:
Cleaning the text
Tokenization
Enrichment – POS tagging
Stopwords removal
Obtaining the stem words
Before deep-diving into the above steps, lemme import the text data from a txt file.
# install and import pandas library
import pandas as pd
# Creating a pandas dataframe from reviews.txt file
data = pd.read_csv('reviews.txt', sep='\t')
print(data.head())
# Data can be different as it is extracted from the website.
This doesn’t look cool. So, we will now drop the “Unnamed: 0″ column using the df.drop function.
Our dataset has a total of 240 observations(reviews).
Step 1: Cleaning the text
In this step, we need to remove the special characters, and numbers from the text. We can use the regular expression operations library of Python.
# Define a function to clean the text
def clean(text):
# Removes all special characters and numericals leaving the alphabets
text = re.sub('[^A-Za-z]+', ' ', text)
return text
# Cleaning the text in the review column
mydata['Cleaned Reviews'] = mydata['review'].apply(clean)
mydata.head()
Explanation: “clean” is the function that takes text as input and returns the text without any punctuation marks or numbers in it. We applied it to the ‘review’ column and created a new column, ‘Cleaned Reviews,’ with the cleaned text.
Great, look at the above image; all the special characters and the numbers are removed.
Step 2: Tokenization
Tokenization is the process of breaking the text into smaller pieces called Tokens. It can be performed at sentences(sentence tokenization) or word level(word tokenization).
I will be performing word-level tokenization using nltk tokenize function word_tokenize().
Note: As our text data is a little large, first I will illustrate steps 2-5 with small example sentences.
Let’s say we have a sentence: “This is an article on Sentiment Analysis“. It can be broken down into small pieces(tokens), as shown below.
Step 3: Enrichment – POS tagging
Parts of Speech (POS) tagging converts each token into a tuple with the form (word, tag). POS tagging is essential to preserve the word’s context and Lemmatization.
This can be achieved by using the nltk pos_tag function.
Below shown are the POS tags of the example sentence “This is an article on Sentiment Analysis”.
Check out the list of possible POS tags from here.
Step 4: Stopwords removal
Stopwords in English are words that carry very little useful information. We need to remove them as part of text preprocessing. nltk has a list of stopwords of every language.
See the stopwords in the English language.
Example of removing stopwords:
The stopwords This, is, an, on are removed and the output sentence is ‘article Sentiment Analysis’.
Step 5: Obtaining the stem words
A stem is a part of a word responsible for its lexical meaning. The two popular techniques of obtaining the root/stem words are Stemming and lematization.
The key difference is Stemming often gives some meaningless root words as it simply chops off some characters in the end. Lemmatization gives meaningful root words. However, it requires POS tags for the words.
Example to illustrate the difference between Stemming and Lemmatization: Click here for code.
If we look at the above example, the output from Stemming is Stem, and the output from Lemmatization is Lemma.
For the word glanced, the stem glanc is meaningless. Whereas the Lemma glance is perfect.
We now understood steps 2-5 by taking simple examples. Without any further delay, let us bounce back to our actual problem.
Code for Steps 2-4: Tokenization, POS tagging, Stopwords removal
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.corpus import wordnet
# POS tagger dictionary
pos_dict = {'J':wordnet.ADJ, 'V':wordnet.VERB, 'N':wordnet.NOUN, 'R':wordnet.ADV}Explore the significance of Sentiment Analysis in extracting insights from diverse textual sources for informed business decisions.
def token_stop_pos(text):
tags = pos_tag(word_tokenize(text))
newlist = []
for word, tag in tags:
if word.lower() not in set(stopwords.words('english')):
newlist.append(tuple([word, pos_dict.get(tag[0])]))
return newlist
mydata['POS tagged'] = mydata['Cleaned Reviews'].apply(token_stop_pos)
mydata.head()
Explanation: token_stop_pos is the function that takes the text and performs tokenization, removes stopwords, and tags the words to their POS. We applied it to the ‘Cleaned Reviews’ column and created a new column for ‘POS tagged’ data.
As mentioned earlier, to obtain the accurate Lemma the WordNetLemmatizer requires POS tags in the form of ‘n’, ‘a’, etc. But the POS tags obtained from pos_tag are in the form of ‘NN’, ‘ADJ’, etc.
To map pos_tag to wordnet tags, we created a dictionary pos_dict. Any pos_tag that starts with J is mapped to wordnet.ADJ, any pos_tag that starts with R is mapped to wordnet.ADV, and so on.
Our tags of interest are Noun, Adjective, Adverb, Verb. Anything out of these four is mapped to None.
In the above fig, we can observe that each word of column ‘POS tagged’ is mapped to its POS from pos_dict.
Code for Step 5: Obtaining the stem words – Lemmatization
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatize(pos_data):
lemma_rew = " "
for word, pos in pos_data:
if not pos:
lemma = word
lemma_rew = lemma_rew + " " + lemma
else:
lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
lemma_rew = lemma_rew + " " + lemma
return lemma_rew
mydata['Lemma'] = mydata['POS tagged'].apply(lemmatize)
mydata.head()
Explanation: lemmatize is a function that takes pos_tag tuples, and gives the Lemma for each word in pos_tag based on the pos of that word. We applied it to the ‘POS tagged’ column and created a column ‘Lemma’ to store the output.
Yay, after a long journey, we are done with the text preprocessing.
Now, take a minute to look at the ‘review’, ‘Lemma’ columns and observe how the text is processed.
Our final data looks clean as we are done with the data preprocessing. Take a short break and return to continue with the real task.
TextBlob is a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more.
The two measures that are used to analyze the sentiment are:
Polarity – talks about how positive or negative the opinion is
Subjectivity – talks about how subjective the opinion is
TextBlob(text).sentiment gives us the Polarity, Subjectivity values. Polarity ranges from -1 to 1 (1 is more positive, 0 is neutral, -1 is more negative) Subjectivity ranges from 0 to 1(0 being very objective and 1 being very subjective)
Example of TextBlob sentiment
Python Code
from textblob import TextBlob
# function to calculate subjectivity
def getSubjectivity(review):
return TextBlob(review).sentiment.subjectivity
# function to calculate polarity
def getPolarity(review):
return TextBlob(review).sentiment.polarity
# function to analyze the reviews
def analysis(score):
if score < 0:
return 'Negative'
elif score == 0:
return 'Neutral'
else:
return 'Positive'
Explanation: Functions were created to obtain polarity and subjectivity values and to label the review based on the polarity score.
Creating a new data frame with the review and Lemma columns and applying the above functions
VADER stands for Valence Aware Dictionary and Sentiment Reasoner. Vader sentiment not only tells if the statement is positive or negative along with the intensity of emotion.
The sum of pos, neg, neu intensities give 1. Compound ranges from -1 to 1 and is the metric used to draw the overall sentiment. positive if compound >= 0.5 neutral if -0.5 < compound < 0.5 negative if -0.5 >= compound
Python Code
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
# function to calculate vader sentiment
def vadersentimentanalysis(review):
vs = analyzer.polarity_scores(review)
return vs['compound']
fin_data['Vader Sentiment'] = fin_data['Lemma'].apply(vadersentimentanalysis)
# function to analyse
def vader_analysis(compound):
if compound >= 0.5:
return 'Positive'
elif compound <= -0.5 :
return 'Negative'
else:
return 'Neutral'
fin_data['Vader Analysis'] = fin_data['Vader Sentiment'].apply(vader_analysis)
fin_data.head()
Explanation: Created functions to obtain the Vader scores and to label the reviews based on compound scores
Count the number of positive, negative, neutral reviews.
SentiWordNet uses the WordNet database. It is important to obtain the POS, lemma of each word. We will then use the lemma, POS to obtain the synonym sets(synsets). We then obtain the positive, negative, and objective scores for all the possible synsets or the very first synset and label the text.
if positive score > negative score, the sentiment is positive if positive score < negative score, the sentiment is negative if positive score = negative score, the sentiment is neutral
Python Code
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn
def sentiwordnetanalysis(pos_data):
sentiment = 0
tokens_count = 0
for word, pos in pos_data:
if not pos:
continue
lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
if not lemma:
continue
synsets = wordnet.synsets(lemma, pos=pos)
if not synsets:
continue
# Take the first sense, the most common
synset = synsets[0]
swn_synset = swn.senti_synset(synset.name())
sentiment += swn_synset.pos_score() - swn_synset.neg_score()
tokens_count += 1
# print(swn_synset.pos_score(),swn_synset.neg_score(),swn_synset.obj_score())
if not tokens_count:
return 0
if sentiment>0:
return "Positive"
if sentiment==0:
return "Neutral"
else:
return "Negative"
fin_data['SWN analysis'] = mydata['POS tagged'].apply(sentiwordnetanalysis)
fin_data.head()
Explanation: We created a function to obtain the positive and negative scores for the first word of the synset then label the text by calculating the sentiment as the difference of positive and negative scores.
Count the number of positive, negative, neutral reviews.
We have seen the implementation of sentiment analysis using some of the popular lexicon-based techniques. Now, quickly do some visualization and compare the results.
Visual representation of TextBlob, VADER, SentiWordNet results.
We will plot the count of positive, negative, and neutral reviews for all three techniques.
If we observe the above image, TextBlob and SentiWordNet results look slightly close, while the VADER results show a large variation.
Conclusion
In conclusion, rule-based sentiment analysis (text analysis) exemplifies the convergence of algorithms and machine learning techniques in deciphering sentiments from text data, which is crucial in data science and artificial intelligence. By analyzing product reviews on Amazon or customer feedback, these techniques highlight the significance of understanding positive and negative sentiments for data-driven decision-making.
Using classifiers like Naive Bayes and neural networks through deep learning demonstrates how machine learning algorithms efficiently tackle sentiment classification, turning unstructured text into meaningful sentiment scores. This process showcases the capability of sentiment analysis models to process online reviews and news articles and emphasizes the importance of machine learning in extracting nuanced insights from text.
Ultimately, sentiment analysis is a powerful tool in the arsenal of data science, enabling a deeper understanding of public opinion across various domains. Integrating machine learning with natural language processing will enhance our ability to analyze and interpret vast amounts of daily text data as technology evolves.
Key Takeaways
Sentiment Analysis Overview: The blog provides an overview of sentiment analysis, emphasizing its importance in analyzing unstructured data such as social media posts, call transcripts, and reviews for applications like market analysis and customer feedback analysis.
NLP and Sentiment Analysis Libraries: It introduces Natural Language Processing (NLP) and popular Python libraries like NLTK, TextBlob, SpaCy, Gensim, and CoreNLP for sentiment analysis, highlighting their role in understanding and processing natural language.
Rule-Based Sentiment Analysis: The focus is on the Rule-Based Sentiment Analysis, specifically the lexicon-based approach, using tools like TextBlob, VADER, and SentiWordNet. The blog details the data preprocessing steps, including cleaning text, tokenization, POS tagging, stopword removal, and obtaining stem words.
Implementation of Sentiment Analysis: It walks through the implementation of sentiment analysis using TextBlob, VADER, and SentiWordNet. Each method shows the steps of analysis, labeling reviews as positive, negative, or neutral, and counting the occurrences of each sentiment.
Comparison and Visualization: The blog concludes by comparing results from TextBlob, VADER, and SentiWordNet. Visual representations, like pie charts, are used to compare the count of positive, negative, and neutral reviews for each sentiment analysis technique. The conclusion emphasizes the role of sentiment analysis in data science and artificial intelligence.
Frequently Asked Questions
Q1. How do positive sentiment and negative words impact sentiment analysis?
Ans. Positive sentiment and negative words are key in sentiment analysis, helping algorithms classify the overall sentiment of texts like customer reviews. This distinction allows for a nuanced understanding of opinions and emotions expressed in textual data.
Q2. Why is labeled data important for sentiment analysis?
Ans. Labeled data trains sentiment analysis models by providing examples of texts tagged with their corresponding sentiments. This enables the models to learn from actual sentiment expressions, improving their accuracy in identifying sentiments in new, unlabeled texts.
Q3. Can sentiment analysis be applied to movie reviews?
Ans. Yes, sentiment analysis is widely applied to movie reviews to gauge audience sentiment and preferences. It helps filmmakers and marketers understand viewer reception and tailor their strategies accordingly.
Q4. How do embeddings enhance sentiment analysis algorithms?
Ans. Embeddings capture semantic relationships between words, enriching sentiment analysis algorithms with a deeper understanding of language nuances. This leads to more accurate sentiment predictions, especially in complex texts.
Q5. What role do support vector machines (SVMs) play in sentiment analysis?
Ans. Support vector machines (SVMs) efficiently classify texts into sentiment categories by finding the best way to separate hyperplanes in high-dimensional space. Compared to other algorithms, SVMs are particularly good at managing high-dimensional data and avoiding overfitting, making them reliable for sentiment analysis tasks.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Hi, my name is Harika. I am a Data Engineer and I thrive on creating innovative solutions and improving user experiences. My passion lies in leveraging data to drive innovation and create meaningful impact.
When I am running code of sentiment analysis using sentiwordnet. It is giving error i.e. "UnboundLocalError: local variable 'lemma' referenced before assignment".
Please look into it.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
can u please provide dataset link
When I am running code of sentiment analysis using sentiwordnet. It is giving error i.e. "UnboundLocalError: local variable 'lemma' referenced before assignment". Please look into it.