Over time, the data have increased exponentially. Digitalization in all areas generates a large number of data per second. Today, this large amount of data is an unstructured type, i.e. there are no predefined formats. This includes the images, videos, audio, text, etc. Among these, the text is the most generated type of unstructured data. Everything is based mainly on text, whether a machine’s log file or a customer’s product review. Companies and enterprises use this data to make decisions. These decisions may include the modification of existing policies, the creation of new products, the formulation of new offers to the community, etc. Since the generated text is human-readable, machines cannot understand it. Therefore, natural language processing or NLP has been introduced to make it machine-friendly. This article will teach us how to analyze text to find its sentiment score.
Overview:
This article was published as a part of the Data Science Blogathon.
Today, text data is widely used to understand the behaviour of customers and users towards services. For example, companies use product reviews to understand customers’ product opinions. Based on similar analyses, customers can be classified, and specific decisions can be made for unhappy customers. The use of texts is endless, and data scientists worldwide have helped analyze the trends in texts. You can also use star ratings to understand the sentiment of users’ feedback on products or services. However, analyzing text-based feedback is a better solution to understanding the reaction.
To understand the emotion behind a text, we can analyze the text using NLP techniques to find patterns. However, if the same process has to be done in many texts, it will take a lot of time, and new texts will be added until then. For these cases, the calculation of sentiment scores is useful. Thus, a positive score shows a positive text, and a negative score shows a negative text. Automating this sentiment score calculator will help analyze the text and make quick decisions. We would use our methods and some pre-defined rule-based methods to calculate the sentiment scores in a jiffy.
Also Read: A Guide to Sentiment Analysis using NLP
The first method calculates sentiment scores using positive and negative word counts with normalization. In this method, we will calculate the Sentiment Scores by classifying and counting the Negative and Positive words from the given text and taking the ratio of the difference between the Positive and Negative Word Counts and the Total Word Count. We will be using the Amazon Cell Phone Reviews dataset from Kaggle.
To calculate the sentiment, here we will use the formula:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
Now, we will import the dataset. Of all the columns present in this dataset, we will use the ‘body’ column, which contains the cell phone reviews.
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
Next, we will create an instance of WordNetLemmatizer, which we will use in the next step. We will also call the Stop Words from the NLTK library.
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
Next, we will define a function to preprocess the text. In this function, first, we convert the input string to lowercase. Then, we extract only the letters from this string. This task is performed using Regular Expressions. Next, we tokenize the string. After tokenizing, we filter the text of stop words. At last, we lemmatize the remaining words and return them.
def text_prep(x):
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize
We will apply this function to every text in the ‘body‘ column.
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
Next, we will calculate the number of resultant words for each text. This will be helpful later when we calculate the Sentiment Score.
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))
In the next step, we defined the generic Negative Words and Positive Words list. For this, we used the positive-text.txt and negative-text.txt files, which are generally known as Opinion Lexicon by the respective authors(s). The files can be downloaded from here.
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
Next, we will count positive and negative words for the text. For this, we will count all the words in preprocess_txt that are present in the positive words list pos_words and, similarly, all the words in preprocess_txt that are present in the negative words list neg_words. We will add these values to the main Pandas DataFrame.
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
Finally, we will calculate the sentiment. We will create a ‘sentiment‘ column and perform the calculation for each text.
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)
df.head()
On executing this code, we get:
The second method calculates sentiment scores using positive and negative word counts with semi normalization. In this method, we calculate the sentiment score by evaluating the ratio of the Count of Positive Words and the Count of Negative Words + 1. Since there is no value difference, the sentiment value will always be more than 0. Also, adding 1 in the denominator would save from Zero Division Error. Let’s start with the implementation. The implementation code will remain the same till the Negative and Positive Word Count from the previous part, with the difference that this time, we don’t need the total word count; we will be omitting that part.
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
Next, we will apply the formula and add the formulated value to the main DataFrame.
df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2)
Putting it all together.
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2)
df.head()
On executing this code, we get:
The third method is to use VADER SentimentIntensityAnalyser to calculate the Sentiment Score. In this method, we will use the Sentiment Intensity Analyser, which uses the VADER Lexicon. VADER is a long-form for Valence Aware and sentiment Reasoner, a rule-based sentiment analysis tool. VADER calculates text emotions and determines whether the text is positive, neutral, or negative. This analyzer calculates text sentiment and produces four output scores: positive, negative, neutral, and compound.
Here, we will make use of the Compound Score. A compound score is the aggregate of the score of a word, or precisely, the sum of all words in the lexicon, normalized between -1 and 1. General text preprocessing techniques before calculation are not recommended because this may affect VADER results. You can also classify each text according to your needs using these VADER results.
Now, we will use VADER to calculate the sentiment score. Since this process differs from the previous two methods, we will implement it from scratch.
First, we will import the libraries. We will use the NLTK library to import the SentimentIntensityAnalyzer.
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Next, we will create the instance of SentimentIntensityAnalyzer.
sent = SentimentIntensityAnalyzer()
Next, we will read the Cell Phone Review Data, which we have been using previously.
df = pd.read_csv('', usecols = ['body'])
Finally, we will calculate the Compound Poverty Score and add it to the Main Data Frame for each text.
polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']]
df['sentiment_score'] = polarity
Putting it all together:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sent = SentimentIntensityAnalyzer()
df = pd.read_csv('', usecols = ['body'])
polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']]
df['sentiment_score'] = polarity
df.head()
On executing this code, we get:
In this article, we learned about three different ways of calculating the Sentiment Scores for each of the given Cell Phone Reviews. The first two methods are self-calculated, while the third method is rule-based, and the score is evaluated by itself. Each of these methods has its pros and cons. There are numerous ways by which one can calculate the sentiment score. One can even define his method of evaluating the scores. VADER is more robust and precise regarding Internet slang, which is widely used today. One can try building a stronger evaluation method covering a broader range of words.
A. Sentiment is measured by analyzing a text’s positive, negative, or neutral tones using natural language processing (NLP) techniques. This can involve counting positive and negative words or using algorithms like VADER, which assign a sentiment score based on the overall emotional context of the text.
A. To get a good sentiment score, preprocess the text data by removing noise (stopwords, special characters), apply sentiment analysis algorithms, and use lexicons or machine learning models to accurately capture the tone. Refining models with domain-specific data improves precision in detecting sentiments.
A. Sentiment ratio refers to the proportion of positive words to negative words within a text. It helps quantify the balance of emotions expressed. A higher ratio indicates a more positive sentiment, while a lower ratio reflects a more negative or neutral sentiment in the analyzed text.
A. A sentiment value is a numerical representation of the sentiment expressed in a text. It typically ranges from -1 to 1, where negative values indicate negative sentiment, positive values indicate positive sentiment and values close to 0 reflect neutral sentiment.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.