Sentiment Analysis is a powerful application of Natural Language Processing (NLP) that identifies the emotional tone of text. Classifying text into positive, negative, or neutral sentiments, from social media monitoring to market research, serves various industries. This article demonstrates how to perform sentiment analysis on IMDB movie reviews using Long-Short-Term Memory (LSTM) networks.
Learning Outcomes:
This article was published as a part of the Data Science Blogathon.
Sentiment Analysis is an NLP application that identifies a text corpus’s emotional or sentimental tone or opinion. Usually, emotions or attitudes toward a topic can be positive, negative, or neutral. This makes sentiment analysis a text classification task. Examples of positive, negative, and neutral expressions are:
“I enjoyed the movie!” – Positive
“I am not sure if I liked the movie.” – Neutral
“It was the most terrible movie I have ever seen.” – Negative
Sentiment analysis is a potent tool with varied applications across industries. It is helpful for social media and brand monitoring, customer support and feedback analysis, market research, etc. By performing sentiment analysis on initial customer feedback, you can identify a new product’s target audience or demographics and evaluate the success of a marketing campaign. As sentiment analysis grows increasingly useful in the industry, we must learn how to perform it.
Recurrent neural networks (RNNs) are a form of Artificial Neural networks that can memorize arbitrary-length sequences of input patterns by capturing connections between sequential data types. However, due to stochastic gradients’ failure, RNNs cannot detect long-term dependencies in lengthy sequences. Researchers proposed several novel RNN models, notably LSTM, to address this issue. LSTM networks are extensions of RNNs designed to learn sequential (temporal) data and their long-term connections more precisely than standard RNNs. They commonly find use in deep learning applications such as stock forecasting, speech recognition, and natural language processing.
We will analyze sentiment in 50k IMDB movie reviews, comprising 25k positive and 25k negative reviews, ensuring a balanced dataset. You can download the dataset from here. We start by importing the necessary packages for text manipulation and model building.
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import math
import nltk
We load the dataset into a pandas dataframe with the help of the following code :
data = pd.read_csv('IMDB Dataset.csv')
data
The data looks like this :
First step in sentiment analysis with LSTM is to remove HTML tags, URLs, and non-alphanumeric characters from the reviews. We do that with the help of the remove_tags function, and Regex functions are used for easy string manipulation.
def remove_tags(string):
removelist = ""
result = re.sub('','',string) #remove HTML tags
result = re.sub('https://.*','',result) #remove URLs
result = re.sub(r'[^w'+removelist+']', ' ',result) #remove non-alphanumeric characters
result = result.lower()
return result
data['review']=data['review'].apply(lambda cw : remove_tags(cw))
We also need to remove stopwords from the corpus. Commonly used words like ‘and’, ‘the’, and ‘at’ are stopwords that do not add any special meaning or significance to a sentence. NLTK provides a list of stopwords, and you can remove them from the corpus using the following code:
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
data['review'] = data['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
We now perform lemmatization on the text. Lemmatization is a useful technique in NLP to obtain the root form of words, known as lemmas. For example, the words “reading,” “reads,” and “read” all lemma to “read.” This approach saves unnecessary computational overhead in deciphering entire words, as their meanings are well-expressed by their lemmas. We perform lemmatization using the WordNetLemmatizer() from nltk. The text is first broken into words using the WhitespaceTokenizer() from nltk. We write a function lemmatize_text to perform lemmatization on the individual tokens.
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
st = ""
for w in w_tokenizer.tokenize(text):
st = st + lemmatizer.lemmatize(w) + " "
return st
data['review'] = data.review.apply(lemmatize_text)
data
The processed data for the LSTM model for sentiment analysis looks like this :
The next step in sentiment analysis with LSTM is to print some basic statistics about the dataset and check if it has an equal number of all labels to ensure balance. Ideally, a balanced dataset is preferable, as a severely imbalanced dataset can be challenging to model and require specialized techniques.
Also Read: 10 Techniques to Solve Imbalanced Classes in Machine Learning (Updated 2024)
s = 0.0
for i in data['review']:
word_list = i.split()
s = s + len(word_list)
print("Average length of each review : ",s/data.shape[0])
pos = 0
for i in range(data.shape[0]):
if data.iloc[i]['sentiment'] == 'positive':
pos = pos + 1
neg = data.shape[0]-pos
print("Percentage of reviews with positive sentiment is "+str(pos/data.shape[0]*100)+"%")
print("Percentage of reviews with negative sentiment is "+str(neg/data.shape[0]*100)+"%")
>>Average length of each review : 119.57112
>>Percentage of reviews with positive sentiment is 50.0%
>>Percentage of reviews with negative sentiment is 50.0%
In this step of sentiment analysis using LSTM, we use the LabelEncoder() from sklearn.preprocessing to convert the labels (‘positive’ and ‘negative’) into 1s and 0s, respectively.
reviews = data['review'].values
labels = data['sentiment'].values
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
Finally, we split the dataset into train and test parts using train_test_split from sklearn.model_selection. We use 80% of the dataset for training and 20% for testing.
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels, stratify = encoded_labels)
Before feeding into the LSTM model for sentiment analysis, we must pad and tokenize the data.
# Hyperparameters of the model
vocab_size = 3000 # choose based on statistics
oov_tok = ''
embedding_dim = 100
max_length = 200 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)
# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)
The next step in sentiment analysis using LSTM is to build a Keras sequential model. It is a linear stack of the following layers :
The code for building the model :
# model initialization
model = keras.Sequential([
keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
keras.layers.Bidirectional(keras.layers.LSTM(64)),
keras.layers.Dense(24, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
# compile model
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
# model summary
model.summary()
We compile the LSTM model for sentiment analysis with binary cross-entropy loss and the Adam optimizer, given that we have a binary classification problem. The Adam optimizer uses stochastic gradient descent to train deep learning models, and it compares the predicted probabilities to the actual class label (0 or 1). We use accuracy as the primary performance metric. You can see the model summary below:
Now, let us train the sentiment analysis model using LSTM for five epochs.
num_epochs = 5
history = model.fit(train_padded, train_labels,
epochs=num_epochs, verbose=1,
validation_split=0.1)
We evaluate the LSTM model for sentiment analysis by calculating its accuracy. We determine classification accuracy by dividing the number of correct predictions by the total number of predictions.
4o
prediction = model.predict(test_padded)
# Get labels based on probability 1 if p>= 0.5 else 0
pred_labels = []
for i in prediction:
if i >= 0.5:
pred_labels.append(1)
else:
pred_labels.append(0)
print("Accuracy of prediction on test set : ", accuracy_score(test_labels,pred_labels))
The prediction accuracy on the test set is 87.27%! You can improve the accuracy further by playing around with the model hyperparameters, tuning the model architecture, or changing the train-test split ratio. You should also train the model for a more significant number of epochs, and we stopped at five epochs because of the computational time. Ideally, this would help prepare the model until the train and test losses converge.
We can use our trained LSTM model for sentiment analysis to determine the sentiment of new unseen movie reviews that are not present in the dataset. Before feeding each new text as input to the model, you must tokenize and pad it. The model.predict() function returns the probability of the positive review. If the probability is more significant than 0.5, we consider the study positive; otherwise, it is negative.
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming",
"I have never seen a terrible movie like this",
"the movie plot is terrible but it had good acting"]
# convert to a sequence
sequences = tokenizer.texts_to_sequences(sentence)
# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)
# Get labels based on probability 1 if p>= 0.5 else 0
prediction = model.predict(padded)
pred_labels = []
for i in prediction:
if i >= 0.5:
pred_labels.append(1)
else:
pred_labels.append(0)
for i in range(len(sentence)):
print(sentence[i])
if pred_labels[i] == 1:
s = 'Positive'
else:
s = 'Negative'
print("Predicted sentiment : ",s)
The output looks very promising!
We demonstrated how to perform sentiment analysis with Long-Short-Term Memory (LSTM) networks on IMDB movie reviews. LSTM networks are Recurrent Neural Networks (RNNs) adept at handling sequential data and capturing long-term dependencies. Sentiment analysis, combined with LSTM networks, provides a powerful framework for understanding and leveraging the emotional tones in textual data. This capability is invaluable for making data-driven decisions in business and research contexts.
Key Takeaways
A. Sentiment analysis is a technique used to determine the emotional tone behind a body of text. It analyzes whether the sentiment expressed is positive, negative, or neutral. This is commonly used in analyzing social media posts, reviews, and feedback to gauge public opinion and improve customer experience.
A. The three main types of sentiment analysis are:
Fine-grained Sentiment Analysis: Classifies emotions as very positive, positive, neutral, negative, or very negative.
Aspect-based Sentiment Analysis: Identifies sentiment toward specific aspects of a subject.
Intent-based Sentiment Analysis: Detects whether the sentiment is an opinion, inquiry, command, or suggestion.
A. Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) designed to remember information for extended periods, which traditional RNNs struggle with. LSTM excels in tasks where understanding sequences or temporal dependencies, such as language processing, is essential, making it widely used in speech recognition and text analysis.
A. In NLP, LSTM networks capture long-range dependencies in text, making them ideal for processing sequential data like sentences and paragraphs. LSTM models can remember past words to improve the accuracy of language models, sentiment analysis, machine translation, and text generation tasks, enhancing contextual understanding.
A. LSTM models are primarily used for tasks involving sequence prediction and data with temporal dependencies. They are commonly applied in natural language processing, speech recognition, time series forecasting, and video analysis, where retaining past information over long sequences is crucial for accurate predictions and outcomes.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Thanks for the paper. Good read .
this is amazing articles you have write it but now there are few updates 1) for pad sequence library is update from "from keras.preprocessing.sequence import pad_sequences" to this "from tensorflow.keras.preprocessing.sequence import pad_sequences" 2) and need to update reguler expression for remove non-alphanumeric characters from this "result = re.sub(r'[^w'+removelist+']', ' ',result)" to this "result = re.sub(r'[^ \w'+removelist+']', ' ', result)"