I work on different Natural Language Processing (NLP) problems (the perks of being a data scientist!). Each NLP problem is a unique challenge in its own way. That’s just a reflection of how complex, beautiful and wonderful the human language is.
But one thing has always been a thorn in an NLP practitioner’s mind is the inability (of machines) to understand the true meaning of a sentence. Yes, I’m talking about context. Traditional NLP techniques and frameworks were great when asked to perform basic tasks. Things quickly went south when we tried to add context to the situation.
The NLP landscape has significantly changed in the last 18 months or so. NLP frameworks like Google’s BERT and Zalando’s Flair, along with the ELMo model, are able to parse through sentences and grasp the context in which they were written.
In this article, you will learn about ELMo embeddings in Python, including how to use ELMo embeddings effectively. We’ll explore ELMo explained, comparing it with BERT in the ELMo vs BERT discussion. You’ll discover how to implement ELMo for text classification in Python, with examples that illustrate its practical applications. Additionally, we’ll delve into the foundational paper on ELMo embeddings from language models, providing insights into their dynamic nature and advantages in natural language processing tasks.
One of the biggest breakthroughs in this regard came thanks to ELMo, a state-of-the-art NLP framework developed by AllenNLP. By the time you finish this article, you too will have become a big ELMo fan – just as I did.
In this article, we will explore ELMo (Embeddings from Language Models) and use it to build a mind-blowing NLP model using Python on a real-world dataset.
Note: This article assumes you are familiar with the different types of word embeddings and LSTM architecture. You can refer to the below articles to learn more about the topics:
No, the ELMo we are referring to isn’t the character from Sesame Street! A classic example of the importance of context.
ELMo is a deep contextualized word in vectors or embeddings, often referred to as ELMo embedding. These word embeddings are helpful in achieving state-of-the-art (SOTA) results in several NLP tasks:
NLP scientists globally have started using ELMo for various NLP tasks, both in research as well as the industry. You must check out the original ELMo research paper.
I don’t usually ask people to read research papers because they can often come across as heavy and complex but I’m making an exception for ELMo. This one is a really cool explanation of how ELMo was designed.
Let’s get an intuition of how ELMo works underneath before we implement it in Python. Why is this important?
Well, picture this. You’ve successfully copied the ELMo code from GitHub into Python and managed to build a model on your custom text data. You get average results so you need to improve the model. How will you do that if you don’t understand the architecture of ELMo? What parameters will you tweak if you haven’t studied about it?
This line of thought applies to all machine learning algorithms. You need not get into their derivations but you should always know enough to play around with them and improve your model.
Now, let’s come back to how ELMo works.
As I mentioned earlier, ELMo word vectors are computed on top of a two-layer bidirectional language model (biLM). This biLM model has two layers stacked together. Each layer has 2 passes — forward pass and backward pass:
As the input to the biLM is computed from characters rather than words, it captures the inner structure of the word. For example, the biLM will be able to figure out that terms like beauty and beautiful are related at some level without even looking at the context they often appear in. Sounds incredible!
Unlike traditional word embeddings such as word2vec and GLoVe, the ELMo vector assigned to a token or word is actually a function of the entire sentence containing that word. Therefore, the same word can have different word vectors under different contexts.
I can imagine you asking – how does knowing that help me deal with NLP problems? Let me explain this using an example.
Suppose we have a couple of sentences:
Take a moment to ponder the difference between these two. The verb “read” in the first sentence is in the past tense. And the same verb transforms into present tense in the second sentence. This is a case of Polysemy wherein a word could have multiple meanings or senses.
Language is such a wonderfully complex thing.
Traditional word embeddings come up with the same vector for the word “read” in both the sentences. Hence, the system would fail to distinguish between the polysemous words. These word embeddings just cannot grasp the context in which the word was used.
ELMo word vectors successfully address this issue. ELMo word representations take the entire input sentence into equation for calculating the word embeddings. Hence, the term “read” would have different ELMo vectors under different context. ELMo compares to models like GPT, it would be beneficial to refer to additional sources or articles that specifically address the comparison between these different NLP models
And now the moment you have been waiting for – implementing ELMo in Python! Let’s take this step-by-step.
The first step towards dealing with any data science challenge is defining the problem statement. It forms the base for our future actions.
For this article, we already have the problem statement in hand:
Sentiment analysis remains one of the key problems that has seen extensive application of natural language processing (NLP). This time around, given the tweets from customers about various tech firms who manufacture and sell mobiles, computers, laptops, etc., the task is to identify if the tweets have a negative sentiment towards such companies or products.
It is clearly a binary text classification task wherein we have to predict the sentiments from the extracted tweets.
Here’s a breakdown of the dataset we have:
You can download the dataset from this page. Note that you will have to register or sign-in to do so.
Caution: Most profane and vulgar terms in the tweets have been replaced with “$&@*#”. However, please note that the dataset might still contain text that could be considered profane, vulgar, or offensive.
Alright, let’s fire up our favorite Python IDE and get coding!
Import the libraries we’ll be using throughout our notebook:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle
pd.set_option('display.max_colwidth', 200)
# read data
train = pd.read_csv("train_2kmZucJ.csv")
test = pd.read_csv("test_oJQbWVk.csv")
train.shape, test.shape
Output: ((7920, 3), (1953, 2))
The train set has 7,920 tweets while the test set has only 1,953. Now let’s check the class distribution in the train set:
train['label'].value_counts(normalize = True)
Output:
0 0.744192
1 0.255808
Name: label, dtype: float64
Here, 1 represents a negative tweet while 0 represents a non-negative tweet.
Let’s take a quick look at the first 5 rows in our train set:
Python Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv("train.csv")
print(train.shape)
print("======")
print(train.head())
sns.countplot(train['label'])
plt.show()
We have three columns to work with. The column ‘tweet’ is the independent variable while the column ‘label’ is the target variable.
We would have a clean and structured dataset to work with in an ideal world. But things are not that simple in NLP (yet).
We need to spend a significant amount of time cleaning the data to make it ready for the model building stage. Feature extraction from the text becomes easy and even the features contain more information. You’ll see a meaningful improvement in your model’s performance the better your data quality becomes.
So let’s clean the text we’ve been given and explore it.
There seem to be quite a few URL links in the tweets. They are not telling us much (if anything) about the sentiment of the tweet so let’s remove them.
# remove URL's from train and test
train['clean_tweet'] = train['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))
test['clean_tweet'] = test['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))
We have used Regular Expressions (or RegEx) to remove the URLs.
Note: You can learn more about Regex in this article.
We’ll go ahead and do some routine text cleaning now.
# remove punctuation marks
punctuation = '!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'
train['clean_tweet'] = train['clean_tweet'].apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))
test['clean_tweet'] = test['clean_tweet'].apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))
# convert text to lowercase
train['clean_tweet'] = train['clean_tweet'].str.lower()
test['clean_tweet'] = test['clean_tweet'].str.lower()
# remove numbers
train['clean_tweet'] = train['clean_tweet'].str.replace("[0-9]", " ")
test['clean_tweet'] = test['clean_tweet'].str.replace("[0-9]", " ")
# remove whitespaces
train['clean_tweet'] = train['clean_tweet'].apply(lambda x:' '.join(x.split()))
test['clean_tweet'] = test['clean_tweet'].apply(lambda x: ' '.join(x.split()))
I’d also like to normalize the text, aka, perform text normalization. This helps in reducing a word to its base form. For example, the base form of the words ‘produces’, ‘production’, and ‘producing’ is ‘product’. It happens quite often that multiple forms of the same word are not really that important and we only need to know the base form of that word.
We will lemmatize (normalize) the text by leveraging the popular spaCy library.
# import spaCy's language model
nlp = spacy.load('en', disable=['parser', 'ner'])
# function to lemmatize text
def lemmatization(texts):
output = []
for i in texts:
s = [token.lemma_ for token in nlp(i)]
output.append(' '.join(s))
return output
Lemmatize tweets in both the train and test sets:
train['clean_tweet'] = lemmatization(train['clean_tweet'])
test['clean_tweet'] = lemmatization(test['clean_tweet'])
Let’s have a quick look at the original tweets vs our cleaned ones:
train.sample(10)
Check out the above columns closely. The tweets in the ‘clean_tweet’ column appear to be much more legible than the original tweets.
However, I feel there is still plenty of scope for cleaning the text. I encourage you to explore the data as much as you can and find more insights or irregularities in the text.
Wait, what does TensorFlow have to do with our tutorial?
TensorFlow Hub is a library that enables transfer learning by allowing the use of many machine learning models for different tasks. ELMo is one such example. That’s why we will access ELMo via TensorFlow Hub in our implementation.
Before we do anything else though, we need to install TensorFlow Hub. You must install or upgrade your TensorFlow package to at least 1.7 to use TensorFlow Hub:
$ pip install "tensorflow>=1.7.0"
$ pip install tensorflow-hub
We will now import the pretrained ELMo model. A note of caution – the model is over 350 mb in size so it might take you a while to download this.
import tensorflow_hub as hub
import tensorflow as tf
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
I will first show you how we can get ELMo vectors for a sentence. All you have to do is pass a list of string(s) in the object elmo.
# just a random sentence
x = ["Roasted ants are a popular snack in Columbia"]
# Extract ELMo features
embeddings = elmo(x, signature="default", as_dict=True)["elmo"]
embeddings.shape
Output:
TensorShape([Dimension(1), Dimension(8), Dimension(1024)])
The output is a 3 dimensional tensor of shape (1, 8, 1024):
Hence, every word in the input sentence has an ELMo vector of size 1024.
Let’s go ahead and extract ELMo vectors for the cleaned tweets in the train and test datasets. However, to arrive at the vector representation of an entire tweet, we will take the mean of the ELMo vectors of constituent terms or tokens of the tweet.
Let’s define a function for doing this:
def elmo_vectors(x):
embeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
# return average of ELMo features
return sess.run(tf.reduce_mean(embeddings,1))
You might run out of computational resources (memory) if you use the above function to extract embeddings for the tweets in one go. As a workaround, split both train and test set into batches of 100 samples each. Then, pass these batches sequentially to the function elmo_vectors( ).
I will keep these batches in a list:
list_train = [train[i:i+100] for i in range(0,train.shape[0],100)]
list_test = [test[i:i+100] for i in range(0,test.shape[0],100)]
Now, we will iterate through these batches and extract the ELMo vectors. Let me warn you, this will take a long time.
# Extract ELMo embeddings
elmo_train = [elmo_vectors(x['clean_tweet']) for x in list_train]
elmo_test = [elmo_vectors(x['clean_tweet']) for x in list_test]
Once we have all the vectors, we can concatenate them back to a single array:
elmo_train_new = np.concatenate(elmo_train, axis = 0)
elmo_test_new = np.concatenate(elmo_test, axis = 0)
I would advice you to save these arrays as it took us a long time to get the ELMo vectors for them. We will save them as pickle files:
# save elmo_train_new
pickle_out = open("elmo_train_03032019.pickle","wb")
pickle.dump(elmo_train_new, pickle_out)
pickle_out.close()
# save elmo_test_new
pickle_out = open("elmo_test_03032019.pickle","wb")
pickle.dump(elmo_test_new, pickle_out)
pickle_out.close()
Use the following code to load them back:
# load elmo_train_new
pickle_in = open("elmo_train_03032019.pickle", "rb")
elmo_train_new = pickle.load(pickle_in)
# load elmo_train_new
pickle_in = open("elmo_test_03032019.pickle", "rb")
elmo_test_new = pickle.load(pickle_in)
Let’s build our NLP model with ELMo!
We will use the ELMo vectors of the train dataset to build a classification model. Then, we will use the model to make predictions on the test set. But before all of that, split elmo_train_new into training and validation set to evaluate our model prior to the testing phase.
from sklearn.model_selection import train_test_split
xtrain, xvalid, ytrain, yvalid = train_test_split(elmo_train_new,
train['label'],
random_state=42,
test_size=0.2)
Since our objective is to set a baseline score, we will build a simple logistic regression model using ELMo vectors as features:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
lreg = LogisticRegression()
lreg.fit(xtrain, ytrain)
Prediction time! First, on the validation set:
preds_valid = lreg.predict(xvalid)
We will evaluate our model by the F1 score metric since this is the official evaluation metric of the contest.
f1_score(yvalid, preds_valid)
Output: 0.789976
The F1 score on the validation set is pretty impressive. Now let’s proceed and make predictions on the test set:
# make predictions on test set
preds_test = lreg.predict(elmo_test_new)
Prepare the submission file which we will upload on the contest page:
# prepare submission dataframe
sub = pd.DataFrame({'id':test['id'], 'label':preds_test})
# write predictions to a CSV file
sub.to_csv("sub_lreg.csv", index=False)
These predictions give us a score of 0.875672 on the public leaderboard. That is frankly pretty impressive given that we only did fairly basic text preprocessing and used a very simple model. Imagine what the score could be with more advanced techniques. Try them out on your end and let me know the results!
We just saw firsthand how effective ELMo can be for text classification. If coupled with a more sophisticated model, it would surely give an even better optimization. The application of ELMo is not limited just to the task of text classification. You can use it whenever you have to vectorize text data.
Below are a few more NLP tasks where we can utilize ELMo:
ELMo is undoubtedly a significant progress in NLP and is here to stay. Given the sheer pace at which research in NLP is progressing, other new state-of-the-art word embeddings have also emerged in the last few months, like Google BERT and Falando’s Flair. Exciting times ahead for NLP practitioners!
I strongly encourage you to use ELMo on other datasets and experience the performance boost yourself. If you have any questions or want to share your experience with me and the community, please do so in the comments section below. You should also check out the below NLP related resources if you’re starting out in this field:
A. A bag of words is a representation technique in Natural Language Processing (NLP) where the text is represented as an unordered set of words, disregarding grammar and word order.
A. A bidirectional Long Short-Term Memory (LSTM) processes input data in both forward and backward directions, capturing context from both past and future, whereas a unidirectional LSTM processes data only in one direction.
A. In NLP, an encoder is a component in sequence-to-sequence models that transforms input data into a fixed-dimensional representation, often used in tasks like machine translation.
A. Fine-tuning in NLP refers to the process of adjusting a pre-trained model on a specific task or domain to enhance its performance for a particular application.
A. Model architecture in NLP refers to the overall structure and design of a neural network, specifying the arrangement of layers, connections, and operations within the model.
This line in the lemmatization(texts) function is not working: s = [token.lemma_ for token in nlp(i)] name 'nlp is not defined' Have run all the code upto this function. Pls advise.
Hi Sanjoy, Thanks for pointing it out. nlp is a language model imported using spaCy by excuting this code
nlp = spacy.load('en', disable=['parser', 'ner'])
. I have updated the same in the blog as well.just a quick heads up, in the end notes there is a typo - Falando -> Zalando. Thanks for the tutorial, keep em coming
Interesting!!
Wonderful article. Thanks. Can you point me to a resource like yours where ELMo/BERT/ULMFiT/or any others is used in NER and /or Text Summarization?