Last couple of years have been incredible for Natural Language Processing (NLP) as a domain! We have seen multiple breakthroughs – ULMFiT, ELMo, Facebook’s PyText, Google’s BERT, among many others. These have rapidly accelerated the state-of-the-art research in NLP (and language modeling, in particular).
We can now predict the next sentence, given a sequence of preceding words.
What’s even more important is that machines are now beginning to understand the key element that had eluded them for long.
Context! Understanding context has broken down barriers that had prevented NLP techniques making headway before. And today, we are going to talk about one such library – Flair.
Until now, the words were either represented as a sparse matrix or as word embeddings such as GLoVe, Bert and ELMo, and the results have been pretty impressive. But, there’s always room for improvement and Flair is willing to stand up to it.
In this article, we will first understand what Flair is and the concept behind it. Then we’ll dive into implementing NLP tasks using Flair. Get ready to be impressed by its accuracy!
Please note that this article assumes familiarity with NLP concepts. You can go through the below articles if you need a quick refresher:
Flair is a simple natural language processing (NLP) library developed and open-sourced by Zalando Research. Flair’s framework builds directly on PyTorch, one of the best deep learning frameworks out there. The Zalando Research team has also released several pre-trained models for the following NLP tasks:
All of this looks promising. But what truly caught my attention was when I saw Flair outperforming several state-of-the-art results in NLP. Check out this table:
Note: F1 score is an evaluation metric primarily used for classification tasks. It’s often used in machine learning projects over the accuracy metric when evaluating models. The F1 score takes into consideration the distribution of the classes present.
There are plenty of awesome features packaged into the Flair library. Here’s my pick of the most prominent ones:
Context is so vital when working on NLP tasks. Learning to predict the next character based on previous characters forms the basis of sequence modeling.
Contextual String Embeddings leverage the internal states of a trained character language model to produce a novel type of word embedding. In simple terms, it uses certain internal principles of a trained character model, such that words can have different meaning in different sentences.
Note: A language and character model is a probability distribution of Words / Characters such that every new word or character depends on the words or characters that came before it. Have a look here to know more about it.
Let’s look at an example to understand this:
Explanation:
Language is such a wonderful yet complex thing. You can read more about Contextual String Embeddings in this Research Paper.
It’s time to put Flair to the test! We’ve seen what this awesome library is all about. Now let’s see firsthand how it works on our machines.
We’ll use Flair to perform all the below NLP tasks in Python:
We will be using Google Colaboratory for running our code. One of the best things about Colab is that it provides GPU support for free! It is pretty handy for training deep learning models.
All you need is a stable internet connection.
We’ll be working on the Twitter Sentiment Analysis practice problem. Go ahead and download the dataset from there (you’ll need to register/log in first).
The problem statement posed by this challenge is:
The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
Overview of steps:
Step 1: Import the data into the local Environment of Colab:
Step 2: Installing Flair
Step 3: Preparing text to work with Flair
Step 4: Word Embeddings with Flair
Step 5: Vectorizing the text
Step 6: Partitioning the data for Train and Test Sets
Step 7: Time for predictions!
# Install the PyDrive wrapper & import libraries. # This only needs to be done once per notebook. !pip install -U -q PyDrive from pydrive.auth import GoogleAuth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentials # Authenticate and create the PyDrive client. # This only needs to be done once per notebook. auth.authenticate_user() gauth = GoogleAuth() gauth.credentials = GoogleCredentials.get_application_default() drive = GoogleDrive(gauth) # Download a file based on its file ID. # A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz file_id = '1GhyH4k9C4uPRnMAMKhJYOqa-V9Tqt4q8' ### File ID ### data = drive.CreateFile({'id': file_id}) #print('Downloaded content "{}"'.format(downloaded.GetContentString()))
You can find the file ID in the shareable link of the dataset file in the drive.
Importing the dataset into the Colab notebook:
import io Import pandas as pd data = pd.read_csv(io.StringIO(data.GetContentString())) data.head()
All the emoticons and symbols have been removed from the data and the characters have been converted to lowercase. Additionally, our dataset has already been divided into train and test sets. You can download this clean dataset from here.
# download flair library # import torch !pip install flair import flair
A Brief look at Flair Data Types
There are two types of objects central to this library – Sentence and Token objects. A Sentence holds a textual sentence and is essentially a list of Tokens:
from flair.data import Sentence # create a sentence # sentence = Sentence('Blogs of Analytics Vidhya are Awesome.') # print the sentence to see what’s in it. # print(Sentence)
#extracting the tweet part# text = data['tweet'] ## txt is a list of tweets ## txt = text.tolist() print(txt[:10])
Feel free to first go through this article if you’re new to word embeddings: An Intuitive Understanding of Word Embeddings.
## Importing the Embeddings ## from flair.embeddings import WordEmbeddings from flair.embeddings import CharacterEmbeddings from flair.embeddings import StackedEmbeddings from flair.embeddings import FlairEmbeddings from flair.embeddings import BertEmbeddings from flair.embeddings import ELMoEmbeddings from flair.embeddings import FlairEmbeddings ### Initialising embeddings (un-comment to use others) ### #glove_embedding = WordEmbeddings('glove') #character_embeddings = CharacterEmbeddings() flair_forward = FlairEmbeddings('news-forward-fast') flair_backward = FlairEmbeddings('news-backward-fast') #bert_embedding = BertEmbedding() #elmo_embedding = ElmoEmbedding() stacked_embeddings = StackedEmbeddings( embeddings = [ flair_forward-fast, flair_backward-fast ])
You would have noticed we just used some of the most popular word embeddings above. Awesome! You can remove the comments ‘#’ to use all the embeddings.
Now you might be asking – What in the world are “Stacked Embeddings”? Here, we can combine multiple embeddings to build a powerful word representation model without much complexity. Quite like ensembling, isn’t it?
We are using the stacked embedding of Flair only for reducing the computational time in this article. Feel free to play around with this and other embeddings by using any combination you like.
Testing the stacked embeddings:
# create a sentence # sentence = Sentence(‘ Analytics Vidhya blogs are Awesome .') # embed words in sentence # stacked.embeddings(sentence) for token in sentence: print(token.embedding) # data type and size of embedding # print(type(token.embedding)) # storing size (length) # z = token.embedding.size()[0]
We’ll be showcasing this using two approaches.
Mean of Word Embeddings within a Tweet
We will be calculating the following in this approach:
For each sentence:
from tqdm import tqdm ## tracks progress of loop ## # creating a tensor for storing sentence embeddings # s = torch.zeros(0,z) # iterating Sentence (tqdm tracks progress) # for tweet in tqdm(txt): # empty tensor for words # w = torch.zeros(0,z) sentence = Sentence(tweet) stacked_embeddings.embed(sentence) # for every word # for token in sentence: # storing word Embeddings of each word in a sentence # w = torch.cat((w,token.embedding.view(-1,z)),0) # storing sentence Embeddings (mean of embeddings of all words) # s = torch.cat((s, w.mean(dim = 0).view(-1, z)),0)
Document Embedding: Vectorizing the entire Tweet
from flair.embeddings import DocumentPoolEmbeddings ### initialize the document embeddings, mode = mean ### document_embeddings = DocumentPoolEmbeddings([ flair_embedding_backward, flair_embedding_forward ]) # Storing Size of embedding # z = sentence.embedding.size()[1] ### Vectorising text ### # creating a tensor for storing sentence embeddings s = torch.zeros(0,z) # iterating Sentences # for tweet in tqdm(txt): sentence = Sentence(tweet) document_embeddings.embed(sentence) # Adding Document embeddings to list # s = torch.cat((s, sentence.embedding.view(-1,z)),0)
You can choose either approach for your model. Now that our text is vectorised, we can feed it to our machine learning model!
## tensor to numpy array ## X = s.numpy() ## Test set ## test = X[31962:,:] train = X[:31962,:] # extracting labels of the training set # target = data['label'][data['label'].isnull()==False].values
Defining custom F1 evaluator for XGBoost
def custom_eval(preds, dtrain): labels = dtrain.get_label().astype(np.int) preds = (preds >= 0.3).astype(np.int) return [('f1_score', f1_score(labels, preds))]
Building the XGBoost model
import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score ### Splitting training set ### x_train, x_valid, y_train, y_valid = train_test_split(train, target, random_state=42, test_size=0.3) ### XGBoost compatible data ### dtrain = xgb.DMatrix(x_train,y_train) dvalid = xgb.DMatrix(x_valid, label = y_valid) ### defining parameters ### params = { 'colsample': 0.9, 'colsample_bytree': 0.5, 'eta': 0.1, 'max_depth': 8, 'min_child_weight': 6, 'objective': 'binary:logistic', 'subsample': 0.9 } ### Training the model ### xgb_model = xgb.train( params, dtrain, feval= custom_eval, num_boost_round= 1000, maximize=True, evals=[(dvalid, "Validation")], early_stopping_rounds=30 )
Our model has been trained and is ready for evaluation! Note: The parameters were taken from this Notebook.
### Reformatting test set for XGB ### dtest = xgb.DMatrix(test) ### Predicting ### predict = xgb_model.predict(dtest) # predicting
I uploaded the predictions to the practice problem page with 0.2 as probability threshold:
Word Embedding | F1- Score | |
Glove | 0.53 | |
flair-forward -fast | 0.45 | |
flair-backward-fast | 0.48 | |
Stacked (flair-forward-fast + flair-backward-fast) | 0.54 |
Note: According to Flair’s official documentation, stacking of the flair embedding with other embeddings often yields even better results, But, there is a catch..
It might take a VERY LONG time to compute on a CPU. I highly recommend leveraging a GPU for faster results. You can use the free one within Colab!
We will be using a subset of the Conll-2003 dataset, is a pre-tagged dataset in English. Download the dataset from here.
Overview of steps:
Step 1: Importing the dataset
Step 2 : Extracting Sentences and PoS Tags from the dataset
Step 3: Tagging the text using NLTK and Flair
Step 4: Evaluating the PoS tags from NLTK and Flair against the tagged dataset
### file was uploaded manually to local environment of Colab ### data = open('pos-tagged_corpus.txt','r') txt = data.read() #print(txt)
The data file contains one word per line, with empty lines representing sentence boundaries.
### converting text in form of list of (words with their tags) ### txt = txt.split('\n') ### removing DOCSTART (document header) txt = [x for x in txt if x != '-DOCSTART- -X- -X- O'] ### check ### for i in range(10): print(txt[i]) print(‘-’*10) ### Extracting Sentences ### # Initialize empty list for storing words words = [] # initialize empty list for storing sentences # corpus = [] for i in tqdm(txt): ## if blank sentence encountered ## if i =='': ## previous words form a sentence ## corpus.append(' '.join(words)) ## Refresh Word list ## words = [] else: ## word at index 0 ## words.append(i.split()[0]) # did it work? # for i in range(10): print(corpus[i]) print(‘-’*10) ### Extracting POS ### # Initialize empty list for storing word pos w_pos = [] #initialize empty list for storing sentence pos # POS = [] for i in tqdm(txt): ## blank sentence = new line ## if i =='': ## previous words form a sentence POS ## POS.append(' '.join(w_pos)) ## Refresh words list ## w_pos = [] else: ## pos tag from index 1 ## w_pos.append(i.split()[1]) # did it work? # for i in range(10): print(corpus[i]) print(POS[i]) ### Removing blanks form sentence and pos ### corpus = [x for x in corpus if x!= ''] POS = [x for x in POS if x!= ''] ### Check ### For i in range(10): print(corpus[i]) print(POS[i])
We have extracted the essentials aspects we require from the dataset. Let’s move on to step 3.
First, import the required libraries:
import nltk nltk.download('tagsets') nltk.download('punkt') nltk.download('averaged_perceptron_tagger') from nltk import word_tokenize
This will download all the necessary files to tag the text using NLTK.
### Tagging the corpus with NLTK ### #for storing results# nltk_pos = [] ##for every sentence ## for i in tqdm(corpus): # Tokenize sentence # text = word_tokenize(i) #tag Words# z = nltk.pos_tag(text) # store # nltk_pos.append(z)
The PoS tags are in this format:
[(‘token_1’, ‘tag_1’), ………….. , (‘token_n’, ‘tag_n’)]
Lets extract PoS from this:
### Extracting final pos by nltk in a list ### tmp = [] nltk_result = [] ## every tagged sentence ## for i in tqdm(nltk_pos): tmp = [] ## every word ## for j in i: ## append tag (from index 1) ## tmp.append(j[1]) # join the tags of every sentence # nltk_result.append(' '.join(tmp)) ### check ### for i in range(10): print(nltk_result[i]) print(corpus[i])
The NLTK tags are ready for business.
Importing the libraries first:
!pip install flair from flair.data import Sentence from flair.models import SequenceTagger
Tagging using Flair
# initiating object # pos = SequenceTagger.load('pos-fast') #for storing pos tagged string# f_pos = [] ## for every sentence ## for i in tqdm(corpus): sentence = Sentence(i) pos.predict(sentence) ## append tagged sentence ## f_pos.append(sentence.to_tagged_string()) ###check ### for i in range(10): print(f_pos[i]) print(corpus[i])
The result is in the below format:
token_1 <tag_1> token_2 <tag_2> ………………….. token_n <tag_n>
Note: We can use different taggers available within the Flair library. Feel free to tinker around and experiment. You can find the list here.
Extract the sentence-wise tags as we did in NLTK
Import re ### Extracting POS tags ### ## in every sentence by index ## for i in tqdm(range(len(f_pos))): ## for every words ith sentence ## for j in corpus[i].split(): ## replace that word from ith sentence in f_pos ## f_pos[i] = str(f_pos[i]).replace(j,"",1) ## Removing < > symbols ## for j in ['<','>']: f_pos[i] = str(f_pos[i]).replace(j,"") ## removing redundant spaces ## f_pos[i] = re.sub(' +', ' ', str(f_pos[i])) f_pos[i] = str(f_pos[i]).lstrip() ### check ### for i in range(10): print(f_pos[i]) print(corpus[i])
Aha! We have finally tagged the corpus and extracted them sentence-wise. We are free to remove all the punctuation and special symbols.
### Removing Symbols and redundant space ### ## in every sentence by index ## for i in tqdm(range(len(corpus))): # Removing Symbols # corpus[i] = re.sub('[^a-zA-Z]', ' ', str(corpus[i])) POS[i] = re.sub('[^a-zA-Z]', ' ', str(POS[i])) f_pos[i] = re.sub('[^a-zA-Z]', ' ', str(f_pos[i])) nltk_result[i] = re.sub('[^a-zA-Z]', ' ', str(nltk_result[i])) ## Removing HYPH SYM (they are for symbols) ## f_pos[i] = str(f_pos[i]).replace('HYPH',"") f_pos[i] = str(f_pos[i]).replace('SYM',"") POS[i] = str(POS[i]).replace('SYM',"") POS[i] = str(POS[i]).replace('HYPH',"") nltk_result[i] = str(nltk_result[i].replace('HYPH','')) nltk_result[i] = str(nltk_result[i].replace('SYM','')) ## Removing redundant space ## POS[i] = re.sub(' +', ' ', str(POS[i])) f_pos[i] = re.sub(' +', ' ', str(f_pos[i])) corpus[i] = re.sub(' +', ' ', str(corpus[i])) nltk_result[i] = re.sub(' +', ' ', str(nltk_result[i]))
We have tagged the corpus using NLTK and Flair, extracted and removed all the unnecessary elements. Let’s see it for ourselves:
for i in range(1000): print('corpus '+corpus[i]) print('actual '+POS[i]) print('nltk '+nltk_result[i]) print('flair '+f_pos[i]) print('-'*50)
OUTPUT:
corpus SOCCER JAPAN GET LUCKY WIN CHINA IN SURPRISE DEFEAT
actual NN NNP VB NNP NNP NNP IN DT NN
nltk NNP NNP NNP NNP NNP NNP NNP NNP NNP
flair NNP NNP VBP JJ NN NNP IN NNP NNP
————————————————–
corpus Nadim Ladki
actual NNP NNP
nltk NNP NNP
flair NNP NNP
————————————————–
corpus AL AIN United Arab Emirates
actual NNP NNP NNP NNPS CD
nltk NNP NNP NNP VBZ JJ
flair NNP NNP NNP NNP CD
That looks convincing!
Here, we are doing word-wise evaluation of the tags with the help of a custom-made evaluator.
corpus Japan coach Shu Kamo said The Syrian own goal proved lucky for us
actual NNP NN NNP NNP VBD POS DT JJ JJ NN VBD JJ IN PRP
nltk NNP VBP NNP NNP VBD DT JJ JJ NN VBD JJ IN PRP
flair NNP NN NNP NNP VBD DT JJ JJ NN VBD JJ IN PRP
Note that in the example above, the actual POS tags contain redundancy compared to NLTK and flair tags as shown (in bold). Therefore we will not be considering the POS tagged sentences where the sentences are of unequal length.
### EVALUATION FUNCTION ### def eval(x,y): # correct match # count = 0 #Total comparisons made# comp = 0 ## for every sentence index in dataset ## for i in range(len(x)): ## if the sentence length match ## if len(x[i].split()) == len(y[i].split()): ## compare each word ## for j in range(len(x[i].split())): if x[i][j] == y[i][j] : ## Match! ## count = count+1 comp = comp + 1 else: comp = comp + 1 return (count/comp)*100
Finally we evaluate the POS tags of NLTK and Flair against the POS tags provided by the dataset.
print("nltk Score ", eval2(POS,nltk_result)) print("Flair Score ", eval2(POS,f_pos))
Our Result:
NLTK Score: 85.38654023442645
Flair Score: 90.96172124773179
Well, well, well. I can see why Flair has been getting so much attention in the NLP community.
Flair clearly provides an edge in word embeddings and stacked word embeddings. These can be implemented without much hassle due to its high level API. The Flair embedding is something to keep an eye on in the near future.
I love that the Flair library supports multiple languages. The developers are additionally currently working on “Frame Detection” using flair. The future looks really bright for this library.
I personally enjoyed working and learning the in’s and out’s of this library. I hope you found the tutorial useful and will be using Flair to your advantage next time you take up an NLP challenge.
Getting error at File "", line 9 z = token.embedding.size()[0] ^ SyntaxError: invalid syntax
Hi Kabir, I think you might be missing a bracket somewhere in that cell.
nice tutorial! please do you have any guide to use flair with some neural network like LSTM.
Hi sir , I would like to extract named entities from resumes which I have as text files . please can you suggest how to do that ? flair framework seems to work only with sentence=Sentence('') . whereas I would like my input to be a text file . how can I do that please ? thanks !