The amount of text data being generated in the world is staggering. Google processes more than 40,000 searches EVERY second! According to a Forbes report, every single minute we send 16 million text messages and post 510,00 comments on Facebook. For a layman, it is difficult to even grasp the sheer magnitude of data out there?
News sites and other online media alone generate tons of text content on an hourly basis. Analyzing patterns in that data can become daunting if you don’t have the right tools. Here we will discuss one such approach, using entity recognition, called Conditional Random Fields (CRF).
This article explains the concept and python implementation of conditional random fields on a self-annotated dataset. This is a really fun concept and I’m sure you’ll enjoy taking this ride with me!
Entity recognition has seen a recent surge in adoption with the interest in Natural Language Processing (NLP). An entity can generally be defined as a part of text that is of interest to the data scientist or the business. Examples of frequently extracted entities are names of people, address, account numbers, locations etc. These are only simple examples and one could come up with one’s own entity for the problem at hand.
To take a simple application of entity recognition, if there’s any text with “London” in the dataset, the algorithm would automatically categorize or classify that as a location (you must be getting a general idea of where I’m going with this).
Let’s take a simple case study to understand our topic in a better way.
Suppose that you are part of an analytics team in an insurance company where each day, the claims team receives thousands of emails from customers regarding their claims. The claims operations team goes through each email and updates an online form with the details before acting on them.
You are asked to work with the IT team to automate the process of pre-populating the online form. For this task, the analytics team needs to build a custom entity recognition algorithm.
To identify entities in text, one must be able to identify the pattern. For example, if we need to identify the claim number, we can look at the words around it such as “my id is” or “my number is”, etc. Let us examine a few approaches mentioned below for identifying the patterns.
The bag of words (BoW) approach works well for multiple text classification problems. This approach assumes that presence or absence of word(s) matter more than the sequence of the words. However, there are problems such as entity recognition, part of speech identification where word sequences matter as much, if not more. Conditional Random Fields (CRF) comes to the rescue here as it uses word sequences as opposed to just words.
Let us now understand how CRF is formulated.
Below is the formula for CRF where Y is the hidden state (for example, part of speech) and X is the observed variable (in our example this is the entity or other words around it).
Broadly speaking, there are 2 components to the CRF formula:
Now that you are aware of the CRF model, let us curate the training data. The first step to doing this is annotation. Annotation is a process of tagging the word(s) with the corresponding tag. For simplicity, let us suppose that we only need 2 entities to populate the online form, namely the claimant name and the claim number.
The following is a sample email received as is. Such emails need to be annotated so that the CRF model can be trained. The annotated text needs to be in an XML format. Although you may choose to annotate the documents in your way, I’ll walk you through the use of the GATE architecture to do the same.
Email received:
“Hi,
I am writing this email to claim my insurance amount. My id is abc123 and I claimed it on 1st January 2018. I did not receive any acknowledgement. Please help.
Thanks,
randomperson”
Annotated Email:
“<document>Hi, I am writing this email to claim my insurance amount. My id is <claim_number>abc123</claim_number> and I claimed on 1st January 2018. I did not receive any acknowledgement. Please help. Thanks, <claimant>randomperson</claimant></document>”
Let us understand how to use the General Architecture for Text Engineering (GATE). Please follow the below steps to install GATE.
Once the installation is complete, you are ready to train and build your own CRF module. Let”s do this!
#invoke libraries from bs4 import BeautifulSoup as bs from bs4.element import Tag import codecs import nltk from nltk import word_tokenize, pos_tag from sklearn.model_selection import train_test_split import pycrfsuite import os, os.path, sys import glob from xml.etree import ElementTree import numpy as np from sklearn.metrics import classification_report
Let’s define and build a few functions.
#this function appends all annotated files def append_annotations(files): xml_files = glob.glob(files +"/*.xml") xml_element_tree = None new_data = "" for xml_file in xml_files: data = ElementTree.parse(xml_file).getroot() #print ElementTree.tostring(data) temp = ElementTree.tostring(data) new_data += (temp) return(new_data) #this function removes special characters and punctuations def remov_punct(withpunct): punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' without_punct = "" char = 'nan' for char in withpunct: if char not in punctuations: without_punct = without_punct + char return(without_punct) # functions for extracting features in documents def extract_features(doc): return [word2features(doc, i) for i in range(len(doc))] def get_labels(doc): return [label for (token, postag, label) in doc]
Now we will import the annotated training data.
files_path = "D:/Annotated/" allxmlfiles = append_annotations(files_path) soup = bs(allxmlfiles, "html5lib") #identify the tagged element docs = [] sents = [] for d in soup.find_all("document"): for wrd in d.contents: tags = [] NoneType = type(None) if isinstance(wrd.name, NoneType) == True: withoutpunct = remov_punct(wrd) temp = word_tokenize(withoutpunct) for token in temp: tags.append((token,'NA')) else: withoutpunct = remov_punct(wrd) temp = word_tokenize(withoutpunct) for token in temp: tags.append((token,wrd.name)) sents = sents + tags docs.append(sents) #appends all the individual documents into one list
Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization.
data = [] for i, doc in enumerate(docs): tokens = [t for t, label in doc] tagged = nltk.pos_tag(tokens) data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)]) def word2features(doc, i): word = doc[i][0] postag = doc[i][1] # Common features for all words. You may add more features here based on your custom use case features = [ 'bias', 'word.lower=' + word.lower(), 'word[-3:]=' + word[-3:], 'word[-2:]=' + word[-2:], 'word.isupper=%s' % word.isupper(), 'word.istitle=%s' % word.istitle(), 'word.isdigit=%s' % word.isdigit(), 'postag=' + postag ] # Features for words that are not at the beginning of a document if i > 0: word1 = doc[i-1][0] postag1 = doc[i-1][1] features.extend([ '-1:word.lower=' + word1.lower(), '-1:word.istitle=%s' % word1.istitle(), '-1:word.isupper=%s' % word1.isupper(), '-1:word.isdigit=%s' % word1.isdigit(), '-1:postag=' + postag1 ]) else: # Indicate that it is the 'beginning of a document' features.append('BOS') # Features for words that are not at the end of a document if i < len(doc)-1: word1 = doc[i+1][0] postag1 = doc[i+1][1] features.extend([ '+1:word.lower=' + word1.lower(), '+1:word.istitle=%s' % word1.istitle(), '+1:word.isupper=%s' % word1.isupper(), '+1:word.isdigit=%s' % word1.isdigit(), '+1:postag=' + postag1 ]) else: # Indicate that it is the 'end of a document' features.append('EOS') return features
Now we’ll build features and create train and test data frames.
X = [extract_features(doc) for doc in data] y = [get_labels(doc) for doc in data] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Let’s test our model.
tagger = pycrfsuite.Tagger() tagger.open('crf.model') y_pred = [tagger.tag(xseq) for xseq in X_test]
You can inspect any predicted value by selecting the corresponding row number “i”.
i = 0 for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]): print("%s (%s)" % (y, x))
Check the performance of the model.
# Create a mapping of labels to indices labels = {"claim_number": 1, "claimant": 1,"NA": 0} # Convert the sequences of tags into a 1-dimensional array predictions = np.array([labels[tag] for row in y_pred for tag in row]) truths = np.array([labels[tag] for row in y_test for tag in row])
Print out the classification report. Based on the model performance, build better features to improve the performance.
print(classification_report( truths, predictions, target_names=["claim_number", "claimant","NA"]))
#predict new data with codecs.open("D:/ SampleEmail6.xml", "r", "utf-8") as infile: soup_test = bs(infile, "html5lib") docs = [] sents = [] for d in soup_test.find_all("document"): for wrd in d.contents: tags = [] NoneType = type(None) if isinstance(wrd.name, NoneType) == True: withoutpunct = remov_punct(wrd) temp = word_tokenize(withoutpunct) for token in temp: tags.append((token,'NA')) else: withoutpunct = remov_punct(wrd) temp = word_tokenize(withoutpunct) for token in temp: tags.append((token,wrd.name)) #docs.append(tags) sents = sents + tags # puts all the sentences of a document in one element of the list docs.append(sents) #appends all the individual documents into one list data_test = [] for i, doc in enumerate(docs): tokens = [t for t, label in doc] tagged = nltk.pos_tag(tokens) data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)]) data_test_feats = [extract_features(doc) for doc in data_test] tagger.open('crf.model') newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats] # Let's check predicted data i = 0 for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]): print("%s (%s)" % (y, x))
By now, you would have understood how to annotate training data, how to use Python to train a CRF model, and finally how to identify entities from new text. Although this algorithm provides some basic set of features, you can come up with your own set of features to improve the accuracy of the model.
To summarize, here are the key points that we have covered in this article:
Elevate your text classification with our ‘Mastering Conditional Random Fields for Text‘ course – learn the step-by-step Python implementation and unlock new levels of accuracy!
Sidharth Macherla – Independent Researcher, Natural Language Processing
A good introduction to CRF.
Thank you very much for introducing such a powerful tool for entity recognition.
Thank you very much . i need some more help from your side.