Complete tutorial on Text Classification using Conditional Random Fields Model (in Python)

Guest Blog Last Updated : 09 Jul, 2024

10 min read

Introduction

The amount of text data being generated in the world is staggering. Google processes more than 40,000 searches EVERY second! According to a Forbes report, every single minute we send 16 million text messages and post 510,00 comments on Facebook. For a layman, it is difficult to even grasp the sheer magnitude of data out there?

News sites and other online media alone generate tons of text content on an hourly basis. Analyzing patterns in that data can become daunting if you don’t have the right tools. Here we will discuss one such approach, using entity recognition, called Conditional Random Fields (CRF).

This article explains the concept and python implementation of conditional random fields on a self-annotated dataset. This is a really fun concept and I’m sure you’ll enjoy taking this ride with me!

What is Entity Recognition?
Case Study Objective and Understanding Different Approaches
Formulating Conditional Random Fields (CRFs)
Annotating Training Data
- Annotations using GATE
Building and Training a CRF Module in Python

What is Entity Recognition?

Entity recognition has seen a recent surge in adoption with the interest in Natural Language Processing (NLP). An entity can generally be defined as a part of text that is of interest to the data scientist or the business. Examples of frequently extracted entities are names of people, address, account numbers, locations etc. These are only simple examples and one could come up with one’s own entity for the problem at hand.

To take a simple application of entity recognition, if there’s any text with “London” in the dataset, the algorithm would automatically categorize or classify that as a location (you must be getting a general idea of where I’m going with this).

Let’s take a simple case study to understand our topic in a better way.

Case Study Objective & Understanding Different Approaches

Suppose that you are part of an analytics team in an insurance company where each day, the claims team receives thousands of emails from customers regarding their claims. The claims operations team goes through each email and updates an online form with the details before acting on them.

Source: mugo.ca

You are asked to work with the IT team to automate the process of pre-populating the online form. For this task, the analytics team needs to build a custom entity recognition algorithm.

To identify entities in text, one must be able to identify the pattern. For example, if we need to identify the claim number, we can look at the words around it such as “my id is” or “my number is”, etc. Let us examine a few approaches mentioned below for identifying the patterns.

Regular expressions: Regular expressions (RegEx) are a form of finite state automaton. They are very helpful in identifying patterns that follow a certain structure. For example, email ID, phone number, etc. can be identified well using RegEx. However, the downside of this approach is that one needs to be aware of all the possible exact words that occur before the claim number. This is not a learning approach, but rather a brute force one
Hidden Markov Model (HMM): This is a sequence modelling algorithm that identifies and learns the pattern. Although HMM considers the future observations around the entities for learning a pattern, it assumes that the features are independent of each other. This approach is better than regular expressions as we do not need to model the exact set of word(s). But in terms of performance, it is not known to be the best method for entity recognition
MaxEnt Markov Model (MEMM): This is also a sequence modelling algorithm. This does not assume that features are independent of each other and also does not consider future observations for learning the pattern. In terms of performance, it is not known to be the best method for identifying entity relationships either
Conditional Random Fields (CRF): This is also a sequence modelling algorithm. This not only assumes that features are dependent on each other, but also considers the future observations while learning a pattern. This combines the best of both HMM and MEMM. In terms of performance, it is considered to be the best method for entity recognition problem

Formulating Conditional Random Fields (CRF)

The bag of words (BoW) approach works well for multiple text classification problems. This approach assumes that presence or absence of word(s) matter more than the sequence of the words. However, there are problems such as entity recognition, part of speech identification where word sequences matter as much, if not more. Conditional Random Fields (CRF) comes to the rescue here as it uses word sequences as opposed to just words.

Let us now understand how CRF is formulated.

Below is the formula for CRF where Y is the hidden state (for example, part of speech) and X is the observed variable (in our example this is the entity or other words around it).

Broadly speaking, there are 2 components to the CRF formula:

Normalization: You may have observed that there are no probabilities on the right side of the equation where we have the weights and features. However, the output is expected to be a probability and hence there is a need for normalization. The normalization constant Z(x) is a sum of all possible state sequences such that the total becomes 1. You can find more details in the reference section of this article to understand how we arrived at this value.
Weights and Features: This component can be thought of as the logistic regression formula with weights and the corresponding features. The weight estimation is performed by maximum likelihood estimation and the features are defined by us.

Annotating training data

Now that you are aware of the CRF model, let us curate the training data. The first step to doing this is annotation. Annotation is a process of tagging the word(s) with the corresponding tag. For simplicity, let us suppose that we only need 2 entities to populate the online form, namely the claimant name and the claim number.

The following is a sample email received as is. Such emails need to be annotated so that the CRF model can be trained. The annotated text needs to be in an XML format. Although you may choose to annotate the documents in your way, I’ll walk you through the use of the GATE architecture to do the same.

Email received:

“Hi,

I am writing this email to claim my insurance amount. My id is abc123 and I claimed it on 1st January 2018. I did not receive any acknowledgement. Please help.

Thanks,

randomperson”

Annotated Email:

“<document>Hi, I am writing this email to claim my insurance amount. My id is <claim_number>abc123</claim_number> and I claimed on 1st January 2018. I did not receive any acknowledgement. Please help. Thanks, <claimant>randomperson</claimant></document>”

Annotations using GATE

Let us understand how to use the General Architecture for Text Engineering (GATE). Please follow the below steps to install GATE.

Download the latest version from this link: https://gate.ac.uk/download/#latest

Install the GATE platform by executing the downloaded installer and following the installation steps appropriately

Post installation, run the application executable file as shown below:

Once the application opens, load the emails iteratively into the language resources by right clicking on “Language Resources”>New>GATE Document as shown below. Give each email a name, set the encoding to “utf-8” so we have no issues in Python, navigate to the emails that need to be annotated by clicking on the icon in sourceUrl section as shown below.

Open one email at a time and start the annotation exercise. There are 2 options for building annotations.
a. Load the annotation xml into GATE and use it
b. Create annotations on the fly and use them. In this article, we will demonstrate this approach.
Click on the email in the Language Resources section for it to open. Click on the “Annotation Sets” and then select word or words and placing the cursor on it for a couple of seconds. A pop-up window for annotation comes up and you can then type in the annotation in place of “_NEW_” and hit enter. A new annotation is created as shown below. Repeat this exercise for all the annotations for each email

Once all the training emails are annotated, create a corpus for ease of use by navigating to Language Resources>NEW>GATE Corpus

Give the new corpus a name for one’s reference, click on the navigation icon and add each email that is loaded into the Language Corpus as shown below

Save the corpus as inline xml in a folder on your machine by right clicking on the corpus and navigating to “Inline XML(.xml)” as shown below

In the next pop-up window, select the annotation types that are pre-populated and remove them. Manually type the annotations and add them in place of the pre-populated annotations. Set the “includeFeatures” option to false by clicking on it and type “document” into the rootElement box. Once all these changes are made, save the file to a folder on your machine by clicking on the “Save To” icon . Following are the screenshots for reference.

The above process will save all the annotated emails in one folder.

Building and Training a CRF Module in Python

First download the pycrf module. For PIP installation, the command is “pip install python-crfsuite” and for conda installation, the command is “conda install -c conda-forge python-crfsuite”
If the above installation doesn’t work, download the relevant pycrf module from https://anaconda.org/conda-forge/python-crfsuite/files. If you have a Windows OS 64-bit machine with python 2.7 version, then use this link: win-64/python-crfsuite-0.9.2-py27_vc9_0.tar.bz2
Extract the pycrfsuite and python_crfsuite-0.9.2-py2.7.egg-info files and place them in the folder where the rest of the packages are present. For example, if you use Anaconda, then these files can be placed in the anaconda>lib>site-packages folder

Once the installation is complete, you are ready to train and build your own CRF module. Let”s do this!

#invoke libraries
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import codecs
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.model_selection import train_test_split
import pycrfsuite
import os, os.path, sys
import glob
from xml.etree import ElementTree
import numpy as np
from sklearn.metrics import classification_report

Let’s define and build a few functions.

#this function appends all annotated files
def append_annotations(files):
    xml_files = glob.glob(files +"/*.xml")
    xml_element_tree = None
    new_data = ""
    for xml_file in xml_files:
        data = ElementTree.parse(xml_file).getroot()
        #print ElementTree.tostring(data)        
        temp = ElementTree.tostring(data)
        new_data += (temp)
    return(new_data)

#this function removes special characters and punctuations
def remov_punct(withpunct):
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    without_punct = ""
    char = 'nan'
    for char in withpunct:
        if char not in punctuations:
            without_punct = without_punct + char
    return(without_punct)

# functions for extracting features in documents
def extract_features(doc):
    return [word2features(doc, i) for i in range(len(doc))]

def get_labels(doc):
    return [label for (token, postag, label) in doc]

Now we will import the annotated training data.

files_path = "D:/Annotated/"

allxmlfiles = append_annotations(files_path)
soup = bs(allxmlfiles, "html5lib")

#identify the tagged element
docs = []
sents = []

for d in soup.find_all("document"):
   for wrd in d.contents:    
    tags = []
    NoneType = type(None)   
    if isinstance(wrd.name, NoneType) == True:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,'NA'))            
    else:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,wrd.name))    
    sents = sents + tags 
   docs.append(sents) #appends all the individual documents into one list

Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization.

data = []

for i, doc in enumerate(docs):
    tokens = [t for t, label in doc]    
    tagged = nltk.pos_tag(tokens)    
    data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

def word2features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

# Common features for all words. You may add more features here based on your custom use case
features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

# Features for words that are not at the beginning of a document
if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')

# Features for words that are not at the end of a document
if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

 return features

Now we’ll build features and create train and test data frames.

X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Let’s test our model.

tagger = pycrfsuite.Tagger()
tagger.open('crf.model')
y_pred = [tagger.tag(xseq) for xseq in X_test]

You can inspect any predicted value by selecting the corresponding row number “i”.

i = 0

for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):

    print("%s (%s)" % (y, x))

Check the performance of the model.

# Create a mapping of labels to indices
labels = {"claim_number": 1, "claimant": 1,"NA": 0}

# Convert the sequences of tags into a 1-dimensional array
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])

Print out the classification report. Based on the model performance, build better features to improve the performance.

print(classification_report(
    truths, predictions,
    target_names=["claim_number", "claimant","NA"]))

#predict new data
with codecs.open("D:/ SampleEmail6.xml", "r", "utf-8") as infile:
    soup_test = bs(infile, "html5lib")

docs = []
sents = []

for d in soup_test.find_all("document"):
   for wrd in d.contents:    
    tags = []
    NoneType = type(None)   

    if isinstance(wrd.name, NoneType) == True:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,'NA'))            
    else:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,wrd.name))
    #docs.append(tags)

sents = sents + tags # puts all the sentences of a document in one element of the list
docs.append(sents) #appends all the individual documents into one list       

data_test = []

for i, doc in enumerate(docs):
    tokens = [t for t, label in doc]    
    tagged = nltk.pos_tag(tokens)    
    data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

data_test_feats = [extract_features(doc) for doc in data_test]
tagger.open('crf.model')
newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats]

# Let's check predicted data
i = 0
for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]):
    print("%s (%s)" % (y, x))

By now, you would have understood how to annotate training data, how to use Python to train a CRF model, and finally how to identify entities from new text. Although this algorithm provides some basic set of features, you can come up with your own set of features to improve the accuracy of the model.

End Notes

To summarize, here are the key points that we have covered in this article:

Entities are parts of text that are of interest for the business problem at hand
Sequence of words or tokens matter in identifying entities
Pattern recognition approaches such as Regular Expressions or graph-based models such as Hidden Markov Model and Maximum Entropy Markov Model can help in identifying entities. However, Conditional Random Fields (CRF) is a popular and arguably a better candidate for entity recognition problems
CRF is an undirected graph-based model that considered words that not only occur before the entity but also after it
The training data can be annotated by using GATE architecture
The Python code provided helps in training a CRF model and extracting entities from text
In conclusion, this article should give you a good starting point for your business problem

Elevate your text classification with our ‘Mastering Conditional Random Fields for Text‘ course – learn the step-by-step Python implementation and unlock new levels of accuracy!

References

An Introduction to Conditional Random Fields by Charles Sutton & Andrew McCallum. (http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf).
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing by Alexander M. Rush(based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag). (http://people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdf).
Performing Sequence Labelling using CRF in Python by Albert Au Yeung. (http://www.albertauyeung.com/post/python-sequence-labelling-with-crf/).
Using GATE as an Annotation Tool by Tom Kenter, Diana Maynard. (https://gate.ac.uk/sale/am/annotationmanual-gate2.pdf)

About the Author

Sidharth Macherla – Independent Researcher, Natural Language Processing

Sidharth Macherla has over 12 years of experience in data science and his current area of focus is Natural Language Processing. He has worked across Banking, Insurance, Investment Research and Retail domains.

Guest Blog

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Complete tutorial on Text Classification using Conditional Random Fields Model (in Python)

Introduction

Table of contents

What is Entity Recognition?

Case Study Objective & Understanding Different Approaches

Formulating Conditional Random Fields (CRF)

Annotating training data

Annotations using GATE

Building and Training a CRF Module in Python

End Notes

References

About the Author

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie